Re: fstests generic/441 -- occasional bcachefs failure

From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: fstests generic/441 -- occasional bcachefs failure
Date: Sat, 4 Feb 2023 16:33:41 -0500	[thread overview]
Message-ID: <Y97PNdqx82rot2WC@bfoster> (raw)
In-Reply-To: <Y9w/oMzjVp90c/Pk@moria.home.lan>

On Thu, Feb 02, 2023 at 05:56:32PM -0500, Kent Overstreet wrote:
> On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote:
> > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote:
> > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> > > > Something else that occurred to me while looking further at this is we
> > > > can also avoid the error in this case fairly easily by bailing out of
> > > > bch2_fsync() if page writeback fails, as opposed to the unconditional
> > > > flush -> sync meta -> flush log sequence that returns the first error
> > > > anyways. That would prevent marking the inode with a new sequence number
> > > > when I/Os are obviously failing. The caveat is that the test still
> > > > fails, now with a "Read-only file system" error instead of EIO, because
> > > > the filesystem is shutdown by the time the vfs write inode path actually
> > > > runs.
> > > 
> > > If some pages did write successfully we don't want to skip the rest of
> > > the fsync, though.
> > > 
> > 
> > What does it matter if the fsync() has already failed? ISTM this is
> > pretty standard error handling behavior across major fs', but it's not
> > clear to me if there's some bcachefs specific quirk that warrants
> > different handling..
> > 
> > FWIW, I think I've been able to fix this test with a couple small
> > tweaks:
> > 
> > 1. Change bch2_fsync() to return on first error.
> 
> I suppose it doesn't make sense to flush the journal if
> file_write_and_wait_range() didn't successfully do anything - but I
> would prefer to keep the behaviour where we do flush the journal on
> partial writeback.
> 

That seems reasonable to me in principle...

> What if we plumbed a did_work parameter through?
> 

... but I'm not sure how that is supposed to work..? Tracking submits
from the current flush wouldn't be hard, but wouldn't tell us about
writeback that might have occurred in the background before fsync was
called. It also doesn't give any information about what I/Os succeeded
or failed.

I was thinking about whether we could look at PageError pages or perhaps
sample errseq or something in the error case, but the fdatawait can
clear page errors and errseq only bumps the sequence value if it's been
sampled since the last update (so no way to distinguish between "one
failure" and "everything failed").

The only other thing that comes to mind is explicit
submit/completion/error counting and tracking somewhere in the mapping
or inode or some such, but that seems like overkill given fsync error
semantics. I.e.  when fsync returns an error, we're basically telling
the user "data loss somewhere in the file," without indication of where
or how much, so the user basically has to recreate the file or toss it
anyways. It would be nice to improve on that, but that's a much bigger
problem than I'm trying to address here. ;P

So I dunno.. I'll try to think a bit more about it but atm I'm not
seeing any sort of elegant way to fix this test and also preserve the
inode flush in the event of writeback failure. Thoughts? ISTM that it's
reasonable to skip the flush on wb error given fsync behavior and error
semantics seem common across all major fs', and as background writeback
should still try to handle it in short order anyways, but it's your
tree.. :)

Brian