Re: fstests generic/441 -- occasional bcachefs failure

From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: fstests generic/441 -- occasional bcachefs failure
Date: Mon, 6 Feb 2023 10:33:14 -0500	[thread overview]
Message-ID: <Y+EduoshRHXec+XU@bfoster> (raw)
In-Reply-To: <Y97Y76dSCVkF0WIE@moria.home.lan>

On Sat, Feb 04, 2023 at 05:15:11PM -0500, Kent Overstreet wrote:
> On Sat, Feb 04, 2023 at 04:33:41PM -0500, Brian Foster wrote:
> > On Thu, Feb 02, 2023 at 05:56:32PM -0500, Kent Overstreet wrote:
> > > On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote:
> > > > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote:
> > > > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> > > > > > Something else that occurred to me while looking further at this is we
> > > > > > can also avoid the error in this case fairly easily by bailing out of
> > > > > > bch2_fsync() if page writeback fails, as opposed to the unconditional
> > > > > > flush -> sync meta -> flush log sequence that returns the first error
> > > > > > anyways. That would prevent marking the inode with a new sequence number
> > > > > > when I/Os are obviously failing. The caveat is that the test still
> > > > > > fails, now with a "Read-only file system" error instead of EIO, because
> > > > > > the filesystem is shutdown by the time the vfs write inode path actually
> > > > > > runs.
> > > > > 
> > > > > If some pages did write successfully we don't want to skip the rest of
> > > > > the fsync, though.
> > > > > 
> > > > 
> > > > What does it matter if the fsync() has already failed? ISTM this is
> > > > pretty standard error handling behavior across major fs', but it's not
> > > > clear to me if there's some bcachefs specific quirk that warrants
> > > > different handling..
> > > > 
> > > > FWIW, I think I've been able to fix this test with a couple small
> > > > tweaks:
> > > > 
> > > > 1. Change bch2_fsync() to return on first error.
> > > 
> > > I suppose it doesn't make sense to flush the journal if
> > > file_write_and_wait_range() didn't successfully do anything - but I
> > > would prefer to keep the behaviour where we do flush the journal on
> > > partial writeback.
> > > 
> > 
> > That seems reasonable to me in principle...
> > 
> > > What if we plumbed a did_work parameter through?
> > > 
> > 
> > ... but I'm not sure how that is supposed to work..? Tracking submits
> > from the current flush wouldn't be hard, but wouldn't tell us about
> > writeback that might have occurred in the background before fsync was
> > called. It also doesn't give any information about what I/Os succeeded
> > or failed.
> 
> Good point.
> 
> Now that I think about it, inode->bi_journal_seq won't be getting
> updated unless a write actually did successfully complete - meaning
> bch2_flush_inode() won't do anything unless there was a successful
> transaction commit; either data was written or the inode was updated for
> some other reason.
> 
> Maybe it's the sync_inode_metadata() call that's causing the journal
> flush to happen?
> 

Yeah, the sync_inode_metadata() -> bch2_vfs_write_inode() ->
bch2_write_inode() path is what commits a transaction and bumps the
->bi_journal_seq on the inode. Essentially I think there are a couple
issues related to this test and perhaps easier to reason about them
separately:

The first issue is that a shutdown occurs because the test basically
does a pwrite() -> fsync() -> dm-error error table switch sequence, and
the error table switch races with journal reclaim I/O that kicks in
after the fsync completes. This problem is resolved by the bcachefs
specific sleep during the dm-table switch (hopefully to be replaced by a
freeze cycle in the future), so no bcachefs changes are required for
that one.

The second issue is that bcachefs still shuts down with the
aforementioned delay in place for a slightly different reason. This
occurs because after the switch to an error table, the test does another
pwrite() -> fsync() sequence to test the "fsync should return error"
case. What I see happen in bcachefs here is that writepages occurs
(because the buffered write is able to dirty the pages and inode and
whatnot since everything is buffered) and fails, bch2_vfs_write_inode()
commits a transaction (again successful, and marks the inode with the
latest journal seq), and then bch2_flush_inode() grabs the updated inode
seq, attempts to flush the journal, fails due to I/O error, and then
shuts down the fs. The shutdown leads to general test failure because
the test eventually switches back to a functional dm table, but bcachefs
reports errors because it has shutdown.

So yes, it seems the sync_inode_metadata() path is what bumps the
journal seq and the explicit journal flush is what shuts down the fs,
but I assume sync_inode_metadata() attempts the inode flush simply
because the buffered write that precedes it works perfectly fine. This
is what had me thinking that bch2_fsync() is being more aggressive than
it really needs to be here. With the proposed logic tweak, the second
issue is resolved because bcachefs no longer attempts the inode flush
when page writeback has failed. So what happens in the test is that it
switches back over to the functional dm table, the vfs inode flush
occurs then and succeeds, and so the test sees no (unexpected) errors
and the fs has survived without shutting down.

All in all I think this test basically builds on some minor assumptions
about how fsync error behavior is implemented in Linux, and bcachefs
happens to slightly diverge in a way that leads to this shutdown. IOW,
I'd expect the same problematic behavior out of XFS if it implemented
this sort of fsync logic, but afaict no other fs does that, so the test
is functional in practice. I admit that's not the best pure engineering
justification for the change in bcachefs (so I understand any
hesitation), but IMO it's reasonable in practice and worthwhile enough
to improve test coverage. I haven't audited fstests for this or
anything, but it wouldn't surprise me much if there are other tests that
rely on this sort of "assumed behavior" for testing I/O failures.
Thoughts?

Brian