fstests generic/441 -- occasional bcachefs failure

* fstests generic/441 -- occasional bcachefs failure
@ 2023-01-25 15:45 Brian Foster
  2023-01-26 15:08 ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2023-01-25 15:45 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Hi Kent, list,

I've noticed occasional failures of generic/441 on bcachefs that look
something like this:

generic/441 11s ... - output mismatch (see /root/xfstests-dev/results//generic/441.out.bad)
    --- tests/generic/441.out   2022-10-11 14:20:36.986202162 -0400
    +++ /root/xfstests-dev/results//generic/441.out.bad 2023-01-24 13:33:54.977729904 -0500
    @@ -1,3 +1,3 @@
     QA output created by 441
     Format and mount
    -Test passed!
    +First fsync after reopen of fd[0] failed: Input/output error
    ...
    (Run 'diff -u /root/xfstests-dev/tests/generic/441.out /root/xfstests-dev/results//generic/441.out.bad'  to see the entire diff)

This test covers fsync error reporting across multiple fds. What it does
in a nutshell is open a bunch of fds, write to them and fsync them, and
then repeat that a couple times over after switching to an error table
(i.e. dm-error) and back to a working table, verifying error reporting
expectations at each step along the way.

The error above is associated with an fsync failure after switching from
an error table back to a working table, and essentially occurs because
the filesystem has shutdown and the bcachefs inode has a journal
sequence number ahead of what the journal shows has been flushed to
disk. That makes sense in general because even though the vfs error
reporting might not expect an error, the filesystem has been shutdown
and fsync has work to do. I've been digging into it a bit regardless to
try and understand why the behavior seems transient..

What I'm seeing looks like a matter of timing at the point the test
switches to an error table and issues the first fsync (where failure is
expected). IIUC, what I think is happening here in the scenario where
the test passes is something like this:

1. With the working table in place, the bch2_vfs_write_inode() ->
bch2_write_inode() path commits a transaction with pending changes to
the inode. These changes end up in some sort of cache (i.e. key cache?).

2. The bch2_flush_inode() path flushes the journal based on the sequence
number in the inode.

3. Sometime shortly after that, journal reclaim kicks in presumably to
write out pending btree node data and release the item from the journal.
Since the test switches to an error table after fsync() return, this
async (workqueue) journal reclaim context is what typically first
encounters I/O failures and essentially shuts down the fs (i.e. we end
up in bch2_fs_emergency_read_only()).

4. The test switches back to working context. The next fsync loop
appears to succeed even though the fs is shutdown because
bch2_vfs_write_inode() transaction commit fails with -EROFS in the
commit path, which means the inode sequence number is not updated by the
bch2_mark_inode() trigger, and thus bch2_flush_inode() appears to have
nothing to do and returns 0.

In the case where the test fails, the journal reclaim context in step 3
appears to race with the transaction commit in step 4 such that
bch2_mark_inode() is called, updates the inode seq number, and
transaction commit actually succeeds before the fs is shutdown. By the
time bch2_flush_inode() is called, however, the journal->err_seq value
has been set and the flush returns -EIO back to userspace.

So all in all this doesn't really seem like a bug. The fs is shutdown
either way and the error depends on whether fsync needs to flush the
log. I was trying to think about whether this could be improved slightly
to perhaps be more consistent and avoid spurious test failure.  For
example, an artificial delay after the dm-table switchover seems to
always prevent the problem because it ensures journal reclaim has
processed the error before a subsequent fsync has a chance to bump the
inode sequence number.

There's still a lot I don't understand here on the bcachefs side, so
after poking around a bit I didn't see anything terribly obvious that
might prevent this in the fs itself. AFAICS there is no serialization
between journal reclaim and the transaction commit, so something like
rechecking journal error state after the transaction is locked (but
before the mem triggers are run) seems like it could help reduce the
race window, but not necessarily prevent the race. Is that accurate? If
so, any other thoughts on how we might prevent this particular failure?

Sorry for the longish writeup on what likely amounts to a minor
behavioral quirk... I dug into it as sort of a proxy to help get more
familiar with the code, so if the right approach is basically to leave
things as is (and/or potentially tweak the test to keep it quiet), then
that seems reasonable to me as well.

Brian

^ permalink raw reply	[flat|nested] 24+ messages in thread