All of lore.kernel.org
 help / color / mirror / Atom feed
* fstests generic/441 -- occasional bcachefs failure
@ 2023-01-25 15:45 Brian Foster
  2023-01-26 15:08 ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2023-01-25 15:45 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Hi Kent, list,

I've noticed occasional failures of generic/441 on bcachefs that look
something like this:

generic/441 11s ... - output mismatch (see /root/xfstests-dev/results//generic/441.out.bad)
    --- tests/generic/441.out   2022-10-11 14:20:36.986202162 -0400
    +++ /root/xfstests-dev/results//generic/441.out.bad 2023-01-24 13:33:54.977729904 -0500
    @@ -1,3 +1,3 @@
     QA output created by 441
     Format and mount
    -Test passed!
    +First fsync after reopen of fd[0] failed: Input/output error
    ...
    (Run 'diff -u /root/xfstests-dev/tests/generic/441.out /root/xfstests-dev/results//generic/441.out.bad'  to see the entire diff)

This test covers fsync error reporting across multiple fds. What it does
in a nutshell is open a bunch of fds, write to them and fsync them, and
then repeat that a couple times over after switching to an error table
(i.e. dm-error) and back to a working table, verifying error reporting
expectations at each step along the way.

The error above is associated with an fsync failure after switching from
an error table back to a working table, and essentially occurs because
the filesystem has shutdown and the bcachefs inode has a journal
sequence number ahead of what the journal shows has been flushed to
disk. That makes sense in general because even though the vfs error
reporting might not expect an error, the filesystem has been shutdown
and fsync has work to do. I've been digging into it a bit regardless to
try and understand why the behavior seems transient..

What I'm seeing looks like a matter of timing at the point the test
switches to an error table and issues the first fsync (where failure is
expected). IIUC, what I think is happening here in the scenario where
the test passes is something like this:

1. With the working table in place, the bch2_vfs_write_inode() ->
bch2_write_inode() path commits a transaction with pending changes to
the inode. These changes end up in some sort of cache (i.e. key cache?).

2. The bch2_flush_inode() path flushes the journal based on the sequence
number in the inode.

3. Sometime shortly after that, journal reclaim kicks in presumably to
write out pending btree node data and release the item from the journal.
Since the test switches to an error table after fsync() return, this
async (workqueue) journal reclaim context is what typically first
encounters I/O failures and essentially shuts down the fs (i.e. we end
up in bch2_fs_emergency_read_only()).

4. The test switches back to working context. The next fsync loop
appears to succeed even though the fs is shutdown because
bch2_vfs_write_inode() transaction commit fails with -EROFS in the
commit path, which means the inode sequence number is not updated by the
bch2_mark_inode() trigger, and thus bch2_flush_inode() appears to have
nothing to do and returns 0.

In the case where the test fails, the journal reclaim context in step 3
appears to race with the transaction commit in step 4 such that
bch2_mark_inode() is called, updates the inode seq number, and
transaction commit actually succeeds before the fs is shutdown. By the
time bch2_flush_inode() is called, however, the journal->err_seq value
has been set and the flush returns -EIO back to userspace.

So all in all this doesn't really seem like a bug. The fs is shutdown
either way and the error depends on whether fsync needs to flush the
log. I was trying to think about whether this could be improved slightly
to perhaps be more consistent and avoid spurious test failure.  For
example, an artificial delay after the dm-table switchover seems to
always prevent the problem because it ensures journal reclaim has
processed the error before a subsequent fsync has a chance to bump the
inode sequence number.

There's still a lot I don't understand here on the bcachefs side, so
after poking around a bit I didn't see anything terribly obvious that
might prevent this in the fs itself. AFAICS there is no serialization
between journal reclaim and the transaction commit, so something like
rechecking journal error state after the transaction is locked (but
before the mem triggers are run) seems like it could help reduce the
race window, but not necessarily prevent the race. Is that accurate? If
so, any other thoughts on how we might prevent this particular failure?

Sorry for the longish writeup on what likely amounts to a minor
behavioral quirk... I dug into it as sort of a proxy to help get more
familiar with the code, so if the right approach is basically to leave
things as is (and/or potentially tweak the test to keep it quiet), then
that seems reasonable to me as well.

Brian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-01-25 15:45 fstests generic/441 -- occasional bcachefs failure Brian Foster
@ 2023-01-26 15:08 ` Kent Overstreet
  2023-01-27  7:21   ` Kent Overstreet
  2023-01-27 14:50   ` Brian Foster
  0 siblings, 2 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-01-26 15:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Wed, Jan 25, 2023 at 10:45:24AM -0500, Brian Foster wrote:
> Hi Kent, list,
> 
> I've noticed occasional failures of generic/441 on bcachefs that look
> something like this:
> 
> generic/441 11s ... - output mismatch (see /root/xfstests-dev/results//generic/441.out.bad)
>     --- tests/generic/441.out   2022-10-11 14:20:36.986202162 -0400
>     +++ /root/xfstests-dev/results//generic/441.out.bad 2023-01-24 13:33:54.977729904 -0500
>     @@ -1,3 +1,3 @@
>      QA output created by 441
>      Format and mount
>     -Test passed!
>     +First fsync after reopen of fd[0] failed: Input/output error
>     ...
>     (Run 'diff -u /root/xfstests-dev/tests/generic/441.out /root/xfstests-dev/results//generic/441.out.bad'  to see the entire diff)
> 
> This test covers fsync error reporting across multiple fds. What it does
> in a nutshell is open a bunch of fds, write to them and fsync them, and
> then repeat that a couple times over after switching to an error table
> (i.e. dm-error) and back to a working table, verifying error reporting
> expectations at each step along the way.
> 
> The error above is associated with an fsync failure after switching from
> an error table back to a working table, and essentially occurs because
> the filesystem has shutdown and the bcachefs inode has a journal
> sequence number ahead of what the journal shows has been flushed to
> disk. That makes sense in general because even though the vfs error
> reporting might not expect an error, the filesystem has been shutdown
> and fsync has work to do. I've been digging into it a bit regardless to
> try and understand why the behavior seems transient..
> 
> What I'm seeing looks like a matter of timing at the point the test
> switches to an error table and issues the first fsync (where failure is
> expected). IIUC, what I think is happening here in the scenario where
> the test passes is something like this:

Hey Brian - my understanding of this bug was that it was even simpler.

As I understand it, the decreed semantics of fsync by the Powers That Be
are that it should return a given error _once_  - if a subsequent fsync
returns an error, then there must have been another write that errored.

So we just need to plumb our errors from the journal through the
errseq_t mechanism. The tricky bit is that bch2_fsync() doesn't record
or check if it's actually doing a journal flush or if the journal flush
was already done.

So I think we'd need to add another sequence number to
bch_inode_info to track the sequence that was flushed, and then only
call bch2_journal_flush_seq() if flushed_seq !=
journal_seq_of_last_update, and then to make this work correctly when an
inode has been evicted it might need to go in struct bch_inode, stored
in the btree, as well.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-01-26 15:08 ` Kent Overstreet
@ 2023-01-27  7:21   ` Kent Overstreet
  2023-01-27 14:50   ` Brian Foster
  1 sibling, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-01-27  7:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Thu, Jan 26, 2023 at 10:08:29AM -0500, Kent Overstreet wrote:
> Hey Brian - my understanding of this bug was that it was even simpler.
> 
> As I understand it, the decreed semantics of fsync by the Powers That Be
> are that it should return a given error _once_  - if a subsequent fsync
> returns an error, then there must have been another write that errored.
> 
> So we just need to plumb our errors from the journal through the
> errseq_t mechanism. The tricky bit is that bch2_fsync() doesn't record
> or check if it's actually doing a journal flush or if the journal flush
> was already done.
> 
> So I think we'd need to add another sequence number to
> bch_inode_info to track the sequence that was flushed, and then only
> call bch2_journal_flush_seq() if flushed_seq !=
> journal_seq_of_last_update, and then to make this work correctly when an
> inode has been evicted it might need to go in struct bch_inode, stored
> in the btree, as well.

BTW, I'm not suggesting we do this right now - I agree with you in that
the current behaviour makes sense to me. It might be worth changing at
some point for consistency, but I'm not in a hurry here.

Thanks for looking at this :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-01-26 15:08 ` Kent Overstreet
  2023-01-27  7:21   ` Kent Overstreet
@ 2023-01-27 14:50   ` Brian Foster
  2023-01-30 17:06     ` Kent Overstreet
  1 sibling, 1 reply; 24+ messages in thread
From: Brian Foster @ 2023-01-27 14:50 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Thu, Jan 26, 2023 at 10:08:25AM -0500, Kent Overstreet wrote:
> On Wed, Jan 25, 2023 at 10:45:24AM -0500, Brian Foster wrote:
> > Hi Kent, list,
> > 
> > I've noticed occasional failures of generic/441 on bcachefs that look
> > something like this:
> > 
> > generic/441 11s ... - output mismatch (see /root/xfstests-dev/results//generic/441.out.bad)
> >     --- tests/generic/441.out   2022-10-11 14:20:36.986202162 -0400
> >     +++ /root/xfstests-dev/results//generic/441.out.bad 2023-01-24 13:33:54.977729904 -0500
> >     @@ -1,3 +1,3 @@
> >      QA output created by 441
> >      Format and mount
> >     -Test passed!
> >     +First fsync after reopen of fd[0] failed: Input/output error
> >     ...
> >     (Run 'diff -u /root/xfstests-dev/tests/generic/441.out /root/xfstests-dev/results//generic/441.out.bad'  to see the entire diff)
> > 
> > This test covers fsync error reporting across multiple fds. What it does
> > in a nutshell is open a bunch of fds, write to them and fsync them, and
> > then repeat that a couple times over after switching to an error table
> > (i.e. dm-error) and back to a working table, verifying error reporting
> > expectations at each step along the way.
> > 
> > The error above is associated with an fsync failure after switching from
> > an error table back to a working table, and essentially occurs because
> > the filesystem has shutdown and the bcachefs inode has a journal
> > sequence number ahead of what the journal shows has been flushed to
> > disk. That makes sense in general because even though the vfs error
> > reporting might not expect an error, the filesystem has been shutdown
> > and fsync has work to do. I've been digging into it a bit regardless to
> > try and understand why the behavior seems transient..
> > 
> > What I'm seeing looks like a matter of timing at the point the test
> > switches to an error table and issues the first fsync (where failure is
> > expected). IIUC, what I think is happening here in the scenario where
> > the test passes is something like this:
> 
> Hey Brian - my understanding of this bug was that it was even simpler.
> 
> As I understand it, the decreed semantics of fsync by the Powers That Be
> are that it should return a given error _once_  - if a subsequent fsync
> returns an error, then there must have been another write that errored.
> 

Yep, that's my understanding as well. This particular test is just
checking that the error reporting spreads across distinct fds for the
same underlying file. I.e., so when writeback fails and sets an error on
the mapping, each open fd/file should see an instance of that error on
fsync (but presumably not repeatedly unless pages are written and failed
again).

> So we just need to plumb our errors from the journal through the
> errseq_t mechanism. The tricky bit is that bch2_fsync() doesn't record
> or check if it's actually doing a journal flush or if the journal flush
> was already done.
> 

That's an interesting idea. I'd have to poke around the errseq stuff and
think about it a bit..

Something else that occurred to me while looking further at this is we
can also avoid the error in this case fairly easily by bailing out of
bch2_fsync() if page writeback fails, as opposed to the unconditional
flush -> sync meta -> flush log sequence that returns the first error
anyways. That would prevent marking the inode with a new sequence number
when I/Os are obviously failing. The caveat is that the test still
fails, now with a "Read-only file system" error instead of EIO, because
the filesystem is shutdown by the time the vfs write inode path actually
runs.

All in all ISTM the shutdown is the fundamental issue wrt to this test.
I'm not sure it's worth complicating the log flush error handling with
errseq and whatnot just to filter out certain errors when the fs has
seen a catastrophic failure/shutdown (also agree with your followup that
isn't the most critical bug in the world anyways).

Either way, I am trying to grok the commit/writeback/journaling path a
bit better here to try and understand why the fs shuts down in this test
whereas other fs' don't seem to. One thing I'm noticing is that this
whole path is rather asynchronous, making it a bit hard to follow. Any
tips on the best way to work through the big picture steps involved in
the path from a trans commit, through journaling, to eventual btree
writeback?

Thanks for the feedback..

Brian

> So I think we'd need to add another sequence number to
> bch_inode_info to track the sequence that was flushed, and then only
> call bch2_journal_flush_seq() if flushed_seq !=
> journal_seq_of_last_update, and then to make this work correctly when an
> inode has been evicted it might need to go in struct bch_inode, stored
> in the btree, as well.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-01-27 14:50   ` Brian Foster
@ 2023-01-30 17:06     ` Kent Overstreet
  2023-01-31 16:04       ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-01-30 17:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> Something else that occurred to me while looking further at this is we
> can also avoid the error in this case fairly easily by bailing out of
> bch2_fsync() if page writeback fails, as opposed to the unconditional
> flush -> sync meta -> flush log sequence that returns the first error
> anyways. That would prevent marking the inode with a new sequence number
> when I/Os are obviously failing. The caveat is that the test still
> fails, now with a "Read-only file system" error instead of EIO, because
> the filesystem is shutdown by the time the vfs write inode path actually
> runs.

If some pages did write successfully we don't want to skip the rest of
the fsync, though.

The test failing because the filesystem went read-only is because a
btree node write fails - I'm assuming without that change the test just
fails before any btree node writes are attempted.

> All in all ISTM the shutdown is the fundamental issue wrt to this test.
> I'm not sure it's worth complicating the log flush error handling with
> errseq and whatnot just to filter out certain errors when the fs has
> seen a catastrophic failure/shutdown (also agree with your followup that
> isn't the most critical bug in the world anyways).

Well, our default setting on serious error is to go read-only, an a
btree node write failing (in a single device, non replicated filesystem)
definitely qualifies.

It looks like our behaviour on btree node write error does need to be
looked at and improved/fixed: btree_io.c line 1734 has a clearly
unfinished case. It looks like we're missing whatever logic should be
there for a replicated filesystem - we definitely need to decide what to
do when some but not all btree node writes fail.

More relevant to this test, we also a setting for error behaviour -
errors=(continue|ro|panic), but that explicitly does _not_ apply to
write IO errors; if we can't write out metadata we always go read-only.

I'm not in a hurry to change that, because one of the things bcachefs
currently does really well is stopping the world if things ever go so
wrong that the filesystem would be inconsistent - we haven't had too
much in the way of unrecoverable filesystems, and I don't want that to
change :)

> Either way, I am trying to grok the commit/writeback/journaling path a
> bit better here to try and understand why the fs shuts down in this test
> whereas other fs' don't seem to. One thing I'm noticing is that this
> whole path is rather asynchronous, making it a bit hard to follow. Any
> tips on the best way to work through the big picture steps involved in
> the path from a trans commit, through journaling, to eventual btree
> writeback?

Transaction commit path is pretty much all straight line synchronous
code - that's btree_update_leaf.c.

Basically, the transaction commit path takes a list of updates, and
 - runs triggers (which will then generate more updates!)
 - ensures all btree paths are fully traversed and have intent locks
 - takes write locks on btree nodes
 - gets a journal reservation
 - does updates to leaf nodes/btree key cache (and soon the btree write
   buffer as well)

Btree node merging and splitting is also driven by the transaction
commit path - before doing the update we check if we need to merge
btree nodes (it has to be before the transaction commit in order to
happen reliably), and btree node splitting is done in an error path -
after taking write locks if we don't have space for the insert we bail
out and to the split in bch2_trans_commit_error().
 
When we update a leaf node or btree key cache key, we mark it as dirty
and create a journal pin for the journal sequence number we got a
journal reservation for. A journal pin has a callback for journal
reclaim to invoke - typically, writeback for the btree node cache and
btree key cache is driven entirely by journal reclaim, and the journal
pin also keeps the relevant journal entry dirty so that it'll be
replayed by journal replay.

(That writeback behaviour will be changing at some point: the behaviour
up until recently was that journal reclaim monitored the dirty
percentage of the btree node cache and key cache and would run to those
numbers under some threshold, but it turns out that really hurts
performance on purely random write benchmarks - we writeback too
aggressively, and we need to allow those caches to be 100% dirty if
there isn't any memory pressure).

The journalling code is pretty self contained w.r.t. the rest of
bcachefs. It is indeed a big pile of tricky asynchronous code (the fast
paths need to be lockless - right now I'm working on making it even more
lockless - and there's a lot of pipelining it has to do). At a high
level it's pretty simple though, and you should be able to treat it as a
black box. The main operations are:

 - getting a reservation: this gets you a reservation in a specific
   journal entry - you're supposed to memcpy() whatever you want
   journalled into the entry to be written and then quickly drop your
   reservation.

   Blocking with a reservation will block the entire journal - note that
   we get our journal reservation after taking btree write locks in the
   transaction commit path (but we do it in nonblocking mode; if we have
   to block on the journal we drop btree locks and do the waiting in
   bch2_trans_commit_error()).
 
 - getting a pre-reservation: trying to get a journal reservation while
   holding a journal pin would deadlock the journal if it ever
   completely filled up - a pre-reservation allows reserving space in
   the journal so that we can get a reservation later without
   deadlocking. This is for e.g. flushing the btree key cache: when we
   do a btree key cache update we update the key cache in memory and
   journal the update, to flush the key cache we have to update the
   btree which requires (for simplicity) re-journalling the update.

 - flush_seq: given a journal sequence number we previously got a
   journal reservation on, request that it be written out (with a
   flush/fua write, not all journal entries are written flush/fua).

Then there's btree node writes, which it sounds like you might be
getting into soon. Btree node writes are more complicated: btree nodes
are log structured, so we have to differentiate between the first write
to a new node and subsequent appending writes, and before returning a
completion for an appending btree node write we have to update the
pointer to that btree node (btree node pointers record the number of
sectors currently written; this is important for multi device
filesystems, and also for ensuring consistent ordering of updates after
a crash - updates can't be visible if they weren't completed in the
journal, so we have to ignore btree node writes that are newer than the
newest completed journal write.

Originally this was handled with the journal sequence number blacklist
mechanism - so you might see a few references to that still in
btree_io.c - but now it's handled by updating parent pointers after
every write, up to the root.

Hope that helps - and feel free to ask me more questions on IRC,
irc.oftc.net#bcache.

Cheers,
Kent

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-01-30 17:06     ` Kent Overstreet
@ 2023-01-31 16:04       ` Brian Foster
  2023-02-01 14:34         ` Kent Overstreet
  2023-02-02 22:56         ` fstests generic/441 -- occasional bcachefs failure Kent Overstreet
  0 siblings, 2 replies; 24+ messages in thread
From: Brian Foster @ 2023-01-31 16:04 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote:
> On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> > Something else that occurred to me while looking further at this is we
> > can also avoid the error in this case fairly easily by bailing out of
> > bch2_fsync() if page writeback fails, as opposed to the unconditional
> > flush -> sync meta -> flush log sequence that returns the first error
> > anyways. That would prevent marking the inode with a new sequence number
> > when I/Os are obviously failing. The caveat is that the test still
> > fails, now with a "Read-only file system" error instead of EIO, because
> > the filesystem is shutdown by the time the vfs write inode path actually
> > runs.
> 
> If some pages did write successfully we don't want to skip the rest of
> the fsync, though.
> 

What does it matter if the fsync() has already failed? ISTM this is
pretty standard error handling behavior across major fs', but it's not
clear to me if there's some bcachefs specific quirk that warrants
different handling..

FWIW, I think I've been able to fix this test with a couple small
tweaks:

1. Change bch2_fsync() to return on first error.
2. Introduce a bcachefs specific delay in the working -> error table
switchover.

I've not run a full regression test on the bcachefs change yet, but if
you're willing to take a patch along those lines I can do that and then
if it survives, follow up with an upstream fstests patch. That would
allow this test to pass fairly reliably and also no longer
unconditionally shutdown bcachefs.

> The test failing because the filesystem went read-only is because a
> btree node write fails - I'm assuming without that change the test just
> fails before any btree node writes are attempted.
> 

Not sure I parse the above. Generally speaking, the test fails because
after the last successful fsync, it immediately switches over to the
error dm table. The fs still has some reclaim/journal work to do after
this point, however, so this essentially always results in emergency
shutdown. Whether the test passes or fails at this point depends on a
race between this shutdown sequence in the background and an fsync call
successfuly committing a transaction (calling bch2_mark_inode()).

The winner of the race dictates whether the first fsync() after
switching back to the working dm table can avoid the journal flush
because the seq associated with the inode is already on disk, or whether
the fs wants to flush the seq on a filesystem that has been shutdown
(and so will always fail).

> > All in all ISTM the shutdown is the fundamental issue wrt to this test.
> > I'm not sure it's worth complicating the log flush error handling with
> > errseq and whatnot just to filter out certain errors when the fs has
> > seen a catastrophic failure/shutdown (also agree with your followup that
> > isn't the most critical bug in the world anyways).
> 
> Well, our default setting on serious error is to go read-only, an a
> btree node write failing (in a single device, non replicated filesystem)
> definitely qualifies.
> 
> It looks like our behaviour on btree node write error does need to be
> looked at and improved/fixed: btree_io.c line 1734 has a clearly
> unfinished case. It looks like we're missing whatever logic should be
> there for a replicated filesystem - we definitely need to decide what to
> do when some but not all btree node writes fail.
> 

Interesting. I noticed that hunk and thought it looked odd, but didn't
have enough context to grok it.

> More relevant to this test, we also a setting for error behaviour -
> errors=(continue|ro|panic), but that explicitly does _not_ apply to
> write IO errors; if we can't write out metadata we always go read-only.
> 
> I'm not in a hurry to change that, because one of the things bcachefs
> currently does really well is stopping the world if things ever go so
> wrong that the filesystem would be inconsistent - we haven't had too
> much in the way of unrecoverable filesystems, and I don't want that to
> change :)
> 

Indeed, that makes sense. I wonder if it would make sense to issue
retries at some point?

As another XFS example, it also has an emergency shutdown mechanism for
similar purposes. Certain I/O failures result in immediate shutdown of
the fs and no further changes make it to disk. Because shutdown itself
is a catastrophic failure, metadata write failures are retried at least
once before initiating shutdown to filter out transient errors. On top
of that, there's a sysfs section for each mount that allows an admin to
configure error handling behavior as needed. This allows setting the
number of retries before an error is considered permanent (including
infinite retries for scenarios where errors are expected to be manually
recoverable), or how long to delay retries.

I'm not sure a simple retry once sequence would help with this test, but
an infinite retry option (with some delay) probably would.

OTOH, I wonder if a more elegant solution is to explicitly quiesce the
fs before the working -> error table switchover. That might typically be
done with a freeze/unfreeze cycle so the test can serialize against
whatever background work needs to complete before putting the block
device into error mode. However, I notice freeze isn't currently
supported on bcachefs. Any thoughts on freeze support?

> > Either way, I am trying to grok the commit/writeback/journaling path a
> > bit better here to try and understand why the fs shuts down in this test
> > whereas other fs' don't seem to. One thing I'm noticing is that this
> > whole path is rather asynchronous, making it a bit hard to follow. Any
> > tips on the best way to work through the big picture steps involved in
> > the path from a trans commit, through journaling, to eventual btree
> > writeback?
> 
> Transaction commit path is pretty much all straight line synchronous
> code - that's btree_update_leaf.c.
> 
> Basically, the transaction commit path takes a list of updates, and
>  - runs triggers (which will then generate more updates!)
>  - ensures all btree paths are fully traversed and have intent locks
>  - takes write locks on btree nodes
>  - gets a journal reservation
>  - does updates to leaf nodes/btree key cache (and soon the btree write
>    buffer as well)
> 

Yep, I was able to follow most of this with the exception of some
confusion over the key cache bits. IIUC, the initial trans commit might
add a key update to the key cache, whereas reclaim might commit another
transaction to flush the key cache to the btree (via the journal
callback you mention below).

> Btree node merging and splitting is also driven by the transaction
> commit path - before doing the update we check if we need to merge
> btree nodes (it has to be before the transaction commit in order to
> happen reliably), and btree node splitting is done in an error path -
> after taking write locks if we don't have space for the insert we bail
> out and to the split in bch2_trans_commit_error().
>  
> When we update a leaf node or btree key cache key, we mark it as dirty
> and create a journal pin for the journal sequence number we got a
> journal reservation for. A journal pin has a callback for journal
> reclaim to invoke - typically, writeback for the btree node cache and
> btree key cache is driven entirely by journal reclaim, and the journal
> pin also keeps the relevant journal entry dirty so that it'll be
> replayed by journal replay.
> 
> (That writeback behaviour will be changing at some point: the behaviour
> up until recently was that journal reclaim monitored the dirty
> percentage of the btree node cache and key cache and would run to those
> numbers under some threshold, but it turns out that really hurts
> performance on purely random write benchmarks - we writeback too
> aggressively, and we need to allow those caches to be 100% dirty if
> there isn't any memory pressure).
> 
> The journalling code is pretty self contained w.r.t. the rest of
> bcachefs. It is indeed a big pile of tricky asynchronous code (the fast
> paths need to be lockless - right now I'm working on making it even more
> lockless - and there's a lot of pipelining it has to do). At a high
> level it's pretty simple though, and you should be able to treat it as a
> black box. The main operations are:
> 
>  - getting a reservation: this gets you a reservation in a specific
>    journal entry - you're supposed to memcpy() whatever you want
>    journalled into the entry to be written and then quickly drop your
>    reservation.
> 
>    Blocking with a reservation will block the entire journal - note that
>    we get our journal reservation after taking btree write locks in the
>    transaction commit path (but we do it in nonblocking mode; if we have
>    to block on the journal we drop btree locks and do the waiting in
>    bch2_trans_commit_error()).
>  
>  - getting a pre-reservation: trying to get a journal reservation while
>    holding a journal pin would deadlock the journal if it ever
>    completely filled up - a pre-reservation allows reserving space in
>    the journal so that we can get a reservation later without
>    deadlocking. This is for e.g. flushing the btree key cache: when we
>    do a btree key cache update we update the key cache in memory and
>    journal the update, to flush the key cache we have to update the
>    btree which requires (for simplicity) re-journalling the update.
> 

Ah.. so does "for simplicity" here essentially mean "for reuse of the
transaction code?"

>  - flush_seq: given a journal sequence number we previously got a
>    journal reservation on, request that it be written out (with a
>    flush/fua write, not all journal entries are written flush/fua).
> 

Thanks for this. Something that initially confused me wrt to the journal
is that I was seeing journal writes after btree node writes. The docs
mention the journal is primarily an optimization and ordering is not
important for crash consistency. With that in mind, does that mean a
journal "pin" refers to entries held in the journal until written back
to the btree (as opposed to the journal pinning keys in memory)? IOW, it
doesn't really matter in which order that journal/btree writes occur,
but only that entries remain pinned in the journal until btree writes
complete..?

(I suspect this confused me because XFS uses the same pinning
terminology, but in that context it refers to metadata items being
pinned by the log, until the log item is flushed and thus unpins the
metadata item for final writeback. So ordering is a critical factor in
this context.).

> Then there's btree node writes, which it sounds like you might be
> getting into soon. Btree node writes are more complicated: btree nodes
> are log structured, so we have to differentiate between the first write
> to a new node and subsequent appending writes, and before returning a
> completion for an appending btree node write we have to update the
> pointer to that btree node (btree node pointers record the number of
> sectors currently written; this is important for multi device
> filesystems, and also for ensuring consistent ordering of updates after
> a crash - updates can't be visible if they weren't completed in the
> journal, so we have to ignore btree node writes that are newer than the
> newest completed journal write.
> 
> Originally this was handled with the journal sequence number blacklist
> mechanism - so you might see a few references to that still in
> btree_io.c - but now it's handled by updating parent pointers after
> every write, up to the root.
> 
> Hope that helps - and feel free to ask me more questions on IRC,
> irc.oftc.net#bcache.
> 

#bcache or #bachefs?

This is all extremely helpful. Thanks again.

Brian

> Cheers,
> Kent
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-01-31 16:04       ` Brian Foster
@ 2023-02-01 14:34         ` Kent Overstreet
  2023-02-02 15:50           ` Brian Foster
  2023-02-02 22:56         ` fstests generic/441 -- occasional bcachefs failure Kent Overstreet
  1 sibling, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-02-01 14:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote:
> On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote:
> > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> > > Something else that occurred to me while looking further at this is we
> > > can also avoid the error in this case fairly easily by bailing out of
> > > bch2_fsync() if page writeback fails, as opposed to the unconditional
> > > flush -> sync meta -> flush log sequence that returns the first error
> > > anyways. That would prevent marking the inode with a new sequence number
> > > when I/Os are obviously failing. The caveat is that the test still
> > > fails, now with a "Read-only file system" error instead of EIO, because
> > > the filesystem is shutdown by the time the vfs write inode path actually
> > > runs.
> > 
> > If some pages did write successfully we don't want to skip the rest of
> > the fsync, though.
> > 
> 
> What does it matter if the fsync() has already failed? ISTM this is
> pretty standard error handling behavior across major fs', but it's not
> clear to me if there's some bcachefs specific quirk that warrants
> different handling..
> 
> FWIW, I think I've been able to fix this test with a couple small
> tweaks:
> 
> 1. Change bch2_fsync() to return on first error.
> 2. Introduce a bcachefs specific delay in the working -> error table
> switchover.
> 
> I've not run a full regression test on the bcachefs change yet, but if
> you're willing to take a patch along those lines I can do that and then
> if it survives, follow up with an upstream fstests patch. That would
> allow this test to pass fairly reliably and also no longer
> unconditionally shutdown bcachefs.

If you link me your git repo I'll add it to the CI so all the automated
tests run.

> 
> > The test failing because the filesystem went read-only is because a
> > btree node write fails - I'm assuming without that change the test just
> > fails before any btree node writes are attempted.
> > 
> 
> Not sure I parse the above. Generally speaking, the test fails because
> after the last successful fsync, it immediately switches over to the
> error dm table. The fs still has some reclaim/journal work to do after
> this point, however, so this essentially always results in emergency
> shutdown. Whether the test passes or fails at this point depends on a
> race between this shutdown sequence in the background and an fsync call
> successfuly committing a transaction (calling bch2_mark_inode()).

*nod* That's more accurate.

> Indeed, that makes sense. I wonder if it would make sense to issue
> retries at some point?

Possibly. AFAIK, on typical block devices we shouldn't see write errors
unless the device is really not working anymore - both spinning rust and
SSDs will remap on write error, and on transport error IMO the block
layer is the more correct place to be issuing retries, if so desired.

But we do have aspirations of running on raw flash in the
not-too-distant future (ZNS SSDs), and I would expect we'll need retry
logic at that point.

> I'm not sure a simple retry once sequence would help with this test, but
> an infinite retry option (with some delay) probably would.

I'd think we'd want a timeout, not a specific number of retries to
attempt (or perhaps both).

> OTOH, I wonder if a more elegant solution is to explicitly quiesce the
> fs before the working -> error table switchover. That might typically be
> done with a freeze/unfreeze cycle so the test can serialize against
> whatever background work needs to complete before putting the block
> device into error mode. However, I notice freeze isn't currently
> supported on bcachefs. Any thoughts on freeze support?

Freeze definitely needs to happen. It's been _ages_ since I was looking
at it so I couldn't say offhand where we'd need to start, but if you're
interested I'd be happy to look at what it'd take.

> Ah.. so does "for simplicity" here essentially mean "for reuse of the
> transaction code?"

Yes - bch2_btree_key_cache_journal_flush() -> btree_key_cache_flush_pos()
-> bch2_trans_commit().

> 
> >  - flush_seq: given a journal sequence number we previously got a
> >    journal reservation on, request that it be written out (with a
> >    flush/fua write, not all journal entries are written flush/fua).
> > 
> 
> Thanks for this. Something that initially confused me wrt to the journal
> is that I was seeing journal writes after btree node writes. The docs
> mention the journal is primarily an optimization and ordering is not
> important for crash consistency.

Wherever you read that must date to well before bcachefs was a thing -
that was true in bcache when the journalling code was still new. In
bcachefs ordering of updates is important and the journal is indeed our
global time ordering.

> With that in mind, does that mean a
> journal "pin" refers to entries held in the journal until written back
> to the btree (as opposed to the journal pinning keys in memory)? IOW, it
> doesn't really matter in which order that journal/btree writes occur,
> but only that entries remain pinned in the journal until btree writes
> complete..?

Correct.

> (I suspect this confused me because XFS uses the same pinning
> terminology, but in that context it refers to metadata items being
> pinned by the log, until the log item is flushed and thus unpins the
> metadata item for final writeback. So ordering is a critical factor in
> this context.).

I see -  in XFS metadata items can't be written until the relevant log
entry is written? Yeah we never did it that way in bcachefs, blocking
writeback which already blocks memory reclaim sounded like a bad idea :)

Originally, before we started updating pointers to btree nodes after
every write, we'd note in every btree node write the newest journal
sequence number it had updates for - then in recovery we'd blacklist the
next N journal sequence numbers and if we ever saw a btree node entry
for a blacklisted journal sequence number we just ignored it.

> #bcache or #bachefs?

#bcache :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-02-01 14:34         ` Kent Overstreet
@ 2023-02-02 15:50           ` Brian Foster
  2023-02-02 17:09             ` Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2023-02-02 15:50 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Wed, Feb 01, 2023 at 09:34:50AM -0500, Kent Overstreet wrote:
> On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote:
> > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote:
> > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> > > > Something else that occurred to me while looking further at this is we
> > > > can also avoid the error in this case fairly easily by bailing out of
> > > > bch2_fsync() if page writeback fails, as opposed to the unconditional
> > > > flush -> sync meta -> flush log sequence that returns the first error
> > > > anyways. That would prevent marking the inode with a new sequence number
> > > > when I/Os are obviously failing. The caveat is that the test still
> > > > fails, now with a "Read-only file system" error instead of EIO, because
> > > > the filesystem is shutdown by the time the vfs write inode path actually
> > > > runs.
> > > 
> > > If some pages did write successfully we don't want to skip the rest of
> > > the fsync, though.
> > > 
> > 
> > What does it matter if the fsync() has already failed? ISTM this is
> > pretty standard error handling behavior across major fs', but it's not
> > clear to me if there's some bcachefs specific quirk that warrants
> > different handling..
> > 
> > FWIW, I think I've been able to fix this test with a couple small
> > tweaks:
> > 
> > 1. Change bch2_fsync() to return on first error.
> > 2. Introduce a bcachefs specific delay in the working -> error table
> > switchover.
> > 
> > I've not run a full regression test on the bcachefs change yet, but if
> > you're willing to take a patch along those lines I can do that and then
> > if it survives, follow up with an upstream fstests patch. That would
> > allow this test to pass fairly reliably and also no longer
> > unconditionally shutdown bcachefs.
> 
> If you link me your git repo I'll add it to the CI so all the automated
> tests run.
> 

I don't have a public repo atm but I've posted the patch if you have
somewhere to land it for CI testing..? It survived my regression tests
so far, FWIW. (I also had posted that random cleanup patch a bit ago if
you hadn't noticed..).

Is there a reporting dashboard or something available for the test
infrastruture for bcachefs?

> > 
> > > The test failing because the filesystem went read-only is because a
> > > btree node write fails - I'm assuming without that change the test just
> > > fails before any btree node writes are attempted.
> > > 
> > 
> > Not sure I parse the above. Generally speaking, the test fails because
> > after the last successful fsync, it immediately switches over to the
> > error dm table. The fs still has some reclaim/journal work to do after
> > this point, however, so this essentially always results in emergency
> > shutdown. Whether the test passes or fails at this point depends on a
> > race between this shutdown sequence in the background and an fsync call
> > successfuly committing a transaction (calling bch2_mark_inode()).
> 
> *nod* That's more accurate.
> 
> > Indeed, that makes sense. I wonder if it would make sense to issue
> > retries at some point?
> 
> Possibly. AFAIK, on typical block devices we shouldn't see write errors
> unless the device is really not working anymore - both spinning rust and
> SSDs will remap on write error, and on transport error IMO the block
> layer is the more correct place to be issuing retries, if so desired.
> 
> But we do have aspirations of running on raw flash in the
> not-too-distant future (ZNS SSDs), and I would expect we'll need retry
> logic at that point.
> 
> > I'm not sure a simple retry once sequence would help with this test, but
> > an infinite retry option (with some delay) probably would.
> 
> I'd think we'd want a timeout, not a specific number of retries to
> attempt (or perhaps both).
> 
> > OTOH, I wonder if a more elegant solution is to explicitly quiesce the
> > fs before the working -> error table switchover. That might typically be
> > done with a freeze/unfreeze cycle so the test can serialize against
> > whatever background work needs to complete before putting the block
> > device into error mode. However, I notice freeze isn't currently
> > supported on bcachefs. Any thoughts on freeze support?
> 
> Freeze definitely needs to happen. It's been _ages_ since I was looking
> at it so I couldn't say offhand where we'd need to start, but if you're
> interested I'd be happy to look at what it'd take.
> 

Yeah, that would be interesting. Thanks.

> > Ah.. so does "for simplicity" here essentially mean "for reuse of the
> > transaction code?"
> 
> Yes - bch2_btree_key_cache_journal_flush() -> btree_key_cache_flush_pos()
> -> bch2_trans_commit().
> 
> > 
> > >  - flush_seq: given a journal sequence number we previously got a
> > >    journal reservation on, request that it be written out (with a
> > >    flush/fua write, not all journal entries are written flush/fua).
> > > 
> > 
> > Thanks for this. Something that initially confused me wrt to the journal
> > is that I was seeing journal writes after btree node writes. The docs
> > mention the journal is primarily an optimization and ordering is not
> > important for crash consistency.
> 
> Wherever you read that must date to well before bcachefs was a thing -
> that was true in bcache when the journalling code was still new. In
> bcachefs ordering of updates is important and the journal is indeed our
> global time ordering.
> 

FWIW, I was reading through the "Journaling" and "Sequential
consistency" sections of the architecture documentation [1]. It's
possible I just misinterpreted some wording.

[1] https://bcachefs.org/Architecture/#Journaling

> > With that in mind, does that mean a
> > journal "pin" refers to entries held in the journal until written back
> > to the btree (as opposed to the journal pinning keys in memory)? IOW, it
> > doesn't really matter in which order that journal/btree writes occur,
> > but only that entries remain pinned in the journal until btree writes
> > complete..?
> 
> Correct.
> 

Ok, I think I get the general idea then. I'll have to dig more into it
at some point.

> > (I suspect this confused me because XFS uses the same pinning
> > terminology, but in that context it refers to metadata items being
> > pinned by the log, until the log item is flushed and thus unpins the
> > metadata item for final writeback. So ordering is a critical factor in
> > this context.).
> 
> I see -  in XFS metadata items can't be written until the relevant log
> entry is written? Yeah we never did it that way in bcachefs, blocking
> writeback which already blocks memory reclaim sounded like a bad idea :)
> 

This is all a completely different design and architecture of course.
XFS has its own buffer cache for metadata, with aging heuristics, and
shrinker hooks and whatnot with feedback into log and metadata flushing
to manage all this and keep things moving along.

I was mainly just trying to wrap my head around the disk write ordering
rules for bcachefs journaling wrt to fs consistency, using "traditional
logging" (i.e. XFS) as a point of reference.

Brian

> Originally, before we started updating pointers to btree nodes after
> every write, we'd note in every btree node write the newest journal
> sequence number it had updates for - then in recovery we'd blacklist the
> next N journal sequence numbers and if we ever saw a btree node entry
> for a blacklisted journal sequence number we just ignored it.
> 
> > #bcache or #bachefs?
> 
> #bcache :)
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure)
  2023-02-02 15:50           ` Brian Foster
@ 2023-02-02 17:09             ` Kent Overstreet
  2023-02-02 20:04               ` Brian Foster
  2023-02-03  0:51               ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-02-02 17:09 UTC (permalink / raw)
  To: Brian Foster, dchinner; +Cc: linux-bcachefs

On Thu, Feb 02, 2023 at 10:50:23AM -0500, Brian Foster wrote:
> I don't have a public repo atm but I've posted the patch if you have
> somewhere to land it for CI testing..? It survived my regression tests
> so far, FWIW. (I also had posted that random cleanup patch a bit ago if
> you hadn't noticed..).

Must have missed it, sorry. I can host a git repo for you on my server,
or github works fine - I generally prefer git repo links, git am is
always a bit of a hassle.

> Is there a reporting dashboard or something available for the test
> infrastruture for bcachefs?

https://evilpiepirate.org/~testdashboard/ci

I've got a small server farm that watches git branches and runs the
entire test suite on every commit starting from the recent - once you've
got a git branch up I'll add yours to the list it watches.

You'll probably want to get acquainted with ktest, it's what both the CI
uses for running tests, and what we use for local development:

https://evilpiepirate.org/git/ktest.git

> > Freeze definitely needs to happen. It's been _ages_ since I was looking
> > at it so I couldn't say offhand where we'd need to start, but if you're
> > interested I'd be happy to look at what it'd take.
> > 
> 
> Yeah, that would be interesting. Thanks.

Maybe we could get Dave to give us a brief rundown of freezing? It's
been ages since I was thinking about that and it's all fallen out of my
brain, but Dave was the one who was able to explain it to me before :)

I think getting freezing done would up our test coverage by a good bit,
so it's worth prioritizing.

> FWIW, I was reading through the "Journaling" and "Sequential
> consistency" sections of the architecture documentation [1]. It's
> possible I just misinterpreted some wording.
> 
> [1] https://bcachefs.org/Architecture/#Journaling

*nod* Thanks for the link, that's old but not ancient - that's from when
the journal seq blacklist mechanism was how we did sequential
consistency.

Whenever I find the time to work on documentation again, I want to move
more stuff (including that sort of high level developer documentation)
to the principles of operation: https://bcachefs.org/bcachefs-principles-of-operation.pdf

Right now it's more end user documentation, but last I was working on
that I was starting to add documentation centered around on disk data
structures - i.e. btrees and key types, and it seemed to be going in a
useful direction.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure)
  2023-02-02 17:09             ` Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) Kent Overstreet
@ 2023-02-02 20:04               ` Brian Foster
  2023-02-02 22:39                 ` Kent Overstreet
  2023-02-03  0:51               ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Brian Foster @ 2023-02-02 20:04 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: dchinner, linux-bcachefs

On Thu, Feb 02, 2023 at 12:09:20PM -0500, Kent Overstreet wrote:
> On Thu, Feb 02, 2023 at 10:50:23AM -0500, Brian Foster wrote:
> > I don't have a public repo atm but I've posted the patch if you have
> > somewhere to land it for CI testing..? It survived my regression tests
> > so far, FWIW. (I also had posted that random cleanup patch a bit ago if
> > you hadn't noticed..).
> 
> Must have missed it, sorry. I can host a git repo for you on my server,
> or github works fine - I generally prefer git repo links, git am is
> always a bit of a hassle.
> 

Sounds good.

> > Is there a reporting dashboard or something available for the test
> > infrastruture for bcachefs?
> 
> https://evilpiepirate.org/~testdashboard/ci
> 
> I've got a small server farm that watches git branches and runs the
> entire test suite on every commit starting from the recent - once you've
> got a git branch up I'll add yours to the list it watches.
> 

Cool. I see that it reproduces the generic/441 failure.

I need to use GitLab for other things so I managed to create a project
cloning your repo there and pushed a testing branch with both of those
commits. I think it should be accessible here:

https://gitlab.com/bfoster2/bcachefs.git bf-bcachefs-testing

... but let me know if it doesn't work.

> You'll probably want to get acquainted with ktest, it's what both the CI
> uses for running tests, and what we use for local development:
> 
> https://evilpiepirate.org/git/ktest.git
> 

I'll check it out, thanks!

Brian

> > > Freeze definitely needs to happen. It's been _ages_ since I was looking
> > > at it so I couldn't say offhand where we'd need to start, but if you're
> > > interested I'd be happy to look at what it'd take.
> > > 
> > 
> > Yeah, that would be interesting. Thanks.
> 
> Maybe we could get Dave to give us a brief rundown of freezing? It's
> been ages since I was thinking about that and it's all fallen out of my
> brain, but Dave was the one who was able to explain it to me before :)
> 
> I think getting freezing done would up our test coverage by a good bit,
> so it's worth prioritizing.
> 
> > FWIW, I was reading through the "Journaling" and "Sequential
> > consistency" sections of the architecture documentation [1]. It's
> > possible I just misinterpreted some wording.
> > 
> > [1] https://bcachefs.org/Architecture/#Journaling
> 
> *nod* Thanks for the link, that's old but not ancient - that's from when
> the journal seq blacklist mechanism was how we did sequential
> consistency.
> 
> Whenever I find the time to work on documentation again, I want to move
> more stuff (including that sort of high level developer documentation)
> to the principles of operation: https://bcachefs.org/bcachefs-principles-of-operation.pdf
> 
> Right now it's more end user documentation, but last I was working on
> that I was starting to add documentation centered around on disk data
> structures - i.e. btrees and key types, and it seemed to be going in a
> useful direction.
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure)
  2023-02-02 20:04               ` Brian Foster
@ 2023-02-02 22:39                 ` Kent Overstreet
  0 siblings, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-02-02 22:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Thu, Feb 02, 2023 at 03:04:22PM -0500, Brian Foster wrote:
> I need to use GitLab for other things so I managed to create a project
> cloning your repo there and pushed a testing branch with both of those
> commits. I think it should be accessible here:
> 
> https://gitlab.com/bfoster2/bcachefs.git bf-bcachefs-testing

You've been added:
https://evilpiepirate.org/~testdashboard/ci?branch=bfoster

If results don't show up right away it's because I just rewrote the CI's
configuration file handling - generally only takes ~20 minutes though.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-01-31 16:04       ` Brian Foster
  2023-02-01 14:34         ` Kent Overstreet
@ 2023-02-02 22:56         ` Kent Overstreet
  2023-02-04 21:33           ` Brian Foster
  1 sibling, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-02-02 22:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote:
> On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote:
> > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> > > Something else that occurred to me while looking further at this is we
> > > can also avoid the error in this case fairly easily by bailing out of
> > > bch2_fsync() if page writeback fails, as opposed to the unconditional
> > > flush -> sync meta -> flush log sequence that returns the first error
> > > anyways. That would prevent marking the inode with a new sequence number
> > > when I/Os are obviously failing. The caveat is that the test still
> > > fails, now with a "Read-only file system" error instead of EIO, because
> > > the filesystem is shutdown by the time the vfs write inode path actually
> > > runs.
> > 
> > If some pages did write successfully we don't want to skip the rest of
> > the fsync, though.
> > 
> 
> What does it matter if the fsync() has already failed? ISTM this is
> pretty standard error handling behavior across major fs', but it's not
> clear to me if there's some bcachefs specific quirk that warrants
> different handling..
> 
> FWIW, I think I've been able to fix this test with a couple small
> tweaks:
> 
> 1. Change bch2_fsync() to return on first error.

I suppose it doesn't make sense to flush the journal if
file_write_and_wait_range() didn't successfully do anything - but I
would prefer to keep the behaviour where we do flush the journal on
partial writeback.

What if we plumbed a did_work parameter through?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure)
  2023-02-02 17:09             ` Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) Kent Overstreet
  2023-02-02 20:04               ` Brian Foster
@ 2023-02-03  0:51               ` Dave Chinner
  2023-02-04  0:35                 ` Kent Overstreet
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2023-02-03  0:51 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Brian Foster, linux-bcachefs

On Thu, Feb 02, 2023 at 12:09:20PM -0500, Kent Overstreet wrote:
> On Thu, Feb 02, 2023 at 10:50:23AM -0500, Brian Foster wrote:
> > I don't have a public repo atm but I've posted the patch if you have
> > somewhere to land it for CI testing..? It survived my regression tests
> > so far, FWIW. (I also had posted that random cleanup patch a bit ago if
> > you hadn't noticed..).
> 
> Must have missed it, sorry. I can host a git repo for you on my server,
> or github works fine - I generally prefer git repo links, git am is
> always a bit of a hassle.
> 
> > Is there a reporting dashboard or something available for the test
> > infrastruture for bcachefs?
> 
> https://evilpiepirate.org/~testdashboard/ci
> 
> I've got a small server farm that watches git branches and runs the
> entire test suite on every commit starting from the recent - once you've
> got a git branch up I'll add yours to the list it watches.
> 
> You'll probably want to get acquainted with ktest, it's what both the CI
> uses for running tests, and what we use for local development:
> 
> https://evilpiepirate.org/git/ktest.git
> 
> > > Freeze definitely needs to happen. It's been _ages_ since I was looking
> > > at it so I couldn't say offhand where we'd need to start, but if you're
> > > interested I'd be happy to look at what it'd take.
> > > 
> > 
> > Yeah, that would be interesting. Thanks.
> 
> Maybe we could get Dave to give us a brief rundown of freezing? It's
> been ages since I was thinking about that and it's all fallen out of my
> brain, but Dave was the one who was able to explain it to me before :)

What do you need to know? The vast majority of the freeze
infrastructure is generic and the filesystem doesn't need to do
anything special. The only thing it needs to implement is
->freeze_fs to essentially quiesce the filesystem - by this stage
all the data has been written back and all user-driven operations
have either been stalled or drained at either the VFS or transaction
reservation points.

This requires the filesystem transaction start point to call
sb_start_intwrite() in a location the transaction start can block
safely forever, and to call sb_end_intwrite() when the transaction
is complete and being torn down. 

[Note that bcachefs might also require
sb_{start/end}_{write/pagefault} calls in paths that it has custom
handlers for and to protect against ioctl operations triggering
modifications during freezes.]

This allows freeze_super() to set a barrier to prevent new
transactions from starting, and to wait on transactions in flight to
drain. Once all transactions have drained, it will then call
->freeze_fs if it is defined so the filesystem can flush it's
journal and dirty in-memory metadata so that it becomes consistent
on disk without requiring journal recovery to be run.

This basically means that once ->fs_freeze completes, the filesystem
should be in the same state on-disk as if it were unmounted cleanly,
and the fs will not issue any more IO until the filesystem is
thawed. Thaw will call ->unfreeze_fs if defined before unblocking
tasks so that the filesystem can restart things that may be needed
for normal operation that were stopped during the freeze.

It's not all that complex anymore - a few hooks to enable
modification barriers to be placed and running the
writeback part of unmount in ->freeze_fs is the main component
of the work that needs to be done....

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure)
  2023-02-03  0:51               ` Dave Chinner
@ 2023-02-04  0:35                 ` Kent Overstreet
  2023-02-07  0:03                   ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-02-04  0:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-bcachefs

On Fri, Feb 03, 2023 at 11:51:12AM +1100, Dave Chinner wrote:
> What do you need to know? The vast majority of the freeze
> infrastructure is generic and the filesystem doesn't need to do
> anything special. The only thing it needs to implement is
> ->freeze_fs to essentially quiesce the filesystem - by this stage
> all the data has been written back and all user-driven operations
> have either been stalled or drained at either the VFS or transaction
> reservation points.
> 
> This requires the filesystem transaction start point to call
> sb_start_intwrite() in a location the transaction start can block
> safely forever, and to call sb_end_intwrite() when the transaction
> is complete and being torn down. 
> 
> [Note that bcachefs might also require
> sb_{start/end}_{write/pagefault} calls in paths that it has custom
> handlers for and to protect against ioctl operations triggering
> modifications during freezes.]
> 
> This allows freeze_super() to set a barrier to prevent new
> transactions from starting, and to wait on transactions in flight to
> drain. Once all transactions have drained, it will then call
> ->freeze_fs if it is defined so the filesystem can flush it's
> journal and dirty in-memory metadata so that it becomes consistent
> on disk without requiring journal recovery to be run.
> 
> This basically means that once ->fs_freeze completes, the filesystem
> should be in the same state on-disk as if it were unmounted cleanly,
> and the fs will not issue any more IO until the filesystem is
> thawed. Thaw will call ->unfreeze_fs if defined before unblocking
> tasks so that the filesystem can restart things that may be needed
> for normal operation that were stopped during the freeze.
> 
> It's not all that complex anymore - a few hooks to enable
> modification barriers to be placed and running the
> writeback part of unmount in ->freeze_fs is the main component
> of the work that needs to be done....

Thanks, that is a lot simpler than I was thinking - I guess I was
thinking about task freezing for suspend, that definitely had some
tricky bits. Sounds like treating it as if we were remounting read-only
is all we need to do.

I'll get on this when I get done with backpointers, unless someone else
is willing to work on it :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-02-02 22:56         ` fstests generic/441 -- occasional bcachefs failure Kent Overstreet
@ 2023-02-04 21:33           ` Brian Foster
  2023-02-04 22:15             ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2023-02-04 21:33 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Thu, Feb 02, 2023 at 05:56:32PM -0500, Kent Overstreet wrote:
> On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote:
> > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote:
> > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> > > > Something else that occurred to me while looking further at this is we
> > > > can also avoid the error in this case fairly easily by bailing out of
> > > > bch2_fsync() if page writeback fails, as opposed to the unconditional
> > > > flush -> sync meta -> flush log sequence that returns the first error
> > > > anyways. That would prevent marking the inode with a new sequence number
> > > > when I/Os are obviously failing. The caveat is that the test still
> > > > fails, now with a "Read-only file system" error instead of EIO, because
> > > > the filesystem is shutdown by the time the vfs write inode path actually
> > > > runs.
> > > 
> > > If some pages did write successfully we don't want to skip the rest of
> > > the fsync, though.
> > > 
> > 
> > What does it matter if the fsync() has already failed? ISTM this is
> > pretty standard error handling behavior across major fs', but it's not
> > clear to me if there's some bcachefs specific quirk that warrants
> > different handling..
> > 
> > FWIW, I think I've been able to fix this test with a couple small
> > tweaks:
> > 
> > 1. Change bch2_fsync() to return on first error.
> 
> I suppose it doesn't make sense to flush the journal if
> file_write_and_wait_range() didn't successfully do anything - but I
> would prefer to keep the behaviour where we do flush the journal on
> partial writeback.
> 

That seems reasonable to me in principle...

> What if we plumbed a did_work parameter through?
> 

... but I'm not sure how that is supposed to work..? Tracking submits
from the current flush wouldn't be hard, but wouldn't tell us about
writeback that might have occurred in the background before fsync was
called. It also doesn't give any information about what I/Os succeeded
or failed.

I was thinking about whether we could look at PageError pages or perhaps
sample errseq or something in the error case, but the fdatawait can
clear page errors and errseq only bumps the sequence value if it's been
sampled since the last update (so no way to distinguish between "one
failure" and "everything failed").

The only other thing that comes to mind is explicit
submit/completion/error counting and tracking somewhere in the mapping
or inode or some such, but that seems like overkill given fsync error
semantics. I.e.  when fsync returns an error, we're basically telling
the user "data loss somewhere in the file," without indication of where
or how much, so the user basically has to recreate the file or toss it
anyways. It would be nice to improve on that, but that's a much bigger
problem than I'm trying to address here. ;P

So I dunno.. I'll try to think a bit more about it but atm I'm not
seeing any sort of elegant way to fix this test and also preserve the
inode flush in the event of writeback failure. Thoughts? ISTM that it's
reasonable to skip the flush on wb error given fsync behavior and error
semantics seem common across all major fs', and as background writeback
should still try to handle it in short order anyways, but it's your
tree.. :)

Brian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-02-04 21:33           ` Brian Foster
@ 2023-02-04 22:15             ` Kent Overstreet
  2023-02-06 15:33               ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-02-04 22:15 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Sat, Feb 04, 2023 at 04:33:41PM -0500, Brian Foster wrote:
> On Thu, Feb 02, 2023 at 05:56:32PM -0500, Kent Overstreet wrote:
> > On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote:
> > > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote:
> > > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> > > > > Something else that occurred to me while looking further at this is we
> > > > > can also avoid the error in this case fairly easily by bailing out of
> > > > > bch2_fsync() if page writeback fails, as opposed to the unconditional
> > > > > flush -> sync meta -> flush log sequence that returns the first error
> > > > > anyways. That would prevent marking the inode with a new sequence number
> > > > > when I/Os are obviously failing. The caveat is that the test still
> > > > > fails, now with a "Read-only file system" error instead of EIO, because
> > > > > the filesystem is shutdown by the time the vfs write inode path actually
> > > > > runs.
> > > > 
> > > > If some pages did write successfully we don't want to skip the rest of
> > > > the fsync, though.
> > > > 
> > > 
> > > What does it matter if the fsync() has already failed? ISTM this is
> > > pretty standard error handling behavior across major fs', but it's not
> > > clear to me if there's some bcachefs specific quirk that warrants
> > > different handling..
> > > 
> > > FWIW, I think I've been able to fix this test with a couple small
> > > tweaks:
> > > 
> > > 1. Change bch2_fsync() to return on first error.
> > 
> > I suppose it doesn't make sense to flush the journal if
> > file_write_and_wait_range() didn't successfully do anything - but I
> > would prefer to keep the behaviour where we do flush the journal on
> > partial writeback.
> > 
> 
> That seems reasonable to me in principle...
> 
> > What if we plumbed a did_work parameter through?
> > 
> 
> ... but I'm not sure how that is supposed to work..? Tracking submits
> from the current flush wouldn't be hard, but wouldn't tell us about
> writeback that might have occurred in the background before fsync was
> called. It also doesn't give any information about what I/Os succeeded
> or failed.

Good point.

Now that I think about it, inode->bi_journal_seq won't be getting
updated unless a write actually did successfully complete - meaning
bch2_flush_inode() won't do anything unless there was a successful
transaction commit; either data was written or the inode was updated for
some other reason.

Maybe it's the sync_inode_metadata() call that's causing the journal
flush to happen?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-02-04 22:15             ` Kent Overstreet
@ 2023-02-06 15:33               ` Brian Foster
  2023-02-06 22:18                 ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2023-02-06 15:33 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Sat, Feb 04, 2023 at 05:15:11PM -0500, Kent Overstreet wrote:
> On Sat, Feb 04, 2023 at 04:33:41PM -0500, Brian Foster wrote:
> > On Thu, Feb 02, 2023 at 05:56:32PM -0500, Kent Overstreet wrote:
> > > On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote:
> > > > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote:
> > > > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote:
> > > > > > Something else that occurred to me while looking further at this is we
> > > > > > can also avoid the error in this case fairly easily by bailing out of
> > > > > > bch2_fsync() if page writeback fails, as opposed to the unconditional
> > > > > > flush -> sync meta -> flush log sequence that returns the first error
> > > > > > anyways. That would prevent marking the inode with a new sequence number
> > > > > > when I/Os are obviously failing. The caveat is that the test still
> > > > > > fails, now with a "Read-only file system" error instead of EIO, because
> > > > > > the filesystem is shutdown by the time the vfs write inode path actually
> > > > > > runs.
> > > > > 
> > > > > If some pages did write successfully we don't want to skip the rest of
> > > > > the fsync, though.
> > > > > 
> > > > 
> > > > What does it matter if the fsync() has already failed? ISTM this is
> > > > pretty standard error handling behavior across major fs', but it's not
> > > > clear to me if there's some bcachefs specific quirk that warrants
> > > > different handling..
> > > > 
> > > > FWIW, I think I've been able to fix this test with a couple small
> > > > tweaks:
> > > > 
> > > > 1. Change bch2_fsync() to return on first error.
> > > 
> > > I suppose it doesn't make sense to flush the journal if
> > > file_write_and_wait_range() didn't successfully do anything - but I
> > > would prefer to keep the behaviour where we do flush the journal on
> > > partial writeback.
> > > 
> > 
> > That seems reasonable to me in principle...
> > 
> > > What if we plumbed a did_work parameter through?
> > > 
> > 
> > ... but I'm not sure how that is supposed to work..? Tracking submits
> > from the current flush wouldn't be hard, but wouldn't tell us about
> > writeback that might have occurred in the background before fsync was
> > called. It also doesn't give any information about what I/Os succeeded
> > or failed.
> 
> Good point.
> 
> Now that I think about it, inode->bi_journal_seq won't be getting
> updated unless a write actually did successfully complete - meaning
> bch2_flush_inode() won't do anything unless there was a successful
> transaction commit; either data was written or the inode was updated for
> some other reason.
> 
> Maybe it's the sync_inode_metadata() call that's causing the journal
> flush to happen?
> 

Yeah, the sync_inode_metadata() -> bch2_vfs_write_inode() ->
bch2_write_inode() path is what commits a transaction and bumps the
->bi_journal_seq on the inode. Essentially I think there are a couple
issues related to this test and perhaps easier to reason about them
separately:

The first issue is that a shutdown occurs because the test basically
does a pwrite() -> fsync() -> dm-error error table switch sequence, and
the error table switch races with journal reclaim I/O that kicks in
after the fsync completes. This problem is resolved by the bcachefs
specific sleep during the dm-table switch (hopefully to be replaced by a
freeze cycle in the future), so no bcachefs changes are required for
that one.

The second issue is that bcachefs still shuts down with the
aforementioned delay in place for a slightly different reason. This
occurs because after the switch to an error table, the test does another
pwrite() -> fsync() sequence to test the "fsync should return error"
case. What I see happen in bcachefs here is that writepages occurs
(because the buffered write is able to dirty the pages and inode and
whatnot since everything is buffered) and fails, bch2_vfs_write_inode()
commits a transaction (again successful, and marks the inode with the
latest journal seq), and then bch2_flush_inode() grabs the updated inode
seq, attempts to flush the journal, fails due to I/O error, and then
shuts down the fs. The shutdown leads to general test failure because
the test eventually switches back to a functional dm table, but bcachefs
reports errors because it has shutdown.

So yes, it seems the sync_inode_metadata() path is what bumps the
journal seq and the explicit journal flush is what shuts down the fs,
but I assume sync_inode_metadata() attempts the inode flush simply
because the buffered write that precedes it works perfectly fine. This
is what had me thinking that bch2_fsync() is being more aggressive than
it really needs to be here. With the proposed logic tweak, the second
issue is resolved because bcachefs no longer attempts the inode flush
when page writeback has failed. So what happens in the test is that it
switches back over to the functional dm table, the vfs inode flush
occurs then and succeeds, and so the test sees no (unexpected) errors
and the fs has survived without shutting down.

All in all I think this test basically builds on some minor assumptions
about how fsync error behavior is implemented in Linux, and bcachefs
happens to slightly diverge in a way that leads to this shutdown. IOW,
I'd expect the same problematic behavior out of XFS if it implemented
this sort of fsync logic, but afaict no other fs does that, so the test
is functional in practice. I admit that's not the best pure engineering
justification for the change in bcachefs (so I understand any
hesitation), but IMO it's reasonable in practice and worthwhile enough
to improve test coverage. I haven't audited fstests for this or
anything, but it wouldn't surprise me much if there are other tests that
rely on this sort of "assumed behavior" for testing I/O failures.
Thoughts?

Brian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-02-06 15:33               ` Brian Foster
@ 2023-02-06 22:18                 ` Kent Overstreet
  2023-02-09 12:57                   ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Kent Overstreet @ 2023-02-06 22:18 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Mon, Feb 06, 2023 at 10:33:14AM -0500, Brian Foster wrote:
> So yes, it seems the sync_inode_metadata() path is what bumps the
> journal seq and the explicit journal flush is what shuts down the fs,
> but I assume sync_inode_metadata() attempts the inode flush simply
> because the buffered write that precedes it works perfectly fine. This
> is what had me thinking that bch2_fsync() is being more aggressive than
> it really needs to be here. With the proposed logic tweak, the second
> issue is resolved because bcachefs no longer attempts the inode flush
> when page writeback has failed. So what happens in the test is that it
> switches back over to the functional dm table, the vfs inode flush
> occurs then and succeeds, and so the test sees no (unexpected) errors
> and the fs has survived without shutting down.

So, sync_inode_metadata() is just writing out timestamp updates.

But, there's no reason we couldn't be doing that as part of the write
path, if it was plumbed through - the write path has to do an inode
update anyways. Although that would involve an unpack/repack of the
inode, so maybe we don't want to do that - we'd want to change the inode
format to not encode timestamps as varints, and I'm not sure that's
worth the hassle and the increased inode size.

I was eyeing changing the way timestamp updates work to make them more
direct, it would get rid of the need to call sync_inode_metadata() and
regularize timestamp updates with the way inode updates work for
everything else - ISTR that's how it works in XFS as well. But it'd make
atime updates more expensive, and that's not something I've looked into
yet.

> All in all I think this test basically builds on some minor assumptions
> about how fsync error behavior is implemented in Linux, and bcachefs
> happens to slightly diverge in a way that leads to this shutdown. IOW,
> I'd expect the same problematic behavior out of XFS if it implemented
> this sort of fsync logic, but afaict no other fs does that, so the test
> is functional in practice. I admit that's not the best pure engineering
> justification for the change in bcachefs (so I understand any
> hesitation), but IMO it's reasonable in practice and worthwhile enough
> to improve test coverage. I haven't audited fstests for this or
> anything, but it wouldn't surprise me much if there are other tests that
> rely on this sort of "assumed behavior" for testing I/O failures.
> Thoughts?

Yeah, maybe we should hold off on making any decisions until we see
where else this comes up, I hoped this test was going to be easier to
fix but it doesn't seem like there's any clean obvious solutions. Ah
well :)

LSF is coming up and as I recall that's where the errseq stuff was
originally discussed - I'd like to hear from the people behind those
changes if they thing leaving our journal flush out of the errseq path
is reasonable (it seems so to me, but it's worth bringing up). I can
pencil it in to the bcachefs talk... need to start working on that...

We can just leave it failing in the CI for now - maybe make some notes
as to what our options are when we decide to come back to it, there's
definitely more straightforward and more impactful bugs to work on in
the meantime.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure)
  2023-02-04  0:35                 ` Kent Overstreet
@ 2023-02-07  0:03                   ` Dave Chinner
  2023-02-16 20:04                     ` Eric Wheeler
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2023-02-07  0:03 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: Brian Foster, linux-bcachefs

On Fri, Feb 03, 2023 at 07:35:15PM -0500, Kent Overstreet wrote:
> On Fri, Feb 03, 2023 at 11:51:12AM +1100, Dave Chinner wrote:
> > What do you need to know? The vast majority of the freeze
> > infrastructure is generic and the filesystem doesn't need to do
> > anything special. The only thing it needs to implement is
> > ->freeze_fs to essentially quiesce the filesystem - by this stage
> > all the data has been written back and all user-driven operations
> > have either been stalled or drained at either the VFS or transaction
> > reservation points.
> > 
> > This requires the filesystem transaction start point to call
> > sb_start_intwrite() in a location the transaction start can block
> > safely forever, and to call sb_end_intwrite() when the transaction
> > is complete and being torn down. 
> > 
> > [Note that bcachefs might also require
> > sb_{start/end}_{write/pagefault} calls in paths that it has custom
> > handlers for and to protect against ioctl operations triggering
> > modifications during freezes.]
> > 
> > This allows freeze_super() to set a barrier to prevent new
> > transactions from starting, and to wait on transactions in flight to
> > drain. Once all transactions have drained, it will then call
> > ->freeze_fs if it is defined so the filesystem can flush it's
> > journal and dirty in-memory metadata so that it becomes consistent
> > on disk without requiring journal recovery to be run.
> > 
> > This basically means that once ->fs_freeze completes, the filesystem
> > should be in the same state on-disk as if it were unmounted cleanly,
> > and the fs will not issue any more IO until the filesystem is
> > thawed. Thaw will call ->unfreeze_fs if defined before unblocking
> > tasks so that the filesystem can restart things that may be needed
> > for normal operation that were stopped during the freeze.
> > 
> > It's not all that complex anymore - a few hooks to enable
> > modification barriers to be placed and running the
> > writeback part of unmount in ->freeze_fs is the main component
> > of the work that needs to be done....
> 
> Thanks, that is a lot simpler than I was thinking - I guess I was
> thinking about task freezing for suspend, that definitely had some
> tricky bits.

"freezing for suspend" is a mess - it should just be calling
freeze_super() and letting the filesystem take care of everything to
do with freezing the filesystem. The whole idea that we can suspend
a filesystem safely by running sync() and stopping kernel threads and
workqueues from running is .... broken.

> Sounds like treating it as if we were remounting read-only
> is all we need to do.

Yup, pretty much. XFS shares all the log quiescing code between
->freeze_fs and remount_ro. The only difference is that remount_ro
has to do all the work to write dirty data back to disk before
it quiesces the log....

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-02-06 22:18                 ` Kent Overstreet
@ 2023-02-09 12:57                   ` Brian Foster
  2023-02-09 14:58                     ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2023-02-09 12:57 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Mon, Feb 06, 2023 at 05:18:35PM -0500, Kent Overstreet wrote:
> On Mon, Feb 06, 2023 at 10:33:14AM -0500, Brian Foster wrote:
> > So yes, it seems the sync_inode_metadata() path is what bumps the
> > journal seq and the explicit journal flush is what shuts down the fs,
> > but I assume sync_inode_metadata() attempts the inode flush simply
> > because the buffered write that precedes it works perfectly fine. This
> > is what had me thinking that bch2_fsync() is being more aggressive than
> > it really needs to be here. With the proposed logic tweak, the second
> > issue is resolved because bcachefs no longer attempts the inode flush
> > when page writeback has failed. So what happens in the test is that it
> > switches back over to the functional dm table, the vfs inode flush
> > occurs then and succeeds, and so the test sees no (unexpected) errors
> > and the fs has survived without shutting down.
> 
> So, sync_inode_metadata() is just writing out timestamp updates.
> 

Yeah.. I hadn't confirmed, but that's what I figured.

> But, there's no reason we couldn't be doing that as part of the write
> path, if it was plumbed through - the write path has to do an inode
> update anyways. Although that would involve an unpack/repack of the
> inode, so maybe we don't want to do that - we'd want to change the inode
> format to not encode timestamps as varints, and I'm not sure that's
> worth the hassle and the increased inode size.
> 
> I was eyeing changing the way timestamp updates work to make them more
> direct, it would get rid of the need to call sync_inode_metadata() and
> regularize timestamp updates with the way inode updates work for
> everything else - ISTR that's how it works in XFS as well. But it'd make
> atime updates more expensive, and that's not something I've looked into
> yet.
> 
> > All in all I think this test basically builds on some minor assumptions
> > about how fsync error behavior is implemented in Linux, and bcachefs
> > happens to slightly diverge in a way that leads to this shutdown. IOW,
> > I'd expect the same problematic behavior out of XFS if it implemented
> > this sort of fsync logic, but afaict no other fs does that, so the test
> > is functional in practice. I admit that's not the best pure engineering
> > justification for the change in bcachefs (so I understand any
> > hesitation), but IMO it's reasonable in practice and worthwhile enough
> > to improve test coverage. I haven't audited fstests for this or
> > anything, but it wouldn't surprise me much if there are other tests that
> > rely on this sort of "assumed behavior" for testing I/O failures.
> > Thoughts?
> 
> Yeah, maybe we should hold off on making any decisions until we see
> where else this comes up, I hoped this test was going to be easier to
> fix but it doesn't seem like there's any clean obvious solutions. Ah
> well :)
> 

Indeed. While the proposed logic change seems reasonable to me (for
reasons already explained)...

> LSF is coming up and as I recall that's where the errseq stuff was
> originally discussed - I'd like to hear from the people behind those
> changes if they thing leaving our journal flush out of the errseq path
> is reasonable (it seems so to me, but it's worth bringing up). I can
> pencil it in to the bcachefs talk... need to start working on that...
> 

... I also think it's reasonable to sit on it, take stock of the scope,
and gain more input wrt to errseq and whatnot, should any of that lead
to potentially better solutions.

> We can just leave it failing in the CI for now - maybe make some notes
> as to what our options are when we decide to come back to it, there's
> definitely more straightforward and more impactful bugs to work on in
> the meantime.
> 

Ok. FWIW, I was just trying to pick off something that looked
approachable because that helps get more comfortable with the code,
etc., even if it doesn't result in an immediate fix. I.e., this
investigation/root-cause was helpful to grok journaling a bit more,
uncover (to me, at least) a use case for freeze, etc.

If you have any other open bugs that you think might be useful to look
into, feel free to send them along..

Brian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: fstests generic/441 -- occasional bcachefs failure
  2023-02-09 12:57                   ` Brian Foster
@ 2023-02-09 14:58                     ` Kent Overstreet
  0 siblings, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-02-09 14:58 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Thu, Feb 09, 2023 at 07:57:31AM -0500, Brian Foster wrote:
> If you have any other open bugs that you think might be useful to look
> into, feel free to send them along..

https://evilpiepirate.org/~testdashboard/c/f99e51d956a74e1cd78f88648b27b0dac7c13a02/xfstests.generic.388/log.br

This might be a good one to start with: a file got truncated but still
has extents in it. I'd start by running it in a loop locally in ktest,
and seeing how easily it reproduces and what else turns up.

It's starting to look like there might be a btree ordering issue -
there's been a number of unusual bug reports of filesystem
inconsistencies that generally are repaired fine by fsck - but aren't
showing up in our tests either, this is the closest to those bugs that
I've come across in the tests.

truncate calls bch2_fpunch() which deletes extents using
bch2_extent_update(); it failing to delete some extents shouldn't be
possible and if i_sie/i_sectors somehow got out of sync I'd expect it to
be in the other direction; still, there might be some asserts we can
add.

'bcachefs list_journal -a' should be one of the first things you look
at, you can grep through the output to find the most recent transaction
commit(s) that updated that inode. See if anything looks funny - was
this one of the most recent updates when the test stopped, or was it
further back?

If it's a lower level btree inconsistency, that might show up by
verifying what the journal says we have against what's in the btree -
'bcachefs list' command.

If it's a btree ordering issue, there's two main possibilites:
 - after a crash, the btree has updates it shouldn't have, because
   they're newer than the newest update that made it into the journal
 - the btree failed to persist some updates that the journal didn't
   think it needed to replay anymore

Keys have a version number field: we can use it for distinguishing
between these two possibilities. In the transaction commit path, you'll
see code for a debug mode where we initialize it based on the sequence
number of the journal reservation we obtained. There may still be some
other related code for that debug mode; the idea is to check for keys
newer than the newest journal sequence number we replayed after a crash.

We should also check for keys with blacklisted journal sequence numbers.

I think one of those approaches will turn up something :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure)
  2023-02-07  0:03                   ` Dave Chinner
@ 2023-02-16 20:04                     ` Eric Wheeler
  2023-02-20 22:19                       ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Wheeler @ 2023-02-16 20:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Kent Overstreet, Brian Foster, linux-bcachefs

On Tue, 7 Feb 2023, Dave Chinner wrote:
> On Fri, Feb 03, 2023 at 07:35:15PM -0500, Kent Overstreet wrote:
> > On Fri, Feb 03, 2023 at 11:51:12AM +1100, Dave Chinner wrote:
> > > What do you need to know? The vast majority of the freeze
> > > infrastructure is generic and the filesystem doesn't need to do
> > > anything special. The only thing it needs to implement is
> > > ->freeze_fs to essentially quiesce the filesystem - by this stage
> > > all the data has been written back and all user-driven operations
> > > have either been stalled or drained at either the VFS or transaction
> > > reservation points.
> > > 
> > > This requires the filesystem transaction start point to call
> > > sb_start_intwrite() in a location the transaction start can block
> > > safely forever, and to call sb_end_intwrite() when the transaction
> > > is complete and being torn down. 
> > > 
> > > [Note that bcachefs might also require
> > > sb_{start/end}_{write/pagefault} calls in paths that it has custom
> > > handlers for and to protect against ioctl operations triggering
> > > modifications during freezes.]
> > > 
> > > This allows freeze_super() to set a barrier to prevent new
> > > transactions from starting, and to wait on transactions in flight to
> > > drain. Once all transactions have drained, it will then call
> > > ->freeze_fs if it is defined so the filesystem can flush it's
> > > journal and dirty in-memory metadata so that it becomes consistent
> > > on disk without requiring journal recovery to be run.
> > > 
> > > This basically means that once ->fs_freeze completes, the filesystem
> > > should be in the same state on-disk as if it were unmounted cleanly,
> > > and the fs will not issue any more IO until the filesystem is
> > > thawed. Thaw will call ->unfreeze_fs if defined before unblocking
> > > tasks so that the filesystem can restart things that may be needed
> > > for normal operation that were stopped during the freeze.
> > > 
> > > It's not all that complex anymore - a few hooks to enable
> > > modification barriers to be placed and running the
> > > writeback part of unmount in ->freeze_fs is the main component
> > > of the work that needs to be done....
> > 
> > Thanks, that is a lot simpler than I was thinking - I guess I was
> > thinking about task freezing for suspend, that definitely had some
> > tricky bits.
> 
> "freezing for suspend" is a mess - it should just be calling
> freeze_super() and letting the filesystem take care of everything to
> do with freezing the filesystem. The whole idea that we can suspend
> a filesystem safely by running sync() and stopping kernel threads and
> workqueues from running is .... broken.
> 
> > Sounds like treating it as if we were remounting read-only
> > is all we need to do.
> 
> Yup, pretty much. XFS shares all the log quiescing code between
> ->freeze_fs and remount_ro. The only difference is that remount_ro
> has to do all the work to write dirty data back to disk before
> it quiesces the log....

Wait, so are you saying that XFS does not commit dirty buffers for 
sleeping, only on remount_ro? In an ideal case I suppose the laptop ram is 
still hot... But sometimes (ahem, far too often) I close my laptop and 
forget about it, in which case the battery dies and of course then any 
dirty pages are lost.  IMHO sleep should always be crash-safe.


--
Eric Wheeler



> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> dchinner@redhat.com
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure)
  2023-02-16 20:04                     ` Eric Wheeler
@ 2023-02-20 22:19                       ` Dave Chinner
  2023-02-20 23:23                         ` Kent Overstreet
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2023-02-20 22:19 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Kent Overstreet, Brian Foster, linux-bcachefs

On Thu, Feb 16, 2023 at 12:04:34PM -0800, Eric Wheeler wrote:
> On Tue, 7 Feb 2023, Dave Chinner wrote:
> > On Fri, Feb 03, 2023 at 07:35:15PM -0500, Kent Overstreet wrote:
> > > On Fri, Feb 03, 2023 at 11:51:12AM +1100, Dave Chinner wrote:
> > > > What do you need to know? The vast majority of the freeze
> > > > infrastructure is generic and the filesystem doesn't need to do
> > > > anything special. The only thing it needs to implement is
> > > > ->freeze_fs to essentially quiesce the filesystem - by this stage
> > > > all the data has been written back and all user-driven operations
> > > > have either been stalled or drained at either the VFS or transaction
> > > > reservation points.
> > > > 
> > > > This requires the filesystem transaction start point to call
> > > > sb_start_intwrite() in a location the transaction start can block
> > > > safely forever, and to call sb_end_intwrite() when the transaction
> > > > is complete and being torn down. 
> > > > 
> > > > [Note that bcachefs might also require
> > > > sb_{start/end}_{write/pagefault} calls in paths that it has custom
> > > > handlers for and to protect against ioctl operations triggering
> > > > modifications during freezes.]
> > > > 
> > > > This allows freeze_super() to set a barrier to prevent new
> > > > transactions from starting, and to wait on transactions in flight to
> > > > drain. Once all transactions have drained, it will then call
> > > > ->freeze_fs if it is defined so the filesystem can flush it's
> > > > journal and dirty in-memory metadata so that it becomes consistent
> > > > on disk without requiring journal recovery to be run.
> > > > 
> > > > This basically means that once ->fs_freeze completes, the filesystem
> > > > should be in the same state on-disk as if it were unmounted cleanly,
> > > > and the fs will not issue any more IO until the filesystem is
> > > > thawed. Thaw will call ->unfreeze_fs if defined before unblocking
> > > > tasks so that the filesystem can restart things that may be needed
> > > > for normal operation that were stopped during the freeze.
> > > > 
> > > > It's not all that complex anymore - a few hooks to enable
> > > > modification barriers to be placed and running the
> > > > writeback part of unmount in ->freeze_fs is the main component
> > > > of the work that needs to be done....
> > > 
> > > Thanks, that is a lot simpler than I was thinking - I guess I was
> > > thinking about task freezing for suspend, that definitely had some
> > > tricky bits.
> > 
> > "freezing for suspend" is a mess - it should just be calling
> > freeze_super() and letting the filesystem take care of everything to
> > do with freezing the filesystem. The whole idea that we can suspend
> > a filesystem safely by running sync() and stopping kernel threads and
> > workqueues from running is .... broken.
> > 
> > > Sounds like treating it as if we were remounting read-only
> > > is all we need to do.
> > 
> > Yup, pretty much. XFS shares all the log quiescing code between
> > ->freeze_fs and remount_ro. The only difference is that remount_ro
> > has to do all the work to write dirty data back to disk before
> > it quiesces the log....
> 
> Wait, so are you saying that XFS does not commit dirty buffers for 
> sleeping, only on remount_ro?

Yup. But this is not unique to XFS - every journalling filesystem
(ext3, ext4, jfs, etc) have exactly the same problem:
sync_filesystem() only guarantees that the filesystem is consistent
on disk, not that it is clean in memory.

And by "consistent on disk", that means all dirty metadata has been
written -to the journal- so that if a crash occurs immeditately
afterwards, journal recovery on the next mount will ensure that the
filesystem is consistent.

IOWs, after sync_filesystem(), the filesystem is most definitely
*not idle* and *not clean in memory*, and that's where all the
issues with suspend end up coming from - it assumes sync() is all
that is needed to put a filesystem in an idle state....

> In an ideal case I suppose the laptop ram is 
> still hot... But sometimes (ahem, far too often) I close my laptop and 
> forget about it, in which case the battery dies and of course then any 
> dirty pages are lost.  IMHO sleep should always be crash-safe.

suspend is generally considered crash safe. The problems with
suspend stem from inconsistent in-memory vs on-disk filesystem state
in the suspend image - this causes problems on resume of the
suspended image, not on the next cold boot of the system.

-Dave.
-- 
Dave Chinner
dchinner@redhat.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure)
  2023-02-20 22:19                       ` Dave Chinner
@ 2023-02-20 23:23                         ` Kent Overstreet
  0 siblings, 0 replies; 24+ messages in thread
From: Kent Overstreet @ 2023-02-20 23:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Wheeler, Brian Foster, linux-bcachefs

On Tue, Feb 21, 2023 at 09:19:37AM +1100, Dave Chinner wrote:
> > Wait, so are you saying that XFS does not commit dirty buffers for 
> > sleeping, only on remount_ro?
> 
> Yup. But this is not unique to XFS - every journalling filesystem
> (ext3, ext4, jfs, etc) have exactly the same problem:
> sync_filesystem() only guarantees that the filesystem is consistent
> on disk, not that it is clean in memory.
> 
> And by "consistent on disk", that means all dirty metadata has been
> written -to the journal- so that if a crash occurs immeditately
> afterwards, journal recovery on the next mount will ensure that the
> filesystem is consistent.
> 
> IOWs, after sync_filesystem(), the filesystem is most definitely
> *not idle* and *not clean in memory*, and that's where all the
> issues with suspend end up coming from - it assumes sync() is all
> that is needed to put a filesystem in an idle state....
> 
> > In an ideal case I suppose the laptop ram is 
> > still hot... But sometimes (ahem, far too often) I close my laptop and 
> > forget about it, in which case the battery dies and of course then any 
> > dirty pages are lost.  IMHO sleep should always be crash-safe.
> 
> suspend is generally considered crash safe. The problems with
> suspend stem from inconsistent in-memory vs on-disk filesystem state
> in the suspend image - this causes problems on resume of the
> suspended image, not on the next cold boot of the system.

Ok, so that means we'll want to do things differently.

I don't think we _quite_ have a straightforward mechanism for quiescing
things without completely flushing the btree (which is actually driven
by journal reclaim).

There is bch2_journal_block()/bch2_journal_unblock() which blocks new
journal reservations - I think that can be our starting point.
bch2_journal_block() also waits for all in flight journal entries to be
written, so that's what we want here.

If we want all dirty metadata to be written _and visible_ in the
journal, we need to make sure the last journal entry written is a flush
entry - in general we do this by setting journal_buf->must_flush for the
entry to be written; see bch2_journal_meta() for a simple example.

We'll also need to prevent new btree node writes from being issued, and
wait for in flight ones to be finished. We don't have a mechanism for
that currently, we can probably create something like
bch2_journal_block() for that.

Other IO sources:
 - superblock writes (rare in normal operation, they happen when e.g. we
   start writing data to a new set of devices... we need that in the
   superblock so that mount knows which devices we need to mount).

 - data reads/writes

 - erasure coding writes (parity blocks when creating stripes,
   reconstruct reads).

I expect we'll have to block reads from being submitted as well? Then
we'll want to block data reads/btree node reads/ec reconstruct reads
after blocking all the sources of writes...

(this is starting to feel more like the old freezer stuff.. blech)

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-02-20 23:23 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-25 15:45 fstests generic/441 -- occasional bcachefs failure Brian Foster
2023-01-26 15:08 ` Kent Overstreet
2023-01-27  7:21   ` Kent Overstreet
2023-01-27 14:50   ` Brian Foster
2023-01-30 17:06     ` Kent Overstreet
2023-01-31 16:04       ` Brian Foster
2023-02-01 14:34         ` Kent Overstreet
2023-02-02 15:50           ` Brian Foster
2023-02-02 17:09             ` Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) Kent Overstreet
2023-02-02 20:04               ` Brian Foster
2023-02-02 22:39                 ` Kent Overstreet
2023-02-03  0:51               ` Dave Chinner
2023-02-04  0:35                 ` Kent Overstreet
2023-02-07  0:03                   ` Dave Chinner
2023-02-16 20:04                     ` Eric Wheeler
2023-02-20 22:19                       ` Dave Chinner
2023-02-20 23:23                         ` Kent Overstreet
2023-02-02 22:56         ` fstests generic/441 -- occasional bcachefs failure Kent Overstreet
2023-02-04 21:33           ` Brian Foster
2023-02-04 22:15             ` Kent Overstreet
2023-02-06 15:33               ` Brian Foster
2023-02-06 22:18                 ` Kent Overstreet
2023-02-09 12:57                   ` Brian Foster
2023-02-09 14:58                     ` Kent Overstreet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.