* fstests generic/441 -- occasional bcachefs failure @ 2023-01-25 15:45 Brian Foster 2023-01-26 15:08 ` Kent Overstreet 0 siblings, 1 reply; 24+ messages in thread From: Brian Foster @ 2023-01-25 15:45 UTC (permalink / raw) To: linux-bcachefs; +Cc: Kent Overstreet Hi Kent, list, I've noticed occasional failures of generic/441 on bcachefs that look something like this: generic/441 11s ... - output mismatch (see /root/xfstests-dev/results//generic/441.out.bad) --- tests/generic/441.out 2022-10-11 14:20:36.986202162 -0400 +++ /root/xfstests-dev/results//generic/441.out.bad 2023-01-24 13:33:54.977729904 -0500 @@ -1,3 +1,3 @@ QA output created by 441 Format and mount -Test passed! +First fsync after reopen of fd[0] failed: Input/output error ... (Run 'diff -u /root/xfstests-dev/tests/generic/441.out /root/xfstests-dev/results//generic/441.out.bad' to see the entire diff) This test covers fsync error reporting across multiple fds. What it does in a nutshell is open a bunch of fds, write to them and fsync them, and then repeat that a couple times over after switching to an error table (i.e. dm-error) and back to a working table, verifying error reporting expectations at each step along the way. The error above is associated with an fsync failure after switching from an error table back to a working table, and essentially occurs because the filesystem has shutdown and the bcachefs inode has a journal sequence number ahead of what the journal shows has been flushed to disk. That makes sense in general because even though the vfs error reporting might not expect an error, the filesystem has been shutdown and fsync has work to do. I've been digging into it a bit regardless to try and understand why the behavior seems transient.. What I'm seeing looks like a matter of timing at the point the test switches to an error table and issues the first fsync (where failure is expected). IIUC, what I think is happening here in the scenario where the test passes is something like this: 1. With the working table in place, the bch2_vfs_write_inode() -> bch2_write_inode() path commits a transaction with pending changes to the inode. These changes end up in some sort of cache (i.e. key cache?). 2. The bch2_flush_inode() path flushes the journal based on the sequence number in the inode. 3. Sometime shortly after that, journal reclaim kicks in presumably to write out pending btree node data and release the item from the journal. Since the test switches to an error table after fsync() return, this async (workqueue) journal reclaim context is what typically first encounters I/O failures and essentially shuts down the fs (i.e. we end up in bch2_fs_emergency_read_only()). 4. The test switches back to working context. The next fsync loop appears to succeed even though the fs is shutdown because bch2_vfs_write_inode() transaction commit fails with -EROFS in the commit path, which means the inode sequence number is not updated by the bch2_mark_inode() trigger, and thus bch2_flush_inode() appears to have nothing to do and returns 0. In the case where the test fails, the journal reclaim context in step 3 appears to race with the transaction commit in step 4 such that bch2_mark_inode() is called, updates the inode seq number, and transaction commit actually succeeds before the fs is shutdown. By the time bch2_flush_inode() is called, however, the journal->err_seq value has been set and the flush returns -EIO back to userspace. So all in all this doesn't really seem like a bug. The fs is shutdown either way and the error depends on whether fsync needs to flush the log. I was trying to think about whether this could be improved slightly to perhaps be more consistent and avoid spurious test failure. For example, an artificial delay after the dm-table switchover seems to always prevent the problem because it ensures journal reclaim has processed the error before a subsequent fsync has a chance to bump the inode sequence number. There's still a lot I don't understand here on the bcachefs side, so after poking around a bit I didn't see anything terribly obvious that might prevent this in the fs itself. AFAICS there is no serialization between journal reclaim and the transaction commit, so something like rechecking journal error state after the transaction is locked (but before the mem triggers are run) seems like it could help reduce the race window, but not necessarily prevent the race. Is that accurate? If so, any other thoughts on how we might prevent this particular failure? Sorry for the longish writeup on what likely amounts to a minor behavioral quirk... I dug into it as sort of a proxy to help get more familiar with the code, so if the right approach is basically to leave things as is (and/or potentially tweak the test to keep it quiet), then that seems reasonable to me as well. Brian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-01-25 15:45 fstests generic/441 -- occasional bcachefs failure Brian Foster @ 2023-01-26 15:08 ` Kent Overstreet 2023-01-27 7:21 ` Kent Overstreet 2023-01-27 14:50 ` Brian Foster 0 siblings, 2 replies; 24+ messages in thread From: Kent Overstreet @ 2023-01-26 15:08 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Wed, Jan 25, 2023 at 10:45:24AM -0500, Brian Foster wrote: > Hi Kent, list, > > I've noticed occasional failures of generic/441 on bcachefs that look > something like this: > > generic/441 11s ... - output mismatch (see /root/xfstests-dev/results//generic/441.out.bad) > --- tests/generic/441.out 2022-10-11 14:20:36.986202162 -0400 > +++ /root/xfstests-dev/results//generic/441.out.bad 2023-01-24 13:33:54.977729904 -0500 > @@ -1,3 +1,3 @@ > QA output created by 441 > Format and mount > -Test passed! > +First fsync after reopen of fd[0] failed: Input/output error > ... > (Run 'diff -u /root/xfstests-dev/tests/generic/441.out /root/xfstests-dev/results//generic/441.out.bad' to see the entire diff) > > This test covers fsync error reporting across multiple fds. What it does > in a nutshell is open a bunch of fds, write to them and fsync them, and > then repeat that a couple times over after switching to an error table > (i.e. dm-error) and back to a working table, verifying error reporting > expectations at each step along the way. > > The error above is associated with an fsync failure after switching from > an error table back to a working table, and essentially occurs because > the filesystem has shutdown and the bcachefs inode has a journal > sequence number ahead of what the journal shows has been flushed to > disk. That makes sense in general because even though the vfs error > reporting might not expect an error, the filesystem has been shutdown > and fsync has work to do. I've been digging into it a bit regardless to > try and understand why the behavior seems transient.. > > What I'm seeing looks like a matter of timing at the point the test > switches to an error table and issues the first fsync (where failure is > expected). IIUC, what I think is happening here in the scenario where > the test passes is something like this: Hey Brian - my understanding of this bug was that it was even simpler. As I understand it, the decreed semantics of fsync by the Powers That Be are that it should return a given error _once_ - if a subsequent fsync returns an error, then there must have been another write that errored. So we just need to plumb our errors from the journal through the errseq_t mechanism. The tricky bit is that bch2_fsync() doesn't record or check if it's actually doing a journal flush or if the journal flush was already done. So I think we'd need to add another sequence number to bch_inode_info to track the sequence that was flushed, and then only call bch2_journal_flush_seq() if flushed_seq != journal_seq_of_last_update, and then to make this work correctly when an inode has been evicted it might need to go in struct bch_inode, stored in the btree, as well. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-01-26 15:08 ` Kent Overstreet @ 2023-01-27 7:21 ` Kent Overstreet 2023-01-27 14:50 ` Brian Foster 1 sibling, 0 replies; 24+ messages in thread From: Kent Overstreet @ 2023-01-27 7:21 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Thu, Jan 26, 2023 at 10:08:29AM -0500, Kent Overstreet wrote: > Hey Brian - my understanding of this bug was that it was even simpler. > > As I understand it, the decreed semantics of fsync by the Powers That Be > are that it should return a given error _once_ - if a subsequent fsync > returns an error, then there must have been another write that errored. > > So we just need to plumb our errors from the journal through the > errseq_t mechanism. The tricky bit is that bch2_fsync() doesn't record > or check if it's actually doing a journal flush or if the journal flush > was already done. > > So I think we'd need to add another sequence number to > bch_inode_info to track the sequence that was flushed, and then only > call bch2_journal_flush_seq() if flushed_seq != > journal_seq_of_last_update, and then to make this work correctly when an > inode has been evicted it might need to go in struct bch_inode, stored > in the btree, as well. BTW, I'm not suggesting we do this right now - I agree with you in that the current behaviour makes sense to me. It might be worth changing at some point for consistency, but I'm not in a hurry here. Thanks for looking at this :) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-01-26 15:08 ` Kent Overstreet 2023-01-27 7:21 ` Kent Overstreet @ 2023-01-27 14:50 ` Brian Foster 2023-01-30 17:06 ` Kent Overstreet 1 sibling, 1 reply; 24+ messages in thread From: Brian Foster @ 2023-01-27 14:50 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Thu, Jan 26, 2023 at 10:08:25AM -0500, Kent Overstreet wrote: > On Wed, Jan 25, 2023 at 10:45:24AM -0500, Brian Foster wrote: > > Hi Kent, list, > > > > I've noticed occasional failures of generic/441 on bcachefs that look > > something like this: > > > > generic/441 11s ... - output mismatch (see /root/xfstests-dev/results//generic/441.out.bad) > > --- tests/generic/441.out 2022-10-11 14:20:36.986202162 -0400 > > +++ /root/xfstests-dev/results//generic/441.out.bad 2023-01-24 13:33:54.977729904 -0500 > > @@ -1,3 +1,3 @@ > > QA output created by 441 > > Format and mount > > -Test passed! > > +First fsync after reopen of fd[0] failed: Input/output error > > ... > > (Run 'diff -u /root/xfstests-dev/tests/generic/441.out /root/xfstests-dev/results//generic/441.out.bad' to see the entire diff) > > > > This test covers fsync error reporting across multiple fds. What it does > > in a nutshell is open a bunch of fds, write to them and fsync them, and > > then repeat that a couple times over after switching to an error table > > (i.e. dm-error) and back to a working table, verifying error reporting > > expectations at each step along the way. > > > > The error above is associated with an fsync failure after switching from > > an error table back to a working table, and essentially occurs because > > the filesystem has shutdown and the bcachefs inode has a journal > > sequence number ahead of what the journal shows has been flushed to > > disk. That makes sense in general because even though the vfs error > > reporting might not expect an error, the filesystem has been shutdown > > and fsync has work to do. I've been digging into it a bit regardless to > > try and understand why the behavior seems transient.. > > > > What I'm seeing looks like a matter of timing at the point the test > > switches to an error table and issues the first fsync (where failure is > > expected). IIUC, what I think is happening here in the scenario where > > the test passes is something like this: > > Hey Brian - my understanding of this bug was that it was even simpler. > > As I understand it, the decreed semantics of fsync by the Powers That Be > are that it should return a given error _once_ - if a subsequent fsync > returns an error, then there must have been another write that errored. > Yep, that's my understanding as well. This particular test is just checking that the error reporting spreads across distinct fds for the same underlying file. I.e., so when writeback fails and sets an error on the mapping, each open fd/file should see an instance of that error on fsync (but presumably not repeatedly unless pages are written and failed again). > So we just need to plumb our errors from the journal through the > errseq_t mechanism. The tricky bit is that bch2_fsync() doesn't record > or check if it's actually doing a journal flush or if the journal flush > was already done. > That's an interesting idea. I'd have to poke around the errseq stuff and think about it a bit.. Something else that occurred to me while looking further at this is we can also avoid the error in this case fairly easily by bailing out of bch2_fsync() if page writeback fails, as opposed to the unconditional flush -> sync meta -> flush log sequence that returns the first error anyways. That would prevent marking the inode with a new sequence number when I/Os are obviously failing. The caveat is that the test still fails, now with a "Read-only file system" error instead of EIO, because the filesystem is shutdown by the time the vfs write inode path actually runs. All in all ISTM the shutdown is the fundamental issue wrt to this test. I'm not sure it's worth complicating the log flush error handling with errseq and whatnot just to filter out certain errors when the fs has seen a catastrophic failure/shutdown (also agree with your followup that isn't the most critical bug in the world anyways). Either way, I am trying to grok the commit/writeback/journaling path a bit better here to try and understand why the fs shuts down in this test whereas other fs' don't seem to. One thing I'm noticing is that this whole path is rather asynchronous, making it a bit hard to follow. Any tips on the best way to work through the big picture steps involved in the path from a trans commit, through journaling, to eventual btree writeback? Thanks for the feedback.. Brian > So I think we'd need to add another sequence number to > bch_inode_info to track the sequence that was flushed, and then only > call bch2_journal_flush_seq() if flushed_seq != > journal_seq_of_last_update, and then to make this work correctly when an > inode has been evicted it might need to go in struct bch_inode, stored > in the btree, as well. > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-01-27 14:50 ` Brian Foster @ 2023-01-30 17:06 ` Kent Overstreet 2023-01-31 16:04 ` Brian Foster 0 siblings, 1 reply; 24+ messages in thread From: Kent Overstreet @ 2023-01-30 17:06 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote: > Something else that occurred to me while looking further at this is we > can also avoid the error in this case fairly easily by bailing out of > bch2_fsync() if page writeback fails, as opposed to the unconditional > flush -> sync meta -> flush log sequence that returns the first error > anyways. That would prevent marking the inode with a new sequence number > when I/Os are obviously failing. The caveat is that the test still > fails, now with a "Read-only file system" error instead of EIO, because > the filesystem is shutdown by the time the vfs write inode path actually > runs. If some pages did write successfully we don't want to skip the rest of the fsync, though. The test failing because the filesystem went read-only is because a btree node write fails - I'm assuming without that change the test just fails before any btree node writes are attempted. > All in all ISTM the shutdown is the fundamental issue wrt to this test. > I'm not sure it's worth complicating the log flush error handling with > errseq and whatnot just to filter out certain errors when the fs has > seen a catastrophic failure/shutdown (also agree with your followup that > isn't the most critical bug in the world anyways). Well, our default setting on serious error is to go read-only, an a btree node write failing (in a single device, non replicated filesystem) definitely qualifies. It looks like our behaviour on btree node write error does need to be looked at and improved/fixed: btree_io.c line 1734 has a clearly unfinished case. It looks like we're missing whatever logic should be there for a replicated filesystem - we definitely need to decide what to do when some but not all btree node writes fail. More relevant to this test, we also a setting for error behaviour - errors=(continue|ro|panic), but that explicitly does _not_ apply to write IO errors; if we can't write out metadata we always go read-only. I'm not in a hurry to change that, because one of the things bcachefs currently does really well is stopping the world if things ever go so wrong that the filesystem would be inconsistent - we haven't had too much in the way of unrecoverable filesystems, and I don't want that to change :) > Either way, I am trying to grok the commit/writeback/journaling path a > bit better here to try and understand why the fs shuts down in this test > whereas other fs' don't seem to. One thing I'm noticing is that this > whole path is rather asynchronous, making it a bit hard to follow. Any > tips on the best way to work through the big picture steps involved in > the path from a trans commit, through journaling, to eventual btree > writeback? Transaction commit path is pretty much all straight line synchronous code - that's btree_update_leaf.c. Basically, the transaction commit path takes a list of updates, and - runs triggers (which will then generate more updates!) - ensures all btree paths are fully traversed and have intent locks - takes write locks on btree nodes - gets a journal reservation - does updates to leaf nodes/btree key cache (and soon the btree write buffer as well) Btree node merging and splitting is also driven by the transaction commit path - before doing the update we check if we need to merge btree nodes (it has to be before the transaction commit in order to happen reliably), and btree node splitting is done in an error path - after taking write locks if we don't have space for the insert we bail out and to the split in bch2_trans_commit_error(). When we update a leaf node or btree key cache key, we mark it as dirty and create a journal pin for the journal sequence number we got a journal reservation for. A journal pin has a callback for journal reclaim to invoke - typically, writeback for the btree node cache and btree key cache is driven entirely by journal reclaim, and the journal pin also keeps the relevant journal entry dirty so that it'll be replayed by journal replay. (That writeback behaviour will be changing at some point: the behaviour up until recently was that journal reclaim monitored the dirty percentage of the btree node cache and key cache and would run to those numbers under some threshold, but it turns out that really hurts performance on purely random write benchmarks - we writeback too aggressively, and we need to allow those caches to be 100% dirty if there isn't any memory pressure). The journalling code is pretty self contained w.r.t. the rest of bcachefs. It is indeed a big pile of tricky asynchronous code (the fast paths need to be lockless - right now I'm working on making it even more lockless - and there's a lot of pipelining it has to do). At a high level it's pretty simple though, and you should be able to treat it as a black box. The main operations are: - getting a reservation: this gets you a reservation in a specific journal entry - you're supposed to memcpy() whatever you want journalled into the entry to be written and then quickly drop your reservation. Blocking with a reservation will block the entire journal - note that we get our journal reservation after taking btree write locks in the transaction commit path (but we do it in nonblocking mode; if we have to block on the journal we drop btree locks and do the waiting in bch2_trans_commit_error()). - getting a pre-reservation: trying to get a journal reservation while holding a journal pin would deadlock the journal if it ever completely filled up - a pre-reservation allows reserving space in the journal so that we can get a reservation later without deadlocking. This is for e.g. flushing the btree key cache: when we do a btree key cache update we update the key cache in memory and journal the update, to flush the key cache we have to update the btree which requires (for simplicity) re-journalling the update. - flush_seq: given a journal sequence number we previously got a journal reservation on, request that it be written out (with a flush/fua write, not all journal entries are written flush/fua). Then there's btree node writes, which it sounds like you might be getting into soon. Btree node writes are more complicated: btree nodes are log structured, so we have to differentiate between the first write to a new node and subsequent appending writes, and before returning a completion for an appending btree node write we have to update the pointer to that btree node (btree node pointers record the number of sectors currently written; this is important for multi device filesystems, and also for ensuring consistent ordering of updates after a crash - updates can't be visible if they weren't completed in the journal, so we have to ignore btree node writes that are newer than the newest completed journal write. Originally this was handled with the journal sequence number blacklist mechanism - so you might see a few references to that still in btree_io.c - but now it's handled by updating parent pointers after every write, up to the root. Hope that helps - and feel free to ask me more questions on IRC, irc.oftc.net#bcache. Cheers, Kent ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-01-30 17:06 ` Kent Overstreet @ 2023-01-31 16:04 ` Brian Foster 2023-02-01 14:34 ` Kent Overstreet 2023-02-02 22:56 ` fstests generic/441 -- occasional bcachefs failure Kent Overstreet 0 siblings, 2 replies; 24+ messages in thread From: Brian Foster @ 2023-01-31 16:04 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote: > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote: > > Something else that occurred to me while looking further at this is we > > can also avoid the error in this case fairly easily by bailing out of > > bch2_fsync() if page writeback fails, as opposed to the unconditional > > flush -> sync meta -> flush log sequence that returns the first error > > anyways. That would prevent marking the inode with a new sequence number > > when I/Os are obviously failing. The caveat is that the test still > > fails, now with a "Read-only file system" error instead of EIO, because > > the filesystem is shutdown by the time the vfs write inode path actually > > runs. > > If some pages did write successfully we don't want to skip the rest of > the fsync, though. > What does it matter if the fsync() has already failed? ISTM this is pretty standard error handling behavior across major fs', but it's not clear to me if there's some bcachefs specific quirk that warrants different handling.. FWIW, I think I've been able to fix this test with a couple small tweaks: 1. Change bch2_fsync() to return on first error. 2. Introduce a bcachefs specific delay in the working -> error table switchover. I've not run a full regression test on the bcachefs change yet, but if you're willing to take a patch along those lines I can do that and then if it survives, follow up with an upstream fstests patch. That would allow this test to pass fairly reliably and also no longer unconditionally shutdown bcachefs. > The test failing because the filesystem went read-only is because a > btree node write fails - I'm assuming without that change the test just > fails before any btree node writes are attempted. > Not sure I parse the above. Generally speaking, the test fails because after the last successful fsync, it immediately switches over to the error dm table. The fs still has some reclaim/journal work to do after this point, however, so this essentially always results in emergency shutdown. Whether the test passes or fails at this point depends on a race between this shutdown sequence in the background and an fsync call successfuly committing a transaction (calling bch2_mark_inode()). The winner of the race dictates whether the first fsync() after switching back to the working dm table can avoid the journal flush because the seq associated with the inode is already on disk, or whether the fs wants to flush the seq on a filesystem that has been shutdown (and so will always fail). > > All in all ISTM the shutdown is the fundamental issue wrt to this test. > > I'm not sure it's worth complicating the log flush error handling with > > errseq and whatnot just to filter out certain errors when the fs has > > seen a catastrophic failure/shutdown (also agree with your followup that > > isn't the most critical bug in the world anyways). > > Well, our default setting on serious error is to go read-only, an a > btree node write failing (in a single device, non replicated filesystem) > definitely qualifies. > > It looks like our behaviour on btree node write error does need to be > looked at and improved/fixed: btree_io.c line 1734 has a clearly > unfinished case. It looks like we're missing whatever logic should be > there for a replicated filesystem - we definitely need to decide what to > do when some but not all btree node writes fail. > Interesting. I noticed that hunk and thought it looked odd, but didn't have enough context to grok it. > More relevant to this test, we also a setting for error behaviour - > errors=(continue|ro|panic), but that explicitly does _not_ apply to > write IO errors; if we can't write out metadata we always go read-only. > > I'm not in a hurry to change that, because one of the things bcachefs > currently does really well is stopping the world if things ever go so > wrong that the filesystem would be inconsistent - we haven't had too > much in the way of unrecoverable filesystems, and I don't want that to > change :) > Indeed, that makes sense. I wonder if it would make sense to issue retries at some point? As another XFS example, it also has an emergency shutdown mechanism for similar purposes. Certain I/O failures result in immediate shutdown of the fs and no further changes make it to disk. Because shutdown itself is a catastrophic failure, metadata write failures are retried at least once before initiating shutdown to filter out transient errors. On top of that, there's a sysfs section for each mount that allows an admin to configure error handling behavior as needed. This allows setting the number of retries before an error is considered permanent (including infinite retries for scenarios where errors are expected to be manually recoverable), or how long to delay retries. I'm not sure a simple retry once sequence would help with this test, but an infinite retry option (with some delay) probably would. OTOH, I wonder if a more elegant solution is to explicitly quiesce the fs before the working -> error table switchover. That might typically be done with a freeze/unfreeze cycle so the test can serialize against whatever background work needs to complete before putting the block device into error mode. However, I notice freeze isn't currently supported on bcachefs. Any thoughts on freeze support? > > Either way, I am trying to grok the commit/writeback/journaling path a > > bit better here to try and understand why the fs shuts down in this test > > whereas other fs' don't seem to. One thing I'm noticing is that this > > whole path is rather asynchronous, making it a bit hard to follow. Any > > tips on the best way to work through the big picture steps involved in > > the path from a trans commit, through journaling, to eventual btree > > writeback? > > Transaction commit path is pretty much all straight line synchronous > code - that's btree_update_leaf.c. > > Basically, the transaction commit path takes a list of updates, and > - runs triggers (which will then generate more updates!) > - ensures all btree paths are fully traversed and have intent locks > - takes write locks on btree nodes > - gets a journal reservation > - does updates to leaf nodes/btree key cache (and soon the btree write > buffer as well) > Yep, I was able to follow most of this with the exception of some confusion over the key cache bits. IIUC, the initial trans commit might add a key update to the key cache, whereas reclaim might commit another transaction to flush the key cache to the btree (via the journal callback you mention below). > Btree node merging and splitting is also driven by the transaction > commit path - before doing the update we check if we need to merge > btree nodes (it has to be before the transaction commit in order to > happen reliably), and btree node splitting is done in an error path - > after taking write locks if we don't have space for the insert we bail > out and to the split in bch2_trans_commit_error(). > > When we update a leaf node or btree key cache key, we mark it as dirty > and create a journal pin for the journal sequence number we got a > journal reservation for. A journal pin has a callback for journal > reclaim to invoke - typically, writeback for the btree node cache and > btree key cache is driven entirely by journal reclaim, and the journal > pin also keeps the relevant journal entry dirty so that it'll be > replayed by journal replay. > > (That writeback behaviour will be changing at some point: the behaviour > up until recently was that journal reclaim monitored the dirty > percentage of the btree node cache and key cache and would run to those > numbers under some threshold, but it turns out that really hurts > performance on purely random write benchmarks - we writeback too > aggressively, and we need to allow those caches to be 100% dirty if > there isn't any memory pressure). > > The journalling code is pretty self contained w.r.t. the rest of > bcachefs. It is indeed a big pile of tricky asynchronous code (the fast > paths need to be lockless - right now I'm working on making it even more > lockless - and there's a lot of pipelining it has to do). At a high > level it's pretty simple though, and you should be able to treat it as a > black box. The main operations are: > > - getting a reservation: this gets you a reservation in a specific > journal entry - you're supposed to memcpy() whatever you want > journalled into the entry to be written and then quickly drop your > reservation. > > Blocking with a reservation will block the entire journal - note that > we get our journal reservation after taking btree write locks in the > transaction commit path (but we do it in nonblocking mode; if we have > to block on the journal we drop btree locks and do the waiting in > bch2_trans_commit_error()). > > - getting a pre-reservation: trying to get a journal reservation while > holding a journal pin would deadlock the journal if it ever > completely filled up - a pre-reservation allows reserving space in > the journal so that we can get a reservation later without > deadlocking. This is for e.g. flushing the btree key cache: when we > do a btree key cache update we update the key cache in memory and > journal the update, to flush the key cache we have to update the > btree which requires (for simplicity) re-journalling the update. > Ah.. so does "for simplicity" here essentially mean "for reuse of the transaction code?" > - flush_seq: given a journal sequence number we previously got a > journal reservation on, request that it be written out (with a > flush/fua write, not all journal entries are written flush/fua). > Thanks for this. Something that initially confused me wrt to the journal is that I was seeing journal writes after btree node writes. The docs mention the journal is primarily an optimization and ordering is not important for crash consistency. With that in mind, does that mean a journal "pin" refers to entries held in the journal until written back to the btree (as opposed to the journal pinning keys in memory)? IOW, it doesn't really matter in which order that journal/btree writes occur, but only that entries remain pinned in the journal until btree writes complete..? (I suspect this confused me because XFS uses the same pinning terminology, but in that context it refers to metadata items being pinned by the log, until the log item is flushed and thus unpins the metadata item for final writeback. So ordering is a critical factor in this context.). > Then there's btree node writes, which it sounds like you might be > getting into soon. Btree node writes are more complicated: btree nodes > are log structured, so we have to differentiate between the first write > to a new node and subsequent appending writes, and before returning a > completion for an appending btree node write we have to update the > pointer to that btree node (btree node pointers record the number of > sectors currently written; this is important for multi device > filesystems, and also for ensuring consistent ordering of updates after > a crash - updates can't be visible if they weren't completed in the > journal, so we have to ignore btree node writes that are newer than the > newest completed journal write. > > Originally this was handled with the journal sequence number blacklist > mechanism - so you might see a few references to that still in > btree_io.c - but now it's handled by updating parent pointers after > every write, up to the root. > > Hope that helps - and feel free to ask me more questions on IRC, > irc.oftc.net#bcache. > #bcache or #bachefs? This is all extremely helpful. Thanks again. Brian > Cheers, > Kent > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-01-31 16:04 ` Brian Foster @ 2023-02-01 14:34 ` Kent Overstreet 2023-02-02 15:50 ` Brian Foster 2023-02-02 22:56 ` fstests generic/441 -- occasional bcachefs failure Kent Overstreet 1 sibling, 1 reply; 24+ messages in thread From: Kent Overstreet @ 2023-02-01 14:34 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote: > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote: > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote: > > > Something else that occurred to me while looking further at this is we > > > can also avoid the error in this case fairly easily by bailing out of > > > bch2_fsync() if page writeback fails, as opposed to the unconditional > > > flush -> sync meta -> flush log sequence that returns the first error > > > anyways. That would prevent marking the inode with a new sequence number > > > when I/Os are obviously failing. The caveat is that the test still > > > fails, now with a "Read-only file system" error instead of EIO, because > > > the filesystem is shutdown by the time the vfs write inode path actually > > > runs. > > > > If some pages did write successfully we don't want to skip the rest of > > the fsync, though. > > > > What does it matter if the fsync() has already failed? ISTM this is > pretty standard error handling behavior across major fs', but it's not > clear to me if there's some bcachefs specific quirk that warrants > different handling.. > > FWIW, I think I've been able to fix this test with a couple small > tweaks: > > 1. Change bch2_fsync() to return on first error. > 2. Introduce a bcachefs specific delay in the working -> error table > switchover. > > I've not run a full regression test on the bcachefs change yet, but if > you're willing to take a patch along those lines I can do that and then > if it survives, follow up with an upstream fstests patch. That would > allow this test to pass fairly reliably and also no longer > unconditionally shutdown bcachefs. If you link me your git repo I'll add it to the CI so all the automated tests run. > > > The test failing because the filesystem went read-only is because a > > btree node write fails - I'm assuming without that change the test just > > fails before any btree node writes are attempted. > > > > Not sure I parse the above. Generally speaking, the test fails because > after the last successful fsync, it immediately switches over to the > error dm table. The fs still has some reclaim/journal work to do after > this point, however, so this essentially always results in emergency > shutdown. Whether the test passes or fails at this point depends on a > race between this shutdown sequence in the background and an fsync call > successfuly committing a transaction (calling bch2_mark_inode()). *nod* That's more accurate. > Indeed, that makes sense. I wonder if it would make sense to issue > retries at some point? Possibly. AFAIK, on typical block devices we shouldn't see write errors unless the device is really not working anymore - both spinning rust and SSDs will remap on write error, and on transport error IMO the block layer is the more correct place to be issuing retries, if so desired. But we do have aspirations of running on raw flash in the not-too-distant future (ZNS SSDs), and I would expect we'll need retry logic at that point. > I'm not sure a simple retry once sequence would help with this test, but > an infinite retry option (with some delay) probably would. I'd think we'd want a timeout, not a specific number of retries to attempt (or perhaps both). > OTOH, I wonder if a more elegant solution is to explicitly quiesce the > fs before the working -> error table switchover. That might typically be > done with a freeze/unfreeze cycle so the test can serialize against > whatever background work needs to complete before putting the block > device into error mode. However, I notice freeze isn't currently > supported on bcachefs. Any thoughts on freeze support? Freeze definitely needs to happen. It's been _ages_ since I was looking at it so I couldn't say offhand where we'd need to start, but if you're interested I'd be happy to look at what it'd take. > Ah.. so does "for simplicity" here essentially mean "for reuse of the > transaction code?" Yes - bch2_btree_key_cache_journal_flush() -> btree_key_cache_flush_pos() -> bch2_trans_commit(). > > > - flush_seq: given a journal sequence number we previously got a > > journal reservation on, request that it be written out (with a > > flush/fua write, not all journal entries are written flush/fua). > > > > Thanks for this. Something that initially confused me wrt to the journal > is that I was seeing journal writes after btree node writes. The docs > mention the journal is primarily an optimization and ordering is not > important for crash consistency. Wherever you read that must date to well before bcachefs was a thing - that was true in bcache when the journalling code was still new. In bcachefs ordering of updates is important and the journal is indeed our global time ordering. > With that in mind, does that mean a > journal "pin" refers to entries held in the journal until written back > to the btree (as opposed to the journal pinning keys in memory)? IOW, it > doesn't really matter in which order that journal/btree writes occur, > but only that entries remain pinned in the journal until btree writes > complete..? Correct. > (I suspect this confused me because XFS uses the same pinning > terminology, but in that context it refers to metadata items being > pinned by the log, until the log item is flushed and thus unpins the > metadata item for final writeback. So ordering is a critical factor in > this context.). I see - in XFS metadata items can't be written until the relevant log entry is written? Yeah we never did it that way in bcachefs, blocking writeback which already blocks memory reclaim sounded like a bad idea :) Originally, before we started updating pointers to btree nodes after every write, we'd note in every btree node write the newest journal sequence number it had updates for - then in recovery we'd blacklist the next N journal sequence numbers and if we ever saw a btree node entry for a blacklisted journal sequence number we just ignored it. > #bcache or #bachefs? #bcache :) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-02-01 14:34 ` Kent Overstreet @ 2023-02-02 15:50 ` Brian Foster 2023-02-02 17:09 ` Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) Kent Overstreet 0 siblings, 1 reply; 24+ messages in thread From: Brian Foster @ 2023-02-02 15:50 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Wed, Feb 01, 2023 at 09:34:50AM -0500, Kent Overstreet wrote: > On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote: > > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote: > > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote: > > > > Something else that occurred to me while looking further at this is we > > > > can also avoid the error in this case fairly easily by bailing out of > > > > bch2_fsync() if page writeback fails, as opposed to the unconditional > > > > flush -> sync meta -> flush log sequence that returns the first error > > > > anyways. That would prevent marking the inode with a new sequence number > > > > when I/Os are obviously failing. The caveat is that the test still > > > > fails, now with a "Read-only file system" error instead of EIO, because > > > > the filesystem is shutdown by the time the vfs write inode path actually > > > > runs. > > > > > > If some pages did write successfully we don't want to skip the rest of > > > the fsync, though. > > > > > > > What does it matter if the fsync() has already failed? ISTM this is > > pretty standard error handling behavior across major fs', but it's not > > clear to me if there's some bcachefs specific quirk that warrants > > different handling.. > > > > FWIW, I think I've been able to fix this test with a couple small > > tweaks: > > > > 1. Change bch2_fsync() to return on first error. > > 2. Introduce a bcachefs specific delay in the working -> error table > > switchover. > > > > I've not run a full regression test on the bcachefs change yet, but if > > you're willing to take a patch along those lines I can do that and then > > if it survives, follow up with an upstream fstests patch. That would > > allow this test to pass fairly reliably and also no longer > > unconditionally shutdown bcachefs. > > If you link me your git repo I'll add it to the CI so all the automated > tests run. > I don't have a public repo atm but I've posted the patch if you have somewhere to land it for CI testing..? It survived my regression tests so far, FWIW. (I also had posted that random cleanup patch a bit ago if you hadn't noticed..). Is there a reporting dashboard or something available for the test infrastruture for bcachefs? > > > > > The test failing because the filesystem went read-only is because a > > > btree node write fails - I'm assuming without that change the test just > > > fails before any btree node writes are attempted. > > > > > > > Not sure I parse the above. Generally speaking, the test fails because > > after the last successful fsync, it immediately switches over to the > > error dm table. The fs still has some reclaim/journal work to do after > > this point, however, so this essentially always results in emergency > > shutdown. Whether the test passes or fails at this point depends on a > > race between this shutdown sequence in the background and an fsync call > > successfuly committing a transaction (calling bch2_mark_inode()). > > *nod* That's more accurate. > > > Indeed, that makes sense. I wonder if it would make sense to issue > > retries at some point? > > Possibly. AFAIK, on typical block devices we shouldn't see write errors > unless the device is really not working anymore - both spinning rust and > SSDs will remap on write error, and on transport error IMO the block > layer is the more correct place to be issuing retries, if so desired. > > But we do have aspirations of running on raw flash in the > not-too-distant future (ZNS SSDs), and I would expect we'll need retry > logic at that point. > > > I'm not sure a simple retry once sequence would help with this test, but > > an infinite retry option (with some delay) probably would. > > I'd think we'd want a timeout, not a specific number of retries to > attempt (or perhaps both). > > > OTOH, I wonder if a more elegant solution is to explicitly quiesce the > > fs before the working -> error table switchover. That might typically be > > done with a freeze/unfreeze cycle so the test can serialize against > > whatever background work needs to complete before putting the block > > device into error mode. However, I notice freeze isn't currently > > supported on bcachefs. Any thoughts on freeze support? > > Freeze definitely needs to happen. It's been _ages_ since I was looking > at it so I couldn't say offhand where we'd need to start, but if you're > interested I'd be happy to look at what it'd take. > Yeah, that would be interesting. Thanks. > > Ah.. so does "for simplicity" here essentially mean "for reuse of the > > transaction code?" > > Yes - bch2_btree_key_cache_journal_flush() -> btree_key_cache_flush_pos() > -> bch2_trans_commit(). > > > > > > - flush_seq: given a journal sequence number we previously got a > > > journal reservation on, request that it be written out (with a > > > flush/fua write, not all journal entries are written flush/fua). > > > > > > > Thanks for this. Something that initially confused me wrt to the journal > > is that I was seeing journal writes after btree node writes. The docs > > mention the journal is primarily an optimization and ordering is not > > important for crash consistency. > > Wherever you read that must date to well before bcachefs was a thing - > that was true in bcache when the journalling code was still new. In > bcachefs ordering of updates is important and the journal is indeed our > global time ordering. > FWIW, I was reading through the "Journaling" and "Sequential consistency" sections of the architecture documentation [1]. It's possible I just misinterpreted some wording. [1] https://bcachefs.org/Architecture/#Journaling > > With that in mind, does that mean a > > journal "pin" refers to entries held in the journal until written back > > to the btree (as opposed to the journal pinning keys in memory)? IOW, it > > doesn't really matter in which order that journal/btree writes occur, > > but only that entries remain pinned in the journal until btree writes > > complete..? > > Correct. > Ok, I think I get the general idea then. I'll have to dig more into it at some point. > > (I suspect this confused me because XFS uses the same pinning > > terminology, but in that context it refers to metadata items being > > pinned by the log, until the log item is flushed and thus unpins the > > metadata item for final writeback. So ordering is a critical factor in > > this context.). > > I see - in XFS metadata items can't be written until the relevant log > entry is written? Yeah we never did it that way in bcachefs, blocking > writeback which already blocks memory reclaim sounded like a bad idea :) > This is all a completely different design and architecture of course. XFS has its own buffer cache for metadata, with aging heuristics, and shrinker hooks and whatnot with feedback into log and metadata flushing to manage all this and keep things moving along. I was mainly just trying to wrap my head around the disk write ordering rules for bcachefs journaling wrt to fs consistency, using "traditional logging" (i.e. XFS) as a point of reference. Brian > Originally, before we started updating pointers to btree nodes after > every write, we'd note in every btree node write the newest journal > sequence number it had updates for - then in recovery we'd blacklist the > next N journal sequence numbers and if we ever saw a btree node entry > for a blacklisted journal sequence number we just ignored it. > > > #bcache or #bachefs? > > #bcache :) > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) 2023-02-02 15:50 ` Brian Foster @ 2023-02-02 17:09 ` Kent Overstreet 2023-02-02 20:04 ` Brian Foster 2023-02-03 0:51 ` Dave Chinner 0 siblings, 2 replies; 24+ messages in thread From: Kent Overstreet @ 2023-02-02 17:09 UTC (permalink / raw) To: Brian Foster, dchinner; +Cc: linux-bcachefs On Thu, Feb 02, 2023 at 10:50:23AM -0500, Brian Foster wrote: > I don't have a public repo atm but I've posted the patch if you have > somewhere to land it for CI testing..? It survived my regression tests > so far, FWIW. (I also had posted that random cleanup patch a bit ago if > you hadn't noticed..). Must have missed it, sorry. I can host a git repo for you on my server, or github works fine - I generally prefer git repo links, git am is always a bit of a hassle. > Is there a reporting dashboard or something available for the test > infrastruture for bcachefs? https://evilpiepirate.org/~testdashboard/ci I've got a small server farm that watches git branches and runs the entire test suite on every commit starting from the recent - once you've got a git branch up I'll add yours to the list it watches. You'll probably want to get acquainted with ktest, it's what both the CI uses for running tests, and what we use for local development: https://evilpiepirate.org/git/ktest.git > > Freeze definitely needs to happen. It's been _ages_ since I was looking > > at it so I couldn't say offhand where we'd need to start, but if you're > > interested I'd be happy to look at what it'd take. > > > > Yeah, that would be interesting. Thanks. Maybe we could get Dave to give us a brief rundown of freezing? It's been ages since I was thinking about that and it's all fallen out of my brain, but Dave was the one who was able to explain it to me before :) I think getting freezing done would up our test coverage by a good bit, so it's worth prioritizing. > FWIW, I was reading through the "Journaling" and "Sequential > consistency" sections of the architecture documentation [1]. It's > possible I just misinterpreted some wording. > > [1] https://bcachefs.org/Architecture/#Journaling *nod* Thanks for the link, that's old but not ancient - that's from when the journal seq blacklist mechanism was how we did sequential consistency. Whenever I find the time to work on documentation again, I want to move more stuff (including that sort of high level developer documentation) to the principles of operation: https://bcachefs.org/bcachefs-principles-of-operation.pdf Right now it's more end user documentation, but last I was working on that I was starting to add documentation centered around on disk data structures - i.e. btrees and key types, and it seemed to be going in a useful direction. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) 2023-02-02 17:09 ` Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) Kent Overstreet @ 2023-02-02 20:04 ` Brian Foster 2023-02-02 22:39 ` Kent Overstreet 2023-02-03 0:51 ` Dave Chinner 1 sibling, 1 reply; 24+ messages in thread From: Brian Foster @ 2023-02-02 20:04 UTC (permalink / raw) To: Kent Overstreet; +Cc: dchinner, linux-bcachefs On Thu, Feb 02, 2023 at 12:09:20PM -0500, Kent Overstreet wrote: > On Thu, Feb 02, 2023 at 10:50:23AM -0500, Brian Foster wrote: > > I don't have a public repo atm but I've posted the patch if you have > > somewhere to land it for CI testing..? It survived my regression tests > > so far, FWIW. (I also had posted that random cleanup patch a bit ago if > > you hadn't noticed..). > > Must have missed it, sorry. I can host a git repo for you on my server, > or github works fine - I generally prefer git repo links, git am is > always a bit of a hassle. > Sounds good. > > Is there a reporting dashboard or something available for the test > > infrastruture for bcachefs? > > https://evilpiepirate.org/~testdashboard/ci > > I've got a small server farm that watches git branches and runs the > entire test suite on every commit starting from the recent - once you've > got a git branch up I'll add yours to the list it watches. > Cool. I see that it reproduces the generic/441 failure. I need to use GitLab for other things so I managed to create a project cloning your repo there and pushed a testing branch with both of those commits. I think it should be accessible here: https://gitlab.com/bfoster2/bcachefs.git bf-bcachefs-testing ... but let me know if it doesn't work. > You'll probably want to get acquainted with ktest, it's what both the CI > uses for running tests, and what we use for local development: > > https://evilpiepirate.org/git/ktest.git > I'll check it out, thanks! Brian > > > Freeze definitely needs to happen. It's been _ages_ since I was looking > > > at it so I couldn't say offhand where we'd need to start, but if you're > > > interested I'd be happy to look at what it'd take. > > > > > > > Yeah, that would be interesting. Thanks. > > Maybe we could get Dave to give us a brief rundown of freezing? It's > been ages since I was thinking about that and it's all fallen out of my > brain, but Dave was the one who was able to explain it to me before :) > > I think getting freezing done would up our test coverage by a good bit, > so it's worth prioritizing. > > > FWIW, I was reading through the "Journaling" and "Sequential > > consistency" sections of the architecture documentation [1]. It's > > possible I just misinterpreted some wording. > > > > [1] https://bcachefs.org/Architecture/#Journaling > > *nod* Thanks for the link, that's old but not ancient - that's from when > the journal seq blacklist mechanism was how we did sequential > consistency. > > Whenever I find the time to work on documentation again, I want to move > more stuff (including that sort of high level developer documentation) > to the principles of operation: https://bcachefs.org/bcachefs-principles-of-operation.pdf > > Right now it's more end user documentation, but last I was working on > that I was starting to add documentation centered around on disk data > structures - i.e. btrees and key types, and it seemed to be going in a > useful direction. > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) 2023-02-02 20:04 ` Brian Foster @ 2023-02-02 22:39 ` Kent Overstreet 0 siblings, 0 replies; 24+ messages in thread From: Kent Overstreet @ 2023-02-02 22:39 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Thu, Feb 02, 2023 at 03:04:22PM -0500, Brian Foster wrote: > I need to use GitLab for other things so I managed to create a project > cloning your repo there and pushed a testing branch with both of those > commits. I think it should be accessible here: > > https://gitlab.com/bfoster2/bcachefs.git bf-bcachefs-testing You've been added: https://evilpiepirate.org/~testdashboard/ci?branch=bfoster If results don't show up right away it's because I just rewrote the CI's configuration file handling - generally only takes ~20 minutes though. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) 2023-02-02 17:09 ` Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) Kent Overstreet 2023-02-02 20:04 ` Brian Foster @ 2023-02-03 0:51 ` Dave Chinner 2023-02-04 0:35 ` Kent Overstreet 1 sibling, 1 reply; 24+ messages in thread From: Dave Chinner @ 2023-02-03 0:51 UTC (permalink / raw) To: Kent Overstreet; +Cc: Brian Foster, linux-bcachefs On Thu, Feb 02, 2023 at 12:09:20PM -0500, Kent Overstreet wrote: > On Thu, Feb 02, 2023 at 10:50:23AM -0500, Brian Foster wrote: > > I don't have a public repo atm but I've posted the patch if you have > > somewhere to land it for CI testing..? It survived my regression tests > > so far, FWIW. (I also had posted that random cleanup patch a bit ago if > > you hadn't noticed..). > > Must have missed it, sorry. I can host a git repo for you on my server, > or github works fine - I generally prefer git repo links, git am is > always a bit of a hassle. > > > Is there a reporting dashboard or something available for the test > > infrastruture for bcachefs? > > https://evilpiepirate.org/~testdashboard/ci > > I've got a small server farm that watches git branches and runs the > entire test suite on every commit starting from the recent - once you've > got a git branch up I'll add yours to the list it watches. > > You'll probably want to get acquainted with ktest, it's what both the CI > uses for running tests, and what we use for local development: > > https://evilpiepirate.org/git/ktest.git > > > > Freeze definitely needs to happen. It's been _ages_ since I was looking > > > at it so I couldn't say offhand where we'd need to start, but if you're > > > interested I'd be happy to look at what it'd take. > > > > > > > Yeah, that would be interesting. Thanks. > > Maybe we could get Dave to give us a brief rundown of freezing? It's > been ages since I was thinking about that and it's all fallen out of my > brain, but Dave was the one who was able to explain it to me before :) What do you need to know? The vast majority of the freeze infrastructure is generic and the filesystem doesn't need to do anything special. The only thing it needs to implement is ->freeze_fs to essentially quiesce the filesystem - by this stage all the data has been written back and all user-driven operations have either been stalled or drained at either the VFS or transaction reservation points. This requires the filesystem transaction start point to call sb_start_intwrite() in a location the transaction start can block safely forever, and to call sb_end_intwrite() when the transaction is complete and being torn down. [Note that bcachefs might also require sb_{start/end}_{write/pagefault} calls in paths that it has custom handlers for and to protect against ioctl operations triggering modifications during freezes.] This allows freeze_super() to set a barrier to prevent new transactions from starting, and to wait on transactions in flight to drain. Once all transactions have drained, it will then call ->freeze_fs if it is defined so the filesystem can flush it's journal and dirty in-memory metadata so that it becomes consistent on disk without requiring journal recovery to be run. This basically means that once ->fs_freeze completes, the filesystem should be in the same state on-disk as if it were unmounted cleanly, and the fs will not issue any more IO until the filesystem is thawed. Thaw will call ->unfreeze_fs if defined before unblocking tasks so that the filesystem can restart things that may be needed for normal operation that were stopped during the freeze. It's not all that complex anymore - a few hooks to enable modification barriers to be placed and running the writeback part of unmount in ->freeze_fs is the main component of the work that needs to be done.... Cheers, Dave. -- Dave Chinner dchinner@redhat.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) 2023-02-03 0:51 ` Dave Chinner @ 2023-02-04 0:35 ` Kent Overstreet 2023-02-07 0:03 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Kent Overstreet @ 2023-02-04 0:35 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-bcachefs On Fri, Feb 03, 2023 at 11:51:12AM +1100, Dave Chinner wrote: > What do you need to know? The vast majority of the freeze > infrastructure is generic and the filesystem doesn't need to do > anything special. The only thing it needs to implement is > ->freeze_fs to essentially quiesce the filesystem - by this stage > all the data has been written back and all user-driven operations > have either been stalled or drained at either the VFS or transaction > reservation points. > > This requires the filesystem transaction start point to call > sb_start_intwrite() in a location the transaction start can block > safely forever, and to call sb_end_intwrite() when the transaction > is complete and being torn down. > > [Note that bcachefs might also require > sb_{start/end}_{write/pagefault} calls in paths that it has custom > handlers for and to protect against ioctl operations triggering > modifications during freezes.] > > This allows freeze_super() to set a barrier to prevent new > transactions from starting, and to wait on transactions in flight to > drain. Once all transactions have drained, it will then call > ->freeze_fs if it is defined so the filesystem can flush it's > journal and dirty in-memory metadata so that it becomes consistent > on disk without requiring journal recovery to be run. > > This basically means that once ->fs_freeze completes, the filesystem > should be in the same state on-disk as if it were unmounted cleanly, > and the fs will not issue any more IO until the filesystem is > thawed. Thaw will call ->unfreeze_fs if defined before unblocking > tasks so that the filesystem can restart things that may be needed > for normal operation that were stopped during the freeze. > > It's not all that complex anymore - a few hooks to enable > modification barriers to be placed and running the > writeback part of unmount in ->freeze_fs is the main component > of the work that needs to be done.... Thanks, that is a lot simpler than I was thinking - I guess I was thinking about task freezing for suspend, that definitely had some tricky bits. Sounds like treating it as if we were remounting read-only is all we need to do. I'll get on this when I get done with backpointers, unless someone else is willing to work on it :) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) 2023-02-04 0:35 ` Kent Overstreet @ 2023-02-07 0:03 ` Dave Chinner 2023-02-16 20:04 ` Eric Wheeler 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2023-02-07 0:03 UTC (permalink / raw) To: Kent Overstreet; +Cc: Brian Foster, linux-bcachefs On Fri, Feb 03, 2023 at 07:35:15PM -0500, Kent Overstreet wrote: > On Fri, Feb 03, 2023 at 11:51:12AM +1100, Dave Chinner wrote: > > What do you need to know? The vast majority of the freeze > > infrastructure is generic and the filesystem doesn't need to do > > anything special. The only thing it needs to implement is > > ->freeze_fs to essentially quiesce the filesystem - by this stage > > all the data has been written back and all user-driven operations > > have either been stalled or drained at either the VFS or transaction > > reservation points. > > > > This requires the filesystem transaction start point to call > > sb_start_intwrite() in a location the transaction start can block > > safely forever, and to call sb_end_intwrite() when the transaction > > is complete and being torn down. > > > > [Note that bcachefs might also require > > sb_{start/end}_{write/pagefault} calls in paths that it has custom > > handlers for and to protect against ioctl operations triggering > > modifications during freezes.] > > > > This allows freeze_super() to set a barrier to prevent new > > transactions from starting, and to wait on transactions in flight to > > drain. Once all transactions have drained, it will then call > > ->freeze_fs if it is defined so the filesystem can flush it's > > journal and dirty in-memory metadata so that it becomes consistent > > on disk without requiring journal recovery to be run. > > > > This basically means that once ->fs_freeze completes, the filesystem > > should be in the same state on-disk as if it were unmounted cleanly, > > and the fs will not issue any more IO until the filesystem is > > thawed. Thaw will call ->unfreeze_fs if defined before unblocking > > tasks so that the filesystem can restart things that may be needed > > for normal operation that were stopped during the freeze. > > > > It's not all that complex anymore - a few hooks to enable > > modification barriers to be placed and running the > > writeback part of unmount in ->freeze_fs is the main component > > of the work that needs to be done.... > > Thanks, that is a lot simpler than I was thinking - I guess I was > thinking about task freezing for suspend, that definitely had some > tricky bits. "freezing for suspend" is a mess - it should just be calling freeze_super() and letting the filesystem take care of everything to do with freezing the filesystem. The whole idea that we can suspend a filesystem safely by running sync() and stopping kernel threads and workqueues from running is .... broken. > Sounds like treating it as if we were remounting read-only > is all we need to do. Yup, pretty much. XFS shares all the log quiescing code between ->freeze_fs and remount_ro. The only difference is that remount_ro has to do all the work to write dirty data back to disk before it quiesces the log.... Cheers, Dave. -- Dave Chinner dchinner@redhat.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) 2023-02-07 0:03 ` Dave Chinner @ 2023-02-16 20:04 ` Eric Wheeler 2023-02-20 22:19 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Eric Wheeler @ 2023-02-16 20:04 UTC (permalink / raw) To: Dave Chinner; +Cc: Kent Overstreet, Brian Foster, linux-bcachefs On Tue, 7 Feb 2023, Dave Chinner wrote: > On Fri, Feb 03, 2023 at 07:35:15PM -0500, Kent Overstreet wrote: > > On Fri, Feb 03, 2023 at 11:51:12AM +1100, Dave Chinner wrote: > > > What do you need to know? The vast majority of the freeze > > > infrastructure is generic and the filesystem doesn't need to do > > > anything special. The only thing it needs to implement is > > > ->freeze_fs to essentially quiesce the filesystem - by this stage > > > all the data has been written back and all user-driven operations > > > have either been stalled or drained at either the VFS or transaction > > > reservation points. > > > > > > This requires the filesystem transaction start point to call > > > sb_start_intwrite() in a location the transaction start can block > > > safely forever, and to call sb_end_intwrite() when the transaction > > > is complete and being torn down. > > > > > > [Note that bcachefs might also require > > > sb_{start/end}_{write/pagefault} calls in paths that it has custom > > > handlers for and to protect against ioctl operations triggering > > > modifications during freezes.] > > > > > > This allows freeze_super() to set a barrier to prevent new > > > transactions from starting, and to wait on transactions in flight to > > > drain. Once all transactions have drained, it will then call > > > ->freeze_fs if it is defined so the filesystem can flush it's > > > journal and dirty in-memory metadata so that it becomes consistent > > > on disk without requiring journal recovery to be run. > > > > > > This basically means that once ->fs_freeze completes, the filesystem > > > should be in the same state on-disk as if it were unmounted cleanly, > > > and the fs will not issue any more IO until the filesystem is > > > thawed. Thaw will call ->unfreeze_fs if defined before unblocking > > > tasks so that the filesystem can restart things that may be needed > > > for normal operation that were stopped during the freeze. > > > > > > It's not all that complex anymore - a few hooks to enable > > > modification barriers to be placed and running the > > > writeback part of unmount in ->freeze_fs is the main component > > > of the work that needs to be done.... > > > > Thanks, that is a lot simpler than I was thinking - I guess I was > > thinking about task freezing for suspend, that definitely had some > > tricky bits. > > "freezing for suspend" is a mess - it should just be calling > freeze_super() and letting the filesystem take care of everything to > do with freezing the filesystem. The whole idea that we can suspend > a filesystem safely by running sync() and stopping kernel threads and > workqueues from running is .... broken. > > > Sounds like treating it as if we were remounting read-only > > is all we need to do. > > Yup, pretty much. XFS shares all the log quiescing code between > ->freeze_fs and remount_ro. The only difference is that remount_ro > has to do all the work to write dirty data back to disk before > it quiesces the log.... Wait, so are you saying that XFS does not commit dirty buffers for sleeping, only on remount_ro? In an ideal case I suppose the laptop ram is still hot... But sometimes (ahem, far too often) I close my laptop and forget about it, in which case the battery dies and of course then any dirty pages are lost. IMHO sleep should always be crash-safe. -- Eric Wheeler > Cheers, > > Dave. > -- > Dave Chinner > dchinner@redhat.com > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) 2023-02-16 20:04 ` Eric Wheeler @ 2023-02-20 22:19 ` Dave Chinner 2023-02-20 23:23 ` Kent Overstreet 0 siblings, 1 reply; 24+ messages in thread From: Dave Chinner @ 2023-02-20 22:19 UTC (permalink / raw) To: Eric Wheeler; +Cc: Kent Overstreet, Brian Foster, linux-bcachefs On Thu, Feb 16, 2023 at 12:04:34PM -0800, Eric Wheeler wrote: > On Tue, 7 Feb 2023, Dave Chinner wrote: > > On Fri, Feb 03, 2023 at 07:35:15PM -0500, Kent Overstreet wrote: > > > On Fri, Feb 03, 2023 at 11:51:12AM +1100, Dave Chinner wrote: > > > > What do you need to know? The vast majority of the freeze > > > > infrastructure is generic and the filesystem doesn't need to do > > > > anything special. The only thing it needs to implement is > > > > ->freeze_fs to essentially quiesce the filesystem - by this stage > > > > all the data has been written back and all user-driven operations > > > > have either been stalled or drained at either the VFS or transaction > > > > reservation points. > > > > > > > > This requires the filesystem transaction start point to call > > > > sb_start_intwrite() in a location the transaction start can block > > > > safely forever, and to call sb_end_intwrite() when the transaction > > > > is complete and being torn down. > > > > > > > > [Note that bcachefs might also require > > > > sb_{start/end}_{write/pagefault} calls in paths that it has custom > > > > handlers for and to protect against ioctl operations triggering > > > > modifications during freezes.] > > > > > > > > This allows freeze_super() to set a barrier to prevent new > > > > transactions from starting, and to wait on transactions in flight to > > > > drain. Once all transactions have drained, it will then call > > > > ->freeze_fs if it is defined so the filesystem can flush it's > > > > journal and dirty in-memory metadata so that it becomes consistent > > > > on disk without requiring journal recovery to be run. > > > > > > > > This basically means that once ->fs_freeze completes, the filesystem > > > > should be in the same state on-disk as if it were unmounted cleanly, > > > > and the fs will not issue any more IO until the filesystem is > > > > thawed. Thaw will call ->unfreeze_fs if defined before unblocking > > > > tasks so that the filesystem can restart things that may be needed > > > > for normal operation that were stopped during the freeze. > > > > > > > > It's not all that complex anymore - a few hooks to enable > > > > modification barriers to be placed and running the > > > > writeback part of unmount in ->freeze_fs is the main component > > > > of the work that needs to be done.... > > > > > > Thanks, that is a lot simpler than I was thinking - I guess I was > > > thinking about task freezing for suspend, that definitely had some > > > tricky bits. > > > > "freezing for suspend" is a mess - it should just be calling > > freeze_super() and letting the filesystem take care of everything to > > do with freezing the filesystem. The whole idea that we can suspend > > a filesystem safely by running sync() and stopping kernel threads and > > workqueues from running is .... broken. > > > > > Sounds like treating it as if we were remounting read-only > > > is all we need to do. > > > > Yup, pretty much. XFS shares all the log quiescing code between > > ->freeze_fs and remount_ro. The only difference is that remount_ro > > has to do all the work to write dirty data back to disk before > > it quiesces the log.... > > Wait, so are you saying that XFS does not commit dirty buffers for > sleeping, only on remount_ro? Yup. But this is not unique to XFS - every journalling filesystem (ext3, ext4, jfs, etc) have exactly the same problem: sync_filesystem() only guarantees that the filesystem is consistent on disk, not that it is clean in memory. And by "consistent on disk", that means all dirty metadata has been written -to the journal- so that if a crash occurs immeditately afterwards, journal recovery on the next mount will ensure that the filesystem is consistent. IOWs, after sync_filesystem(), the filesystem is most definitely *not idle* and *not clean in memory*, and that's where all the issues with suspend end up coming from - it assumes sync() is all that is needed to put a filesystem in an idle state.... > In an ideal case I suppose the laptop ram is > still hot... But sometimes (ahem, far too often) I close my laptop and > forget about it, in which case the battery dies and of course then any > dirty pages are lost. IMHO sleep should always be crash-safe. suspend is generally considered crash safe. The problems with suspend stem from inconsistent in-memory vs on-disk filesystem state in the suspend image - this causes problems on resume of the suspended image, not on the next cold boot of the system. -Dave. -- Dave Chinner dchinner@redhat.com ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) 2023-02-20 22:19 ` Dave Chinner @ 2023-02-20 23:23 ` Kent Overstreet 0 siblings, 0 replies; 24+ messages in thread From: Kent Overstreet @ 2023-02-20 23:23 UTC (permalink / raw) To: Dave Chinner; +Cc: Eric Wheeler, Brian Foster, linux-bcachefs On Tue, Feb 21, 2023 at 09:19:37AM +1100, Dave Chinner wrote: > > Wait, so are you saying that XFS does not commit dirty buffers for > > sleeping, only on remount_ro? > > Yup. But this is not unique to XFS - every journalling filesystem > (ext3, ext4, jfs, etc) have exactly the same problem: > sync_filesystem() only guarantees that the filesystem is consistent > on disk, not that it is clean in memory. > > And by "consistent on disk", that means all dirty metadata has been > written -to the journal- so that if a crash occurs immeditately > afterwards, journal recovery on the next mount will ensure that the > filesystem is consistent. > > IOWs, after sync_filesystem(), the filesystem is most definitely > *not idle* and *not clean in memory*, and that's where all the > issues with suspend end up coming from - it assumes sync() is all > that is needed to put a filesystem in an idle state.... > > > In an ideal case I suppose the laptop ram is > > still hot... But sometimes (ahem, far too often) I close my laptop and > > forget about it, in which case the battery dies and of course then any > > dirty pages are lost. IMHO sleep should always be crash-safe. > > suspend is generally considered crash safe. The problems with > suspend stem from inconsistent in-memory vs on-disk filesystem state > in the suspend image - this causes problems on resume of the > suspended image, not on the next cold boot of the system. Ok, so that means we'll want to do things differently. I don't think we _quite_ have a straightforward mechanism for quiescing things without completely flushing the btree (which is actually driven by journal reclaim). There is bch2_journal_block()/bch2_journal_unblock() which blocks new journal reservations - I think that can be our starting point. bch2_journal_block() also waits for all in flight journal entries to be written, so that's what we want here. If we want all dirty metadata to be written _and visible_ in the journal, we need to make sure the last journal entry written is a flush entry - in general we do this by setting journal_buf->must_flush for the entry to be written; see bch2_journal_meta() for a simple example. We'll also need to prevent new btree node writes from being issued, and wait for in flight ones to be finished. We don't have a mechanism for that currently, we can probably create something like bch2_journal_block() for that. Other IO sources: - superblock writes (rare in normal operation, they happen when e.g. we start writing data to a new set of devices... we need that in the superblock so that mount knows which devices we need to mount). - data reads/writes - erasure coding writes (parity blocks when creating stripes, reconstruct reads). I expect we'll have to block reads from being submitted as well? Then we'll want to block data reads/btree node reads/ec reconstruct reads after blocking all the sources of writes... (this is starting to feel more like the old freezer stuff.. blech) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-01-31 16:04 ` Brian Foster 2023-02-01 14:34 ` Kent Overstreet @ 2023-02-02 22:56 ` Kent Overstreet 2023-02-04 21:33 ` Brian Foster 1 sibling, 1 reply; 24+ messages in thread From: Kent Overstreet @ 2023-02-02 22:56 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote: > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote: > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote: > > > Something else that occurred to me while looking further at this is we > > > can also avoid the error in this case fairly easily by bailing out of > > > bch2_fsync() if page writeback fails, as opposed to the unconditional > > > flush -> sync meta -> flush log sequence that returns the first error > > > anyways. That would prevent marking the inode with a new sequence number > > > when I/Os are obviously failing. The caveat is that the test still > > > fails, now with a "Read-only file system" error instead of EIO, because > > > the filesystem is shutdown by the time the vfs write inode path actually > > > runs. > > > > If some pages did write successfully we don't want to skip the rest of > > the fsync, though. > > > > What does it matter if the fsync() has already failed? ISTM this is > pretty standard error handling behavior across major fs', but it's not > clear to me if there's some bcachefs specific quirk that warrants > different handling.. > > FWIW, I think I've been able to fix this test with a couple small > tweaks: > > 1. Change bch2_fsync() to return on first error. I suppose it doesn't make sense to flush the journal if file_write_and_wait_range() didn't successfully do anything - but I would prefer to keep the behaviour where we do flush the journal on partial writeback. What if we plumbed a did_work parameter through? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-02-02 22:56 ` fstests generic/441 -- occasional bcachefs failure Kent Overstreet @ 2023-02-04 21:33 ` Brian Foster 2023-02-04 22:15 ` Kent Overstreet 0 siblings, 1 reply; 24+ messages in thread From: Brian Foster @ 2023-02-04 21:33 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Thu, Feb 02, 2023 at 05:56:32PM -0500, Kent Overstreet wrote: > On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote: > > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote: > > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote: > > > > Something else that occurred to me while looking further at this is we > > > > can also avoid the error in this case fairly easily by bailing out of > > > > bch2_fsync() if page writeback fails, as opposed to the unconditional > > > > flush -> sync meta -> flush log sequence that returns the first error > > > > anyways. That would prevent marking the inode with a new sequence number > > > > when I/Os are obviously failing. The caveat is that the test still > > > > fails, now with a "Read-only file system" error instead of EIO, because > > > > the filesystem is shutdown by the time the vfs write inode path actually > > > > runs. > > > > > > If some pages did write successfully we don't want to skip the rest of > > > the fsync, though. > > > > > > > What does it matter if the fsync() has already failed? ISTM this is > > pretty standard error handling behavior across major fs', but it's not > > clear to me if there's some bcachefs specific quirk that warrants > > different handling.. > > > > FWIW, I think I've been able to fix this test with a couple small > > tweaks: > > > > 1. Change bch2_fsync() to return on first error. > > I suppose it doesn't make sense to flush the journal if > file_write_and_wait_range() didn't successfully do anything - but I > would prefer to keep the behaviour where we do flush the journal on > partial writeback. > That seems reasonable to me in principle... > What if we plumbed a did_work parameter through? > ... but I'm not sure how that is supposed to work..? Tracking submits from the current flush wouldn't be hard, but wouldn't tell us about writeback that might have occurred in the background before fsync was called. It also doesn't give any information about what I/Os succeeded or failed. I was thinking about whether we could look at PageError pages or perhaps sample errseq or something in the error case, but the fdatawait can clear page errors and errseq only bumps the sequence value if it's been sampled since the last update (so no way to distinguish between "one failure" and "everything failed"). The only other thing that comes to mind is explicit submit/completion/error counting and tracking somewhere in the mapping or inode or some such, but that seems like overkill given fsync error semantics. I.e. when fsync returns an error, we're basically telling the user "data loss somewhere in the file," without indication of where or how much, so the user basically has to recreate the file or toss it anyways. It would be nice to improve on that, but that's a much bigger problem than I'm trying to address here. ;P So I dunno.. I'll try to think a bit more about it but atm I'm not seeing any sort of elegant way to fix this test and also preserve the inode flush in the event of writeback failure. Thoughts? ISTM that it's reasonable to skip the flush on wb error given fsync behavior and error semantics seem common across all major fs', and as background writeback should still try to handle it in short order anyways, but it's your tree.. :) Brian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-02-04 21:33 ` Brian Foster @ 2023-02-04 22:15 ` Kent Overstreet 2023-02-06 15:33 ` Brian Foster 0 siblings, 1 reply; 24+ messages in thread From: Kent Overstreet @ 2023-02-04 22:15 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Sat, Feb 04, 2023 at 04:33:41PM -0500, Brian Foster wrote: > On Thu, Feb 02, 2023 at 05:56:32PM -0500, Kent Overstreet wrote: > > On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote: > > > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote: > > > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote: > > > > > Something else that occurred to me while looking further at this is we > > > > > can also avoid the error in this case fairly easily by bailing out of > > > > > bch2_fsync() if page writeback fails, as opposed to the unconditional > > > > > flush -> sync meta -> flush log sequence that returns the first error > > > > > anyways. That would prevent marking the inode with a new sequence number > > > > > when I/Os are obviously failing. The caveat is that the test still > > > > > fails, now with a "Read-only file system" error instead of EIO, because > > > > > the filesystem is shutdown by the time the vfs write inode path actually > > > > > runs. > > > > > > > > If some pages did write successfully we don't want to skip the rest of > > > > the fsync, though. > > > > > > > > > > What does it matter if the fsync() has already failed? ISTM this is > > > pretty standard error handling behavior across major fs', but it's not > > > clear to me if there's some bcachefs specific quirk that warrants > > > different handling.. > > > > > > FWIW, I think I've been able to fix this test with a couple small > > > tweaks: > > > > > > 1. Change bch2_fsync() to return on first error. > > > > I suppose it doesn't make sense to flush the journal if > > file_write_and_wait_range() didn't successfully do anything - but I > > would prefer to keep the behaviour where we do flush the journal on > > partial writeback. > > > > That seems reasonable to me in principle... > > > What if we plumbed a did_work parameter through? > > > > ... but I'm not sure how that is supposed to work..? Tracking submits > from the current flush wouldn't be hard, but wouldn't tell us about > writeback that might have occurred in the background before fsync was > called. It also doesn't give any information about what I/Os succeeded > or failed. Good point. Now that I think about it, inode->bi_journal_seq won't be getting updated unless a write actually did successfully complete - meaning bch2_flush_inode() won't do anything unless there was a successful transaction commit; either data was written or the inode was updated for some other reason. Maybe it's the sync_inode_metadata() call that's causing the journal flush to happen? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-02-04 22:15 ` Kent Overstreet @ 2023-02-06 15:33 ` Brian Foster 2023-02-06 22:18 ` Kent Overstreet 0 siblings, 1 reply; 24+ messages in thread From: Brian Foster @ 2023-02-06 15:33 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Sat, Feb 04, 2023 at 05:15:11PM -0500, Kent Overstreet wrote: > On Sat, Feb 04, 2023 at 04:33:41PM -0500, Brian Foster wrote: > > On Thu, Feb 02, 2023 at 05:56:32PM -0500, Kent Overstreet wrote: > > > On Tue, Jan 31, 2023 at 11:04:11AM -0500, Brian Foster wrote: > > > > On Mon, Jan 30, 2023 at 12:06:34PM -0500, Kent Overstreet wrote: > > > > > On Fri, Jan 27, 2023 at 09:50:05AM -0500, Brian Foster wrote: > > > > > > Something else that occurred to me while looking further at this is we > > > > > > can also avoid the error in this case fairly easily by bailing out of > > > > > > bch2_fsync() if page writeback fails, as opposed to the unconditional > > > > > > flush -> sync meta -> flush log sequence that returns the first error > > > > > > anyways. That would prevent marking the inode with a new sequence number > > > > > > when I/Os are obviously failing. The caveat is that the test still > > > > > > fails, now with a "Read-only file system" error instead of EIO, because > > > > > > the filesystem is shutdown by the time the vfs write inode path actually > > > > > > runs. > > > > > > > > > > If some pages did write successfully we don't want to skip the rest of > > > > > the fsync, though. > > > > > > > > > > > > > What does it matter if the fsync() has already failed? ISTM this is > > > > pretty standard error handling behavior across major fs', but it's not > > > > clear to me if there's some bcachefs specific quirk that warrants > > > > different handling.. > > > > > > > > FWIW, I think I've been able to fix this test with a couple small > > > > tweaks: > > > > > > > > 1. Change bch2_fsync() to return on first error. > > > > > > I suppose it doesn't make sense to flush the journal if > > > file_write_and_wait_range() didn't successfully do anything - but I > > > would prefer to keep the behaviour where we do flush the journal on > > > partial writeback. > > > > > > > That seems reasonable to me in principle... > > > > > What if we plumbed a did_work parameter through? > > > > > > > ... but I'm not sure how that is supposed to work..? Tracking submits > > from the current flush wouldn't be hard, but wouldn't tell us about > > writeback that might have occurred in the background before fsync was > > called. It also doesn't give any information about what I/Os succeeded > > or failed. > > Good point. > > Now that I think about it, inode->bi_journal_seq won't be getting > updated unless a write actually did successfully complete - meaning > bch2_flush_inode() won't do anything unless there was a successful > transaction commit; either data was written or the inode was updated for > some other reason. > > Maybe it's the sync_inode_metadata() call that's causing the journal > flush to happen? > Yeah, the sync_inode_metadata() -> bch2_vfs_write_inode() -> bch2_write_inode() path is what commits a transaction and bumps the ->bi_journal_seq on the inode. Essentially I think there are a couple issues related to this test and perhaps easier to reason about them separately: The first issue is that a shutdown occurs because the test basically does a pwrite() -> fsync() -> dm-error error table switch sequence, and the error table switch races with journal reclaim I/O that kicks in after the fsync completes. This problem is resolved by the bcachefs specific sleep during the dm-table switch (hopefully to be replaced by a freeze cycle in the future), so no bcachefs changes are required for that one. The second issue is that bcachefs still shuts down with the aforementioned delay in place for a slightly different reason. This occurs because after the switch to an error table, the test does another pwrite() -> fsync() sequence to test the "fsync should return error" case. What I see happen in bcachefs here is that writepages occurs (because the buffered write is able to dirty the pages and inode and whatnot since everything is buffered) and fails, bch2_vfs_write_inode() commits a transaction (again successful, and marks the inode with the latest journal seq), and then bch2_flush_inode() grabs the updated inode seq, attempts to flush the journal, fails due to I/O error, and then shuts down the fs. The shutdown leads to general test failure because the test eventually switches back to a functional dm table, but bcachefs reports errors because it has shutdown. So yes, it seems the sync_inode_metadata() path is what bumps the journal seq and the explicit journal flush is what shuts down the fs, but I assume sync_inode_metadata() attempts the inode flush simply because the buffered write that precedes it works perfectly fine. This is what had me thinking that bch2_fsync() is being more aggressive than it really needs to be here. With the proposed logic tweak, the second issue is resolved because bcachefs no longer attempts the inode flush when page writeback has failed. So what happens in the test is that it switches back over to the functional dm table, the vfs inode flush occurs then and succeeds, and so the test sees no (unexpected) errors and the fs has survived without shutting down. All in all I think this test basically builds on some minor assumptions about how fsync error behavior is implemented in Linux, and bcachefs happens to slightly diverge in a way that leads to this shutdown. IOW, I'd expect the same problematic behavior out of XFS if it implemented this sort of fsync logic, but afaict no other fs does that, so the test is functional in practice. I admit that's not the best pure engineering justification for the change in bcachefs (so I understand any hesitation), but IMO it's reasonable in practice and worthwhile enough to improve test coverage. I haven't audited fstests for this or anything, but it wouldn't surprise me much if there are other tests that rely on this sort of "assumed behavior" for testing I/O failures. Thoughts? Brian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-02-06 15:33 ` Brian Foster @ 2023-02-06 22:18 ` Kent Overstreet 2023-02-09 12:57 ` Brian Foster 0 siblings, 1 reply; 24+ messages in thread From: Kent Overstreet @ 2023-02-06 22:18 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Mon, Feb 06, 2023 at 10:33:14AM -0500, Brian Foster wrote: > So yes, it seems the sync_inode_metadata() path is what bumps the > journal seq and the explicit journal flush is what shuts down the fs, > but I assume sync_inode_metadata() attempts the inode flush simply > because the buffered write that precedes it works perfectly fine. This > is what had me thinking that bch2_fsync() is being more aggressive than > it really needs to be here. With the proposed logic tweak, the second > issue is resolved because bcachefs no longer attempts the inode flush > when page writeback has failed. So what happens in the test is that it > switches back over to the functional dm table, the vfs inode flush > occurs then and succeeds, and so the test sees no (unexpected) errors > and the fs has survived without shutting down. So, sync_inode_metadata() is just writing out timestamp updates. But, there's no reason we couldn't be doing that as part of the write path, if it was plumbed through - the write path has to do an inode update anyways. Although that would involve an unpack/repack of the inode, so maybe we don't want to do that - we'd want to change the inode format to not encode timestamps as varints, and I'm not sure that's worth the hassle and the increased inode size. I was eyeing changing the way timestamp updates work to make them more direct, it would get rid of the need to call sync_inode_metadata() and regularize timestamp updates with the way inode updates work for everything else - ISTR that's how it works in XFS as well. But it'd make atime updates more expensive, and that's not something I've looked into yet. > All in all I think this test basically builds on some minor assumptions > about how fsync error behavior is implemented in Linux, and bcachefs > happens to slightly diverge in a way that leads to this shutdown. IOW, > I'd expect the same problematic behavior out of XFS if it implemented > this sort of fsync logic, but afaict no other fs does that, so the test > is functional in practice. I admit that's not the best pure engineering > justification for the change in bcachefs (so I understand any > hesitation), but IMO it's reasonable in practice and worthwhile enough > to improve test coverage. I haven't audited fstests for this or > anything, but it wouldn't surprise me much if there are other tests that > rely on this sort of "assumed behavior" for testing I/O failures. > Thoughts? Yeah, maybe we should hold off on making any decisions until we see where else this comes up, I hoped this test was going to be easier to fix but it doesn't seem like there's any clean obvious solutions. Ah well :) LSF is coming up and as I recall that's where the errseq stuff was originally discussed - I'd like to hear from the people behind those changes if they thing leaving our journal flush out of the errseq path is reasonable (it seems so to me, but it's worth bringing up). I can pencil it in to the bcachefs talk... need to start working on that... We can just leave it failing in the CI for now - maybe make some notes as to what our options are when we decide to come back to it, there's definitely more straightforward and more impactful bugs to work on in the meantime. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-02-06 22:18 ` Kent Overstreet @ 2023-02-09 12:57 ` Brian Foster 2023-02-09 14:58 ` Kent Overstreet 0 siblings, 1 reply; 24+ messages in thread From: Brian Foster @ 2023-02-09 12:57 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Mon, Feb 06, 2023 at 05:18:35PM -0500, Kent Overstreet wrote: > On Mon, Feb 06, 2023 at 10:33:14AM -0500, Brian Foster wrote: > > So yes, it seems the sync_inode_metadata() path is what bumps the > > journal seq and the explicit journal flush is what shuts down the fs, > > but I assume sync_inode_metadata() attempts the inode flush simply > > because the buffered write that precedes it works perfectly fine. This > > is what had me thinking that bch2_fsync() is being more aggressive than > > it really needs to be here. With the proposed logic tweak, the second > > issue is resolved because bcachefs no longer attempts the inode flush > > when page writeback has failed. So what happens in the test is that it > > switches back over to the functional dm table, the vfs inode flush > > occurs then and succeeds, and so the test sees no (unexpected) errors > > and the fs has survived without shutting down. > > So, sync_inode_metadata() is just writing out timestamp updates. > Yeah.. I hadn't confirmed, but that's what I figured. > But, there's no reason we couldn't be doing that as part of the write > path, if it was plumbed through - the write path has to do an inode > update anyways. Although that would involve an unpack/repack of the > inode, so maybe we don't want to do that - we'd want to change the inode > format to not encode timestamps as varints, and I'm not sure that's > worth the hassle and the increased inode size. > > I was eyeing changing the way timestamp updates work to make them more > direct, it would get rid of the need to call sync_inode_metadata() and > regularize timestamp updates with the way inode updates work for > everything else - ISTR that's how it works in XFS as well. But it'd make > atime updates more expensive, and that's not something I've looked into > yet. > > > All in all I think this test basically builds on some minor assumptions > > about how fsync error behavior is implemented in Linux, and bcachefs > > happens to slightly diverge in a way that leads to this shutdown. IOW, > > I'd expect the same problematic behavior out of XFS if it implemented > > this sort of fsync logic, but afaict no other fs does that, so the test > > is functional in practice. I admit that's not the best pure engineering > > justification for the change in bcachefs (so I understand any > > hesitation), but IMO it's reasonable in practice and worthwhile enough > > to improve test coverage. I haven't audited fstests for this or > > anything, but it wouldn't surprise me much if there are other tests that > > rely on this sort of "assumed behavior" for testing I/O failures. > > Thoughts? > > Yeah, maybe we should hold off on making any decisions until we see > where else this comes up, I hoped this test was going to be easier to > fix but it doesn't seem like there's any clean obvious solutions. Ah > well :) > Indeed. While the proposed logic change seems reasonable to me (for reasons already explained)... > LSF is coming up and as I recall that's where the errseq stuff was > originally discussed - I'd like to hear from the people behind those > changes if they thing leaving our journal flush out of the errseq path > is reasonable (it seems so to me, but it's worth bringing up). I can > pencil it in to the bcachefs talk... need to start working on that... > ... I also think it's reasonable to sit on it, take stock of the scope, and gain more input wrt to errseq and whatnot, should any of that lead to potentially better solutions. > We can just leave it failing in the CI for now - maybe make some notes > as to what our options are when we decide to come back to it, there's > definitely more straightforward and more impactful bugs to work on in > the meantime. > Ok. FWIW, I was just trying to pick off something that looked approachable because that helps get more comfortable with the code, etc., even if it doesn't result in an immediate fix. I.e., this investigation/root-cause was helpful to grok journaling a bit more, uncover (to me, at least) a use case for freeze, etc. If you have any other open bugs that you think might be useful to look into, feel free to send them along.. Brian ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: fstests generic/441 -- occasional bcachefs failure 2023-02-09 12:57 ` Brian Foster @ 2023-02-09 14:58 ` Kent Overstreet 0 siblings, 0 replies; 24+ messages in thread From: Kent Overstreet @ 2023-02-09 14:58 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Thu, Feb 09, 2023 at 07:57:31AM -0500, Brian Foster wrote: > If you have any other open bugs that you think might be useful to look > into, feel free to send them along.. https://evilpiepirate.org/~testdashboard/c/f99e51d956a74e1cd78f88648b27b0dac7c13a02/xfstests.generic.388/log.br This might be a good one to start with: a file got truncated but still has extents in it. I'd start by running it in a loop locally in ktest, and seeing how easily it reproduces and what else turns up. It's starting to look like there might be a btree ordering issue - there's been a number of unusual bug reports of filesystem inconsistencies that generally are repaired fine by fsck - but aren't showing up in our tests either, this is the closest to those bugs that I've come across in the tests. truncate calls bch2_fpunch() which deletes extents using bch2_extent_update(); it failing to delete some extents shouldn't be possible and if i_sie/i_sectors somehow got out of sync I'd expect it to be in the other direction; still, there might be some asserts we can add. 'bcachefs list_journal -a' should be one of the first things you look at, you can grep through the output to find the most recent transaction commit(s) that updated that inode. See if anything looks funny - was this one of the most recent updates when the test stopped, or was it further back? If it's a lower level btree inconsistency, that might show up by verifying what the journal says we have against what's in the btree - 'bcachefs list' command. If it's a btree ordering issue, there's two main possibilites: - after a crash, the btree has updates it shouldn't have, because they're newer than the newest update that made it into the journal - the btree failed to persist some updates that the journal didn't think it needed to replay anymore Keys have a version number field: we can use it for distinguishing between these two possibilities. In the transaction commit path, you'll see code for a debug mode where we initialize it based on the sequence number of the journal reservation we obtained. There may still be some other related code for that debug mode; the idea is to check for keys newer than the newest journal sequence number we replayed after a crash. We should also check for keys with blacklisted journal sequence numbers. I think one of those approaches will turn up something :) ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2023-02-20 23:23 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-01-25 15:45 fstests generic/441 -- occasional bcachefs failure Brian Foster 2023-01-26 15:08 ` Kent Overstreet 2023-01-27 7:21 ` Kent Overstreet 2023-01-27 14:50 ` Brian Foster 2023-01-30 17:06 ` Kent Overstreet 2023-01-31 16:04 ` Brian Foster 2023-02-01 14:34 ` Kent Overstreet 2023-02-02 15:50 ` Brian Foster 2023-02-02 17:09 ` Freezing (was: Re: fstests generic/441 -- occasional bcachefs failure) Kent Overstreet 2023-02-02 20:04 ` Brian Foster 2023-02-02 22:39 ` Kent Overstreet 2023-02-03 0:51 ` Dave Chinner 2023-02-04 0:35 ` Kent Overstreet 2023-02-07 0:03 ` Dave Chinner 2023-02-16 20:04 ` Eric Wheeler 2023-02-20 22:19 ` Dave Chinner 2023-02-20 23:23 ` Kent Overstreet 2023-02-02 22:56 ` fstests generic/441 -- occasional bcachefs failure Kent Overstreet 2023-02-04 21:33 ` Brian Foster 2023-02-04 22:15 ` Kent Overstreet 2023-02-06 15:33 ` Brian Foster 2023-02-06 22:18 ` Kent Overstreet 2023-02-09 12:57 ` Brian Foster 2023-02-09 14:58 ` Kent Overstreet
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.