* [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits @ 2023-03-02 19:51 Brian Foster 2023-03-04 2:44 ` Kent Overstreet 0 siblings, 1 reply; 5+ messages in thread From: Brian Foster @ 2023-03-02 19:51 UTC (permalink / raw) To: linux-bcachefs; +Cc: Kent Overstreet fstest generic/388 occasionally reproduces corruptions where an inode has extents beyond i_size. This is a deliberate crash and recovery test, and the post crash+recovery characteristics are usually the same: the inode exists on disk in an early (i.e. just allocated) state based on the journal sequence number associated with the inode. Subsequent inode updates exist in the journal at higher sequence numbers, but the inode hadn't been written back before the associated crash and the post-crash recovery processes a set of journal sequence numbers that doesn't include updates to the inode. In fact, the sequence with the most recent inode key update always happens to be the sequence just before the front of the journal processed by recovery. This last bit is a significant hint that the problem relates to an on-disk journal update of the front of the journal. The root cause of this problem is basically that the inode is updated (multiple times) in-core and in the key cache, each time bumping the key cache sequence number used to control the cache flush. The cache flush skips one or more times, bumping the associated key cache journal pin to the key cache seq value. This has a side effect of holding the inode in memory a bit longer than normal, which helps exacerbate this problem, but is also unsafe in certain cases where the key cache seq may have been updated by a transaction commit that didn't journal the associated key. For example, consider an inode that has been allocated, updated several times in the key cache, journaled, but not yet written back. At this stage, everything should be consistent if the fs happens to crash because the latest update has been journal. Now consider a key update via bch2_extent_update_i_size_sectors() that uses the BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode state, it can have the side effect of bumping ck->seq in bch2_btree_insert_key_cached(). In turn, if a subsequent key cache flush skips due to seq not matching the former, the ck->journal pin is updated to ck->seq even though the most recent key update was not journaled. If this pin happens to reside at the front (tail) of the journal, this means a subsequent journal write can update last_seq to a value beyond that which includes the most recent update to the inode. If this occurs and the fs happens to crash before the inode happens to flush, recovery will see the latest last_seq, fail to recover the inode and leave the inode in the inconsistent state described above. To avoid this problem, skip the key cache seq update on NOJOURNAL commits. Pass the insert entry directly to bch2_btree_insert_key_cached() to make the associated flag available and be consistent with btree_insert_key_leaf(). Not-Signed-off-by: Brian Foster <bfoster@redhat.com> --- Hi Kent, As noted on IRC, this patch is RFC because it's definitely broken. I reproduce a generic/036 unmount hang related to pin flushing. I tweaked the logic to also skip the bch2_journal_pin_add() call, which seems to fix generic/036 but then results in a generic/127 panic where atomic_long_read(&bc->nr_keys) == 1 on exit. So this all kind of smells like I'm breaking the journal pin/key cache flushing logic here and so need to grok this a bit better to see what this is getting wrong. FWIW, I also ran a quick test to just drop the usage of NOJOURNAL from the inode update in the extent update path. That seems to also address the generic/388 problem, but I haven't tested that one more broadly. One other thing is I know we proved the commit that introduced ck->seq was not the root cause of this problem a week or so ago. I suspect this still occurs prior to that commit because that one also relocated the pin update from the commit path to the journal reclaim (i.e. cache flush) path. IOW, the problem still occurs, it just depends on the NOJOURNAL commit itself to bump the pin and a journal write to race with a cache flush. Anyways, I still want to play around with this some more but it might take me some time and I wanted to at least post the problem description in the meantime. I'm curious if you have any thoughts on the analysis above and/or the initial attempt to fix.. Brian fs/bcachefs/btree_key_cache.c | 9 +++++---- fs/bcachefs/btree_key_cache.h | 2 +- fs/bcachefs/btree_update_leaf.c | 2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c index 13df0d408634..272e66437ffc 100644 --- a/fs/bcachefs/btree_key_cache.c +++ b/fs/bcachefs/btree_key_cache.c @@ -770,11 +770,11 @@ int bch2_btree_key_cache_flush(struct btree_trans *trans, bool bch2_btree_insert_key_cached(struct btree_trans *trans, unsigned flags, - struct btree_path *path, - struct bkey_i *insert) + struct btree_insert_entry *insert_entry) { struct bch_fs *c = trans->c; - struct bkey_cached *ck = (void *) path->l[0].b; + struct bkey_cached *ck = (void *) insert_entry->path->l[0].b; + struct bkey_i *insert = insert_entry->k; bool kick_reclaim = false; BUG_ON(insert->u64s > ck->u64s); @@ -804,7 +804,8 @@ bool bch2_btree_insert_key_cached(struct btree_trans *trans, bch2_journal_pin_add(&c->journal, trans->journal_res.seq, &ck->journal, bch2_btree_key_cache_journal_flush); - ck->seq = trans->journal_res.seq; + if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL)) + ck->seq = trans->journal_res.seq; if (kick_reclaim) journal_reclaim_kick(&c->journal); diff --git a/fs/bcachefs/btree_key_cache.h b/fs/bcachefs/btree_key_cache.h index c86d5e48f6e3..be3acde2caa0 100644 --- a/fs/bcachefs/btree_key_cache.h +++ b/fs/bcachefs/btree_key_cache.h @@ -30,7 +30,7 @@ int bch2_btree_path_traverse_cached(struct btree_trans *, struct btree_path *, unsigned); bool bch2_btree_insert_key_cached(struct btree_trans *, unsigned, - struct btree_path *, struct bkey_i *); + struct btree_insert_entry *); int bch2_btree_key_cache_flush(struct btree_trans *, enum btree_id, struct bpos); void bch2_btree_key_cache_drop(struct btree_trans *, diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c index c93c132dd815..456375fed276 100644 --- a/fs/bcachefs/btree_update_leaf.c +++ b/fs/bcachefs/btree_update_leaf.c @@ -765,7 +765,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags, if (!i->cached) btree_insert_key_leaf(trans, i); else if (!i->key_cache_already_flushed) - bch2_btree_insert_key_cached(trans, flags, i->path, i->k); + bch2_btree_insert_key_cached(trans, flags, i); else { bch2_btree_key_cache_drop(trans, i->path); btree_path_set_dirty(i->path, BTREE_ITER_NEED_TRAVERSE); -- 2.39.2 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits 2023-03-02 19:51 [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits Brian Foster @ 2023-03-04 2:44 ` Kent Overstreet 2023-03-06 13:18 ` Brian Foster 0 siblings, 1 reply; 5+ messages in thread From: Kent Overstreet @ 2023-03-04 2:44 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Thu, Mar 02, 2023 at 02:51:10PM -0500, Brian Foster wrote: > As noted on IRC, this patch is RFC because it's definitely broken. I > reproduce a generic/036 unmount hang related to pin flushing. I tweaked > the logic to also skip the bch2_journal_pin_add() call, which seems to > fix generic/036 but then results in a generic/127 panic where > atomic_long_read(&bc->nr_keys) == 1 on exit. So this all kind of smells > like I'm breaking the journal pin/key cache flushing logic here and so > need to grok this a bit better to see what this is getting wrong. > > FWIW, I also ran a quick test to just drop the usage of NOJOURNAL from > the inode update in the extent update path. That seems to also address > the generic/388 problem, but I haven't tested that one more broadly. > > One other thing is I know we proved the commit that introduced ck->seq > was not the root cause of this problem a week or so ago. I suspect this > still occurs prior to that commit because that one also relocated the > pin update from the commit path to the journal reclaim (i.e. cache > flush) path. IOW, the problem still occurs, it just depends on the > NOJOURNAL commit itself to bump the pin and a journal write to race with > a cache flush. So continuing the IRC conversation, it seems like this is (almost) exactly the right fix - and I think I might know why you're seeing the generic/036 hang too. Without your fix, the bug is that we're setting ck->seq to a newer sequence number than the last time this key was updated in the journal. That means that when we later update the journal pin, this key cache key is pinning a newer journal entry than the last time this key was updated in the journal - meaning there's a window in the journal where we've lost the inode update. Specifically, the inode update _previous_ to this inode update (that was not marked as NOJOURNAL) gets lost - the bch2_journal_pin_update() call tells the journal we don't need that update anymore (which would be correct if we had journalled a newer update) - but we didn't journal a newer update, so we did need the older update. It looks like the generic/036 hang is beacuse with your fix, you're calling bch2_journal_pin_add() without updating ck->seq - then we end up holding a journal pin while ck->seq is garbage, so bch2_btree_key_cache_journal_flush() gets confused - haven't worked out what exactly happens then. Try moving the bch2_journal_pin_add() inside your new BTREE_UPDATE_NOJOURNAL check, I bet that fixes it. > > fs/bcachefs/btree_key_cache.c | 9 +++++---- > fs/bcachefs/btree_key_cache.h | 2 +- > fs/bcachefs/btree_update_leaf.c | 2 +- > 3 files changed, 7 insertions(+), 6 deletions(-) > > diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c > index 13df0d408634..272e66437ffc 100644 > --- a/fs/bcachefs/btree_key_cache.c > +++ b/fs/bcachefs/btree_key_cache.c > @@ -770,11 +770,11 @@ int bch2_btree_key_cache_flush(struct btree_trans *trans, > > bool bch2_btree_insert_key_cached(struct btree_trans *trans, > unsigned flags, > - struct btree_path *path, > - struct bkey_i *insert) > + struct btree_insert_entry *insert_entry) > { > struct bch_fs *c = trans->c; > - struct bkey_cached *ck = (void *) path->l[0].b; > + struct bkey_cached *ck = (void *) insert_entry->path->l[0].b; > + struct bkey_i *insert = insert_entry->k; > bool kick_reclaim = false; > > BUG_ON(insert->u64s > ck->u64s); > @@ -804,7 +804,8 @@ bool bch2_btree_insert_key_cached(struct btree_trans *trans, > > bch2_journal_pin_add(&c->journal, trans->journal_res.seq, > &ck->journal, bch2_btree_key_cache_journal_flush); > - ck->seq = trans->journal_res.seq; > + if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL)) > + ck->seq = trans->journal_res.seq; > > if (kick_reclaim) > journal_reclaim_kick(&c->journal); > diff --git a/fs/bcachefs/btree_key_cache.h b/fs/bcachefs/btree_key_cache.h > index c86d5e48f6e3..be3acde2caa0 100644 > --- a/fs/bcachefs/btree_key_cache.h > +++ b/fs/bcachefs/btree_key_cache.h > @@ -30,7 +30,7 @@ int bch2_btree_path_traverse_cached(struct btree_trans *, struct btree_path *, > unsigned); > > bool bch2_btree_insert_key_cached(struct btree_trans *, unsigned, > - struct btree_path *, struct bkey_i *); > + struct btree_insert_entry *); > int bch2_btree_key_cache_flush(struct btree_trans *, > enum btree_id, struct bpos); > void bch2_btree_key_cache_drop(struct btree_trans *, > diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c > index c93c132dd815..456375fed276 100644 > --- a/fs/bcachefs/btree_update_leaf.c > +++ b/fs/bcachefs/btree_update_leaf.c > @@ -765,7 +765,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags, > if (!i->cached) > btree_insert_key_leaf(trans, i); > else if (!i->key_cache_already_flushed) > - bch2_btree_insert_key_cached(trans, flags, i->path, i->k); > + bch2_btree_insert_key_cached(trans, flags, i); > else { > bch2_btree_key_cache_drop(trans, i->path); > btree_path_set_dirty(i->path, BTREE_ITER_NEED_TRAVERSE); > -- > 2.39.2 > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits 2023-03-04 2:44 ` Kent Overstreet @ 2023-03-06 13:18 ` Brian Foster 2023-03-07 10:32 ` Kent Overstreet 0 siblings, 1 reply; 5+ messages in thread From: Brian Foster @ 2023-03-06 13:18 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Fri, Mar 03, 2023 at 09:44:55PM -0500, Kent Overstreet wrote: > On Thu, Mar 02, 2023 at 02:51:10PM -0500, Brian Foster wrote: > > As noted on IRC, this patch is RFC because it's definitely broken. I > > reproduce a generic/036 unmount hang related to pin flushing. I tweaked > > the logic to also skip the bch2_journal_pin_add() call, which seems to > > fix generic/036 but then results in a generic/127 panic where > > atomic_long_read(&bc->nr_keys) == 1 on exit. So this all kind of smells > > like I'm breaking the journal pin/key cache flushing logic here and so > > need to grok this a bit better to see what this is getting wrong. > > > > FWIW, I also ran a quick test to just drop the usage of NOJOURNAL from > > the inode update in the extent update path. That seems to also address > > the generic/388 problem, but I haven't tested that one more broadly. > > > > One other thing is I know we proved the commit that introduced ck->seq > > was not the root cause of this problem a week or so ago. I suspect this > > still occurs prior to that commit because that one also relocated the > > pin update from the commit path to the journal reclaim (i.e. cache > > flush) path. IOW, the problem still occurs, it just depends on the > > NOJOURNAL commit itself to bump the pin and a journal write to race with > > a cache flush. > > So continuing the IRC conversation, it seems like this is (almost) > exactly the right fix - and I think I might know why you're seeing the > generic/036 hang too. > > Without your fix, the bug is that we're setting ck->seq to a newer > sequence number than the last time this key was updated in the journal. > That means that when we later update the journal pin, this key cache key > is pinning a newer journal entry than the last time this key was updated > in the journal - meaning there's a window in the journal where we've > lost the inode update. > > Specifically, the inode update _previous_ to this inode update (that was > not marked as NOJOURNAL) gets lost - the bch2_journal_pin_update() call > tells the journal we don't need that update anymore (which would be > correct if we had journalled a newer update) - but we didn't journal a > newer update, so we did need the older update. > Yep.. > It looks like the generic/036 hang is beacuse with your fix, you're > calling bch2_journal_pin_add() without updating ck->seq - then we end up > holding a journal pin while ck->seq is garbage, so > bch2_btree_key_cache_journal_flush() gets confused - haven't worked out > what exactly happens then. > Yeah, I tracked the hang problem down to essentially being due to a "stale" ck->seq value. I.e., a normal commit occurs, updates ck->seq, the key cache flushes, pin drops, etc. Subsequently a NOJOURNAL commit occurs and happens to be the only key cache update at that point. It adds a new pin but does not update ck->seq. The key flush callback sees an old ck->seq and doesn't flush, which essentially leaves around the following: 96611: count 1 flushed: ffff99c313fb64d8 bch2_btree_key_cache_journal_flush [bcachefs] ... ... and we're stuck because the callback didn't flush, but the pin wasn't updated either. > Try moving the bch2_journal_pin_add() inside your new > BTREE_UPDATE_NOJOURNAL check, I bet that fixes it. > Hmm.. I want to say I tried that and I ran into another problem with generic/127 where the key cache wouldn't flush (see the comments in my initial writeup), I _think_ because we hit a case where we just failed to add a pin and that is required to tie into journal reclaim and actually have the cache flush (even if nothing was logged). I ran with the following logic over the weekend and it seemed to avoid any problems: if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) || !journal_pin_active(&ck->journal)) { ck->seq = trans->journal_res.seq; } bch2_journal_pin_add(&c->journal, trans->journal_res.seq, &ck->journal, bch2_btree_key_cache_journal_flush); This addresses the scenario above by updating ck->seq when the key cache pin happens to be inactive, which is presumably (?) safe because that means the cache doesn't hold any previous updates. FWIW, I suspect there are probably multiple ways to fix this, such as allowing pin_add() to return whether it actually added a pin, or perhaps resetting ck->seq somewhere on key cache flush so the callback doesn't think it's valid, etc. Thoughts? I have a bigger picture question, however. The key cache is intended to hold updates to multiple different keys that land in the same btree node, right? If so, is it safe to update the key cache pin on later updates like this in general? Specifically I was thinking about the example of two inodes that happen to land in the same key cache (inode A, A+1) but under different journal seqs. For example, suppose inode A commits, logs, adds the ck pin, and sets ck->seq (A). Then inode A+1 commits, logs to seq+1, and updates ck->seq again. At this point if journal reclaim flushes the cache with the original pin, we update the pin to the A+1 ck->seq value and return. If this moves the pin from the front of the journal, what prevents last_seq from being updated to A+1 on disk and introducing the same sort of window where in-core updates aren't covered by the active range of the journal? I.e., if last_seq updates and the filesystem crashes, ISTM we potentially lose the inode A key update, even if it was journaled. Unless I'm missing something (entirely possible), I'm wondering if it makes sense to update ck->seq like this at all (or if something else is required to allow a ck to hold reference to multiple pins, etc.). Hm? Brian > > > > fs/bcachefs/btree_key_cache.c | 9 +++++---- > > fs/bcachefs/btree_key_cache.h | 2 +- > > fs/bcachefs/btree_update_leaf.c | 2 +- > > 3 files changed, 7 insertions(+), 6 deletions(-) > > > > diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c > > index 13df0d408634..272e66437ffc 100644 > > --- a/fs/bcachefs/btree_key_cache.c > > +++ b/fs/bcachefs/btree_key_cache.c > > @@ -770,11 +770,11 @@ int bch2_btree_key_cache_flush(struct btree_trans *trans, > > > > bool bch2_btree_insert_key_cached(struct btree_trans *trans, > > unsigned flags, > > - struct btree_path *path, > > - struct bkey_i *insert) > > + struct btree_insert_entry *insert_entry) > > { > > struct bch_fs *c = trans->c; > > - struct bkey_cached *ck = (void *) path->l[0].b; > > + struct bkey_cached *ck = (void *) insert_entry->path->l[0].b; > > + struct bkey_i *insert = insert_entry->k; > > bool kick_reclaim = false; > > > > BUG_ON(insert->u64s > ck->u64s); > > @@ -804,7 +804,8 @@ bool bch2_btree_insert_key_cached(struct btree_trans *trans, > > > > bch2_journal_pin_add(&c->journal, trans->journal_res.seq, > > &ck->journal, bch2_btree_key_cache_journal_flush); > > - ck->seq = trans->journal_res.seq; > > + if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL)) > > + ck->seq = trans->journal_res.seq; > > > > if (kick_reclaim) > > journal_reclaim_kick(&c->journal); > > diff --git a/fs/bcachefs/btree_key_cache.h b/fs/bcachefs/btree_key_cache.h > > index c86d5e48f6e3..be3acde2caa0 100644 > > --- a/fs/bcachefs/btree_key_cache.h > > +++ b/fs/bcachefs/btree_key_cache.h > > @@ -30,7 +30,7 @@ int bch2_btree_path_traverse_cached(struct btree_trans *, struct btree_path *, > > unsigned); > > > > bool bch2_btree_insert_key_cached(struct btree_trans *, unsigned, > > - struct btree_path *, struct bkey_i *); > > + struct btree_insert_entry *); > > int bch2_btree_key_cache_flush(struct btree_trans *, > > enum btree_id, struct bpos); > > void bch2_btree_key_cache_drop(struct btree_trans *, > > diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c > > index c93c132dd815..456375fed276 100644 > > --- a/fs/bcachefs/btree_update_leaf.c > > +++ b/fs/bcachefs/btree_update_leaf.c > > @@ -765,7 +765,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags, > > if (!i->cached) > > btree_insert_key_leaf(trans, i); > > else if (!i->key_cache_already_flushed) > > - bch2_btree_insert_key_cached(trans, flags, i->path, i->k); > > + bch2_btree_insert_key_cached(trans, flags, i); > > else { > > bch2_btree_key_cache_drop(trans, i->path); > > btree_path_set_dirty(i->path, BTREE_ITER_NEED_TRAVERSE); > > -- > > 2.39.2 > > > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits 2023-03-06 13:18 ` Brian Foster @ 2023-03-07 10:32 ` Kent Overstreet 2023-03-07 14:22 ` Brian Foster 0 siblings, 1 reply; 5+ messages in thread From: Kent Overstreet @ 2023-03-07 10:32 UTC (permalink / raw) To: Brian Foster; +Cc: linux-bcachefs On Mon, Mar 06, 2023 at 08:18:55AM -0500, Brian Foster wrote: > I ran with the following logic over the weekend and it seemed to avoid > any problems: > > if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) || > !journal_pin_active(&ck->journal)) { > ck->seq = trans->journal_res.seq; > } > bch2_journal_pin_add(&c->journal, trans->journal_res.seq, > &ck->journal, bch2_btree_key_cache_journal_flush); > > This addresses the scenario above by updating ck->seq when the key cache > pin happens to be inactive, which is presumably (?) safe because that > means the cache doesn't hold any previous updates. FWIW, I suspect there > are probably multiple ways to fix this, such as allowing pin_add() to > return whether it actually added a pin, or perhaps resetting ck->seq > somewhere on key cache flush so the callback doesn't think it's valid, > etc. Thoughts? It took me a bit to work out why this is the correct fix, so it'll definitely need some documenting (and the blame is on me; BTREE_UPDATE_NOJOURNAL was clearly more subtle than I recognized at the time) - but yeah, looks good. To recap what we discussed on IRC, for anyone else who's reading: naively one might assume that in BTREE_UPDATE_NOJOURNAL mode we shouldn't do anything here and skip adding the journal pin, except that NOJOURNAL updates _do_ need to be written back to the btree and it's journal reclaim that is also responsible for that. The reason for BTREE_UPDATE_NOJOURNAL is that sometimes we're doing inode updates that are only updating bi_journal_seq, nothing else: these updates are only needed so that fsync() later knows which journal sequence number has to be flushed. We have to record this in the btree, because if it was stored in the VFS inode that's just a cache and reclaim would break fsync - but we're not going to need it after a crash, hence no need to journal it. So BTREE_UPDATE_NOJOURNAL updates are really this super rare, special purpose performance hack that knowingly break our sequential consistency model and should be regarded with suspicion :) > I have a bigger picture question, however. The key cache is intended to > hold updates to multiple different keys that land in the same btree > node, right? If so, is it safe to update the key cache pin on later > updates like this in general? Specifically I was thinking about the > example of two inodes that happen to land in the same key cache (inode > A, A+1) but under different journal seqs. Yes. Originally, on every key cache update we'd call bch2_journal_pin_update() - dropping our old journal pin and pinning the journal entry of the new update. See the patch "bcachefs: Btree key cache optimization" where we switched to the new behaviour - it's an optimization to avoid hitting the journal lock. > For example, suppose inode A commits, logs, adds the ck pin, and sets > ck->seq (A). Then inode A+1 commits, logs to seq+1, and updates ck->seq > again. At this point if journal reclaim flushes the cache with the > original pin, we update the pin to the A+1 ck->seq value and return. If > this moves the pin from the front of the journal, what prevents last_seq > from being updated to A+1 on disk and introducing the same sort of > window where in-core updates aren't covered by the active range of the > journal? I.e., if last_seq updates and the filesystem crashes, ISTM we > potentially lose the inode A key update, even if it was journaled. That's completely fine, provided inode A1 was journalled and is still pinned in the journal - in your example, since the pin is updated to the sequence number for A1 it will be. That is, the journal is very much allowed to drop redundant updates. The journal exists to guarantee time ordering, but that guarantee only applies after we've finished replaying everything that was in the journal. If we're only partway through journal replay (say, after inode A but before A1), then the btree is going to be complete nonsense in terms of consistency - not just because stuff may have been dropped from the journal, but because things in the journal maybe be flushed in any order, so if we're only halfway through journal replay we fully expect to find updates in the btree that are newer than where we're at in the journal. This is touching on the topic of how we guarantee sequential consistency after crashes - which is an interesting and lengthy topic, but fortunately we don't need to understand it for the problem at hand. The only relevant thing for now is that yes, the journal can drop redundant updates and it's completely fine. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits 2023-03-07 10:32 ` Kent Overstreet @ 2023-03-07 14:22 ` Brian Foster 0 siblings, 0 replies; 5+ messages in thread From: Brian Foster @ 2023-03-07 14:22 UTC (permalink / raw) To: Kent Overstreet; +Cc: linux-bcachefs On Tue, Mar 07, 2023 at 05:32:05AM -0500, Kent Overstreet wrote: > On Mon, Mar 06, 2023 at 08:18:55AM -0500, Brian Foster wrote: > > I ran with the following logic over the weekend and it seemed to avoid > > any problems: > > > > if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) || > > !journal_pin_active(&ck->journal)) { > > ck->seq = trans->journal_res.seq; > > } > > bch2_journal_pin_add(&c->journal, trans->journal_res.seq, > > &ck->journal, bch2_btree_key_cache_journal_flush); > > > > This addresses the scenario above by updating ck->seq when the key cache > > pin happens to be inactive, which is presumably (?) safe because that > > means the cache doesn't hold any previous updates. FWIW, I suspect there > > are probably multiple ways to fix this, such as allowing pin_add() to > > return whether it actually added a pin, or perhaps resetting ck->seq > > somewhere on key cache flush so the callback doesn't think it's valid, > > etc. Thoughts? > > It took me a bit to work out why this is the correct fix, so it'll > definitely need some documenting (and the blame is on me; > BTREE_UPDATE_NOJOURNAL was clearly more subtle than I recognized at the > time) - but yeah, looks good. > > To recap what we discussed on IRC, for anyone else who's reading: > naively one might assume that in BTREE_UPDATE_NOJOURNAL mode we > shouldn't do anything here and skip adding the journal pin, except that > NOJOURNAL updates _do_ need to be written back to the btree and it's > journal reclaim that is also responsible for that. > > The reason for BTREE_UPDATE_NOJOURNAL is that sometimes we're doing > inode updates that are only updating bi_journal_seq, nothing else: these > updates are only needed so that fsync() later knows which journal > sequence number has to be flushed. > > We have to record this in the btree, because if it was stored in the VFS > inode that's just a cache and reclaim would break fsync - but we're not > going to need it after a crash, hence no need to journal it. > > So BTREE_UPDATE_NOJOURNAL updates are really this super rare, special > purpose performance hack that knowingly break our sequential consistency > model and should be regarded with suspicion :) > Ack, makes sense. I added some comments around the updated ck->seq logic and pushed the patch to my testing branch. It looks like the CI sees it, but hasn't started on it yet.. > > I have a bigger picture question, however. The key cache is intended to > > hold updates to multiple different keys that land in the same btree > > node, right? If so, is it safe to update the key cache pin on later > > updates like this in general? Specifically I was thinking about the > > example of two inodes that happen to land in the same key cache (inode > > A, A+1) but under different journal seqs. > > Yes. > > Originally, on every key cache update we'd call > bch2_journal_pin_update() - dropping our old journal pin and pinning the > journal entry of the new update. > > See the patch "bcachefs: Btree key cache optimization" where we switched > to the new behaviour - it's an optimization to avoid hitting the journal > lock. > Yeah, I got my wires crossed wrt to the relevant data structures and was thinking of bkey_cached as "bkey cache," saw the seq associated with that, and then naturally started to wonder how we manage a set of cached keys with a single sequence value. Sometime after writing this up I realized that bkey_cached is a single cached key, and there's a higher level btree_key_cache data stucture. I still need to poke through that code a bit more to grok it, but that explains my confusion. :) > > For example, suppose inode A commits, logs, adds the ck pin, and sets > > ck->seq (A). Then inode A+1 commits, logs to seq+1, and updates ck->seq > > again. At this point if journal reclaim flushes the cache with the > > original pin, we update the pin to the A+1 ck->seq value and return. If > > this moves the pin from the front of the journal, what prevents last_seq > > from being updated to A+1 on disk and introducing the same sort of > > window where in-core updates aren't covered by the active range of the > > journal? I.e., if last_seq updates and the filesystem crashes, ISTM we > > potentially lose the inode A key update, even if it was journaled. > > That's completely fine, provided inode A1 was journalled and is still > pinned in the journal - in your example, since the pin is updated to the > sequence number for A1 it will be. > > That is, the journal is very much allowed to drop redundant updates. The > journal exists to guarantee time ordering, but that guarantee only > applies after we've finished replaying everything that was in the > journal. > > If we're only partway through journal replay (say, after inode A but > before A1), then the btree is going to be complete nonsense in terms of > consistency - not just because stuff may have been dropped from the > journal, but because things in the journal maybe be flushed in any > order, so if we're only halfway through journal replay we fully expect > to find updates in the btree that are newer than where we're at in the > journal. > Ok. > This is touching on the topic of how we guarantee sequential consistency > after crashes - which is an interesting and lengthy topic, but > fortunately we don't need to understand it for the problem at hand. The > only relevant thing for now is that yes, the journal can drop redundant > updates and it's completely fine. > Indeed. Thanks for the explanation. Brian ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-03-07 14:26 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-03-02 19:51 [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits Brian Foster 2023-03-04 2:44 ` Kent Overstreet 2023-03-06 13:18 ` Brian Foster 2023-03-07 10:32 ` Kent Overstreet 2023-03-07 14:22 ` Brian Foster
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.