All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits
@ 2023-03-02 19:51 Brian Foster
  2023-03-04  2:44 ` Kent Overstreet
  0 siblings, 1 reply; 5+ messages in thread
From: Brian Foster @ 2023-03-02 19:51 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

fstest generic/388 occasionally reproduces corruptions where an
inode has extents beyond i_size. This is a deliberate crash and
recovery test, and the post crash+recovery characteristics are
usually the same: the inode exists on disk in an early (i.e. just
allocated) state based on the journal sequence number associated
with the inode. Subsequent inode updates exist in the journal at
higher sequence numbers, but the inode hadn't been written back
before the associated crash and the post-crash recovery processes a
set of journal sequence numbers that doesn't include updates to the
inode. In fact, the sequence with the most recent inode key update
always happens to be the sequence just before the front of the
journal processed by recovery.

This last bit is a significant hint that the problem relates to an
on-disk journal update of the front of the journal. The root cause
of this problem is basically that the inode is updated (multiple
times) in-core and in the key cache, each time bumping the key cache
sequence number used to control the cache flush. The cache flush
skips one or more times, bumping the associated key cache journal
pin to the key cache seq value. This has a side effect of holding
the inode in memory a bit longer than normal, which helps exacerbate
this problem, but is also unsafe in certain cases where the key
cache seq may have been updated by a transaction commit that didn't
journal the associated key.

For example, consider an inode that has been allocated, updated
several times in the key cache, journaled, but not yet written back.
At this stage, everything should be consistent if the fs happens to
crash because the latest update has been journal. Now consider a key
update via bch2_extent_update_i_size_sectors() that uses the
BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode
state, it can have the side effect of bumping ck->seq in
bch2_btree_insert_key_cached(). In turn, if a subsequent key cache
flush skips due to seq not matching the former, the ck->journal pin
is updated to ck->seq even though the most recent key update was not
journaled. If this pin happens to reside at the front (tail) of the
journal, this means a subsequent journal write can update last_seq
to a value beyond that which includes the most recent update to the
inode. If this occurs and the fs happens to crash before the inode
happens to flush, recovery will see the latest last_seq, fail to
recover the inode and leave the inode in the inconsistent state
described above.

To avoid this problem, skip the key cache seq update on NOJOURNAL
commits. Pass the insert entry directly to
bch2_btree_insert_key_cached() to make the associated flag available
and be consistent with btree_insert_key_leaf().

Not-Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi Kent,

As noted on IRC, this patch is RFC because it's definitely broken. I
reproduce a generic/036 unmount hang related to pin flushing. I tweaked
the logic to also skip the bch2_journal_pin_add() call, which seems to
fix generic/036 but then results in a generic/127 panic where
atomic_long_read(&bc->nr_keys) == 1 on exit. So this all kind of smells
like I'm breaking the journal pin/key cache flushing logic here and so
need to grok this a bit better to see what this is getting wrong.

FWIW, I also ran a quick test to just drop the usage of NOJOURNAL from
the inode update in the extent update path. That seems to also address
the generic/388 problem, but I haven't tested that one more broadly.

One other thing is I know we proved the commit that introduced ck->seq
was not the root cause of this problem a week or so ago. I suspect this
still occurs prior to that commit because that one also relocated the
pin update from the commit path to the journal reclaim (i.e. cache
flush) path. IOW, the problem still occurs, it just depends on the
NOJOURNAL commit itself to bump the pin and a journal write to race with
a cache flush.

Anyways, I still want to play around with this some more but it might
take me some time and I wanted to at least post the problem description
in the meantime. I'm curious if you have any thoughts on the analysis
above and/or the initial attempt to fix..

Brian

 fs/bcachefs/btree_key_cache.c   | 9 +++++----
 fs/bcachefs/btree_key_cache.h   | 2 +-
 fs/bcachefs/btree_update_leaf.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c
index 13df0d408634..272e66437ffc 100644
--- a/fs/bcachefs/btree_key_cache.c
+++ b/fs/bcachefs/btree_key_cache.c
@@ -770,11 +770,11 @@ int bch2_btree_key_cache_flush(struct btree_trans *trans,
 
 bool bch2_btree_insert_key_cached(struct btree_trans *trans,
 				  unsigned flags,
-				  struct btree_path *path,
-				  struct bkey_i *insert)
+				  struct btree_insert_entry *insert_entry)
 {
 	struct bch_fs *c = trans->c;
-	struct bkey_cached *ck = (void *) path->l[0].b;
+	struct bkey_cached *ck = (void *) insert_entry->path->l[0].b;
+	struct bkey_i *insert = insert_entry->k;
 	bool kick_reclaim = false;
 
 	BUG_ON(insert->u64s > ck->u64s);
@@ -804,7 +804,8 @@ bool bch2_btree_insert_key_cached(struct btree_trans *trans,
 
 	bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
 			     &ck->journal, bch2_btree_key_cache_journal_flush);
-	ck->seq = trans->journal_res.seq;
+	if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL))
+		ck->seq = trans->journal_res.seq;
 
 	if (kick_reclaim)
 		journal_reclaim_kick(&c->journal);
diff --git a/fs/bcachefs/btree_key_cache.h b/fs/bcachefs/btree_key_cache.h
index c86d5e48f6e3..be3acde2caa0 100644
--- a/fs/bcachefs/btree_key_cache.h
+++ b/fs/bcachefs/btree_key_cache.h
@@ -30,7 +30,7 @@ int bch2_btree_path_traverse_cached(struct btree_trans *, struct btree_path *,
 				    unsigned);
 
 bool bch2_btree_insert_key_cached(struct btree_trans *, unsigned,
-			struct btree_path *, struct bkey_i *);
+			struct btree_insert_entry *);
 int bch2_btree_key_cache_flush(struct btree_trans *,
 			       enum btree_id, struct bpos);
 void bch2_btree_key_cache_drop(struct btree_trans *,
diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c
index c93c132dd815..456375fed276 100644
--- a/fs/bcachefs/btree_update_leaf.c
+++ b/fs/bcachefs/btree_update_leaf.c
@@ -765,7 +765,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
 		if (!i->cached)
 			btree_insert_key_leaf(trans, i);
 		else if (!i->key_cache_already_flushed)
-			bch2_btree_insert_key_cached(trans, flags, i->path, i->k);
+			bch2_btree_insert_key_cached(trans, flags, i);
 		else {
 			bch2_btree_key_cache_drop(trans, i->path);
 			btree_path_set_dirty(i->path, BTREE_ITER_NEED_TRAVERSE);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits
  2023-03-02 19:51 [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits Brian Foster
@ 2023-03-04  2:44 ` Kent Overstreet
  2023-03-06 13:18   ` Brian Foster
  0 siblings, 1 reply; 5+ messages in thread
From: Kent Overstreet @ 2023-03-04  2:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Thu, Mar 02, 2023 at 02:51:10PM -0500, Brian Foster wrote:
> As noted on IRC, this patch is RFC because it's definitely broken. I
> reproduce a generic/036 unmount hang related to pin flushing. I tweaked
> the logic to also skip the bch2_journal_pin_add() call, which seems to
> fix generic/036 but then results in a generic/127 panic where
> atomic_long_read(&bc->nr_keys) == 1 on exit. So this all kind of smells
> like I'm breaking the journal pin/key cache flushing logic here and so
> need to grok this a bit better to see what this is getting wrong.
> 
> FWIW, I also ran a quick test to just drop the usage of NOJOURNAL from
> the inode update in the extent update path. That seems to also address
> the generic/388 problem, but I haven't tested that one more broadly.
> 
> One other thing is I know we proved the commit that introduced ck->seq
> was not the root cause of this problem a week or so ago. I suspect this
> still occurs prior to that commit because that one also relocated the
> pin update from the commit path to the journal reclaim (i.e. cache
> flush) path. IOW, the problem still occurs, it just depends on the
> NOJOURNAL commit itself to bump the pin and a journal write to race with
> a cache flush.

So continuing the IRC conversation, it seems like this is (almost)
exactly the right fix - and I think I might know why you're seeing the
generic/036 hang too.

Without your fix, the bug is that we're setting ck->seq to a newer
sequence number than the last time this key was updated in the journal.
That means that when we later update the journal pin, this key cache key
is pinning a newer journal entry than the last time this key was updated
in the journal - meaning there's a window in the journal where we've
lost the inode update.

Specifically, the inode update _previous_ to this inode update (that was
not marked as NOJOURNAL) gets lost - the bch2_journal_pin_update() call
tells the journal we don't need that update anymore (which would be
correct if we had journalled a newer update) - but we didn't journal a
newer update, so we did need the older update.

It looks like the generic/036 hang is beacuse with your fix, you're
calling bch2_journal_pin_add() without updating ck->seq - then we end up
holding a journal pin while ck->seq is garbage, so
bch2_btree_key_cache_journal_flush() gets confused - haven't worked out
what exactly happens then.

Try moving the bch2_journal_pin_add() inside your new
BTREE_UPDATE_NOJOURNAL check, I bet that fixes it.

> 
>  fs/bcachefs/btree_key_cache.c   | 9 +++++----
>  fs/bcachefs/btree_key_cache.h   | 2 +-
>  fs/bcachefs/btree_update_leaf.c | 2 +-
>  3 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c
> index 13df0d408634..272e66437ffc 100644
> --- a/fs/bcachefs/btree_key_cache.c
> +++ b/fs/bcachefs/btree_key_cache.c
> @@ -770,11 +770,11 @@ int bch2_btree_key_cache_flush(struct btree_trans *trans,
>  
>  bool bch2_btree_insert_key_cached(struct btree_trans *trans,
>  				  unsigned flags,
> -				  struct btree_path *path,
> -				  struct bkey_i *insert)
> +				  struct btree_insert_entry *insert_entry)
>  {
>  	struct bch_fs *c = trans->c;
> -	struct bkey_cached *ck = (void *) path->l[0].b;
> +	struct bkey_cached *ck = (void *) insert_entry->path->l[0].b;
> +	struct bkey_i *insert = insert_entry->k;
>  	bool kick_reclaim = false;
>  
>  	BUG_ON(insert->u64s > ck->u64s);
> @@ -804,7 +804,8 @@ bool bch2_btree_insert_key_cached(struct btree_trans *trans,
>  
>  	bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
>  			     &ck->journal, bch2_btree_key_cache_journal_flush);
> -	ck->seq = trans->journal_res.seq;
> +	if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL))
> +		ck->seq = trans->journal_res.seq;
>  
>  	if (kick_reclaim)
>  		journal_reclaim_kick(&c->journal);
> diff --git a/fs/bcachefs/btree_key_cache.h b/fs/bcachefs/btree_key_cache.h
> index c86d5e48f6e3..be3acde2caa0 100644
> --- a/fs/bcachefs/btree_key_cache.h
> +++ b/fs/bcachefs/btree_key_cache.h
> @@ -30,7 +30,7 @@ int bch2_btree_path_traverse_cached(struct btree_trans *, struct btree_path *,
>  				    unsigned);
>  
>  bool bch2_btree_insert_key_cached(struct btree_trans *, unsigned,
> -			struct btree_path *, struct bkey_i *);
> +			struct btree_insert_entry *);
>  int bch2_btree_key_cache_flush(struct btree_trans *,
>  			       enum btree_id, struct bpos);
>  void bch2_btree_key_cache_drop(struct btree_trans *,
> diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c
> index c93c132dd815..456375fed276 100644
> --- a/fs/bcachefs/btree_update_leaf.c
> +++ b/fs/bcachefs/btree_update_leaf.c
> @@ -765,7 +765,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
>  		if (!i->cached)
>  			btree_insert_key_leaf(trans, i);
>  		else if (!i->key_cache_already_flushed)
> -			bch2_btree_insert_key_cached(trans, flags, i->path, i->k);
> +			bch2_btree_insert_key_cached(trans, flags, i);
>  		else {
>  			bch2_btree_key_cache_drop(trans, i->path);
>  			btree_path_set_dirty(i->path, BTREE_ITER_NEED_TRAVERSE);
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits
  2023-03-04  2:44 ` Kent Overstreet
@ 2023-03-06 13:18   ` Brian Foster
  2023-03-07 10:32     ` Kent Overstreet
  0 siblings, 1 reply; 5+ messages in thread
From: Brian Foster @ 2023-03-06 13:18 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Fri, Mar 03, 2023 at 09:44:55PM -0500, Kent Overstreet wrote:
> On Thu, Mar 02, 2023 at 02:51:10PM -0500, Brian Foster wrote:
> > As noted on IRC, this patch is RFC because it's definitely broken. I
> > reproduce a generic/036 unmount hang related to pin flushing. I tweaked
> > the logic to also skip the bch2_journal_pin_add() call, which seems to
> > fix generic/036 but then results in a generic/127 panic where
> > atomic_long_read(&bc->nr_keys) == 1 on exit. So this all kind of smells
> > like I'm breaking the journal pin/key cache flushing logic here and so
> > need to grok this a bit better to see what this is getting wrong.
> > 
> > FWIW, I also ran a quick test to just drop the usage of NOJOURNAL from
> > the inode update in the extent update path. That seems to also address
> > the generic/388 problem, but I haven't tested that one more broadly.
> > 
> > One other thing is I know we proved the commit that introduced ck->seq
> > was not the root cause of this problem a week or so ago. I suspect this
> > still occurs prior to that commit because that one also relocated the
> > pin update from the commit path to the journal reclaim (i.e. cache
> > flush) path. IOW, the problem still occurs, it just depends on the
> > NOJOURNAL commit itself to bump the pin and a journal write to race with
> > a cache flush.
> 
> So continuing the IRC conversation, it seems like this is (almost)
> exactly the right fix - and I think I might know why you're seeing the
> generic/036 hang too.
> 
> Without your fix, the bug is that we're setting ck->seq to a newer
> sequence number than the last time this key was updated in the journal.
> That means that when we later update the journal pin, this key cache key
> is pinning a newer journal entry than the last time this key was updated
> in the journal - meaning there's a window in the journal where we've
> lost the inode update.
> 
> Specifically, the inode update _previous_ to this inode update (that was
> not marked as NOJOURNAL) gets lost - the bch2_journal_pin_update() call
> tells the journal we don't need that update anymore (which would be
> correct if we had journalled a newer update) - but we didn't journal a
> newer update, so we did need the older update.
> 

Yep..

> It looks like the generic/036 hang is beacuse with your fix, you're
> calling bch2_journal_pin_add() without updating ck->seq - then we end up
> holding a journal pin while ck->seq is garbage, so
> bch2_btree_key_cache_journal_flush() gets confused - haven't worked out
> what exactly happens then.
> 

Yeah, I tracked the hang problem down to essentially being due to a
"stale" ck->seq value. I.e., a normal commit occurs, updates ck->seq,
the key cache flushes, pin drops, etc. Subsequently a NOJOURNAL commit
occurs and happens to be the only key cache update at that point. It
adds a new pin but does not update ck->seq. The key flush callback sees
an old ck->seq and doesn't flush, which essentially leaves around the
following:

96611: count 1
  flushed:
        ffff99c313fb64d8 bch2_btree_key_cache_journal_flush [bcachefs]
...

... and we're stuck because the callback didn't flush, but the pin
wasn't updated either.

> Try moving the bch2_journal_pin_add() inside your new
> BTREE_UPDATE_NOJOURNAL check, I bet that fixes it.
> 

Hmm.. I want to say I tried that and I ran into another problem with
generic/127 where the key cache wouldn't flush (see the comments in my
initial writeup), I _think_ because we hit a case where we just failed
to add a pin and that is required to tie into journal reclaim and
actually have the cache flush (even if nothing was logged).

I ran with the following logic over the weekend and it seemed to avoid
any problems:

        if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) ||
            !journal_pin_active(&ck->journal)) {
                ck->seq = trans->journal_res.seq;
        }
        bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
                             &ck->journal, bch2_btree_key_cache_journal_flush);

This addresses the scenario above by updating ck->seq when the key cache
pin happens to be inactive, which is presumably (?) safe because that
means the cache doesn't hold any previous updates. FWIW, I suspect there
are probably multiple ways to fix this, such as allowing pin_add() to
return whether it actually added a pin, or perhaps resetting ck->seq
somewhere on key cache flush so the callback doesn't think it's valid,
etc. Thoughts?

I have a bigger picture question, however. The key cache is intended to
hold updates to multiple different keys that land in the same btree
node, right? If so, is it safe to update the key cache pin on later
updates like this in general? Specifically I was thinking about the
example of two inodes that happen to land in the same key cache (inode
A, A+1) but under different journal seqs.

For example, suppose inode A commits, logs, adds the ck pin, and sets
ck->seq (A). Then inode A+1 commits, logs to seq+1, and updates ck->seq
again. At this point if journal reclaim flushes the cache with the
original pin, we update the pin to the A+1 ck->seq value and return.  If
this moves the pin from the front of the journal, what prevents last_seq
from being updated to A+1 on disk and introducing the same sort of
window where in-core updates aren't covered by the active range of the
journal? I.e., if last_seq updates and the filesystem crashes, ISTM we
potentially lose the inode A key update, even if it was journaled.

Unless I'm missing something (entirely possible), I'm wondering if it
makes sense to update ck->seq like this at all (or if something else is
required to allow a ck to hold reference to multiple pins, etc.). Hm?

Brian

> > 
> >  fs/bcachefs/btree_key_cache.c   | 9 +++++----
> >  fs/bcachefs/btree_key_cache.h   | 2 +-
> >  fs/bcachefs/btree_update_leaf.c | 2 +-
> >  3 files changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c
> > index 13df0d408634..272e66437ffc 100644
> > --- a/fs/bcachefs/btree_key_cache.c
> > +++ b/fs/bcachefs/btree_key_cache.c
> > @@ -770,11 +770,11 @@ int bch2_btree_key_cache_flush(struct btree_trans *trans,
> >  
> >  bool bch2_btree_insert_key_cached(struct btree_trans *trans,
> >  				  unsigned flags,
> > -				  struct btree_path *path,
> > -				  struct bkey_i *insert)
> > +				  struct btree_insert_entry *insert_entry)
> >  {
> >  	struct bch_fs *c = trans->c;
> > -	struct bkey_cached *ck = (void *) path->l[0].b;
> > +	struct bkey_cached *ck = (void *) insert_entry->path->l[0].b;
> > +	struct bkey_i *insert = insert_entry->k;
> >  	bool kick_reclaim = false;
> >  
> >  	BUG_ON(insert->u64s > ck->u64s);
> > @@ -804,7 +804,8 @@ bool bch2_btree_insert_key_cached(struct btree_trans *trans,
> >  
> >  	bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
> >  			     &ck->journal, bch2_btree_key_cache_journal_flush);
> > -	ck->seq = trans->journal_res.seq;
> > +	if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL))
> > +		ck->seq = trans->journal_res.seq;
> >  
> >  	if (kick_reclaim)
> >  		journal_reclaim_kick(&c->journal);
> > diff --git a/fs/bcachefs/btree_key_cache.h b/fs/bcachefs/btree_key_cache.h
> > index c86d5e48f6e3..be3acde2caa0 100644
> > --- a/fs/bcachefs/btree_key_cache.h
> > +++ b/fs/bcachefs/btree_key_cache.h
> > @@ -30,7 +30,7 @@ int bch2_btree_path_traverse_cached(struct btree_trans *, struct btree_path *,
> >  				    unsigned);
> >  
> >  bool bch2_btree_insert_key_cached(struct btree_trans *, unsigned,
> > -			struct btree_path *, struct bkey_i *);
> > +			struct btree_insert_entry *);
> >  int bch2_btree_key_cache_flush(struct btree_trans *,
> >  			       enum btree_id, struct bpos);
> >  void bch2_btree_key_cache_drop(struct btree_trans *,
> > diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c
> > index c93c132dd815..456375fed276 100644
> > --- a/fs/bcachefs/btree_update_leaf.c
> > +++ b/fs/bcachefs/btree_update_leaf.c
> > @@ -765,7 +765,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
> >  		if (!i->cached)
> >  			btree_insert_key_leaf(trans, i);
> >  		else if (!i->key_cache_already_flushed)
> > -			bch2_btree_insert_key_cached(trans, flags, i->path, i->k);
> > +			bch2_btree_insert_key_cached(trans, flags, i);
> >  		else {
> >  			bch2_btree_key_cache_drop(trans, i->path);
> >  			btree_path_set_dirty(i->path, BTREE_ITER_NEED_TRAVERSE);
> > -- 
> > 2.39.2
> > 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits
  2023-03-06 13:18   ` Brian Foster
@ 2023-03-07 10:32     ` Kent Overstreet
  2023-03-07 14:22       ` Brian Foster
  0 siblings, 1 reply; 5+ messages in thread
From: Kent Overstreet @ 2023-03-07 10:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Mon, Mar 06, 2023 at 08:18:55AM -0500, Brian Foster wrote:
> I ran with the following logic over the weekend and it seemed to avoid
> any problems:
> 
>         if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) ||
>             !journal_pin_active(&ck->journal)) {
>                 ck->seq = trans->journal_res.seq;
>         }
>         bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
>                              &ck->journal, bch2_btree_key_cache_journal_flush);
> 
> This addresses the scenario above by updating ck->seq when the key cache
> pin happens to be inactive, which is presumably (?) safe because that
> means the cache doesn't hold any previous updates. FWIW, I suspect there
> are probably multiple ways to fix this, such as allowing pin_add() to
> return whether it actually added a pin, or perhaps resetting ck->seq
> somewhere on key cache flush so the callback doesn't think it's valid,
> etc. Thoughts?

It took me a bit to work out why this is the correct fix, so it'll
definitely need some documenting (and the blame is on me;
BTREE_UPDATE_NOJOURNAL was clearly more subtle than I recognized at the
time) - but yeah, looks good.

To recap what we discussed on IRC, for anyone else who's reading:
naively one might assume that in BTREE_UPDATE_NOJOURNAL mode we
shouldn't do anything here and skip adding the journal pin, except that
NOJOURNAL updates _do_ need to be written back to the btree and it's
journal reclaim that is also responsible for that.

The reason for BTREE_UPDATE_NOJOURNAL is that sometimes we're doing
inode updates that are only updating bi_journal_seq, nothing else: these
updates are only needed so that fsync() later knows which journal
sequence number has to be flushed.

We have to record this in the btree, because if it was stored in the VFS
inode that's just a cache and reclaim would break fsync - but we're not
going to need it after a crash, hence no need to journal it.

So BTREE_UPDATE_NOJOURNAL updates are really this super rare, special
purpose performance hack that knowingly break our sequential consistency
model and should be regarded with suspicion :)

> I have a bigger picture question, however. The key cache is intended to
> hold updates to multiple different keys that land in the same btree
> node, right? If so, is it safe to update the key cache pin on later
> updates like this in general? Specifically I was thinking about the
> example of two inodes that happen to land in the same key cache (inode
> A, A+1) but under different journal seqs.

Yes.

Originally, on every key cache update we'd call
bch2_journal_pin_update() - dropping our old journal pin and pinning the
journal entry of the new update.

See the patch "bcachefs: Btree key cache optimization" where we switched
to the new behaviour - it's an optimization to avoid hitting the journal
lock.

> For example, suppose inode A commits, logs, adds the ck pin, and sets
> ck->seq (A). Then inode A+1 commits, logs to seq+1, and updates ck->seq
> again. At this point if journal reclaim flushes the cache with the
> original pin, we update the pin to the A+1 ck->seq value and return.  If
> this moves the pin from the front of the journal, what prevents last_seq
> from being updated to A+1 on disk and introducing the same sort of
> window where in-core updates aren't covered by the active range of the
> journal? I.e., if last_seq updates and the filesystem crashes, ISTM we
> potentially lose the inode A key update, even if it was journaled.

That's completely fine, provided inode A1 was journalled and is still
pinned in the journal - in your example, since the pin is updated to the
sequence number for A1 it will be.

That is, the journal is very much allowed to drop redundant updates. The
journal exists to guarantee time ordering, but that guarantee only
applies after we've finished replaying everything that was in the
journal.

If we're only partway through journal replay (say, after inode A but
before A1), then the btree is going to be complete nonsense in terms of
consistency - not just because stuff may have been dropped from the
journal, but because things in the journal maybe be flushed in any
order, so if we're only halfway through journal replay we fully expect
to find updates in the btree that are newer than where we're at in the
journal.

This is touching on the topic of how we guarantee sequential consistency
after crashes - which is an interesting and lengthy topic, but
fortunately we don't need to understand it for the problem at hand. The
only relevant thing for now is that yes, the journal can drop redundant
updates and it's completely fine.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits
  2023-03-07 10:32     ` Kent Overstreet
@ 2023-03-07 14:22       ` Brian Foster
  0 siblings, 0 replies; 5+ messages in thread
From: Brian Foster @ 2023-03-07 14:22 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Tue, Mar 07, 2023 at 05:32:05AM -0500, Kent Overstreet wrote:
> On Mon, Mar 06, 2023 at 08:18:55AM -0500, Brian Foster wrote:
> > I ran with the following logic over the weekend and it seemed to avoid
> > any problems:
> > 
> >         if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) ||
> >             !journal_pin_active(&ck->journal)) {
> >                 ck->seq = trans->journal_res.seq;
> >         }
> >         bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
> >                              &ck->journal, bch2_btree_key_cache_journal_flush);
> > 
> > This addresses the scenario above by updating ck->seq when the key cache
> > pin happens to be inactive, which is presumably (?) safe because that
> > means the cache doesn't hold any previous updates. FWIW, I suspect there
> > are probably multiple ways to fix this, such as allowing pin_add() to
> > return whether it actually added a pin, or perhaps resetting ck->seq
> > somewhere on key cache flush so the callback doesn't think it's valid,
> > etc. Thoughts?
> 
> It took me a bit to work out why this is the correct fix, so it'll
> definitely need some documenting (and the blame is on me;
> BTREE_UPDATE_NOJOURNAL was clearly more subtle than I recognized at the
> time) - but yeah, looks good.
> 
> To recap what we discussed on IRC, for anyone else who's reading:
> naively one might assume that in BTREE_UPDATE_NOJOURNAL mode we
> shouldn't do anything here and skip adding the journal pin, except that
> NOJOURNAL updates _do_ need to be written back to the btree and it's
> journal reclaim that is also responsible for that.
> 
> The reason for BTREE_UPDATE_NOJOURNAL is that sometimes we're doing
> inode updates that are only updating bi_journal_seq, nothing else: these
> updates are only needed so that fsync() later knows which journal
> sequence number has to be flushed.
> 
> We have to record this in the btree, because if it was stored in the VFS
> inode that's just a cache and reclaim would break fsync - but we're not
> going to need it after a crash, hence no need to journal it.
> 
> So BTREE_UPDATE_NOJOURNAL updates are really this super rare, special
> purpose performance hack that knowingly break our sequential consistency
> model and should be regarded with suspicion :)
> 

Ack, makes sense.

I added some comments around the updated ck->seq logic and pushed the
patch to my testing branch. It looks like the CI sees it, but hasn't
started on it yet..

> > I have a bigger picture question, however. The key cache is intended to
> > hold updates to multiple different keys that land in the same btree
> > node, right? If so, is it safe to update the key cache pin on later
> > updates like this in general? Specifically I was thinking about the
> > example of two inodes that happen to land in the same key cache (inode
> > A, A+1) but under different journal seqs.
> 
> Yes.
> 
> Originally, on every key cache update we'd call
> bch2_journal_pin_update() - dropping our old journal pin and pinning the
> journal entry of the new update.
> 
> See the patch "bcachefs: Btree key cache optimization" where we switched
> to the new behaviour - it's an optimization to avoid hitting the journal
> lock.
> 

Yeah, I got my wires crossed wrt to the relevant data structures and was
thinking of bkey_cached as "bkey cache," saw the seq associated with
that, and then naturally started to wonder how we manage a set of cached
keys with a single sequence value. Sometime after writing this up I
realized that bkey_cached is a single cached key, and there's a higher
level btree_key_cache data stucture. I still need to poke through that
code a bit more to grok it, but that explains my confusion. :)

> > For example, suppose inode A commits, logs, adds the ck pin, and sets
> > ck->seq (A). Then inode A+1 commits, logs to seq+1, and updates ck->seq
> > again. At this point if journal reclaim flushes the cache with the
> > original pin, we update the pin to the A+1 ck->seq value and return.  If
> > this moves the pin from the front of the journal, what prevents last_seq
> > from being updated to A+1 on disk and introducing the same sort of
> > window where in-core updates aren't covered by the active range of the
> > journal? I.e., if last_seq updates and the filesystem crashes, ISTM we
> > potentially lose the inode A key update, even if it was journaled.
> 
> That's completely fine, provided inode A1 was journalled and is still
> pinned in the journal - in your example, since the pin is updated to the
> sequence number for A1 it will be.
> 
> That is, the journal is very much allowed to drop redundant updates. The
> journal exists to guarantee time ordering, but that guarantee only
> applies after we've finished replaying everything that was in the
> journal.
> 
> If we're only partway through journal replay (say, after inode A but
> before A1), then the btree is going to be complete nonsense in terms of
> consistency - not just because stuff may have been dropped from the
> journal, but because things in the journal maybe be flushed in any
> order, so if we're only halfway through journal replay we fully expect
> to find updates in the btree that are newer than where we're at in the
> journal.
> 

Ok.

> This is touching on the topic of how we guarantee sequential consistency
> after crashes - which is an interesting and lengthy topic, but
> fortunately we don't need to understand it for the problem at hand. The
> only relevant thing for now is that yes, the journal can drop redundant
> updates and it's completely fine.
> 

Indeed. Thanks for the explanation.

Brian


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-03-07 14:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-02 19:51 [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits Brian Foster
2023-03-04  2:44 ` Kent Overstreet
2023-03-06 13:18   ` Brian Foster
2023-03-07 10:32     ` Kent Overstreet
2023-03-07 14:22       ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.