All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits
Date: Mon, 6 Mar 2023 08:18:55 -0500	[thread overview]
Message-ID: <ZAXoP29b/QVq3Onn@bfoster> (raw)
In-Reply-To: <ZAKwp52iF2v8Okqz@moria.home.lan>

On Fri, Mar 03, 2023 at 09:44:55PM -0500, Kent Overstreet wrote:
> On Thu, Mar 02, 2023 at 02:51:10PM -0500, Brian Foster wrote:
> > As noted on IRC, this patch is RFC because it's definitely broken. I
> > reproduce a generic/036 unmount hang related to pin flushing. I tweaked
> > the logic to also skip the bch2_journal_pin_add() call, which seems to
> > fix generic/036 but then results in a generic/127 panic where
> > atomic_long_read(&bc->nr_keys) == 1 on exit. So this all kind of smells
> > like I'm breaking the journal pin/key cache flushing logic here and so
> > need to grok this a bit better to see what this is getting wrong.
> > 
> > FWIW, I also ran a quick test to just drop the usage of NOJOURNAL from
> > the inode update in the extent update path. That seems to also address
> > the generic/388 problem, but I haven't tested that one more broadly.
> > 
> > One other thing is I know we proved the commit that introduced ck->seq
> > was not the root cause of this problem a week or so ago. I suspect this
> > still occurs prior to that commit because that one also relocated the
> > pin update from the commit path to the journal reclaim (i.e. cache
> > flush) path. IOW, the problem still occurs, it just depends on the
> > NOJOURNAL commit itself to bump the pin and a journal write to race with
> > a cache flush.
> 
> So continuing the IRC conversation, it seems like this is (almost)
> exactly the right fix - and I think I might know why you're seeing the
> generic/036 hang too.
> 
> Without your fix, the bug is that we're setting ck->seq to a newer
> sequence number than the last time this key was updated in the journal.
> That means that when we later update the journal pin, this key cache key
> is pinning a newer journal entry than the last time this key was updated
> in the journal - meaning there's a window in the journal where we've
> lost the inode update.
> 
> Specifically, the inode update _previous_ to this inode update (that was
> not marked as NOJOURNAL) gets lost - the bch2_journal_pin_update() call
> tells the journal we don't need that update anymore (which would be
> correct if we had journalled a newer update) - but we didn't journal a
> newer update, so we did need the older update.
> 

Yep..

> It looks like the generic/036 hang is beacuse with your fix, you're
> calling bch2_journal_pin_add() without updating ck->seq - then we end up
> holding a journal pin while ck->seq is garbage, so
> bch2_btree_key_cache_journal_flush() gets confused - haven't worked out
> what exactly happens then.
> 

Yeah, I tracked the hang problem down to essentially being due to a
"stale" ck->seq value. I.e., a normal commit occurs, updates ck->seq,
the key cache flushes, pin drops, etc. Subsequently a NOJOURNAL commit
occurs and happens to be the only key cache update at that point. It
adds a new pin but does not update ck->seq. The key flush callback sees
an old ck->seq and doesn't flush, which essentially leaves around the
following:

96611: count 1
  flushed:
        ffff99c313fb64d8 bch2_btree_key_cache_journal_flush [bcachefs]
...

... and we're stuck because the callback didn't flush, but the pin
wasn't updated either.

> Try moving the bch2_journal_pin_add() inside your new
> BTREE_UPDATE_NOJOURNAL check, I bet that fixes it.
> 

Hmm.. I want to say I tried that and I ran into another problem with
generic/127 where the key cache wouldn't flush (see the comments in my
initial writeup), I _think_ because we hit a case where we just failed
to add a pin and that is required to tie into journal reclaim and
actually have the cache flush (even if nothing was logged).

I ran with the following logic over the weekend and it seemed to avoid
any problems:

        if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) ||
            !journal_pin_active(&ck->journal)) {
                ck->seq = trans->journal_res.seq;
        }
        bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
                             &ck->journal, bch2_btree_key_cache_journal_flush);

This addresses the scenario above by updating ck->seq when the key cache
pin happens to be inactive, which is presumably (?) safe because that
means the cache doesn't hold any previous updates. FWIW, I suspect there
are probably multiple ways to fix this, such as allowing pin_add() to
return whether it actually added a pin, or perhaps resetting ck->seq
somewhere on key cache flush so the callback doesn't think it's valid,
etc. Thoughts?

I have a bigger picture question, however. The key cache is intended to
hold updates to multiple different keys that land in the same btree
node, right? If so, is it safe to update the key cache pin on later
updates like this in general? Specifically I was thinking about the
example of two inodes that happen to land in the same key cache (inode
A, A+1) but under different journal seqs.

For example, suppose inode A commits, logs, adds the ck pin, and sets
ck->seq (A). Then inode A+1 commits, logs to seq+1, and updates ck->seq
again. At this point if journal reclaim flushes the cache with the
original pin, we update the pin to the A+1 ck->seq value and return.  If
this moves the pin from the front of the journal, what prevents last_seq
from being updated to A+1 on disk and introducing the same sort of
window where in-core updates aren't covered by the active range of the
journal? I.e., if last_seq updates and the filesystem crashes, ISTM we
potentially lose the inode A key update, even if it was journaled.

Unless I'm missing something (entirely possible), I'm wondering if it
makes sense to update ck->seq like this at all (or if something else is
required to allow a ck to hold reference to multiple pins, etc.). Hm?

Brian

> > 
> >  fs/bcachefs/btree_key_cache.c   | 9 +++++----
> >  fs/bcachefs/btree_key_cache.h   | 2 +-
> >  fs/bcachefs/btree_update_leaf.c | 2 +-
> >  3 files changed, 7 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c
> > index 13df0d408634..272e66437ffc 100644
> > --- a/fs/bcachefs/btree_key_cache.c
> > +++ b/fs/bcachefs/btree_key_cache.c
> > @@ -770,11 +770,11 @@ int bch2_btree_key_cache_flush(struct btree_trans *trans,
> >  
> >  bool bch2_btree_insert_key_cached(struct btree_trans *trans,
> >  				  unsigned flags,
> > -				  struct btree_path *path,
> > -				  struct bkey_i *insert)
> > +				  struct btree_insert_entry *insert_entry)
> >  {
> >  	struct bch_fs *c = trans->c;
> > -	struct bkey_cached *ck = (void *) path->l[0].b;
> > +	struct bkey_cached *ck = (void *) insert_entry->path->l[0].b;
> > +	struct bkey_i *insert = insert_entry->k;
> >  	bool kick_reclaim = false;
> >  
> >  	BUG_ON(insert->u64s > ck->u64s);
> > @@ -804,7 +804,8 @@ bool bch2_btree_insert_key_cached(struct btree_trans *trans,
> >  
> >  	bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
> >  			     &ck->journal, bch2_btree_key_cache_journal_flush);
> > -	ck->seq = trans->journal_res.seq;
> > +	if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL))
> > +		ck->seq = trans->journal_res.seq;
> >  
> >  	if (kick_reclaim)
> >  		journal_reclaim_kick(&c->journal);
> > diff --git a/fs/bcachefs/btree_key_cache.h b/fs/bcachefs/btree_key_cache.h
> > index c86d5e48f6e3..be3acde2caa0 100644
> > --- a/fs/bcachefs/btree_key_cache.h
> > +++ b/fs/bcachefs/btree_key_cache.h
> > @@ -30,7 +30,7 @@ int bch2_btree_path_traverse_cached(struct btree_trans *, struct btree_path *,
> >  				    unsigned);
> >  
> >  bool bch2_btree_insert_key_cached(struct btree_trans *, unsigned,
> > -			struct btree_path *, struct bkey_i *);
> > +			struct btree_insert_entry *);
> >  int bch2_btree_key_cache_flush(struct btree_trans *,
> >  			       enum btree_id, struct bpos);
> >  void bch2_btree_key_cache_drop(struct btree_trans *,
> > diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c
> > index c93c132dd815..456375fed276 100644
> > --- a/fs/bcachefs/btree_update_leaf.c
> > +++ b/fs/bcachefs/btree_update_leaf.c
> > @@ -765,7 +765,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
> >  		if (!i->cached)
> >  			btree_insert_key_leaf(trans, i);
> >  		else if (!i->key_cache_already_flushed)
> > -			bch2_btree_insert_key_cached(trans, flags, i->path, i->k);
> > +			bch2_btree_insert_key_cached(trans, flags, i);
> >  		else {
> >  			bch2_btree_key_cache_drop(trans, i->path);
> >  			btree_path_set_dirty(i->path, BTREE_ITER_NEED_TRAVERSE);
> > -- 
> > 2.39.2
> > 
> 


  reply	other threads:[~2023-03-06 13:18 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 19:51 [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits Brian Foster
2023-03-04  2:44 ` Kent Overstreet
2023-03-06 13:18   ` Brian Foster [this message]
2023-03-07 10:32     ` Kent Overstreet
2023-03-07 14:22       ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZAXoP29b/QVq3Onn@bfoster \
    --to=bfoster@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.