All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits
Date: Tue, 7 Mar 2023 09:22:07 -0500	[thread overview]
Message-ID: <ZAdIj7loB1hdsKCz@bfoster> (raw)
In-Reply-To: <ZAcSpXjwUeAchhMn@moria.home.lan>

On Tue, Mar 07, 2023 at 05:32:05AM -0500, Kent Overstreet wrote:
> On Mon, Mar 06, 2023 at 08:18:55AM -0500, Brian Foster wrote:
> > I ran with the following logic over the weekend and it seemed to avoid
> > any problems:
> > 
> >         if (!(insert_entry->flags & BTREE_UPDATE_NOJOURNAL) ||
> >             !journal_pin_active(&ck->journal)) {
> >                 ck->seq = trans->journal_res.seq;
> >         }
> >         bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
> >                              &ck->journal, bch2_btree_key_cache_journal_flush);
> > 
> > This addresses the scenario above by updating ck->seq when the key cache
> > pin happens to be inactive, which is presumably (?) safe because that
> > means the cache doesn't hold any previous updates. FWIW, I suspect there
> > are probably multiple ways to fix this, such as allowing pin_add() to
> > return whether it actually added a pin, or perhaps resetting ck->seq
> > somewhere on key cache flush so the callback doesn't think it's valid,
> > etc. Thoughts?
> 
> It took me a bit to work out why this is the correct fix, so it'll
> definitely need some documenting (and the blame is on me;
> BTREE_UPDATE_NOJOURNAL was clearly more subtle than I recognized at the
> time) - but yeah, looks good.
> 
> To recap what we discussed on IRC, for anyone else who's reading:
> naively one might assume that in BTREE_UPDATE_NOJOURNAL mode we
> shouldn't do anything here and skip adding the journal pin, except that
> NOJOURNAL updates _do_ need to be written back to the btree and it's
> journal reclaim that is also responsible for that.
> 
> The reason for BTREE_UPDATE_NOJOURNAL is that sometimes we're doing
> inode updates that are only updating bi_journal_seq, nothing else: these
> updates are only needed so that fsync() later knows which journal
> sequence number has to be flushed.
> 
> We have to record this in the btree, because if it was stored in the VFS
> inode that's just a cache and reclaim would break fsync - but we're not
> going to need it after a crash, hence no need to journal it.
> 
> So BTREE_UPDATE_NOJOURNAL updates are really this super rare, special
> purpose performance hack that knowingly break our sequential consistency
> model and should be regarded with suspicion :)
> 

Ack, makes sense.

I added some comments around the updated ck->seq logic and pushed the
patch to my testing branch. It looks like the CI sees it, but hasn't
started on it yet..

> > I have a bigger picture question, however. The key cache is intended to
> > hold updates to multiple different keys that land in the same btree
> > node, right? If so, is it safe to update the key cache pin on later
> > updates like this in general? Specifically I was thinking about the
> > example of two inodes that happen to land in the same key cache (inode
> > A, A+1) but under different journal seqs.
> 
> Yes.
> 
> Originally, on every key cache update we'd call
> bch2_journal_pin_update() - dropping our old journal pin and pinning the
> journal entry of the new update.
> 
> See the patch "bcachefs: Btree key cache optimization" where we switched
> to the new behaviour - it's an optimization to avoid hitting the journal
> lock.
> 

Yeah, I got my wires crossed wrt to the relevant data structures and was
thinking of bkey_cached as "bkey cache," saw the seq associated with
that, and then naturally started to wonder how we manage a set of cached
keys with a single sequence value. Sometime after writing this up I
realized that bkey_cached is a single cached key, and there's a higher
level btree_key_cache data stucture. I still need to poke through that
code a bit more to grok it, but that explains my confusion. :)

> > For example, suppose inode A commits, logs, adds the ck pin, and sets
> > ck->seq (A). Then inode A+1 commits, logs to seq+1, and updates ck->seq
> > again. At this point if journal reclaim flushes the cache with the
> > original pin, we update the pin to the A+1 ck->seq value and return.  If
> > this moves the pin from the front of the journal, what prevents last_seq
> > from being updated to A+1 on disk and introducing the same sort of
> > window where in-core updates aren't covered by the active range of the
> > journal? I.e., if last_seq updates and the filesystem crashes, ISTM we
> > potentially lose the inode A key update, even if it was journaled.
> 
> That's completely fine, provided inode A1 was journalled and is still
> pinned in the journal - in your example, since the pin is updated to the
> sequence number for A1 it will be.
> 
> That is, the journal is very much allowed to drop redundant updates. The
> journal exists to guarantee time ordering, but that guarantee only
> applies after we've finished replaying everything that was in the
> journal.
> 
> If we're only partway through journal replay (say, after inode A but
> before A1), then the btree is going to be complete nonsense in terms of
> consistency - not just because stuff may have been dropped from the
> journal, but because things in the journal maybe be flushed in any
> order, so if we're only halfway through journal replay we fully expect
> to find updates in the btree that are newer than where we're at in the
> journal.
> 

Ok.

> This is touching on the topic of how we guarantee sequential consistency
> after crashes - which is an interesting and lengthy topic, but
> fortunately we don't need to understand it for the problem at hand. The
> only relevant thing for now is that yes, the journal can drop redundant
> updates and it's completely fine.
> 

Indeed. Thanks for the explanation.

Brian


      reply	other threads:[~2023-03-07 14:26 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-02 19:51 [PATCH RFC] bcachefs: don't bump key cache journal seq on nojournal commits Brian Foster
2023-03-04  2:44 ` Kent Overstreet
2023-03-06 13:18   ` Brian Foster
2023-03-07 10:32     ` Kent Overstreet
2023-03-07 14:22       ` Brian Foster [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZAdIj7loB1hdsKCz@bfoster \
    --to=bfoster@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.