All of lore.kernel.org
 help / color / mirror / Atom feed
* bcachefs backpointers errors in no write buf mode
@ 2023-07-07 17:32 Brian Foster
  2023-07-07 17:58 ` Kent Overstreet
  2023-07-07 20:37 ` Kent Overstreet
  0 siblings, 2 replies; 9+ messages in thread
From: Brian Foster @ 2023-07-07 17:32 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Hi Kent,

I'm doing some analysis of these "existing backpointer found when
inserting" errors observed when running generic/388 in
backpointers_no_use_write_buffer=1 mode. To start with context, here's
what I see in one particular variant of the issue...

We're in read bio completion path doing a CRC narrow. The commit for
that transaction runs trans triggers on the extent key, which then falls
into bch2_trans_mark_extent() and attempts backpointer updates from
there. I stuck a warning in the backpointer error checking code just to
get a stack trace for the error context, which is appended below for
reference.

On IRC Kent mentioned this was likely a trigger ordering issue and that
something similar was addressed in the normal backpointer write buffer
update path. I think I follow what's going on there from commit
0a1b82456400 ("bcachefs: btree write buffer update ordering"), but the
context for the no write buf path seems a bit different..

WRT ordering, if I follow the code correctly, run_btree_triggers() in
the commit path basically implements an overwrite followed by !overwrite
pass, except if the old and new key type/triggers match we call into
bch2_trans_mark_key() with both the old and new keys. This calls into
bch2_trans_mark_extent(), which then implements an explicit insert ->
overwrite order. If that falls into the backpointer write buffer path,
that then has to reorder the updates in the write buffer by putting the
overwrite update at the head, presumably so it won't remove the
backpointer insert that is about to be added. (I assume there are
legitimate design reasons for the update ordering at each level here,
but the reordering back and forth seems like unfortunate complexity.)

This raises a few questions wrt the non write buffer path. First, the
existing bp error check lives in
bch2_bucket_backpointer_mod_nowritebuffer() before we even call into
bch2_trans_update(), so changing the order of trans updates doesn't seem
like it would avoid this particular error. I don't think we want to
change the high level order of the extent key trigger, so that means we
probably need to do something with the error check itself.

It also seems like this is possibly different from whatever issue lead
to the ordering fix in the write buf path, though I'm not quite sure how
that one manifests or is reproduced. A missing backpointer due to
unexpected bp removal, perhaps? If so, I suppose I could see why the
updates should be reordered in the non-buf trans, assuming the overwrite
would delete the just inserted bp key from the trans (and then btree on
commit), and perhaps I'm just not seeing that because I hit the existing
bp error check first.

So I'm still trying to piece high level things together such that an
appropriate fix is not yet clear to me. I'm wondering if perhaps the
high level (narrow crc) trans needs a flag to modify the bp checking
behavior to indicate we aren't actually inserting a new bp here (so
seeing an existing one is fine/expected), and then subsequently
bch2_bucket_backpointer_mod_nowritebuffer() needs to do something to
turn the "insert" into a bp key overwrite, and perhaps ignore the higher
level overwrite, or something *handwavy* along those lines.

I need to think about this some more, but wanted to put some thoughts
down in case I'm misunderstanding things or there's an obvious fix I'm
just not seeing yet..

Brian

--- 8< ---

WARNING: CPU: 1 PID: 31479 at fs/bcachefs/backpointers.c:128 backpointer_mod_err.isra.0+0x193/0x1b0 [bcachefs]
...
CPU: 1 PID: 31479 Comm: kworker/u16:4 Tainted: G        W  OE      6.4.0+ #68
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc36 04/01/2014
Workqueue: events_unbound __bch2_read_endio [bcachefs]

RIP: 0010:backpointer_mod_err.isra.0+0x193/0x1b0 [bcachefs]
...
Call Trace:
 <TASK>
 ? backpointer_mod_err.isra.0+0x193/0x1b0 [bcachefs]
 ? __warn+0x7d/0x130
 ? backpointer_mod_err.isra.0+0x193/0x1b0 [bcachefs]
 ? report_bug+0x18d/0x1c0
 ? handle_bug+0x41/0x70
 ? exc_invalid_op+0x13/0x60
 ? asm_exc_invalid_op+0x16/0x20
 ? backpointer_mod_err.isra.0+0x193/0x1b0 [bcachefs]
 bch2_bucket_backpointer_mod_nowritebuffer+0x15b/0x210 [bcachefs]
 ? bch2_bucket_backpointer_mod_nowritebuffer+0x5/0x210 [bcachefs]
 bch2_bucket_backpointer_mod+0x2ad/0x2c0 [bcachefs]
 bch2_trans_mark_pointer.constprop.0+0x34d/0x360 [bcachefs]
 ? bch2_trans_start_alloc_update+0x5/0x140 [bcachefs]
 __trans_mark_extent+0x1c5/0x3e0 [bcachefs]
 ? __trace_bprintk+0x54/0x60
 bch2_trans_mark_extent+0x9f/0xe0 [bcachefs]
 run_btree_triggers+0x1ae/0x330 [bcachefs]
 __bch2_trans_commit+0x92/0x1e70 [bcachefs]
 ? __bch2_rbio_narrow_crcs+0xbb/0x380 [bcachefs]
 bch2_rbio_narrow_crcs+0xbf/0x100 [bcachefs]
 ? bch2_rbio_narrow_crcs+0x45/0x100 [bcachefs]
 __bch2_read_endio+0x7c7/0x8f0 [bcachefs]
 ? process_one_work+0x1c8/0x3c0
 process_one_work+0x1c8/0x3c0
 worker_thread+0x4d/0x380
 ? _raw_spin_lock_irqsave+0x23/0x50
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xf3/0x120
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 </TASK>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bcachefs backpointers errors in no write buf mode
  2023-07-07 17:32 bcachefs backpointers errors in no write buf mode Brian Foster
@ 2023-07-07 17:58 ` Kent Overstreet
  2023-07-10 13:37   ` Brian Foster
  2023-07-07 20:37 ` Kent Overstreet
  1 sibling, 1 reply; 9+ messages in thread
From: Kent Overstreet @ 2023-07-07 17:58 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Fri, Jul 07, 2023 at 01:32:09PM -0400, Brian Foster wrote:
> Hi Kent,
> 
> I'm doing some analysis of these "existing backpointer found when
> inserting" errors observed when running generic/388 in
> backpointers_no_use_write_buffer=1 mode. To start with context, here's
> what I see in one particular variant of the issue...
> 
> We're in read bio completion path doing a CRC narrow. The commit for
> that transaction runs trans triggers on the extent key, which then falls
> into bch2_trans_mark_extent() and attempts backpointer updates from
> there. I stuck a warning in the backpointer error checking code just to
> get a stack trace for the error context, which is appended below for
> reference.
> 
> On IRC Kent mentioned this was likely a trigger ordering issue and that
> something similar was addressed in the normal backpointer write buffer
> update path. I think I follow what's going on there from commit
> 0a1b82456400 ("bcachefs: btree write buffer update ordering"), but the
> context for the no write buf path seems a bit different..
> 
> WRT ordering, if I follow the code correctly, run_btree_triggers() in
> the commit path basically implements an overwrite followed by !overwrite
> pass, except if the old and new key type/triggers match we call into
> bch2_trans_mark_key() with both the old and new keys. This calls into
> bch2_trans_mark_extent(), which then implements an explicit insert ->
> overwrite order. If that falls into the backpointer write buffer path,
> that then has to reorder the updates in the write buffer by putting the
> overwrite update at the head, presumably so it won't remove the
> backpointer insert that is about to be added. (I assume there are
> legitimate design reasons for the update ordering at each level here,
> but the reordering back and forth seems like unfortunate complexity.)

You've pretty much got it, except it's !overwrite then overwrite.

This definitely merits some documentation, how does this look to you?

diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c
index f7ffd68f56..a9c6bbeebe 100644
--- a/fs/bcachefs/btree_update_leaf.c
+++ b/fs/bcachefs/btree_update_leaf.c
@@ -401,6 +401,43 @@ static int btree_key_can_insert_cached(struct btree_trans *trans, unsigned flags
 
 /* Triggers: */
 
+/*
+ * Trigger ordering has some subtleties:
+ *
+ * Extent triggers will add a reference to an alloc key (for a bucket) on
+ * insert, and delete that reference when the extent is removed.
+ *
+ * When the last reference to a bucket or indirect extent is deleted, the bucket
+ * will transition states to need_discard or the indirect extent will be
+ * deleted. This means that when an extent is being moved (by e.g. the
+ * finsert/fcollapse paths, or by the reflink code turning an extent into an
+ * indirect extent), if we process the trigger for the delete before the trigger
+ * for the insert Bad Things (tm) will happen.
+ *
+ * To avoid this, we always run triggers for inserts before triggers for
+ * overwrites.
+ *
+ * There's a complication: various triggers need to see both the old and the new
+ * key at the same time, if they're keys that have the same trigger - this
+ * breaks our strict insert-before-overwrite ordering, but not in a way that
+ * breaks moving extents (XXX: why?)
+ *
+ * Additionally, we run alloc triggers last - XXX, also why?
+ *
+ * Another, related trigger ordering issue not handled here: backpointers (as
+ * well as LRU entries) are inserted/removed by triggers that do simple
+ * inserts/updates - i.e. they don't increment/decrement usage counts, like
+ * alloc key/reflink btree updates.
+ *
+ * This means they have to run in the reverse order, overwrites before inserts:
+ * but the backpointers and LRU btrees do not themselves have triggers so the
+ * delete-then-recreate issue for alloc btree references does not happen here.
+ *
+ * To get the correct ordering for LRU/backpointer updates, we explicitly put
+ * the update on either the head or the tail of the list of pending updates, if
+ * it was an overwrite or an insert, respectively.
+ */
+
 static int run_one_mem_trigger(struct btree_trans *trans,
 			       struct btree_insert_entry *i,
 			       unsigned flags)

> This raises a few questions wrt the non write buffer path. First, the
> existing bp error check lives in
> bch2_bucket_backpointer_mod_nowritebuffer() before we even call into
> bch2_trans_update(), so changing the order of trans updates doesn't seem
> like it would avoid this particular error. I don't think we want to
> change the high level order of the extent key trigger, so that means we
> probably need to do something with the error check itself.

Yeah, I'm not sure offhand how best to fix that; it's seeing an existing
backpointer because the update deleting it hasn't happened yet.

We can consider deleting the check - it's redundant, since we also have
fsck. If the check worked it would be preferable to keep it, since it
would let us catch backpointer errors when they first occured, but we
can still look at the journal to see what tranaction commit caused the
error even if it's not caught until fsck.

> It also seems like this is possibly different from whatever issue lead
> to the ordering fix in the write buf path, though I'm not quite sure how
> that one manifests or is reproduced. A missing backpointer due to
> unexpected bp removal, perhaps? If so, I suppose I could see why the
> updates should be reordered in the non-buf trans, assuming the overwrite
> would delete the just inserted bp key from the trans (and then btree on
> commit), and perhaps I'm just not seeing that because I hit the existing
> bp error check first.

Yeah I tend to think adding a bch2_trans_update_ordered() that matches
the behaviour of bch2_trans_update_buffered() is the way to go - the
whole point of the nowritebuffer path is to give us something simpler
but with matching behaviour of the write buffer path, to aid in
isolating bugs, so that only makes sense if they actually do behave the
same way.

> So I'm still trying to piece high level things together such that an
> appropriate fix is not yet clear to me. I'm wondering if perhaps the
> high level (narrow crc) trans needs a flag to modify the bp checking
> behavior to indicate we aren't actually inserting a new bp here (so
> seeing an existing one is fine/expected), and then subsequently
> bch2_bucket_backpointer_mod_nowritebuffer() needs to do something to
> turn the "insert" into a bp key overwrite, and perhaps ignore the higher
> level overwrite, or something *handwavy* along those lines.
> 
> I need to think about this some more, but wanted to put some thoughts
> down in case I'm misunderstanding things or there's an obvious fix I'm
> just not seeing yet..

I think we also need to check if my recent trigger ordering patches were
at fault here - the rebalance_work btree trigger is going to require
_extent_ triggers to now be "wants old and new" triggers - so I
recently switched _all_ triggers to "wants old and new", which then
required the write buffer ordering parameter.

I'm going to loop merge_torture_flakey on a version prior to that
change, without the write buffer, and see what comes up.

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: bcachefs backpointers errors in no write buf mode
  2023-07-07 17:32 bcachefs backpointers errors in no write buf mode Brian Foster
  2023-07-07 17:58 ` Kent Overstreet
@ 2023-07-07 20:37 ` Kent Overstreet
  1 sibling, 0 replies; 9+ messages in thread
From: Kent Overstreet @ 2023-07-07 20:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Fri, Jul 07, 2023 at 01:32:09PM -0400, Brian Foster wrote:
> Hi Kent,
> 
> I'm doing some analysis of these "existing backpointer found when
> inserting" errors observed when running generic/388 in
> backpointers_no_use_write_buffer=1 mode. To start with context, here's
> what I see in one particular variant of the issue...
> 
> We're in read bio completion path doing a CRC narrow. The commit for
> that transaction runs trans triggers on the extent key, which then falls
> into bch2_trans_mark_extent() and attempts backpointer updates from
> there. I stuck a warning in the backpointer error checking code just to
> get a stack trace for the error context, which is appended below for
> reference.
> 
> On IRC Kent mentioned this was likely a trigger ordering issue and that
> something similar was addressed in the normal backpointer write buffer
> update path. I think I follow what's going on there from commit
> 0a1b82456400 ("bcachefs: btree write buffer update ordering"), but the
> context for the no write buf path seems a bit different..

So turns out - those patches were bogus. I was misremembering the
trigger ordering from how it was in the past; we shouldn't need to do
the weird dual different update ordering thing.

So I've backed out those patches for now (they were prep work for the
rebalance_work btree, but that's not ready yet), and with that the
merge_torture_flakey test is passing - more. It still fails if I loop
it, and interestingly it's _not_ a btree write buffer issue, it still
fails with the write buffer off.

backpointer for missing extent
  u64s 9 type backpointer 0:1966604288:0 len 0 ver 0: bucket=0:3751:0 btree=xattrs l=1 offset=0:0 len=64 pos=SPOS_MAX, exiting

We should improve the error message, it'd be useful to know if that's a
btree root or not.

Now that I have io_uring disabled in ktest/xfstests, I can start looping
generic/388 as well and see if that turns up any patterns...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bcachefs backpointers errors in no write buf mode
  2023-07-07 17:58 ` Kent Overstreet
@ 2023-07-10 13:37   ` Brian Foster
  2023-07-10 14:56     ` Kent Overstreet
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2023-07-10 13:37 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Fri, Jul 07, 2023 at 01:58:23PM -0400, Kent Overstreet wrote:
> On Fri, Jul 07, 2023 at 01:32:09PM -0400, Brian Foster wrote:
> > Hi Kent,
> > 
> > I'm doing some analysis of these "existing backpointer found when
> > inserting" errors observed when running generic/388 in
> > backpointers_no_use_write_buffer=1 mode. To start with context, here's
> > what I see in one particular variant of the issue...
> > 
> > We're in read bio completion path doing a CRC narrow. The commit for
> > that transaction runs trans triggers on the extent key, which then falls
> > into bch2_trans_mark_extent() and attempts backpointer updates from
> > there. I stuck a warning in the backpointer error checking code just to
> > get a stack trace for the error context, which is appended below for
> > reference.
> > 
> > On IRC Kent mentioned this was likely a trigger ordering issue and that
> > something similar was addressed in the normal backpointer write buffer
> > update path. I think I follow what's going on there from commit
> > 0a1b82456400 ("bcachefs: btree write buffer update ordering"), but the
> > context for the no write buf path seems a bit different..
> > 
> > WRT ordering, if I follow the code correctly, run_btree_triggers() in
> > the commit path basically implements an overwrite followed by !overwrite
> > pass, except if the old and new key type/triggers match we call into
> > bch2_trans_mark_key() with both the old and new keys. This calls into
> > bch2_trans_mark_extent(), which then implements an explicit insert ->
> > overwrite order. If that falls into the backpointer write buffer path,
> > that then has to reorder the updates in the write buffer by putting the
> > overwrite update at the head, presumably so it won't remove the
> > backpointer insert that is about to be added. (I assume there are
> > legitimate design reasons for the update ordering at each level here,
> > but the reordering back and forth seems like unfortunate complexity.)
> 
> You've pretty much got it, except it's !overwrite then overwrite.
> 

Hmm.. for which case? run_btree_triggers() does this:

	for (overwrite = 1; overwrite >= 0; --overwrite)

... whereas bch2_trans_mark_key() sees both keys, but uses
trigger_run_insert_then_overwrite().

> This definitely merits some documentation, how does this look to you?
> 

Agree. :)

> diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c
> index f7ffd68f56..a9c6bbeebe 100644
> --- a/fs/bcachefs/btree_update_leaf.c
> +++ b/fs/bcachefs/btree_update_leaf.c
> @@ -401,6 +401,43 @@ static int btree_key_can_insert_cached(struct btree_trans *trans, unsigned flags
>  
>  /* Triggers: */
>  
> +/*
> + * Trigger ordering has some subtleties:
> + *
> + * Extent triggers will add a reference to an alloc key (for a bucket) on
> + * insert, and delete that reference when the extent is removed.
> + *
> + * When the last reference to a bucket or indirect extent is deleted, the bucket
> + * will transition states to need_discard or the indirect extent will be
> + * deleted. This means that when an extent is being moved (by e.g. the
> + * finsert/fcollapse paths, or by the reflink code turning an extent into an
> + * indirect extent), if we process the trigger for the delete before the trigger
> + * for the insert Bad Things (tm) will happen.
> + *
> + * To avoid this, we always run triggers for inserts before triggers for
> + * overwrites.
> + *

So I think you mentioned offline that this was stale reasoning (or
something). So does the above essentially refer to why extent key
triggers are ordered insert -> overwrite? If so, then I think this is
still useful content, perhaps the comment just needs to be swizzled
around a bit.

> + * There's a complication: various triggers need to see both the old and the new
> + * key at the same time, if they're keys that have the same trigger - this
> + * breaks our strict insert-before-overwrite ordering, but not in a way that
> + * breaks moving extents (XXX: why?)
> + *
> + * Additionally, we run alloc triggers last - XXX, also why?
> + *
> + * Another, related trigger ordering issue not handled here: backpointers (as
> + * well as LRU entries) are inserted/removed by triggers that do simple
> + * inserts/updates - i.e. they don't increment/decrement usage counts, like
> + * alloc key/reflink btree updates.
> + *
> + * This means they have to run in the reverse order, overwrites before inserts:
> + * but the backpointers and LRU btrees do not themselves have triggers so the
> + * delete-then-recreate issue for alloc btree references does not happen here.
> + *

This wording is a little confusing.. it's not clear to me why the lack
of ref counts directly means they have to run in reverse order. What
happens if such triggers are not run in reverse order? From the old (now
removed) ordering fix patch, I take it the issue is that a backpointer
(for example) key would be added and then immediately removed, right?

I wonder if this bit of docs would be more clear to do something like 1.
explain what specific key types have particular ordering rules (and why,
such as the reference count example given above), and then 2.  explain
why the high level trigger invocation code has the current ordering (if
it matters). I suspect the part below just goes away with that write buf
ordering patch, at least for the time being..

> + * To get the correct ordering for LRU/backpointer updates, we explicitly put
> + * the update on either the head or the tail of the list of pending updates, if
> + * it was an overwrite or an insert, respectively.
> + */
> +
>  static int run_one_mem_trigger(struct btree_trans *trans,
>  			       struct btree_insert_entry *i,
>  			       unsigned flags)
> 
> > This raises a few questions wrt the non write buffer path. First, the
> > existing bp error check lives in
> > bch2_bucket_backpointer_mod_nowritebuffer() before we even call into
> > bch2_trans_update(), so changing the order of trans updates doesn't seem
> > like it would avoid this particular error. I don't think we want to
> > change the high level order of the extent key trigger, so that means we
> > probably need to do something with the error check itself.
> 
> Yeah, I'm not sure offhand how best to fix that; it's seeing an existing
> backpointer because the update deleting it hasn't happened yet.
> 
> We can consider deleting the check - it's redundant, since we also have
> fsck. If the check worked it would be preferable to keep it, since it
> would let us catch backpointer errors when they first occured, but we
> can still look at the journal to see what tranaction commit caused the
> error even if it's not caught until fsck.
> 

Ok, that is good to know. Initially I was inundated with these errors
and so wanted to work past this first. If these are more like
overzealous checks than coherency issues, then that might be the
simplest option..

> > It also seems like this is possibly different from whatever issue lead
> > to the ordering fix in the write buf path, though I'm not quite sure how
> > that one manifests or is reproduced. A missing backpointer due to
> > unexpected bp removal, perhaps? If so, I suppose I could see why the
> > updates should be reordered in the non-buf trans, assuming the overwrite
> > would delete the just inserted bp key from the trans (and then btree on
> > commit), and perhaps I'm just not seeing that because I hit the existing
> > bp error check first.
> 
> Yeah I tend to think adding a bch2_trans_update_ordered() that matches
> the behaviour of bch2_trans_update_buffered() is the way to go - the
> whole point of the nowritebuffer path is to give us something simpler
> but with matching behaviour of the write buffer path, to aid in
> isolating bugs, so that only makes sense if they actually do behave the
> same way.
> 

I take it your follow up reply (re: the ordering patch) invalidates the
thought around bch2_trans_update_ordered(), but this is useful context
on !writebuffer nonetheless. I'll see if I can just bypass these
existing backpointer errors and narrow further in on the actual missing
bp issue...

Brian

> > So I'm still trying to piece high level things together such that an
> > appropriate fix is not yet clear to me. I'm wondering if perhaps the
> > high level (narrow crc) trans needs a flag to modify the bp checking
> > behavior to indicate we aren't actually inserting a new bp here (so
> > seeing an existing one is fine/expected), and then subsequently
> > bch2_bucket_backpointer_mod_nowritebuffer() needs to do something to
> > turn the "insert" into a bp key overwrite, and perhaps ignore the higher
> > level overwrite, or something *handwavy* along those lines.
> > 
> > I need to think about this some more, but wanted to put some thoughts
> > down in case I'm misunderstanding things or there's an obvious fix I'm
> > just not seeing yet..
> 
> I think we also need to check if my recent trigger ordering patches were
> at fault here - the rebalance_work btree trigger is going to require
> _extent_ triggers to now be "wants old and new" triggers - so I
> recently switched _all_ triggers to "wants old and new", which then
> required the write buffer ordering parameter.
> 
> I'm going to loop merge_torture_flakey on a version prior to that
> change, without the write buffer, and see what comes up.
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bcachefs backpointers errors in no write buf mode
  2023-07-10 13:37   ` Brian Foster
@ 2023-07-10 14:56     ` Kent Overstreet
  2023-07-12 15:01       ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Kent Overstreet @ 2023-07-10 14:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Mon, Jul 10, 2023 at 09:37:23AM -0400, Brian Foster wrote:
> This wording is a little confusing.. it's not clear to me why the lack
> of ref counts directly means they have to run in reverse order. What
> happens if such triggers are not run in reverse order? From the old (now
> removed) ordering fix patch, I take it the issue is that a backpointer
> (for example) key would be added and then immediately removed, right?
> 
> I wonder if this bit of docs would be more clear to do something like 1.
> explain what specific key types have particular ordering rules (and why,
> such as the reference count example given above), and then 2.  explain
> why the high level trigger invocation code has the current ordering (if
> it matters). I suspect the part below just goes away with that write buf
> ordering patch, at least for the time being..

Yeah, that sounds like a better documentation approach.

I'll work up a new patch (and try to do some git spelunking to make sure
I remember all the context).

> > Yeah I tend to think adding a bch2_trans_update_ordered() that matches
> > the behaviour of bch2_trans_update_buffered() is the way to go - the
> > whole point of the nowritebuffer path is to give us something simpler
> > but with matching behaviour of the write buffer path, to aid in
> > isolating bugs, so that only makes sense if they actually do behave the
> > same way.
> > 
> 
> I take it your follow up reply (re: the ordering patch) invalidates the
> thought around bch2_trans_update_ordered(), but this is useful context
> on !writebuffer nonetheless. I'll see if I can just bypass these
> existing backpointer errors and narrow further in on the actual missing
> bp issue...

Yeah, it turned out the recent trigger ordering patches were definitely
to blame. With those patches yanked, I was able to run the
merge_torture_flakey test for ~24 hours in
backpointers_no_use_write_buffer mode - and with the write buffer on,
we're not hitting the "backpointer not found" errors as often as we were
before.

So we probably need to think a bit more about what happens differently
when the write buffer is in use; we're doing exactly the same updates,
but flushing them to the btree is a bit delayed and the ordering is
different. I think we need to hone in on that.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bcachefs backpointers errors in no write buf mode
  2023-07-10 14:56     ` Kent Overstreet
@ 2023-07-12 15:01       ` Brian Foster
  2023-07-12 15:40         ` Kent Overstreet
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2023-07-12 15:01 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Mon, Jul 10, 2023 at 10:56:46AM -0400, Kent Overstreet wrote:
> On Mon, Jul 10, 2023 at 09:37:23AM -0400, Brian Foster wrote:
> > This wording is a little confusing.. it's not clear to me why the lack
> > of ref counts directly means they have to run in reverse order. What
> > happens if such triggers are not run in reverse order? From the old (now
> > removed) ordering fix patch, I take it the issue is that a backpointer
> > (for example) key would be added and then immediately removed, right?
> > 
> > I wonder if this bit of docs would be more clear to do something like 1.
> > explain what specific key types have particular ordering rules (and why,
> > such as the reference count example given above), and then 2.  explain
> > why the high level trigger invocation code has the current ordering (if
> > it matters). I suspect the part below just goes away with that write buf
> > ordering patch, at least for the time being..
> 
> Yeah, that sounds like a better documentation approach.
> 
> I'll work up a new patch (and try to do some git spelunking to make sure
> I remember all the context).
> 
> > > Yeah I tend to think adding a bch2_trans_update_ordered() that matches
> > > the behaviour of bch2_trans_update_buffered() is the way to go - the
> > > whole point of the nowritebuffer path is to give us something simpler
> > > but with matching behaviour of the write buffer path, to aid in
> > > isolating bugs, so that only makes sense if they actually do behave the
> > > same way.
> > > 
> > 
> > I take it your follow up reply (re: the ordering patch) invalidates the
> > thought around bch2_trans_update_ordered(), but this is useful context
> > on !writebuffer nonetheless. I'll see if I can just bypass these
> > existing backpointer errors and narrow further in on the actual missing
> > bp issue...
> 
> Yeah, it turned out the recent trigger ordering patches were definitely
> to blame. With those patches yanked, I was able to run the
> merge_torture_flakey test for ~24 hours in
> backpointers_no_use_write_buffer mode - and with the write buffer on,
> we're not hitting the "backpointer not found" errors as often as we were
> before.
> 

Ok, I realized after the fact that those patches caused the runtime
bp errors. I've not seen them since those patches were removed.

> So we probably need to think a bit more about what happens differently
> when the write buffer is in use; we're doing exactly the same updates,
> but flushing them to the btree is a bit delayed and the ordering is
> different. I think we need to hone in on that.
> 

Yep. I think I have a general idea of what's going on here. The first
thing I realized after looking through some tracing was that this seems
to correlate with the non-fast fallback paths for write buffer flushing.
For example, if I unconditionally redirect wb key flushes to the
trans_commit path in flush_one(), that 1. dramatically increases the
reproduction rate and 2. also reproduces a similar issue for lru
entries. I think the only reason we don't reproduce the lru variant of
the problem with generic/388 alone is that I don't see nearly as many
lru updates as backpointer updates, and so the workload just doesn't
happen to facilitate it.

With that, I tweaked my kernel to unconditionally redirect backpointer
btree updates to the non-fast flush paths for debug purposes. From
there, I've seen a few different variants of the problem that I think
just happen to relate to the different extent modifying operations that
fsstress happens to be running. The most simple example that describes
the fundamental problem is as follows:

1. Extent allocated and associated bp inserted to write buffer index 0.
2. Flush begins on idx 0, active idx switches to 1.
3. Previously alloc'd extent is deleted. Associated deleted bp key
inserted to buffer idx 1.
4. Flush of idx 0 lands on the initial bp key for the extent. The
trans_commit path rejournals the key for the backpointer btree update.
5. The fs shuts down. Recovery finds the key updates in the order they
were journaled by this sequence (update->delete->update), which doesn't
match the actual runtime changes of the extent btree (update->delete).
This puts an entry in the backpointers btree that doesn't reflect
reality.

So essentially the current rejournaling by the non-fast write buffer
flush paths creates an on-disk journal ordering inconsistency wrt
changes in associated btrees.

I tried to skip journaling in the write buffer flush commit path as a
quick experiment, but that doesn't work on its own because we can then
lose key updates in the journal held by the write buffer pin. I.e., the
tail of the journal can then move past the key before it's been updated
in the btree because the latter update hasn't journaled it (this results
in similar backpointer errors and I can see the key getting lost by
looking at list_journal vs. list_journal -a).

OTOH, it looks like the fast flush path gets this right without
rejournaling because it adds a pin to the btree buffer based on the seq
of the already journaled key. IIUC, this essentially allows the btree
update to refer to the journal entry created when the key was inserted
into the write buffer. I'm wondering if the right way to fix this is to
allow the trans update/commit path to basically do the same thing. I.e.,
somehow or another allow a trans update of an already journaled key to
inherit the seq of the key for a btree pin rather than explicitly
rejournal it at the current seq of the transaction. Thoughts?

Brian


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bcachefs backpointers errors in no write buf mode
  2023-07-12 15:01       ` Brian Foster
@ 2023-07-12 15:40         ` Kent Overstreet
  2023-07-12 16:52           ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Kent Overstreet @ 2023-07-12 15:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Wed, Jul 12, 2023 at 11:01:06AM -0400, Brian Foster wrote:
> Yep. I think I have a general idea of what's going on here. The first
> thing I realized after looking through some tracing was that this seems
> to correlate with the non-fast fallback paths for write buffer flushing.
> For example, if I unconditionally redirect wb key flushes to the
> trans_commit path in flush_one(), that 1. dramatically increases the
> reproduction rate and 2. also reproduces a similar issue for lru
> entries. I think the only reason we don't reproduce the lru variant of
> the problem with generic/388 alone is that I don't see nearly as many
> lru updates as backpointer updates, and so the workload just doesn't
> happen to facilitate it.

Fantastic :)

I just pulled dynamic fault injection into the bcachefs tree - to make
sure this path gets tested better in the future we can add a
dynamic_fault("race") here.

(first, I have to dust off the fault injection test integration...)

> With that, I tweaked my kernel to unconditionally redirect backpointer
> btree updates to the non-fast flush paths for debug purposes. From
> there, I've seen a few different variants of the problem that I think
> just happen to relate to the different extent modifying operations that
> fsstress happens to be running. The most simple example that describes
> the fundamental problem is as follows:
> 
> 1. Extent allocated and associated bp inserted to write buffer index 0.
> 2. Flush begins on idx 0, active idx switches to 1.
> 3. Previously alloc'd extent is deleted. Associated deleted bp key
> inserted to buffer idx 1.
> 4. Flush of idx 0 lands on the initial bp key for the extent. The
> trans_commit path rejournals the key for the backpointer btree update.
> 5. The fs shuts down. Recovery finds the key updates in the order they
> were journaled by this sequence (update->delete->update), which doesn't
> match the actual runtime changes of the extent btree (update->delete).
> This puts an entry in the backpointers btree that doesn't reflect
> reality.
> 
> So essentially the current rejournaling by the non-fast write buffer
> flush paths creates an on-disk journal ordering inconsistency wrt
> changes in associated btrees.

Yep, sounds like you nailed it.

> I tried to skip journaling in the write buffer flush commit path as a
> quick experiment, but that doesn't work on its own because we can then
> lose key updates in the journal held by the write buffer pin. I.e., the
> tail of the journal can then move past the key before it's been updated
> in the btree because the latter update hasn't journaled it (this results
> in similar backpointer errors and I can see the key getting lost by
> looking at list_journal vs. list_journal -a).

Correct

> OTOH, it looks like the fast flush path gets this right without
> rejournaling because it adds a pin to the btree buffer based on the seq
> of the already journaled key. IIUC, this essentially allows the btree
> update to refer to the journal entry created when the key was inserted
> into the write buffer. I'm wondering if the right way to fix this is to
> allow the trans update/commit path to basically do the same thing. I.e.,
> somehow or another allow a trans update of an already journaled key to
> inherit the seq of the key for a btree pin rather than explicitly
> rejournal it at the current seq of the transaction. Thoughts?

I think that could work.

What if the entry being flushed is already at the tail of the journal,
and journal reclaim is already trying to flush everything at that
sequence number? Since we'll often have multiple write buffer keys per
journal entry, and per btree node, this might lead to btree nodes
getting flushed multiple times redundantly.

I suspect this would only happen in artificial circumstances - it'll be
based on the ratio of write buffer size:journal size - so if we add a
slowpath tracepoint/counter, and make sure to to some testing where we
provoke this condition that'll be sufficient.

Maybe write a small journal, large write buffer test?

I was also looking at adding the same "skip rejournalling" optimization
to key cache flushing - this could take care of that too.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bcachefs backpointers errors in no write buf mode
  2023-07-12 15:40         ` Kent Overstreet
@ 2023-07-12 16:52           ` Brian Foster
  2023-07-12 17:10             ` Kent Overstreet
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2023-07-12 16:52 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Wed, Jul 12, 2023 at 11:40:47AM -0400, Kent Overstreet wrote:
> On Wed, Jul 12, 2023 at 11:01:06AM -0400, Brian Foster wrote:
> > Yep. I think I have a general idea of what's going on here. The first
> > thing I realized after looking through some tracing was that this seems
> > to correlate with the non-fast fallback paths for write buffer flushing.
> > For example, if I unconditionally redirect wb key flushes to the
> > trans_commit path in flush_one(), that 1. dramatically increases the
> > reproduction rate and 2. also reproduces a similar issue for lru
> > entries. I think the only reason we don't reproduce the lru variant of
> > the problem with generic/388 alone is that I don't see nearly as many
> > lru updates as backpointer updates, and so the workload just doesn't
> > happen to facilitate it.
> 
> Fantastic :)
> 
> I just pulled dynamic fault injection into the bcachefs tree - to make
> sure this path gets tested better in the future we can add a
> dynamic_fault("race") here.
> 
> (first, I have to dust off the fault injection test integration...)
> 
> > With that, I tweaked my kernel to unconditionally redirect backpointer
> > btree updates to the non-fast flush paths for debug purposes. From
> > there, I've seen a few different variants of the problem that I think
> > just happen to relate to the different extent modifying operations that
> > fsstress happens to be running. The most simple example that describes
> > the fundamental problem is as follows:
> > 
> > 1. Extent allocated and associated bp inserted to write buffer index 0.
> > 2. Flush begins on idx 0, active idx switches to 1.
> > 3. Previously alloc'd extent is deleted. Associated deleted bp key
> > inserted to buffer idx 1.
> > 4. Flush of idx 0 lands on the initial bp key for the extent. The
> > trans_commit path rejournals the key for the backpointer btree update.
> > 5. The fs shuts down. Recovery finds the key updates in the order they
> > were journaled by this sequence (update->delete->update), which doesn't
> > match the actual runtime changes of the extent btree (update->delete).
> > This puts an entry in the backpointers btree that doesn't reflect
> > reality.
> > 
> > So essentially the current rejournaling by the non-fast write buffer
> > flush paths creates an on-disk journal ordering inconsistency wrt
> > changes in associated btrees.
> 
> Yep, sounds like you nailed it.
> 
> > I tried to skip journaling in the write buffer flush commit path as a
> > quick experiment, but that doesn't work on its own because we can then
> > lose key updates in the journal held by the write buffer pin. I.e., the
> > tail of the journal can then move past the key before it's been updated
> > in the btree because the latter update hasn't journaled it (this results
> > in similar backpointer errors and I can see the key getting lost by
> > looking at list_journal vs. list_journal -a).
> 
> Correct
> 
> > OTOH, it looks like the fast flush path gets this right without
> > rejournaling because it adds a pin to the btree buffer based on the seq
> > of the already journaled key. IIUC, this essentially allows the btree
> > update to refer to the journal entry created when the key was inserted
> > into the write buffer. I'm wondering if the right way to fix this is to
> > allow the trans update/commit path to basically do the same thing. I.e.,
> > somehow or another allow a trans update of an already journaled key to
> > inherit the seq of the key for a btree pin rather than explicitly
> > rejournal it at the current seq of the transaction. Thoughts?
> 
> I think that could work.
> 
> What if the entry being flushed is already at the tail of the journal,
> and journal reclaim is already trying to flush everything at that
> sequence number? Since we'll often have multiple write buffer keys per
> journal entry, and per btree node, this might lead to btree nodes
> getting flushed multiple times redundantly.
> 

I think I follow what you mean, but isn't this what the fast path is
doing already? (Not to say that it isn't a potential issue to be aware
of.)

BTW, I haven't dug much into the btree code... If a btree node has a
bunch of pins of different sequence numbers for whatever key updates are
pending, is the whole set flushed out on the first pin flush (releasing
all pins)? Or is it an incremental flush to update keys just up to the
specified pin seq?

> I suspect this would only happen in artificial circumstances - it'll be
> based on the ratio of write buffer size:journal size - so if we add a
> slowpath tracepoint/counter, and make sure to to some testing where we
> provoke this condition that'll be sufficient.
> 
> Maybe write a small journal, large write buffer test?
> 

Yeah, I was wondering if/what sort of non-default configuration might
more reliably reproduce this issue. That sounds like something to try.

> I was also looking at adding the same "skip rejournalling" optimization
> to key cache flushing - this could take care of that too.
> 

Indeed. Even if that is an optimization for key cache vs. more of a
correctness issue for write buffer, it sounds like quite a bit of
functional overlap.

Anyways, I'll see if I can prototype something to verify whether this
addresses the backpointer issue before getting too far into the weeds.
;) Thanks for the feedback.

Brian


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bcachefs backpointers errors in no write buf mode
  2023-07-12 16:52           ` Brian Foster
@ 2023-07-12 17:10             ` Kent Overstreet
  0 siblings, 0 replies; 9+ messages in thread
From: Kent Overstreet @ 2023-07-12 17:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Wed, Jul 12, 2023 at 12:52:14PM -0400, Brian Foster wrote:
> On Wed, Jul 12, 2023 at 11:40:47AM -0400, Kent Overstreet wrote:
> > What if the entry being flushed is already at the tail of the journal,
> > and journal reclaim is already trying to flush everything at that
> > sequence number? Since we'll often have multiple write buffer keys per
> > journal entry, and per btree node, this might lead to btree nodes
> > getting flushed multiple times redundantly.
> > 
> 
> I think I follow what you mean, but isn't this what the fast path is
> doing already? (Not to say that it isn't a potential issue to be aware
> of.)

Yes, but when we're low on free space in the journal that's when we
switch to the slowpath.

> BTW, I haven't dug much into the btree code... If a btree node has a
> bunch of pins of different sequence numbers for whatever key updates are
> pending, is the whole set flushed out on the first pin flush (releasing
> all pins)? Or is it an incremental flush to update keys just up to the
> specified pin seq?

We write out everything that's dirty for that given btree node (which is
why a btree node only has to retain a single journal pin to the oldest
pinned journal entry).

> Indeed. Even if that is an optimization for key cache vs. more of a
> correctness issue for write buffer, it sounds like quite a bit of
> functional overlap.
> 
> Anyways, I'll see if I can prototype something to verify whether this
> addresses the backpointer issue before getting too far into the weeds.
> ;) Thanks for the feedback.

Sounds good :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-07-12 17:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-07 17:32 bcachefs backpointers errors in no write buf mode Brian Foster
2023-07-07 17:58 ` Kent Overstreet
2023-07-10 13:37   ` Brian Foster
2023-07-10 14:56     ` Kent Overstreet
2023-07-12 15:01       ` Brian Foster
2023-07-12 15:40         ` Kent Overstreet
2023-07-12 16:52           ` Brian Foster
2023-07-12 17:10             ` Kent Overstreet
2023-07-07 20:37 ` Kent Overstreet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.