* [PATCH] bcache: fix insane -ESRCH error in bch_journal_replay
@ 2022-02-15 8:28 Minlan Wang
2022-05-13 15:22 ` Coly Li
0 siblings, 1 reply; 2+ messages in thread
From: Minlan Wang @ 2022-02-15 8:28 UTC (permalink / raw)
To: colyli, linux-bcache; +Cc: wangminlan
bch_keylist_empty is checked to make sure key insert is successful
Signed-off-by: Minlan Wang <wangminlan@szsandstone.com>
---
We've been using a pretty old version of bcache (from
linux-4.18-rc8) and experiencing occasional data corruption
after a new register session of bcache. It turns out the
data corruption is caused by this code in our old version of
bcache:
in bch_journal_replay:
...
for (k = i->j.start;
k < bset_bkey_last(&i->j);
k = bkey_next(k)) {
trace_bcache_journal_replay_key(k);
bch_keylist_init_single(&keylist, k);
ret = bch_btree_insert(s, &keylist, i->pin, NULL);
if (ret)
goto err;
...
bch_journal_replay returns -ESRCH, then in run_cache_set:
...
if (j->version < BCACHE_JSET_VERSION_UUID)
__uuid_write(c);
bch_journal_replay(c, &journal);
...
There's no message warning about this key insert error in
journal replay, nor does run_cache_set check for return
value of bch_journal_replay either, so later keys in jset
got lost silently and cause data corruption.
We checked the key which caused this failure in
bch_journal_replay, it is a key mapped to 2 btree node, and
the first btree node is full.
The code path causing this problem is:
1. k1 mapped to btree node n1, n2
2. When mapping into n1, bch_btree_insert_node goes into
split case, top half of k1, say k1.1, is inserted into
splitted n1: say n1.1, n1.2. Since the bottom half of k1,
say k1.2, is left in insert_keys, so bch_btree_insert_node
returns -EINTR.
This happens in this code:
in bch_btree_insert_node:
...
} else {
/* Invalidated all iterators */
int ret = btree_split(b, op, insert_keys, replace_key);
if (bch_keylist_empty(insert_keys))
return 0;
else if (!ret)
return -EINTR;
return ret;
}
...
3. For return value -EINTR, another
bch_btree_map_nodes_recurse in bcache_btree_root is
triggered, with the wrong "from" key: which is
START_KEY(&k1), instead of START_KEY(&k1.2)
4. n1.2 is revisted, since it has no overlapping range for
k1.2, bch_btree_insert_keys sets
op->insert_collision = true
in the following code path:
btree_insert_fn
--> bch_btree_insert_node
--> bch_btree_insert_keys
5. n2 is visisted and k1.2 is inserted here
6. bch_btree_insert detects op.op.insert_collision, and
returns -ESRCH
In this case, the key is successfully inserted, though
bch_btree_insert returns -ESRCH.
We tried the latest version of bcache in linux mainline,
which has commit ce3e4cfb59cb (bcache: add failure check to
run_cache_set() for journal replay), the cache disk register
failed with this message:
bcache: replay journal failed, disabling caching
This is not what is expected, since the data in disk in
intact, journal replay should success.
I'm wondering if the checking of bch_btree_insert's return
value is really neccessary in bch_journal_replay, could we
just check bch_keylist_empty(&keylist) and make the replay
continue if that is empty?
We are using this patch for a work around for this issue now.
drivers/md/bcache/journal.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 61bd79babf7a..35b8cbcd73c5 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -380,6 +380,8 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
bch_keylist_init_single(&keylist, k);
ret = bch_btree_insert(s, &keylist, i->pin, NULL);
+ if ((ret == -ESRCH) && (bch_keylist_empty(&keylist)))
+ ret = 0;
if (ret)
goto err;
--
2.18.4
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH] bcache: fix insane -ESRCH error in bch_journal_replay
2022-02-15 8:28 [PATCH] bcache: fix insane -ESRCH error in bch_journal_replay Minlan Wang
@ 2022-05-13 15:22 ` Coly Li
0 siblings, 0 replies; 2+ messages in thread
From: Coly Li @ 2022-05-13 15:22 UTC (permalink / raw)
To: Minlan Wang; +Cc: linux-bcache
On 2/15/22 4:28 PM, Minlan Wang wrote:
> bch_keylist_empty is checked to make sure key insert is successful
>
> Signed-off-by: Minlan Wang <wangminlan@szsandstone.com>
> ---
> We've been using a pretty old version of bcache (from
> linux-4.18-rc8) and experiencing occasional data corruption
> after a new register session of bcache. It turns out the
> data corruption is caused by this code in our old version of
> bcache:
>
> in bch_journal_replay:
> ...
> for (k = i->j.start;
> k < bset_bkey_last(&i->j);
> k = bkey_next(k)) {
> trace_bcache_journal_replay_key(k);
>
> bch_keylist_init_single(&keylist, k);
>
> ret = bch_btree_insert(s, &keylist, i->pin, NULL);
> if (ret)
> goto err;
> ...
> bch_journal_replay returns -ESRCH, then in run_cache_set:
> ...
> if (j->version < BCACHE_JSET_VERSION_UUID)
> __uuid_write(c);
>
> bch_journal_replay(c, &journal);
> ...
> There's no message warning about this key insert error in
> journal replay, nor does run_cache_set check for return
> value of bch_journal_replay either, so later keys in jset
> got lost silently and cause data corruption.
>
> We checked the key which caused this failure in
> bch_journal_replay, it is a key mapped to 2 btree node, and
> the first btree node is full.
> The code path causing this problem is:
> 1. k1 mapped to btree node n1, n2
Hi Minlan,
Could you please provide more details about how the key 'k1' mapped to
btree node n1 and n2 both?
Thanks.
Coly Li
> 2. When mapping into n1, bch_btree_insert_node goes into
> split case, top half of k1, say k1.1, is inserted into
> splitted n1: say n1.1, n1.2. Since the bottom half of k1,
> say k1.2, is left in insert_keys, so bch_btree_insert_node
> returns -EINTR.
> This happens in this code:
> in bch_btree_insert_node:
> ...
> } else {
> /* Invalidated all iterators */
> int ret = btree_split(b, op, insert_keys, replace_key);
>
> if (bch_keylist_empty(insert_keys))
> return 0;
> else if (!ret)
> return -EINTR;
> return ret;
> }
> ...
> 3. For return value -EINTR, another
> bch_btree_map_nodes_recurse in bcache_btree_root is
> triggered, with the wrong "from" key: which is
> START_KEY(&k1), instead of START_KEY(&k1.2)
> 4. n1.2 is revisted, since it has no overlapping range for
> k1.2, bch_btree_insert_keys sets
> op->insert_collision = true
> in the following code path:
> btree_insert_fn
> --> bch_btree_insert_node
> --> bch_btree_insert_keys
> 5. n2 is visisted and k1.2 is inserted here
> 6. bch_btree_insert detects op.op.insert_collision, and
> returns -ESRCH
>
> In this case, the key is successfully inserted, though
> bch_btree_insert returns -ESRCH.
>
> We tried the latest version of bcache in linux mainline,
> which has commit ce3e4cfb59cb (bcache: add failure check to
> run_cache_set() for journal replay), the cache disk register
> failed with this message:
> bcache: replay journal failed, disabling caching
>
> This is not what is expected, since the data in disk in
> intact, journal replay should success.
> I'm wondering if the checking of bch_btree_insert's return
> value is really neccessary in bch_journal_replay, could we
> just check bch_keylist_empty(&keylist) and make the replay
> continue if that is empty?
>
> We are using this patch for a work around for this issue now.
>
> drivers/md/bcache/journal.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 61bd79babf7a..35b8cbcd73c5 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -380,6 +380,8 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
> bch_keylist_init_single(&keylist, k);
>
> ret = bch_btree_insert(s, &keylist, i->pin, NULL);
> + if ((ret == -ESRCH) && (bch_keylist_empty(&keylist)))
> + ret = 0;
> if (ret)
> goto err;
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2022-05-13 15:22 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-15 8:28 [PATCH] bcache: fix insane -ESRCH error in bch_journal_replay Minlan Wang
2022-05-13 15:22 ` Coly Li
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.