[Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave)
@ 2022-05-24 10:23 Coly Li
  2022-05-24 10:23 ` [PATCH 1/4] bcache: improve multithreaded bch_btree_check() Coly Li
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Coly Li @ 2022-05-24 10:23 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

Hi Jens,

Thank you for taking the late arrived series, they are all for bcache
fixes when I work on the bcache journal no-space deadlock issue. It
spent me quite long time to fix because other issues interfered my debug
and analysis. When all the depending issues were fixed and my fix for
the journal no-space deadlock is verified, this submission is late for
Linux v5.19 submission. But it is still worthy to take them into v5.19
because real issues are fixed by this series.

The bcache has 4 patches for Linux v5.19 merge window, all from me.
- The first 2 patches are code clean up and potential bug fixes for
multi- threaded btree nodes check (for cache device) and dirty sectors
counting (for backing device), although no report from mailing list for
them, it is good to have the fixes.
- The 3rd patch removes incremental dirty sectors counting because it
is conflicted with multithreaded dirty sectors counting and the latter
one is 10x times faster.
- The last patch fixes a journal no-space deadlock during cache device
registration, it always reserves one journal bucket and only uses it
in registration time, so the no-spance condition won't happen anymore.

There are still 2 fixes are still under the long time I/O pressure
testing, once they are accomplished, I will submit to you in later
RC cycles.

The v2 series fixed previously detectd oversize stack frame issue, in
my test I don't observed the stack frame oversize warning and normal
bcache operations work as expected.

Thank you in advance.

Coly Li
---

Coly Li (4):
  bcache: improve multithreaded bch_btree_check()
  bcache: improve multithreaded bch_sectors_dirty_init()
  bcache: remove incremental dirty sector counting for
    bch_sectors_dirty_init()
  bcache: avoid journal no-space deadlock by reserving 1 journal bucket

 drivers/md/bcache/btree.c     |  58 +++++++++----------
 drivers/md/bcache/btree.h     |   2 +-
 drivers/md/bcache/journal.c   |  31 +++++++++--
 drivers/md/bcache/journal.h   |   2 +
 drivers/md/bcache/super.c     |   1 +
 drivers/md/bcache/writeback.c | 101 +++++++++++++---------------------
 drivers/md/bcache/writeback.h |   2 +-
 7 files changed, 94 insertions(+), 103 deletions(-)

-- 
2.35.3

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] bcache: improve multithreaded bch_btree_check()
  2022-05-24 10:23 [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave) Coly Li
@ 2022-05-24 10:23 ` Coly Li
  2022-05-24 10:23 ` [PATCH 2/4] bcache: improve multithreaded bch_sectors_dirty_init() Coly Li
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2022-05-24 10:23 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, stable

Commit 8e7102273f59 ("bcache: make bch_btree_check() to be
multithreaded") makes bch_btree_check() to be much faster when checking
all btree nodes during cache device registration. But it isn't in ideal
shap yet, still can be improved.

This patch does the following thing to improve current parallel btree
nodes check by multiple threads in bch_btree_check(),
- Add read lock to root node while checking all the btree nodes with
  multiple threads. Although currently it is not mandatory but it is
  good to have a read lock in code logic.
- Remove local variable 'char name[32]', and generate kernel thread name
  string directly when calling kthread_run().
- Allocate local variable "struct btree_check_state check_state" on the
  stack and avoid unnecessary dynamic memory allocation for it.
- Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed.
- Increase check_state->started to count created kernel thread after it
  succeeds to create.
- When wait for all checking kernel threads to finish, use wait_event()
  to replace wait_event_interruptible().

With this change, the code is more clear, and some potential error
conditions are avoided.

Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/btree.c | 58 ++++++++++++++++++---------------------
 drivers/md/bcache/btree.h |  2 +-
 2 files changed, 27 insertions(+), 33 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index ad9f16689419..2362bb8ef6d1 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -2006,8 +2006,7 @@ int bch_btree_check(struct cache_set *c)
 	int i;
 	struct bkey *k = NULL;
 	struct btree_iter iter;
-	struct btree_check_state *check_state;
-	char name[32];
+	struct btree_check_state check_state;
 
 	/* check and mark root node keys */
 	for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid)
@@ -2018,63 +2017,58 @@ int bch_btree_check(struct cache_set *c)
 	if (c->root->level == 0)
 		return 0;
 
-	check_state = kzalloc(sizeof(struct btree_check_state), GFP_KERNEL);
-	if (!check_state)
-		return -ENOMEM;
-
-	check_state->c = c;
-	check_state->total_threads = bch_btree_chkthread_nr();
-	check_state->key_idx = 0;
-	spin_lock_init(&check_state->idx_lock);
-	atomic_set(&check_state->started, 0);
-	atomic_set(&check_state->enough, 0);
-	init_waitqueue_head(&check_state->wait);
+	check_state.c = c;
+	check_state.total_threads = bch_btree_chkthread_nr();
+	check_state.key_idx = 0;
+	spin_lock_init(&check_state.idx_lock);
+	atomic_set(&check_state.started, 0);
+	atomic_set(&check_state.enough, 0);
+	init_waitqueue_head(&check_state.wait);
 
+	rw_lock(0, c->root, c->root->level);
 	/*
 	 * Run multiple threads to check btree nodes in parallel,
-	 * if check_state->enough is non-zero, it means current
+	 * if check_state.enough is non-zero, it means current
 	 * running check threads are enough, unncessary to create
 	 * more.
 	 */
-	for (i = 0; i < check_state->total_threads; i++) {
-		/* fetch latest check_state->enough earlier */
+	for (i = 0; i < check_state.total_threads; i++) {
+		/* fetch latest check_state.enough earlier */
 		smp_mb__before_atomic();
-		if (atomic_read(&check_state->enough))
+		if (atomic_read(&check_state.enough))
 			break;
 
-		check_state->infos[i].result = 0;
-		check_state->infos[i].state = check_state;
-		snprintf(name, sizeof(name), "bch_btrchk[%u]", i);
-		atomic_inc(&check_state->started);
+		check_state.infos[i].result = 0;
+		check_state.infos[i].state = &check_state;
 
-		check_state->infos[i].thread =
+		check_state.infos[i].thread =
 			kthread_run(bch_btree_check_thread,
-				    &check_state->infos[i],
-				    name);
-		if (IS_ERR(check_state->infos[i].thread)) {
+				    &check_state.infos[i],
+				    "bch_btrchk[%d]", i);
+		if (IS_ERR(check_state.infos[i].thread)) {
 			pr_err("fails to run thread bch_btrchk[%d]\n", i);
 			for (--i; i >= 0; i--)
-				kthread_stop(check_state->infos[i].thread);
+				kthread_stop(check_state.infos[i].thread);
 			ret = -ENOMEM;
 			goto out;
 		}
+		atomic_inc(&check_state.started);
 	}
 
 	/*
 	 * Must wait for all threads to stop.
 	 */
-	wait_event_interruptible(check_state->wait,
-				 atomic_read(&check_state->started) == 0);
+	wait_event(check_state.wait, atomic_read(&check_state.started) == 0);
 
-	for (i = 0; i < check_state->total_threads; i++) {
-		if (check_state->infos[i].result) {
-			ret = check_state->infos[i].result;
+	for (i = 0; i < check_state.total_threads; i++) {
+		if (check_state.infos[i].result) {
+			ret = check_state.infos[i].result;
 			goto out;
 		}
 	}
 
 out:
-	kfree(check_state);
+	rw_unlock(0, c->root);
 	return ret;
 }
 
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index 50482107134f..1b5fdbc0d83e 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -226,7 +226,7 @@ struct btree_check_info {
 	int				result;
 };
 
-#define BCH_BTR_CHKTHREAD_MAX	64
+#define BCH_BTR_CHKTHREAD_MAX	12
 struct btree_check_state {
 	struct cache_set		*c;
 	int				total_threads;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] bcache: improve multithreaded bch_sectors_dirty_init()
  2022-05-24 10:23 [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave) Coly Li
  2022-05-24 10:23 ` [PATCH 1/4] bcache: improve multithreaded bch_btree_check() Coly Li
@ 2022-05-24 10:23 ` Coly Li
  2022-05-24 10:23 ` [PATCH 3/4] bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() Coly Li
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2022-05-24 10:23 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, stable

Commit b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be
multithreaded") makes bch_sectors_dirty_init() to be much faster
when counting dirty sectors by iterating all dirty keys in the btree.
But it isn't in ideal shape yet, still can be improved.

This patch does the following changes to improve current parallel dirty
keys iteration on the btree,
- Add read lock to root node when multiple threads iterating the btree,
  to prevent the root node gets split by I/Os from other registered
  bcache devices.
- Remove local variable "char name[32]" and generate kernel thread name
  string directly when calling kthread_run().
- Allocate "struct bch_dirty_init_state state" directly on stack and
  avoid the unnecessary dynamic memory allocation for it.
- Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed.
- Increase &state->started to count created kernel thread after it
  succeeds to create.
- When wait for all dirty key counting threads to finish, use
  wait_event() to replace wait_event_interruptible().

With the above changes, the code is more clear, and some potential error
conditions are avoided.

Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/writeback.c | 62 ++++++++++++++---------------------
 drivers/md/bcache/writeback.h |  2 +-
 2 files changed, 26 insertions(+), 38 deletions(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 9ee0005874cd..d24c09490f8e 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -948,10 +948,10 @@ void bch_sectors_dirty_init(struct bcache_device *d)
 	struct btree_iter iter;
 	struct sectors_dirty_init op;
 	struct cache_set *c = d->c;
-	struct bch_dirty_init_state *state;
-	char name[32];
+	struct bch_dirty_init_state state;
 
 	/* Just count root keys if no leaf node */
+	rw_lock(0, c->root, c->root->level);
 	if (c->root->level == 0) {
 		bch_btree_op_init(&op.op, -1);
 		op.inode = d->id;
@@ -961,54 +961,42 @@ void bch_sectors_dirty_init(struct bcache_device *d)
 		for_each_key_filter(&c->root->keys,
 				    k, &iter, bch_ptr_invalid)
 			sectors_dirty_init_fn(&op.op, c->root, k);
+		rw_unlock(0, c->root);
 		return;
 	}
 
-	state = kzalloc(sizeof(struct bch_dirty_init_state), GFP_KERNEL);
-	if (!state) {
-		pr_warn("sectors dirty init failed: cannot allocate memory\n");
-		return;
-	}
-
-	state->c = c;
-	state->d = d;
-	state->total_threads = bch_btre_dirty_init_thread_nr();
-	state->key_idx = 0;
-	spin_lock_init(&state->idx_lock);
-	atomic_set(&state->started, 0);
-	atomic_set(&state->enough, 0);
-	init_waitqueue_head(&state->wait);
-
-	for (i = 0; i < state->total_threads; i++) {
-		/* Fetch latest state->enough earlier */
+	state.c = c;
+	state.d = d;
+	state.total_threads = bch_btre_dirty_init_thread_nr();
+	state.key_idx = 0;
+	spin_lock_init(&state.idx_lock);
+	atomic_set(&state.started, 0);
+	atomic_set(&state.enough, 0);
+	init_waitqueue_head(&state.wait);
+
+	for (i = 0; i < state.total_threads; i++) {
+		/* Fetch latest state.enough earlier */
 		smp_mb__before_atomic();
-		if (atomic_read(&state->enough))
+		if (atomic_read(&state.enough))
 			break;
 
-		state->infos[i].state = state;
-		atomic_inc(&state->started);
-		snprintf(name, sizeof(name), "bch_dirty_init[%d]", i);
-
-		state->infos[i].thread =
-			kthread_run(bch_dirty_init_thread,
-				    &state->infos[i],
-				    name);
-		if (IS_ERR(state->infos[i].thread)) {
+		state.infos[i].state = &state;
+		state.infos[i].thread =
+			kthread_run(bch_dirty_init_thread, &state.infos[i],
+				    "bch_dirtcnt[%d]", i);
+		if (IS_ERR(state.infos[i].thread)) {
 			pr_err("fails to run thread bch_dirty_init[%d]\n", i);
 			for (--i; i >= 0; i--)
-				kthread_stop(state->infos[i].thread);
+				kthread_stop(state.infos[i].thread);
 			goto out;
 		}
+		atomic_inc(&state.started);
 	}
 
-	/*
-	 * Must wait for all threads to stop.
-	 */
-	wait_event_interruptible(state->wait,
-		 atomic_read(&state->started) == 0);
-
 out:
-	kfree(state);
+	/* Must wait for all threads to stop. */
+	wait_event(state.wait, atomic_read(&state.started) == 0);
+	rw_unlock(0, c->root);
 }
 
 void bch_cached_dev_writeback_init(struct cached_dev *dc)
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 02b2f9df73f6..31df716951f6 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -20,7 +20,7 @@
 #define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57
 #define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
 
-#define BCH_DIRTY_INIT_THRD_MAX	64
+#define BCH_DIRTY_INIT_THRD_MAX	12
 /*
  * 14 (16384ths) is chosen here as something that each backing device
  * should be a reasonable fraction of the share, and not to blow up
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] bcache: remove incremental dirty sector counting for bch_sectors_dirty_init()
  2022-05-24 10:23 [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave) Coly Li
  2022-05-24 10:23 ` [PATCH 1/4] bcache: improve multithreaded bch_btree_check() Coly Li
  2022-05-24 10:23 ` [PATCH 2/4] bcache: improve multithreaded bch_sectors_dirty_init() Coly Li
@ 2022-05-24 10:23 ` Coly Li
  2022-05-24 10:23 ` [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket Coly Li
  2022-05-24 12:19 ` [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave) Jens Axboe
  4 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2022-05-24 10:23 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, stable

After making bch_sectors_dirty_init() being multithreaded, the existing
incremental dirty sector counting in bch_root_node_dirty_init() doesn't
release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME)
bkeys. Because a read lock is added on btree root node to prevent the
btree to be split during the dirty sectors counting, other I/O requester
has no chance to gain the write lock even restart bcache_btree().

That is to say, the incremental dirty sectors counting is incompatible
to the multhreaded bch_sectors_dirty_init(). We have to choose one and
drop another one.

In my testing, with 512 bytes random writes, I generate 1.2T dirty data
and a btree with 400K nodes. With single thread and incremental dirty
sectors counting, it takes 30+ minites to register the backing device.
And with multithreaded dirty sectors counting, the backing device
registration can be accomplished within 2 minutes.

The 30+ minutes V.S. 2- minutes difference makes me decide to keep
multithreaded bch_sectors_dirty_init() and drop the incremental dirty
sectors counting. This is what this patch does.

But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU
will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys
iterated. This is to avoid the watchdog reports a bogus soft lockup
warning.

Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/writeback.c | 41 +++++++++++------------------------
 1 file changed, 13 insertions(+), 28 deletions(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index d24c09490f8e..75b71199800d 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -805,13 +805,11 @@ static int bch_writeback_thread(void *arg)
 
 /* Init */
 #define INIT_KEYS_EACH_TIME	500000
-#define INIT_KEYS_SLEEP_MS	100
 
 struct sectors_dirty_init {
 	struct btree_op	op;
 	unsigned int	inode;
 	size_t		count;
-	struct bkey	start;
 };
 
 static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
@@ -827,11 +825,8 @@ static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
 					     KEY_START(k), KEY_SIZE(k));
 
 	op->count++;
-	if (atomic_read(&b->c->search_inflight) &&
-	    !(op->count % INIT_KEYS_EACH_TIME)) {
-		bkey_copy_key(&op->start, k);
-		return -EAGAIN;
-	}
+	if (!(op->count % INIT_KEYS_EACH_TIME))
+		cond_resched();
 
 	return MAP_CONTINUE;
 }
@@ -846,24 +841,16 @@ static int bch_root_node_dirty_init(struct cache_set *c,
 	bch_btree_op_init(&op.op, -1);
 	op.inode = d->id;
 	op.count = 0;
-	op.start = KEY(op.inode, 0, 0);
-
-	do {
-		ret = bcache_btree(map_keys_recurse,
-				   k,
-				   c->root,
-				   &op.op,
-				   &op.start,
-				   sectors_dirty_init_fn,
-				   0);
-		if (ret == -EAGAIN)
-			schedule_timeout_interruptible(
-				msecs_to_jiffies(INIT_KEYS_SLEEP_MS));
-		else if (ret < 0) {
-			pr_warn("sectors dirty init failed, ret=%d!\n", ret);
-			break;
-		}
-	} while (ret == -EAGAIN);
+
+	ret = bcache_btree(map_keys_recurse,
+			   k,
+			   c->root,
+			   &op.op,
+			   &KEY(op.inode, 0, 0),
+			   sectors_dirty_init_fn,
+			   0);
+	if (ret < 0)
+		pr_warn("sectors dirty init failed, ret=%d!\n", ret);
 
 	return ret;
 }
@@ -907,7 +894,6 @@ static int bch_dirty_init_thread(void *arg)
 				goto out;
 			}
 			skip_nr--;
-			cond_resched();
 		}
 
 		if (p) {
@@ -917,7 +903,6 @@ static int bch_dirty_init_thread(void *arg)
 
 		p = NULL;
 		prev_idx = cur_idx;
-		cond_resched();
 	}
 
 out:
@@ -956,11 +941,11 @@ void bch_sectors_dirty_init(struct bcache_device *d)
 		bch_btree_op_init(&op.op, -1);
 		op.inode = d->id;
 		op.count = 0;
-		op.start = KEY(op.inode, 0, 0);
 
 		for_each_key_filter(&c->root->keys,
 				    k, &iter, bch_ptr_invalid)
 			sectors_dirty_init_fn(&op.op, c->root, k);
+
 		rw_unlock(0, c->root);
 		return;
 	}
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket
  2022-05-24 10:23 [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave) Coly Li
                   ` (2 preceding siblings ...)
  2022-05-24 10:23 ` [PATCH 3/4] bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() Coly Li
@ 2022-05-24 10:23 ` Coly Li
  2022-05-24 12:19 ` [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave) Jens Axboe
  4 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2022-05-24 10:23 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Nikhil Kshirsagar, stable

The journal no-space deadlock was reported time to time. Such deadlock
can happen in the following situation.

When all journal buckets are fully filled by active jset with heavy
write I/O load, the cache set registration (after a reboot) will load
all active jsets and inserting them into the btree again (which is
called journal replay). If a journaled bkey is inserted into a btree
node and results btree node split, new journal request might be
triggered. For example, the btree grows one more level after the node
split, then the root node record in cache device super block will be
upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
space in journal buckets, the journal replay has to wait for new journal
bucket to be reclaimed after at least one journal bucket replayed. This
is one example that how the journal no-space deadlock happens.

The solution to avoid the deadlock is to reserve 1 journal bucket in
run time, and only permit the reserved journal bucket to be used during
cache set registration procedure for things like journal replay. Then
the journal space will never be fully filled, there is no chance for
journal no-space deadlock to happen anymore.

This patch adds a new member "bool do_reserve" in struct journal, it is
inititalized to 0 (false) when struct journal is allocated, and set to
1 (true) by bch_journal_space_reserve() when all initialization done in
run_cache_set(). In the run time when journal_reclaim() tries to
allocate a new journal bucket, free_journal_buckets() is called to check
whether there are enough free journal buckets to use. If there is only
1 free journal bucket and journal->do_reserve is 1 (true), the last
bucket is reserved and free_journal_buckets() will return 0 to indicate
no free journal bucket. Then journal_reclaim() will give up, and try
next time to see whetheer there is free journal bucket to allocate. By
this method, there is always 1 jouranl bucket reserved in run time.

During the cache set registration, journal->do_reserve is 0 (false), so
the reserved journal bucket can be used to avoid the no-space deadlock.

Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/journal.c | 31 ++++++++++++++++++++++++++-----
 drivers/md/bcache/journal.h |  2 ++
 drivers/md/bcache/super.c   |  1 +
 3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index df5347ea450b..e5da469a4235 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -405,6 +405,11 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 	return ret;
 }

+void bch_journal_space_reserve(struct journal *j)
+{
+	j->do_reserve = true;
+}
+
 /* Journalling */

 static void btree_flush_write(struct cache_set *c)
@@ -621,12 +626,30 @@ static void do_journal_discard(struct cache *ca)
 	}
 }

+static unsigned int free_journal_buckets(struct cache_set *c)
+{
+	struct journal *j = &c->journal;
+	struct cache *ca = c->cache;
+	struct journal_device *ja = &c->cache->journal;
+	unsigned int n;
+
+	/* In case njournal_buckets is not power of 2 */
+	if (ja->cur_idx >= ja->discard_idx)
+		n = ca->sb.njournal_buckets +  ja->discard_idx - ja->cur_idx;
+	else
+		n = ja->discard_idx - ja->cur_idx;
+
+	if (n > (1 + j->do_reserve))
+		return n - (1 + j->do_reserve);
+
+	return 0;
+}
+
 static void journal_reclaim(struct cache_set *c)
 {
 	struct bkey *k = &c->journal.key;
 	struct cache *ca = c->cache;
 	uint64_t last_seq;
-	unsigned int next;
 	struct journal_device *ja = &ca->journal;
 	atomic_t p __maybe_unused;

@@ -649,12 +672,10 @@ static void journal_reclaim(struct cache_set *c)
 	if (c->journal.blocks_free)
 		goto out;

-	next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
-	/* No space available on this device */
-	if (next == ja->discard_idx)
+	if (!free_journal_buckets(c))
 		goto out;

-	ja->cur_idx = next;
+	ja->cur_idx = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
 	k->ptr[0] = MAKE_PTR(0,
 			     bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
 			     ca->sb.nr_this_dev);
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index f2ea34d5f431..cd316b4a1e95 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -105,6 +105,7 @@ struct journal {
 	spinlock_t		lock;
 	spinlock_t		flush_write_lock;
 	bool			btree_flushing;
+	bool			do_reserve;
 	/* used when waiting because the journal was full */
 	struct closure_waitlist	wait;
 	struct closure		io;
@@ -182,5 +183,6 @@ int bch_journal_replay(struct cache_set *c, struct list_head *list);

 void bch_journal_free(struct cache_set *c);
 int bch_journal_alloc(struct cache_set *c);
+void bch_journal_space_reserve(struct journal *j);

 #endif /* _BCACHE_JOURNAL_H */
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index bf3de149d3c9..2bb55278d22d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2128,6 +2128,7 @@ static int run_cache_set(struct cache_set *c)

 	flash_devs_run(c);

+	bch_journal_space_reserve(&c->journal);
 	set_bit(CACHE_SET_RUNNING, &c->flags);
 	return 0;
 err:
-- 
2.35.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave)
  2022-05-24 10:23 [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave) Coly Li
                   ` (3 preceding siblings ...)
  2022-05-24 10:23 ` [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket Coly Li
@ 2022-05-24 12:19 ` Jens Axboe
  4 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2022-05-24 12:19 UTC (permalink / raw)
  To: colyli; +Cc: linux-bcache, linux-block

On Tue, 24 May 2022 18:23:32 +0800, Coly Li wrote:
> Thank you for taking the late arrived series, they are all for bcache
> fixes when I work on the bcache journal no-space deadlock issue. It
> spent me quite long time to fix because other issues interfered my debug
> and analysis. When all the depending issues were fixed and my fix for
> the journal no-space deadlock is verified, this submission is late for
> Linux v5.19 submission. But it is still worthy to take them into v5.19
> because real issues are fixed by this series.
> 
> [...]

Applied, thanks!

[1/4] bcache: improve multithreaded bch_btree_check()
      commit: 622536443b6731ec82c563aae7807165adbe9178
[2/4] bcache: improve multithreaded bch_sectors_dirty_init()
      commit: 4dc34ae1b45fe26e772a44379f936c72623dd407
[3/4] bcache: remove incremental dirty sector counting for bch_sectors_dirty_init()
      commit: 80db4e4707e78cb22287da7d058d7274bd4cb370
[4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket
      commit: 32feee36c30ea06e38ccb8ae6e5c44c6eec790a6

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 0/4] bcache patches for Linux v5.19 (1st wave)
@ 2022-05-22 17:07 Coly Li
  2022-05-22 17:07 ` [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket Coly Li
  0 siblings, 1 reply; 10+ messages in thread
From: Coly Li @ 2022-05-22 17:07 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li

Hi Jens,

The bcache has 4 patches for Linux v5.19 merge window, all from me.
- The first 2 patches are code clean up and potential bug fixes for
multi- threaded btree nodes check (for cache device) and dirty sectors
counting (for backing device), although no report from mailing list for
them, it is good to have the fixes.
- The 3rd patch removes incremental dirty sectors counting because it
is conflicted with multithreaded dirty sectors counting and the latter
one is 10x times faster.
- The last patch fixes a journal no-space deadlock during cache device
registration, it always reserves one journal bucket and only uses it
in registration time, so the no-spance condition won't happen anymore.

There are still 2 fixes are still under the long time I/O pressure
testing, once they are accomplished, I will submit to you in later
RC cycles.

Please take them, and thanks in advance.

Coly Li
---

Coly Li (4):
  bcache: improve multithreaded bch_btree_check()
  bcache: improve multithreaded bch_sectors_dirty_init()
  bcache: remove incremental dirty sector counting for
    bch_sectors_dirty_init()
  bcache: avoid journal no-space deadlock by reserving 1 journal bucket

 drivers/md/bcache/btree.c     |  58 +++++++++----------
 drivers/md/bcache/journal.c   |  31 +++++++++--
 drivers/md/bcache/journal.h   |   2 +
 drivers/md/bcache/super.c     |   1 +
 drivers/md/bcache/writeback.c | 101 +++++++++++++---------------------
 5 files changed, 92 insertions(+), 101 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket
  2022-05-22 17:07 [PATCH 0/4] bcache patches " Coly Li
@ 2022-05-22 17:07 ` Coly Li
  0 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2022-05-22 17:07 UTC (permalink / raw)
  To: axboe; +Cc: linux-bcache, linux-block, Coly Li, Nikhil Kshirsagar, stable

The journal no-space deadlock was reported time to time. Such deadlock
can happen in the following situation.

When all journal buckets are fully filled by active jset with heavy
write I/O load, the cache set registration (after a reboot) will load
all active jsets and inserting them into the btree again (which is
called journal replay). If a journaled bkey is inserted into a btree
node and results btree node split, new journal request might be
triggered. For example, the btree grows one more level after the node
split, then the root node record in cache device super block will be
upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
space in journal buckets, the journal replay has to wait for new journal
bucket to be reclaimed after at least one journal bucket replayed. This
is one example that how the journal no-space deadlock happens.

The solution to avoid the deadlock is to reserve 1 journal bucket in
run time, and only permit the reserved journal bucket to be used during
cache set registration procedure for things like journal replay. Then
the journal space will never be fully filled, there is no chance for
journal no-space deadlock to happen anymore.

This patch adds a new member "bool do_reserve" in struct journal, it is
inititalized to 0 (false) when struct journal is allocated, and set to
1 (true) by bch_journal_space_reserve() when all initialization done in
run_cache_set(). In the run time when journal_reclaim() tries to
allocate a new journal bucket, free_journal_buckets() is called to check
whether there are enough free journal buckets to use. If there is only
1 free journal bucket and journal->do_reserve is 1 (true), the last
bucket is reserved and free_journal_buckets() will return 0 to indicate
no free journal bucket. Then journal_reclaim() will give up, and try
next time to see whetheer there is free journal bucket to allocate. By
this method, there is always 1 jouranl bucket reserved in run time.

During the cache set registration, journal->do_reserve is 0 (false), so
the reserved journal bucket can be used to avoid the no-space deadlock.

Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/journal.c | 31 ++++++++++++++++++++++++++-----
 drivers/md/bcache/journal.h |  2 ++
 drivers/md/bcache/super.c   |  1 +
 3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index df5347ea450b..e5da469a4235 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -405,6 +405,11 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 	return ret;
 }

+void bch_journal_space_reserve(struct journal *j)
+{
+	j->do_reserve = true;
+}
+
 /* Journalling */

 static void btree_flush_write(struct cache_set *c)
@@ -621,12 +626,30 @@ static void do_journal_discard(struct cache *ca)
 	}
 }

+static unsigned int free_journal_buckets(struct cache_set *c)
+{
+	struct journal *j = &c->journal;
+	struct cache *ca = c->cache;
+	struct journal_device *ja = &c->cache->journal;
+	unsigned int n;
+
+	/* In case njournal_buckets is not power of 2 */
+	if (ja->cur_idx >= ja->discard_idx)
+		n = ca->sb.njournal_buckets + ja->discard_idx - ja->cur_idx;
+	else
+		n = ja->discard_idx - ja->cur_idx;
+
+	if (n > (1 + j->do_reserve))
+		return n - (1 + j->do_reserve);
+
+	return 0;
+}
+
 static void journal_reclaim(struct cache_set *c)
 {
 	struct bkey *k = &c->journal.key;
 	struct cache *ca = c->cache;
 	uint64_t last_seq;
-	unsigned int next;
 	struct journal_device *ja = &ca->journal;
 	atomic_t p __maybe_unused;

@@ -649,12 +672,10 @@ static void journal_reclaim(struct cache_set *c)
 	if (c->journal.blocks_free)
 		goto out;

-	next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
-	/* No space available on this device */
-	if (next == ja->discard_idx)
+	if (!free_journal_buckets(c))
 		goto out;

-	ja->cur_idx = next;
+	ja->cur_idx = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
 	k->ptr[0] = MAKE_PTR(0,
 			     bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
 			     ca->sb.nr_this_dev);
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index f2ea34d5f431..cd316b4a1e95 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -105,6 +105,7 @@ struct journal {
 	spinlock_t		lock;
 	spinlock_t		flush_write_lock;
 	bool			btree_flushing;
+	bool			do_reserve;
 	/* used when waiting because the journal was full */
 	struct closure_waitlist	wait;
 	struct closure		io;
@@ -182,5 +183,6 @@ int bch_journal_replay(struct cache_set *c, struct list_head *list);

 void bch_journal_free(struct cache_set *c);
 int bch_journal_alloc(struct cache_set *c);
+void bch_journal_space_reserve(struct journal *j);

 #endif /* _BCACHE_JOURNAL_H */
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index bf3de149d3c9..2bb55278d22d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2128,6 +2128,7 @@ static int run_cache_set(struct cache_set *c)

 	flash_devs_run(c);

+	bch_journal_space_reserve(&c->journal);
 	set_bit(CACHE_SET_RUNNING, &c->flags);
 	return 0;
 err:
-- 
2.35.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 1/4] bcache: improve multithreaded bch_btree_check()
@ 2022-05-21 17:04 Coly Li
  2022-05-21 17:05 ` [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket Coly Li
  0 siblings, 1 reply; 10+ messages in thread
From: Coly Li @ 2022-05-21 17:04 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable

Commit 8e7102273f59 ("bcache: make bch_btree_check() to be
multithreaded") makes bch_btree_check() to be much faster when checking
all btree nodes during cache device registration. But it isn't in ideal
shap yet, still can be improved.

This patch does the following thing to improve current parallel btree
nodes check by multiple threads in bch_btree_check(),
- Add read lock to root node while checking all the btree nodes with
  multiple threads. Although currently it is not mandatory but it is
  good to have a read lock in code logic.
- Remove local variable 'char name[32]', and generate kernel thread name
  string directly when calling kthread_run().
- Allocate local variable "struct btree_check_state check_state" on the
  stack and avoid unnecessary dynamic memory allocation for it.
- Increase check_state->started to count created kernel thread after it
  succeeds to create.
- When wait for all checking kernel threads to finish, use wait_event()
  to replace wait_event_interruptible().

With this change, the code is more clear, and some potential error
conditions are avoided.

Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/btree.c | 58 ++++++++++++++++++---------------------
 1 file changed, 26 insertions(+), 32 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index ad9f16689419..2362bb8ef6d1 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -2006,8 +2006,7 @@ int bch_btree_check(struct cache_set *c)
 	int i;
 	struct bkey *k = NULL;
 	struct btree_iter iter;
-	struct btree_check_state *check_state;
-	char name[32];
+	struct btree_check_state check_state;
 
 	/* check and mark root node keys */
 	for_each_key_filter(&c->root->keys, k, &iter, bch_ptr_invalid)
@@ -2018,63 +2017,58 @@ int bch_btree_check(struct cache_set *c)
 	if (c->root->level == 0)
 		return 0;
 
-	check_state = kzalloc(sizeof(struct btree_check_state), GFP_KERNEL);
-	if (!check_state)
-		return -ENOMEM;
-
-	check_state->c = c;
-	check_state->total_threads = bch_btree_chkthread_nr();
-	check_state->key_idx = 0;
-	spin_lock_init(&check_state->idx_lock);
-	atomic_set(&check_state->started, 0);
-	atomic_set(&check_state->enough, 0);
-	init_waitqueue_head(&check_state->wait);
+	check_state.c = c;
+	check_state.total_threads = bch_btree_chkthread_nr();
+	check_state.key_idx = 0;
+	spin_lock_init(&check_state.idx_lock);
+	atomic_set(&check_state.started, 0);
+	atomic_set(&check_state.enough, 0);
+	init_waitqueue_head(&check_state.wait);
 
+	rw_lock(0, c->root, c->root->level);
 	/*
 	 * Run multiple threads to check btree nodes in parallel,
-	 * if check_state->enough is non-zero, it means current
+	 * if check_state.enough is non-zero, it means current
 	 * running check threads are enough, unncessary to create
 	 * more.
 	 */
-	for (i = 0; i < check_state->total_threads; i++) {
-		/* fetch latest check_state->enough earlier */
+	for (i = 0; i < check_state.total_threads; i++) {
+		/* fetch latest check_state.enough earlier */
 		smp_mb__before_atomic();
-		if (atomic_read(&check_state->enough))
+		if (atomic_read(&check_state.enough))
 			break;
 
-		check_state->infos[i].result = 0;
-		check_state->infos[i].state = check_state;
-		snprintf(name, sizeof(name), "bch_btrchk[%u]", i);
-		atomic_inc(&check_state->started);
+		check_state.infos[i].result = 0;
+		check_state.infos[i].state = &check_state;
 
-		check_state->infos[i].thread =
+		check_state.infos[i].thread =
 			kthread_run(bch_btree_check_thread,
-				    &check_state->infos[i],
-				    name);
-		if (IS_ERR(check_state->infos[i].thread)) {
+				    &check_state.infos[i],
+				    "bch_btrchk[%d]", i);
+		if (IS_ERR(check_state.infos[i].thread)) {
 			pr_err("fails to run thread bch_btrchk[%d]\n", i);
 			for (--i; i >= 0; i--)
-				kthread_stop(check_state->infos[i].thread);
+				kthread_stop(check_state.infos[i].thread);
 			ret = -ENOMEM;
 			goto out;
 		}
+		atomic_inc(&check_state.started);
 	}
 
 	/*
 	 * Must wait for all threads to stop.
 	 */
-	wait_event_interruptible(check_state->wait,
-				 atomic_read(&check_state->started) == 0);
+	wait_event(check_state.wait, atomic_read(&check_state.started) == 0);
 
-	for (i = 0; i < check_state->total_threads; i++) {
-		if (check_state->infos[i].result) {
-			ret = check_state->infos[i].result;
+	for (i = 0; i < check_state.total_threads; i++) {
+		if (check_state.infos[i].result) {
+			ret = check_state.infos[i].result;
 			goto out;
 		}
 	}
 
 out:
-	kfree(check_state);
+	rw_unlock(0, c->root);
 	return ret;
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket
  2022-05-21 17:04 [PATCH 1/4] bcache: improve multithreaded bch_btree_check() Coly Li
@ 2022-05-21 17:05 ` Coly Li
  2022-06-08 20:45   ` Nix
  0 siblings, 1 reply; 10+ messages in thread
From: Coly Li @ 2022-05-21 17:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, Nikhil Kshirsagar, stable

The journal no-space deadlock was reported time to time. Such deadlock
can happen in the following situation.

When all journal buckets are fully filled by active jset with heavy
write I/O load, the cache set registration (after a reboot) will load
all active jsets and inserting them into the btree again (which is
called journal replay). If a journaled bkey is inserted into a btree
node and results btree node split, new journal request might be
triggered. For example, the btree grows one more level after the node
split, then the root node record in cache device super block will be
upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
space in journal buckets, the journal replay has to wait for new journal
bucket to be reclaimed after at least one journal bucket replayed. This
is one example that how the journal no-space deadlock happens.

The solution to avoid the deadlock is to reserve 1 journal bucket in
run time, and only permit the reserved journal bucket to be used during
cache set registration procedure for things like journal replay. Then
the journal space will never be fully filled, there is no chance for
journal no-space deadlock to happen anymore.

This patch adds a new member "bool do_reserve" in struct journal, it is
inititalized to 0 (false) when struct journal is allocated, and set to
1 (true) by bch_journal_space_reserve() when all initialization done in
run_cache_set(). In the run time when journal_reclaim() tries to
allocate a new journal bucket, free_journal_buckets() is called to check
whether there are enough free journal buckets to use. If there is only
1 free journal bucket and journal->do_reserve is 1 (true), the last
bucket is reserved and free_journal_buckets() will return 0 to indicate
no free journal bucket. Then journal_reclaim() will give up, and try
next time to see whetheer there is free journal bucket to allocate. By
this method, there is always 1 jouranl bucket reserved in run time.

During the cache set registration, journal->do_reserve is 0 (false), so
the reserved journal bucket can be used to avoid the no-space deadlock.

Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/journal.c | 31 ++++++++++++++++++++++++++-----
 drivers/md/bcache/journal.h |  2 ++
 drivers/md/bcache/super.c   |  1 +
 3 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index df5347ea450b..e5da469a4235 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -405,6 +405,11 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 	return ret;
 }

+void bch_journal_space_reserve(struct journal *j)
+{
+	j->do_reserve = true;
+}
+
 /* Journalling */

 static void btree_flush_write(struct cache_set *c)
@@ -621,12 +626,30 @@ static void do_journal_discard(struct cache *ca)
 	}
 }

+static unsigned int free_journal_buckets(struct cache_set *c)
+{
+	struct journal *j = &c->journal;
+	struct cache *ca = c->cache;
+	struct journal_device *ja = &c->cache->journal;
+	unsigned int n;
+
+	/* In case njournal_buckets is not power of 2 */
+	if (ja->cur_idx >= ja->discard_idx)
+		n = ca->sb.njournal_buckets +  ja->discard_idx - ja->cur_idx;
+	else
+		n = ja->discard_idx - ja->cur_idx;
+
+	if (n > (1 + j->do_reserve))
+		return n - (1 + j->do_reserve);
+
+	return 0;
+}
+
 static void journal_reclaim(struct cache_set *c)
 {
 	struct bkey *k = &c->journal.key;
 	struct cache *ca = c->cache;
 	uint64_t last_seq;
-	unsigned int next;
 	struct journal_device *ja = &ca->journal;
 	atomic_t p __maybe_unused;

@@ -649,12 +672,10 @@ static void journal_reclaim(struct cache_set *c)
 	if (c->journal.blocks_free)
 		goto out;

-	next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
-	/* No space available on this device */
-	if (next == ja->discard_idx)
+	if (!free_journal_buckets(c))
 		goto out;

-	ja->cur_idx = next;
+	ja->cur_idx = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
 	k->ptr[0] = MAKE_PTR(0,
 			     bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
 			     ca->sb.nr_this_dev);
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index f2ea34d5f431..cd316b4a1e95 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -105,6 +105,7 @@ struct journal {
 	spinlock_t		lock;
 	spinlock_t		flush_write_lock;
 	bool			btree_flushing;
+	bool			do_reserve;
 	/* used when waiting because the journal was full */
 	struct closure_waitlist	wait;
 	struct closure		io;
@@ -182,5 +183,6 @@ int bch_journal_replay(struct cache_set *c, struct list_head *list);

 void bch_journal_free(struct cache_set *c);
 int bch_journal_alloc(struct cache_set *c);
+void bch_journal_space_reserve(struct journal *j);

 #endif /* _BCACHE_JOURNAL_H */
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index bf3de149d3c9..2bb55278d22d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2128,6 +2128,7 @@ static int run_cache_set(struct cache_set *c)

 	flash_devs_run(c);

+	bch_journal_space_reserve(&c->journal);
 	set_bit(CACHE_SET_RUNNING, &c->flags);
 	return 0;
 err:
-- 
2.35.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket
  2022-05-21 17:05 ` [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket Coly Li
@ 2022-06-08 20:45   ` Nix
  2022-06-09 13:54     ` Coly Li
  0 siblings, 1 reply; 10+ messages in thread
From: Nix @ 2022-06-08 20:45 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache, linux-block, Nikhil Kshirsagar, stable

On 21 May 2022, Coly Li spake thusly:

> When all journal buckets are fully filled by active jset with heavy
> write I/O load, the cache set registration (after a reboot) will load
> all active jsets and inserting them into the btree again (which is
> called journal replay). If a journaled bkey is inserted into a btree
> node and results btree node split, new journal request might be
> triggered. For example, the btree grows one more level after the node
> split, then the root node record in cache device super block will be
> upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
> space in journal buckets, the journal replay has to wait for new journal
> bucket to be reclaimed after at least one journal bucket replayed. This
> is one example that how the journal no-space deadlock happens.
> 
> The solution to avoid the deadlock is to reserve 1 journal bucket in

It seems to me that this could happen more than once in a single journal
replay (multiple nodes might be split, etc). Is one bucket actually
always enough, or is it merely enough nearly all the time?

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket
  2022-06-08 20:45   ` Nix
@ 2022-06-09 13:54     ` Coly Li
  0 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2022-06-09 13:54 UTC (permalink / raw)
  To: Nix; +Cc: linux-bcache, linux-block, Nikhil Kshirsagar, stable

> 2022年6月9日 04:45，Nix <nix@esperi.org.uk> 写道：
> 
> On 21 May 2022, Coly Li spake thusly:
> 
>> When all journal buckets are fully filled by active jset with heavy
>> write I/O load, the cache set registration (after a reboot) will load
>> all active jsets and inserting them into the btree again (which is
>> called journal replay). If a journaled bkey is inserted into a btree
>> node and results btree node split, new journal request might be
>> triggered. For example, the btree grows one more level after the node
>> split, then the root node record in cache device super block will be
>> upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
>> space in journal buckets, the journal replay has to wait for new journal
>> bucket to be reclaimed after at least one journal bucket replayed. This
>> is one example that how the journal no-space deadlock happens.
>> 
>> The solution to avoid the deadlock is to reserve 1 journal bucket in
> 
> It seems to me that this could happen more than once in a single journal
> replay (multiple nodes might be split, etc). Is one bucket actually
> always enough, or is it merely enough nearly all the time?

It is possible that multiple leaf nodes split during journal replay, but the journal_meta() only gets called when the root node is updated.
For the new bkey of the new split node inserting into root node, it doesn’t go into journal because journal only records inserting bkeys for leaf nodes. Only when the btree node split causes root node split, the new root node location (bkey) has to be recored in journal set.

Therefore almost all the time that btree root node only splits once during journal replay, it is very rare that between two root node splits (that means very large number of bkeys inserted) the oldest journal entry doesn’t get replayed, that is almost impossible in real practice. So reserving 8K journal space is indeed enough for the no-space deadlock situation.

The default bucket size is much larger than 8K, so we don’t have to worry about the reserved journal space exhausted even with a much larger journal buckets number.

Indeed my initial effort was to reserve 8K space within a journal bucket if the bucket size > 8KB. But there are too many locations should be careful, and the logic of the patch is complicated and total change set is 200+ lines. And I find if I reserve a whole bucket, the change set is only 30+ lines. So finally I decide to reserve a whole journal bucket, because the change is much simpler and easier to be understood.

Coly Li

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-06-09 13:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-24 10:23 [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave) Coly Li
2022-05-24 10:23 ` [PATCH 1/4] bcache: improve multithreaded bch_btree_check() Coly Li
2022-05-24 10:23 ` [PATCH 2/4] bcache: improve multithreaded bch_sectors_dirty_init() Coly Li
2022-05-24 10:23 ` [PATCH 3/4] bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() Coly Li
2022-05-24 10:23 ` [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket Coly Li
2022-05-24 12:19 ` [Resend PATCH v2 0/4] bcache fixes for Linux v5.19 (1st wave) Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2022-05-22 17:07 [PATCH 0/4] bcache patches " Coly Li
2022-05-22 17:07 ` [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket Coly Li
2022-05-21 17:04 [PATCH 1/4] bcache: improve multithreaded bch_btree_check() Coly Li
2022-05-21 17:05 ` [PATCH 4/4] bcache: avoid journal no-space deadlock by reserving 1 journal bucket Coly Li
2022-06-08 20:45   ` Nix
2022-06-09 13:54     ` Coly Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.