All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock
@ 2019-04-19 16:04 Coly Li
  2019-04-19 16:04 ` [RFC PATCH v2 01/16] bcache: move definition of 'int ret' out of macro read_bucket() Coly Li
                   ` (15 more replies)
  0 siblings, 16 replies; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:04 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

The initial journal no-space deadlock issue was known as several
kthreads or kworkers were reported by kernel to hang for quite long
time. The reason was a deadlock happened when there is no more journal
space avialable for new coming journal request.

In v1 RFC series, I though the journal no-space deadlock was from two
conditions, which was not the truth. After long time testing and
debugging, I realize the journal deadlock was a result of a series of
problems hidden in current code.

Now I make progress in v2 series, and all known problems related to the
journal no-space deadlock are fixed. I don't observe journal deadlock
and related I/O hang warning any more.

Unfortunately we can not apply this whole series at this moment, because
after fixing the journal no-space deadlock issue, I find a race in dirty
btree node flushing. Beside normal dirty btree node flushing, when there
is no journal space, btree_flush_write() will be called to write down
the oldest dirty btree node. Once the oldest dirty btree node is written
from memory into cache device, its associated journal reference will be
released, this operation is necessary to reclaim oldest busy journal
bucket when no-space in journal buckets.

The problem of this race is, when building c->flush_btree heap, all
dirty btree node from for_each_cached_btree() are not protected or
referenced, so there is a race that after the heap c->flush_btree is
built and before the oldest node is selected from the heap, the oldest
node is already written in normal code path, and the memory is
released/reused.

From my testing, a kernel panic triggered by wild pointer deference or
un-paired mutex_lock/unlock can be observed from btree_flush_write(),
this is because the selected btree node was written and released
already, btree_flush_write() just references invalid memory object.

So far I don't have good idea to fix such race without hurting I/O
performance, and IMHO the bcache I/O hang by journal is kind of better
than kenrel panic. Therefore before the race of dirty btree nodes
writting gets fixed, I won't apply the whole series.

But there are still some helpful and non-major fixes which can go into
upstream, to reduce the whole patch set and avoid huge changes in a
single kernel merge.

The patch 'bcache: acquire c->journal.lock in bch_btree_leaf_dirty()` in
v1 series was removed from v2 series. I still feel this is a problem to
access journal pipo without any protection, but this fix is limited and
I need to think about a more thoughtful way to fix.

Any review comment or suggestion are warmly welcome.

Thanks in advance for your help.

Coly Li
---

Coly Li (16):
  bcache: move definition of 'int ret' out of macro read_bucket()
  bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim()
  bcache: reload jouranl key information during journal replay
  bcache: fix journal deadlock during jouranl replay
  bcache: reserve space for journal_meta() in run time
  bcache: add failure check to run_cache_set() for journal replay
  bcache: add comments for kobj release callback routine
  bcache: return error immediately in bch_journal_replay()
  bcache: add error check for calling register_bdev()
  bcache: Add comments for blkdev_put() in registration code path
  bcache: add comments for closure_fn to be called in closure_queue()
  bcache: add pendings_cleanup to stop pending bcache device
  bcache: fix fifo index swapping condition in btree_flush_write()
  bcache: try to flush btree nodes as many as possible
  bcache: improve bcache_reboot()
  bcache: introduce spinlock_t flush_write_lock in struct journal

 drivers/md/bcache/journal.c | 312 ++++++++++++++++++++++++++++++++++++++++----
 drivers/md/bcache/journal.h |   8 +-
 drivers/md/bcache/super.c   | 112 ++++++++++++++--
 3 files changed, 393 insertions(+), 39 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 01/16] bcache: move definition of 'int ret' out of macro read_bucket()
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
@ 2019-04-19 16:04 ` Coly Li
  2019-04-21 17:47   ` Chaitanya Kulkarni
  2019-04-23  6:50   ` Hannes Reinecke
  2019-04-19 16:04 ` [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim() Coly Li
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:04 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

'int ret' is defined as a local variable inside macro read_bucket().
Since this macro is called multiple times, and following patches will
use a 'int ret' variable in bch_journal_read(), this patch moves
definition of 'int ret' from macro read_bucket() to range of function
bch_journal_read().

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index b2fd412715b1..6e18057d1d82 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -147,7 +147,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 {
 #define read_bucket(b)							\
 	({								\
-		int ret = journal_read_bucket(ca, list, b);		\
+		ret = journal_read_bucket(ca, list, b);			\
 		__set_bit(b, bitmap);					\
 		if (ret < 0)						\
 			return ret;					\
@@ -156,6 +156,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 
 	struct cache *ca;
 	unsigned int iter;
+	int ret = 0;
 
 	for_each_cache(ca, c, iter) {
 		struct journal_device *ja = &ca->journal;
@@ -267,7 +268,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 					    struct journal_replay,
 					    list)->j.seq;
 
-	return 0;
+	return ret;
 #undef read_bucket
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim()
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
  2019-04-19 16:04 ` [RFC PATCH v2 01/16] bcache: move definition of 'int ret' out of macro read_bucket() Coly Li
@ 2019-04-19 16:04 ` Coly Li
  2019-04-23  6:50   ` Hannes Reinecke
  2019-04-19 16:04 ` [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay Coly Li
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:04 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable

In journal_reclaim() ja->cur_idx of each cache will be update to
reclaim available journal buckets. Variable 'int n' is used to count how
many cache is successfully reclaimed, then n is set to c->journal.key
by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
loop will write the jset data onto each cache.

The problem is, if all jouranl buckets on each cache is full, the
following code in journal_reclaim(),

529 for_each_cache(ca, c, iter) {
530         struct journal_device *ja = &ca->journal;
531         unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
532
533         /* No space available on this device */
534         if (next == ja->discard_idx)
535                 continue;
536
537         ja->cur_idx = next;
538         k->ptr[n++] = MAKE_PTR(0,
539                           bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
540                           ca->sb.nr_this_dev);
541 }
542
543 bkey_init(k);
544 SET_KEY_PTRS(k, n);

If there is no available bucket to reclaim, the if() condition at line
534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
will set KEY_PTRS field of c->journal.key to 0.

Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
journal_write_unlocked() the journal data is written in following loop,

649	for (i = 0; i < KEY_PTRS(k); i++) {
650-671		submit journal data to cache device
672	}

If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
won't be written to cache device here. If system crahed or rebooted
before bkeys of the lost journal entries written into btree nodes, data
corruption will be reported during bcache reload after rebooting the
system.

Indeed there is only one cache in a cache set, there is no need to set
KEY_PTRS field in journal_reclaim() at all. But in order to keep the
for_each_cache() logic consistent for now, this patch fixes the above
problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
available to reclaim.

Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 6e18057d1d82..5180bed911ef 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -541,11 +541,11 @@ static void journal_reclaim(struct cache_set *c)
 				  ca->sb.nr_this_dev);
 	}
 
-	bkey_init(k);
-	SET_KEY_PTRS(k, n);
-
-	if (n)
+	if (n) {
+		bkey_init(k);
+		SET_KEY_PTRS(k, n);
 		c->journal.blocks_free = c->sb.bucket_size >> c->block_bits;
+	}
 out:
 	if (!journal_full(&c->journal))
 		__closure_wake_up(&c->journal.wait);
@@ -672,6 +672,9 @@ static void journal_write_unlocked(struct closure *cl)
 		ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
 	}
 
+	/* If KEY_PTRS(k) == 0, this jset gets lost in air */
+	BUG_ON(i == 0);
+
 	atomic_dec_bug(&fifo_back(&c->journal.pin));
 	bch_journal_next(&c->journal);
 	journal_reclaim(c);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
  2019-04-19 16:04 ` [RFC PATCH v2 01/16] bcache: move definition of 'int ret' out of macro read_bucket() Coly Li
  2019-04-19 16:04 ` [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim() Coly Li
@ 2019-04-19 16:04 ` Coly Li
  2019-04-23  6:54   ` Hannes Reinecke
  2019-04-19 16:04 ` [RFC PATCH v2 04/16] bcache: fix journal deadlock during jouranl replay Coly Li
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:04 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable

When bcache journal initiates during running cache set, cache set
journal.blocks_free is initiated as 0. Then during journal replay if
journal_meta() is called and an empty jset is written to cache device,
journal_reclaim() is called. If there is available journal bucket to
reclaim, c->journal.blocks_free is set to numbers of blocks of a journal
bucket, which is c->sb.bucket_size >> c->block_bits.

Most of time the above process works correctly, expect the condtion
when journal space is almost full. "Almost full" means there is no free
journal bucket, but there are still free blocks in last available
bucket indexed by ja->cur_idx.

If system crashes or reboots when journal space is almost full, problem
comes. During cache set reload after the reboot, c->journal.blocks_free
is initialized as 0, when jouranl replay process writes bcache jouranl,
journal_reclaim() will be called to reclaim available journal bucket and
set c->journal.blocks_free to c->sb.bucket_size >> c->block_bits. But
there is no fully free bucket to reclaim in journal_reclaim(), so value
of c->journal.blocks_free will keep 0. If the first journal entry
processed by journal_replay() causes btree split and requires writing
journal space by journal_meta(), journal_meta() has to go into an
infinite loop to reclaim jouranl bucket, and blocks the whole cache set
to run.

Such buggy situation can be solved if we do following things before
journal replay starts,
- Recover previous value of c->journal.blocks_free in last run time,
  and set it to current c->journal.blocks_free as initial value.
- Recover previous value of ja->cur_idx in last run time, and set it to
  KEY_PTR of current c->journal.key as initial value.

After c->journal.blocks_free and c->journal.key are recovered, in
condition when jouranl space is almost full and cache set is reloaded,
meta journal entry from journal reply can be written into free blocks of
the last available journal bucket, then old jouranl entries can be
replayed and reclaimed for further journaling request.

This patch adds bch_journal_key_reload() to recover journal blocks_free
and key ptr value for above purpose. bch_journal_key_reload() is called
in bch_journal_read() before replying journal by bch_journal_replay().

Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 87 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 5180bed911ef..a6deb16c15c8 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -143,6 +143,89 @@ reread:		left = ca->sb.bucket_size - offset;
 	return ret;
 }
 
+static int bch_journal_key_reload(struct cache_set *c)
+{
+	struct cache *ca;
+	unsigned int iter, n = 0;
+	struct bkey *k = &c->journal.key;
+	int ret = 0;
+
+	for_each_cache(ca, c, iter) {
+		struct journal_device *ja = &ca->journal;
+		struct bio *bio = &ja->bio;
+		struct jset *j, *data = c->journal.w[0].data;
+		struct closure cl;
+		unsigned int len, left;
+		unsigned int offset = 0, used_blocks = 0;
+		sector_t bucket = bucket_to_sector(c, ca->sb.d[ja->cur_idx]);
+
+		closure_init_stack(&cl);
+
+		while (offset < ca->sb.bucket_size) {
+reread:			left = ca->sb.bucket_size - offset;
+			len = min_t(unsigned int,
+				    left, PAGE_SECTORS << JSET_BITS);
+
+			bio_reset(bio);
+			bio->bi_iter.bi_sector = bucket + offset;
+			bio_set_dev(bio, ca->bdev);
+			bio->bi_iter.bi_size = len << 9;
+
+			bio->bi_end_io = journal_read_endio;
+			bio->bi_private = &cl;
+			bio_set_op_attrs(bio, REQ_OP_READ, 0);
+			bch_bio_map(bio, data);
+
+			closure_bio_submit(c, bio, &cl);
+			closure_sync(&cl);
+
+			j = data;
+			while (len) {
+				size_t blocks, bytes = set_bytes(j);
+
+				if (j->magic != jset_magic(&ca->sb))
+					goto out;
+
+				if (bytes > left << 9 ||
+				    bytes > PAGE_SIZE << JSET_BITS) {
+					pr_err("jset may be correpted: too big");
+					ret = -EIO;
+					goto err;
+				}
+
+				if (bytes > len << 9)
+					goto reread;
+
+				if (j->csum != csum_set(j)) {
+					pr_err("jset may be corrupted: bad csum");
+					ret = -EIO;
+					goto err;
+				}
+
+				blocks = set_blocks(j, block_bytes(c));
+				used_blocks += blocks;
+
+				offset	+= blocks * ca->sb.block_size;
+				len	-= blocks * ca->sb.block_size;
+				j = ((void *) j) + blocks * block_bytes(ca);
+			}
+		}
+out:
+		c->journal.blocks_free =
+			(c->sb.bucket_size >> c->block_bits) -
+			used_blocks;
+
+		k->ptr[n++] = MAKE_PTR(0, bucket, ca->sb.nr_this_dev);
+	}
+
+	BUG_ON(n == 0);
+	bkey_init(k);
+	SET_KEY_PTRS(k, n);
+
+err:
+	return ret;
+}
+
 int bch_journal_read(struct cache_set *c, struct list_head *list)
 {
 #define read_bucket(b)							\
@@ -268,6 +351,10 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 					    struct journal_replay,
 					    list)->j.seq;
 
+	/* Initial value of c->journal.blocks_free should be 0 */
+	BUG_ON(c->journal.blocks_free != 0);
+	ret = bch_journal_key_reload(c);
+
 	return ret;
 #undef read_bucket
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 04/16] bcache: fix journal deadlock during jouranl replay
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (2 preceding siblings ...)
  2019-04-19 16:04 ` [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay Coly Li
@ 2019-04-19 16:04 ` Coly Li
  2019-04-23  6:59   ` Hannes Reinecke
  2019-04-19 16:04 ` [RFC PATCH v2 05/16] bcache: reserve space for journal_meta() in run time Coly Li
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:04 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

A deadlock of bcache jouranling may happen during journal replay. Such
deadlock happens when,
- Journal space is totally full (no any free blocks) and system crashes
  or reboots.
- After reboot, the first journal entry handled by jouranl replay causes
  btree split and jouranl_meta() is called to write an empty jset to
  journal space.
- There is no journal space to write and journal_reclaim() fails to get
  any available bucket because this is the first replayed journal entry
  to be blocked.
Then the whole cache set is blocked from running.

This patch is an effort to fix such journal replay deadlock in a simpler
way,
- Add a bool varialbe 'in_replay' in struct journal, set it to true when
  journal replay starts, and set it to false when journal replay
  completed. in_replay is initialized to be false.
- Reserve 6 sectors in journal bucket, do not use them in normal bcache
  runtime. These sectors are only permitted to use during journal
  replay (when c->jouranl.in_replay is true)

Then in normal bcache runtime, journal space won't be totally full and
there are 6 sectors are always reserved for journal replay time. After
system reboots, if bch_btree_insert() in bch_journal_replay() causes
btree split and bch_journal_beta() gets called to require 1 sector
from journal buckets to write an empty jset, there are enough reserved
space to serve.

The reason to reserve 6 sectors is, we should choose a number that won't
fix into a bucket size. If the reserved space happens to be a whole
bucket, more logic has to be added in journal_replay() to handle
journal.blocks_free with reserved spaces in journal replay time. This is
why 6 sectors is choosed, it is 3KB and won't be any proper block size
or bucket size.

The bcache btree node size is quite large, so btree node split won't be
a frequent event. And when btree node split happens, new added key will
be insert directly into uppper level or neighbor nodes and won't go into
journal again, only bch_journal_meta() is called to write jset metadata
which occupies 1 block in journal space. If blocksize is set to 4K size,
reserve 6 sectors indeed is 2 blocks, so there can be two continuously
btree splitting happen during journal replay, this is very very rare in
practice. As default blocksize is set to sector size, that equals to
6 blocks reserved. Contiously splitting the btree for 6 times in journal
replay is almost impossible, so the reserved space seems to be enough
in my humble opinion.

If in future the reserved space turns out to be not enough, let's extend
it then.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 100 ++++++++++++++++++++++++++++++++++++++++----
 drivers/md/bcache/journal.h |   4 ++
 2 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index a6deb16c15c8..c60a702f53a9 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -415,6 +415,8 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 	uint64_t start = i->j.last_seq, end = i->j.seq, n = start;
 	struct keylist keylist;
 
+	s->journal.in_replay = true;
+
 	list_for_each_entry(i, list, list) {
 		BUG_ON(i->pin && atomic_read(i->pin) != 1);
 
@@ -448,6 +450,7 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 	pr_info("journal replay done, %i keys in %i entries, seq %llu",
 		keys, entries, end);
 err:
+	s->journal.in_replay = false;
 	while (!list_empty(list)) {
 		i = list_first_entry(list, struct journal_replay, list);
 		list_del(&i->list);
@@ -577,6 +580,22 @@ static void do_journal_discard(struct cache *ca)
 	}
 }
 
+static inline bool last_available_journal_bucket(struct cache_set *c)
+{
+	struct cache *ca;
+	unsigned int iter;
+	struct journal_device *ja;
+
+	for_each_cache(ca, c, iter) {
+		ja = &ca->journal;
+		if (unlikely((ja->cur_idx + 1) % ca->sb.njournal_buckets ==
+			     ja->last_idx))
+			return true;
+	}
+
+	return false;
+}
+
 static void journal_reclaim(struct cache_set *c)
 {
 	struct bkey *k = &c->journal.key;
@@ -584,6 +603,7 @@ static void journal_reclaim(struct cache_set *c)
 	uint64_t last_seq;
 	unsigned int iter, n = 0;
 	atomic_t p __maybe_unused;
+	bool last, do_wakeup = false;
 
 	atomic_long_inc(&c->reclaim);
 
@@ -606,8 +626,13 @@ static void journal_reclaim(struct cache_set *c)
 	for_each_cache(ca, c, iter)
 		do_journal_discard(ca);
 
-	if (c->journal.blocks_free)
+	last = last_available_journal_bucket(c);
+	if ((!last && c->journal.blocks_free) ||
+	    (last && (c->journal.blocks_free * c->sb.block_size) >
+		      BCH_JOURNAL_RPLY_RESERVE)) {
+		do_wakeup = true;
 		goto out;
+	}
 
 	/*
 	 * Allocate:
@@ -632,9 +657,10 @@ static void journal_reclaim(struct cache_set *c)
 		bkey_init(k);
 		SET_KEY_PTRS(k, n);
 		c->journal.blocks_free = c->sb.bucket_size >> c->block_bits;
+		do_wakeup = true;
 	}
 out:
-	if (!journal_full(&c->journal))
+	if (do_wakeup && !journal_full(&c->journal))
 		__closure_wake_up(&c->journal.wait);
 }
 
@@ -692,6 +718,21 @@ static void journal_write_unlock(struct closure *cl)
 	spin_unlock(&c->journal.lock);
 }
 
+static bool should_reclaim(struct cache_set *c,
+			   struct journal_write *w)
+{
+	if (unlikely(journal_full(&c->journal)))
+		return true;
+
+	if (unlikely(last_available_journal_bucket(c) &&
+		     (!c->journal.in_replay) &&
+		     (c->journal.blocks_free * c->sb.block_size <=
+			BCH_JOURNAL_RPLY_RESERVE)))
+		return true;
+
+	return false;
+}
+
 static void journal_write_unlocked(struct closure *cl)
 	__releases(c->journal.lock)
 {
@@ -710,7 +751,7 @@ static void journal_write_unlocked(struct closure *cl)
 	if (!w->need_write) {
 		closure_return_with_destructor(cl, journal_write_unlock);
 		return;
-	} else if (journal_full(&c->journal)) {
+	} else if (should_reclaim(c, w)) {
 		journal_reclaim(c);
 		spin_unlock(&c->journal.lock);
 
@@ -798,6 +839,52 @@ static void journal_try_write(struct cache_set *c)
 	}
 }
 
+static bool no_journal_wait(struct cache_set *c,
+			    size_t sectors)
+{
+	bool last = last_available_journal_bucket(c);
+	size_t reserved_sectors = 0;
+	size_t n = min_t(size_t,
+			 c->journal.blocks_free * c->sb.block_size,
+			 PAGE_SECTORS << JSET_BITS);
+
+	if (last && !c->journal.in_replay)
+		reserved_sectors = BCH_JOURNAL_RPLY_RESERVE;
+
+	if (sectors <= (n - reserved_sectors))
+		return true;
+
+	return false;
+}
+
+static bool should_try_write(struct cache_set *c,
+			     struct journal_write *w)
+{
+	size_t reserved_sectors, n, sectors;
+
+	if (journal_full(&c->journal))
+		return false;
+
+	if (!last_available_journal_bucket(c))
+		return true;
+
+	/* the check in no_journal_wait exceeds BCH_JOURNAL_RPLY_RESERVE */
+	if (w->data->keys == 0)
+		return false;
+
+	reserved_sectors = BCH_JOURNAL_RPLY_RESERVE;
+	n = min_t(size_t,
+		  (c->journal.blocks_free * c->sb.block_size),
+		  PAGE_SECTORS << JSET_BITS);
+	sectors = __set_blocks(w->data, w->data->keys,
+			       block_bytes(c)) * c->sb.block_size;
+	if (sectors <= (n - reserved_sectors))
+		return true;
+
+	return false;
+}
+
+
 static struct journal_write *journal_wait_for_write(struct cache_set *c,
 						    unsigned int nkeys)
 	__acquires(&c->journal.lock)
@@ -816,15 +903,13 @@ static struct journal_write *journal_wait_for_write(struct cache_set *c,
 		sectors = __set_blocks(w->data, w->data->keys + nkeys,
 				       block_bytes(c)) * c->sb.block_size;
 
-		if (sectors <= min_t(size_t,
-				     c->journal.blocks_free * c->sb.block_size,
-				     PAGE_SECTORS << JSET_BITS))
+		if (no_journal_wait(c, sectors))
 			return w;
 
 		if (wait)
 			closure_wait(&c->journal.wait, &cl);
 
-		if (!journal_full(&c->journal)) {
+		if (should_try_write(c, w)) {
 			if (wait)
 				trace_bcache_journal_entry_full(c);
 
@@ -933,6 +1018,7 @@ int bch_journal_alloc(struct cache_set *c)
 	INIT_DELAYED_WORK(&j->work, journal_write_work);
 
 	c->journal_delay_ms = 100;
+	j->in_replay = false;
 
 	j->w[0].c = c;
 	j->w[1].c = c;
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index 66f0facff84b..54408e248a39 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -108,6 +108,7 @@ struct journal {
 	struct closure		io;
 	int			io_in_flight;
 	struct delayed_work	work;
+	bool			in_replay;
 
 	/* Number of blocks free in the bucket(s) we're currently writing to */
 	unsigned int		blocks_free;
@@ -159,6 +160,9 @@ struct journal_device {
 
 #define JOURNAL_PIN	20000
 
+/* Reserved jouranl space in sectors */
+#define BCH_JOURNAL_RPLY_RESERVE	6U
+
 #define journal_full(j)						\
 	(!(j)->blocks_free || fifo_free(&(j)->pin) <= 1)
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 05/16] bcache: reserve space for journal_meta() in run time
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (3 preceding siblings ...)
  2019-04-19 16:04 ` [RFC PATCH v2 04/16] bcache: fix journal deadlock during jouranl replay Coly Li
@ 2019-04-19 16:04 ` Coly Li
  2019-04-23  7:00   ` Hannes Reinecke
  2019-04-19 16:04 ` [RFC PATCH v2 06/16] bcache: add failure check to run_cache_set() for journal replay Coly Li
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:04 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Another journal deadlock of bcache jouranling can happen in normal
bcache runtime. It is very rare to happen but there are people report
bkey insert work queue blocked which caused by such deadlock.

This is how such jouranling deadlock in runtime happens,
- Journal space is totally full and no free space to reclaim, jouranling
  tasks waiting for space to write in journal_wait_for_write().
- In order to have free journal space, btree_flush_write() is called to
  flush earlest journaled in-memory btree key into btree node. Then all
  journaled bkey in early used journal buckets are flushed to on-disk
  btree, this journal bucket can be reclaimed for new coming jouranl
  request.
- But if the earlest jouranled bkey causes a btree node split during
  insert it into btree node, finally journal_meta() will be called to
  journal btree root (and other information) into the journal space.
- Unfortunately the journal space is full, and the jouranl entries has
  to be flushed in linear turn. So bch_journal_meta() from bkey insert
  is blocked too.
Then jouranling deadlock during bcache run time happens.

A method to fix such deadlock is to reserve some journal space too. The
reserved space can only be used when,
- Current journal bucket is the last journal bucket which has available
  space to write into.
- When calling bch_journal(), current jset is empty and there is no key
  in the inserting key list. This means the journal request if from
  bch_journal_meta() and no non-reserved space can be used.

Then if such journaling request is from bch_journal_meta() of inserting
the earlest journaled bkey back into btree, the deadlock condition won't
happen any more because the reserved space can be used for such
scenario.

Since there are already 6 sectors reserved for journal replay, here we
reserve 7 sectors for runtime meta journal from btree split caused by
flushing journal entries back to btree node. Depends on block size from
1 sector to 4KB, the reserved space can serve for form 7 to 2 journal
blocks. Indeed only one journal block reserved for such journal deadlock
scenario is enough, 2 continuous btree splits cause by two adjoin bkey
flushing from journal is very very rare to happen. So reserve 7 sectors
should works.

Another reason for reserving 7 sectors is, there are already 6 sectors
reserved fo journal repley, so in total there are 13 sectors reserved in
last available journal bucket. 13 sectors won't be a proper bucket size,
so we don't need to add more code to handle journal.blocks_free
initialization for whole reserved jouranl bucket. Even such code logic
is simple, less code is better in my humble opinion.

Again, if in future the reserved space turns out to be not enough, let's
extend it then.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 89 +++++++++++++++++++++++++++++++++------------
 drivers/md/bcache/journal.h |  1 +
 2 files changed, 66 insertions(+), 24 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index c60a702f53a9..6aa68ab7cd78 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -629,7 +629,7 @@ static void journal_reclaim(struct cache_set *c)
 	last = last_available_journal_bucket(c);
 	if ((!last && c->journal.blocks_free) ||
 	    (last && (c->journal.blocks_free * c->sb.block_size) >
-		      BCH_JOURNAL_RPLY_RESERVE)) {
+		      (BCH_JOURNAL_RESERVE + BCH_JOURNAL_RPLY_RESERVE))) {
 		do_wakeup = true;
 		goto out;
 	}
@@ -718,18 +718,27 @@ static void journal_write_unlock(struct closure *cl)
 	spin_unlock(&c->journal.lock);
 }
 
-static bool should_reclaim(struct cache_set *c,
-			   struct journal_write *w)
+static inline bool should_reclaim(struct cache_set *c,
+				  struct journal_write *w)
 {
-	if (unlikely(journal_full(&c->journal)))
-		return true;
+	bool last = last_available_journal_bucket(c);
 
-	if (unlikely(last_available_journal_bucket(c) &&
-		     (!c->journal.in_replay) &&
-		     (c->journal.blocks_free * c->sb.block_size <=
-			BCH_JOURNAL_RPLY_RESERVE)))
+	if (!last && journal_full(&c->journal))
 		return true;
 
+	if (unlikely(last)) {
+		size_t n = c->journal.blocks_free * c->sb.block_size;
+
+		if (!c->journal.in_replay) {
+			if (n <= BCH_JOURNAL_RESERVE +
+				 BCH_JOURNAL_RPLY_RESERVE)
+				return true;
+		} else {
+			if (n <= BCH_JOURNAL_RPLY_RESERVE)
+				return true;
+		}
+	}
+
 	return false;
 }
 
@@ -751,7 +760,9 @@ static void journal_write_unlocked(struct closure *cl)
 	if (!w->need_write) {
 		closure_return_with_destructor(cl, journal_write_unlock);
 		return;
-	} else if (should_reclaim(c, w)) {
+	}
+
+	if (should_reclaim(c, w)) {
 		journal_reclaim(c);
 		spin_unlock(&c->journal.lock);
 
@@ -840,16 +851,26 @@ static void journal_try_write(struct cache_set *c)
 }
 
 static bool no_journal_wait(struct cache_set *c,
-			    size_t sectors)
+			    size_t sectors,
+			    int nkeys)
 {
+	bool is_journal_meta = (nkeys == 0) ? true : false;
 	bool last = last_available_journal_bucket(c);
 	size_t reserved_sectors = 0;
-	size_t n = min_t(size_t,
-			 c->journal.blocks_free * c->sb.block_size,
-			 PAGE_SECTORS << JSET_BITS);
+	size_t n;
+
+	if (unlikely(last)) {
+		if (!is_journal_meta)
+			reserved_sectors = BCH_JOURNAL_RESERVE +
+					   BCH_JOURNAL_RPLY_RESERVE;
+		else
+			reserved_sectors = (!c->journal.in_replay) ?
+				BCH_JOURNAL_RPLY_RESERVE : 0;
+	}
 
-	if (last && !c->journal.in_replay)
-		reserved_sectors = BCH_JOURNAL_RPLY_RESERVE;
+	n = min_t(size_t,
+		  c->journal.blocks_free * c->sb.block_size,
+		  PAGE_SECTORS << JSET_BITS);
 
 	if (sectors <= (n - reserved_sectors))
 		return true;
@@ -858,26 +879,46 @@ static bool no_journal_wait(struct cache_set *c,
 }
 
 static bool should_try_write(struct cache_set *c,
-			     struct journal_write *w)
+			     struct journal_write *w,
+			     int nkeys)
 {
 	size_t reserved_sectors, n, sectors;
+	bool last, empty_jset;
 
 	if (journal_full(&c->journal))
 		return false;
 
-	if (!last_available_journal_bucket(c))
+	last = last_available_journal_bucket(c);
+	empty_jset = (w->data->keys == 0) ? true : false;
+
+	if (!last) {
+		/*
+		 * Not last available journal bucket, no reserved journal
+		 * space restriction, an empty jset should not be here.
+		 */
+		BUG_ON(empty_jset);
 		return true;
+	}
 
-	/* the check in no_journal_wait exceeds BCH_JOURNAL_RPLY_RESERVE */
-	if (w->data->keys == 0)
+	if (empty_jset) {
+		/*
+		 * If nkeys is 0 it means the journaling request is for meta
+		 * data, which should be returned in journal_wait_for_write()
+		 * by checking no_journal_wait(), and won't get here.
+		 */
+		BUG_ON(nkeys == 0);
 		return false;
+	}
 
-	reserved_sectors = BCH_JOURNAL_RPLY_RESERVE;
+	reserved_sectors = BCH_JOURNAL_RESERVE +
+			   BCH_JOURNAL_RPLY_RESERVE;
 	n = min_t(size_t,
 		  (c->journal.blocks_free * c->sb.block_size),
 		  PAGE_SECTORS << JSET_BITS);
-	sectors = __set_blocks(w->data, w->data->keys,
+	sectors = __set_blocks(w->data,
+			       w->data->keys,
 			       block_bytes(c)) * c->sb.block_size;
+
 	if (sectors <= (n - reserved_sectors))
 		return true;
 
@@ -903,13 +944,13 @@ static struct journal_write *journal_wait_for_write(struct cache_set *c,
 		sectors = __set_blocks(w->data, w->data->keys + nkeys,
 				       block_bytes(c)) * c->sb.block_size;
 
-		if (no_journal_wait(c, sectors))
+		if (no_journal_wait(c, sectors, nkeys))
 			return w;
 
 		if (wait)
 			closure_wait(&c->journal.wait, &cl);
 
-		if (should_try_write(c, w)) {
+		if (should_try_write(c, w, nkeys)) {
 			if (wait)
 				trace_bcache_journal_entry_full(c);
 
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index 54408e248a39..55f81443f304 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -162,6 +162,7 @@ struct journal_device {
 
 /* Reserved jouranl space in sectors */
 #define BCH_JOURNAL_RPLY_RESERVE	6U
+#define BCH_JOURNAL_RESERVE		7U
 
 #define journal_full(j)						\
 	(!(j)->blocks_free || fifo_free(&(j)->pin) <= 1)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 06/16] bcache: add failure check to run_cache_set() for journal replay
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (4 preceding siblings ...)
  2019-04-19 16:04 ` [RFC PATCH v2 05/16] bcache: reserve space for journal_meta() in run time Coly Li
@ 2019-04-19 16:04 ` Coly Li
  2019-04-23  7:02   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 07/16] bcache: add comments for kobj release callback routine Coly Li
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:04 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Currently run_cache_set() has no return value, if there is failure in
bch_journal_replay(), the caller of run_cache_set() has no idea about
such failure and just continue to execute following code after
run_cache_set().  The internal failure is triggered inside
bch_journal_replay() and being handled in async way. This behavior is
inefficient, while failure handling inside bch_journal_replay(), cache
register code is still running to start the cache set. Registering and
unregistering code running as same time may introduce some rare race
condition, and make the code to be more hard to be understood.

This patch adds return value to run_cache_set(), and returns -EIO if
bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
such failure and stop registering code flow immedidately inside
register_cache_set(). 

If journal replay fails, run_cache_set() can report error immediately
to register_cache_set(). This patch makes the failure handling for
bch_journal_replay() be in synchronized way, easier to understand and
debug, and avoid poetential race condition for register-and-unregister
in same time.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a697a3a923cd..036bffad0bfe 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1775,7 +1775,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 	return NULL;
 }
 
-static void run_cache_set(struct cache_set *c)
+static int run_cache_set(struct cache_set *c)
 {
 	const char *err = "cannot allocate memory";
 	struct cached_dev *dc, *t;
@@ -1869,7 +1869,9 @@ static void run_cache_set(struct cache_set *c)
 		if (j->version < BCACHE_JSET_VERSION_UUID)
 			__uuid_write(c);
 
-		bch_journal_replay(c, &journal);
+		err = "bcache: replay journal failed";
+		if (bch_journal_replay(c, &journal))
+			goto err;
 	} else {
 		pr_notice("invalidating existing data");
 
@@ -1937,11 +1939,13 @@ static void run_cache_set(struct cache_set *c)
 	flash_devs_run(c);
 
 	set_bit(CACHE_SET_RUNNING, &c->flags);
-	return;
+	return 0;
 err:
 	closure_sync(&cl);
 	/* XXX: test this, it's broken */
 	bch_cache_set_error(c, "%s", err);
+
+	return -EIO;
 }
 
 static bool can_attach_cache(struct cache *ca, struct cache_set *c)
@@ -2005,8 +2009,11 @@ static const char *register_cache_set(struct cache *ca)
 	ca->set->cache[ca->sb.nr_this_dev] = ca;
 	c->cache_by_alloc[c->caches_loaded++] = ca;
 
-	if (c->caches_loaded == c->sb.nr_in_set)
-		run_cache_set(c);
+	if (c->caches_loaded == c->sb.nr_in_set) {
+		err = "failed to run cache set";
+		if (run_cache_set(c) < 0)
+			goto err;
+	}
 
 	return NULL;
 err:
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 07/16] bcache: add comments for kobj release callback routine
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (5 preceding siblings ...)
  2019-04-19 16:04 ` [RFC PATCH v2 06/16] bcache: add failure check to run_cache_set() for journal replay Coly Li
@ 2019-04-19 16:05 ` Coly Li
  2019-04-21 17:52   ` Chaitanya Kulkarni
  2019-04-23  7:03   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 08/16] bcache: return error immediately in bch_journal_replay() Coly Li
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Bcache has several routines to release resources in implicit way, they
are called when the associated kobj released. This patch adds code
comments to notice when and which release callback will be called,
- When dc->disk.kobj released:
  void bch_cached_dev_release(struct kobject *kobj)
- When d->kobj released:
  void bch_flash_dev_release(struct kobject *kobj)
- When c->kobj released:
  void bch_cache_set_release(struct kobject *kobj)
- When ca->kobj released
  void bch_cache_release(struct kobject *kobj)

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 036bffad0bfe..400af446c372 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1174,6 +1174,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
 	return 0;
 }
 
+/* when dc->disk.kobj released */
 void bch_cached_dev_release(struct kobject *kobj)
 {
 	struct cached_dev *dc = container_of(kobj, struct cached_dev,
@@ -1326,6 +1327,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 
 /* Flash only volumes */
 
+/* When d->kobj released */
 void bch_flash_dev_release(struct kobject *kobj)
 {
 	struct bcache_device *d = container_of(kobj, struct bcache_device,
@@ -1496,6 +1498,7 @@ bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
 	return true;
 }
 
+/* When c->kobj released */
 void bch_cache_set_release(struct kobject *kobj)
 {
 	struct cache_set *c = container_of(kobj, struct cache_set, kobj);
@@ -2023,6 +2026,7 @@ static const char *register_cache_set(struct cache *ca)
 
 /* Cache device */
 
+/* When ca->kobj released */
 void bch_cache_release(struct kobject *kobj)
 {
 	struct cache *ca = container_of(kobj, struct cache, kobj);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 08/16] bcache: return error immediately in bch_journal_replay()
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (6 preceding siblings ...)
  2019-04-19 16:05 ` [RFC PATCH v2 07/16] bcache: add comments for kobj release callback routine Coly Li
@ 2019-04-19 16:05 ` Coly Li
  2019-04-23  7:04   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 09/16] bcache: add error check for calling register_bdev() Coly Li
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When failure happens inside bch_journal_replay(), calling
cache_set_err_on() and handling the failure in async way is not a good
idea. Because after bch_journal_replay() returns, registering code will
continue to execute following steps, and unregistering code triggered
by cache_set_err_on() is running in same time. First it is unnecessary
to handle failure and unregister cache set in an async way, second there
might be potential race condition to run register and unregister code
for same cache set.

So in this patch, if failure happens in bch_journal_replay(), we don't
call cache_set_err_on(), and just print out the same error message to
kernel message buffer, then return -EIO immediately caller. Then caller
can detect such failure and handle it in synchrnozied way.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 6aa68ab7cd78..bdb6f9cefe48 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -420,9 +420,12 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 	list_for_each_entry(i, list, list) {
 		BUG_ON(i->pin && atomic_read(i->pin) != 1);
 
-		cache_set_err_on(n != i->j.seq, s,
-"bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
-				 n, i->j.seq - 1, start, end);
+		if (n != i->j.seq) {
+			pr_err("bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
+			n, i->j.seq - 1, start, end);
+			ret = -EIO;
+			goto err;
+		}
 
 		for (k = i->j.start;
 		     k < bset_bkey_last(&i->j);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 09/16] bcache: add error check for calling register_bdev()
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (7 preceding siblings ...)
  2019-04-19 16:05 ` [RFC PATCH v2 08/16] bcache: return error immediately in bch_journal_replay() Coly Li
@ 2019-04-19 16:05 ` Coly Li
  2019-04-21 18:00   ` Chaitanya Kulkarni
  2019-04-23  7:04   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 10/16] bcache: Add comments for blkdev_put() in registration code path Coly Li
                   ` (6 subsequent siblings)
  15 siblings, 2 replies; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

This patch adds return value to register_bdev(). Then if failure happens
inside register_bdev(), its caller register_bcache() may detect and
handle the failure more properly.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 400af446c372..a435c506edba 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1281,7 +1281,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
 
 /* Cached device - bcache superblock */
 
-static void register_bdev(struct cache_sb *sb, struct page *sb_page,
+static int register_bdev(struct cache_sb *sb, struct page *sb_page,
 				 struct block_device *bdev,
 				 struct cached_dev *dc)
 {
@@ -1319,10 +1319,11 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
 	    BDEV_STATE(&dc->sb) == BDEV_STATE_STALE)
 		bch_cached_dev_run(dc);
 
-	return;
+	return 0;
 err:
 	pr_notice("error %s: %s", dc->backing_dev_name, err);
 	bcache_device_stop(&dc->disk);
+	return -EIO;
 }
 
 /* Flash only volumes */
@@ -2273,7 +2274,7 @@ static bool bch_is_open(struct block_device *bdev)
 static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			       const char *buffer, size_t size)
 {
-	ssize_t ret = size;
+	ssize_t ret = -EINVAL;
 	const char *err = "cannot allocate memory";
 	char *path = NULL;
 	struct cache_sb *sb = NULL;
@@ -2307,7 +2308,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			if (!IS_ERR(bdev))
 				bdput(bdev);
 			if (attr == &ksysfs_register_quiet)
-				goto out;
+				goto quiet_out;
 		}
 		goto err;
 	}
@@ -2328,8 +2329,10 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			goto err_close;
 
 		mutex_lock(&bch_register_lock);
-		register_bdev(sb, sb_page, bdev, dc);
+		ret = register_bdev(sb, sb_page, bdev, dc);
 		mutex_unlock(&bch_register_lock);
+		if (ret < 0)
+			goto err;
 	} else {
 		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
 
@@ -2339,6 +2342,8 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 		if (register_cache(sb, sb_page, bdev, ca) != 0)
 			goto err;
 	}
+quiet_out:
+	ret = size;
 out:
 	if (sb_page)
 		put_page(sb_page);
@@ -2351,7 +2356,6 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
 err:
 	pr_info("error %s: %s", path, err);
-	ret = -EINVAL;
 	goto out;
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 10/16] bcache: Add comments for blkdev_put() in registration code path
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (8 preceding siblings ...)
  2019-04-19 16:05 ` [RFC PATCH v2 09/16] bcache: add error check for calling register_bdev() Coly Li
@ 2019-04-19 16:05 ` Coly Li
  2019-04-21 17:50   ` Chaitanya Kulkarni
  2019-04-23  7:05   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 11/16] bcache: add comments for closure_fn to be called in closure_queue() Coly Li
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Add comments to explain why in register_bcache() blkdev_put() won't
be called in two location. Add comments to explain why blkdev_put()
must be called in register_cache() when cache_alloc() failed.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a435c506edba..83a7cb0e0e45 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2191,6 +2191,12 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 
 	ret = cache_alloc(ca);
 	if (ret != 0) {
+		/*
+		 * If we failed here, it means ca->kobj is not initialzed yet,
+		 * kobject_put() won't be called and there is no chance to
+		 * call blkdev_put() to bdev in bch_cache_release(). So we
+		 * explictly call blkdev_put() here.
+		 */
 		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
 		if (ret == -ENOMEM)
 			err = "cache_alloc(): -ENOMEM";
@@ -2331,6 +2337,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 		mutex_lock(&bch_register_lock);
 		ret = register_bdev(sb, sb_page, bdev, dc);
 		mutex_unlock(&bch_register_lock);
+		/* blkdev_put() will be called in cached_dev_free() */
 		if (ret < 0)
 			goto err;
 	} else {
@@ -2339,6 +2346,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 		if (!ca)
 			goto err_close;
 
+		/* blkdev_put() will be called in bch_cache_release() */
 		if (register_cache(sb, sb_page, bdev, ca) != 0)
 			goto err;
 	}
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 11/16] bcache: add comments for closure_fn to be called in closure_queue()
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (9 preceding siblings ...)
  2019-04-19 16:05 ` [RFC PATCH v2 10/16] bcache: Add comments for blkdev_put() in registration code path Coly Li
@ 2019-04-19 16:05 ` Coly Li
  2019-04-21 17:43   ` Chaitanya Kulkarni
  2019-04-23  7:05   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 12/16] bcache: add pendings_cleanup to stop pending bcache device Coly Li
                   ` (4 subsequent siblings)
  15 siblings, 2 replies; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Add code comments to explain which call back function might be called
for the closure_queue(). This is an effort to make code to be more
understandable for readers.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 83a7cb0e0e45..9b41e0b62cc0 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -662,6 +662,11 @@ static const struct block_device_operations bcache_ops = {
 void bcache_device_stop(struct bcache_device *d)
 {
 	if (!test_and_set_bit(BCACHE_DEV_CLOSING, &d->flags))
+		/*
+		 * closure_fn set to
+		 * - cached device: cached_dev_flush()
+		 * - flash dev: flash_dev_flush()
+		 */
 		closure_queue(&d->cl);
 }
 
@@ -1677,6 +1682,7 @@ static void __cache_set_unregister(struct closure *cl)
 void bch_cache_set_stop(struct cache_set *c)
 {
 	if (!test_and_set_bit(CACHE_SET_STOPPING, &c->flags))
+		/* closure_fn set to __cache_set_unregister() */
 		closure_queue(&c->caching);
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 12/16] bcache: add pendings_cleanup to stop pending bcache device
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (10 preceding siblings ...)
  2019-04-19 16:05 ` [RFC PATCH v2 11/16] bcache: add comments for closure_fn to be called in closure_queue() Coly Li
@ 2019-04-19 16:05 ` Coly Li
  2019-04-23  7:08   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write() Coly Li
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

If a bcache device is in dirty state and its cache set is not
registered, this bcache deivce will not appear in /dev/bcache<N>,
and there is no way to stop it or remove the bcache kernel module.

This is an as-designed behavior, but sometimes people has to reboot
whole system to release or stop the pending backing device.

This sysfs interface may remove such pending bcache devices when
write anything into the sysfs file manually.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 9b41e0b62cc0..e988e46a6479 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2246,9 +2246,13 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 
 static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			       const char *buffer, size_t size);
+static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
+					 struct kobj_attribute *attr,
+					 const char *buffer, size_t size);
 
 kobj_attribute_write(register,		register_bcache);
 kobj_attribute_write(register_quiet,	register_bcache);
+kobj_attribute_write(pendings_cleanup,	bch_pending_bdevs_cleanup);
 
 static bool bch_is_open_backing(struct block_device *bdev)
 {
@@ -2373,6 +2377,56 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	goto out;
 }
 
+
+struct pdev {
+	struct list_head list;
+	struct cached_dev *dc;
+};
+
+static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
+					 struct kobj_attribute *attr,
+					 const char *buffer,
+					 size_t size)
+{
+	LIST_HEAD(pending_devs);
+	ssize_t ret = size;
+	struct cached_dev *dc, *tdc;
+	struct pdev *pdev, *tpdev;
+	struct cache_set *c, *tc;
+
+	mutex_lock(&bch_register_lock);
+	list_for_each_entry_safe(dc, tdc, &uncached_devices, list) {
+		pdev = kmalloc(sizeof(struct pdev), GFP_KERNEL);
+		if (!pdev)
+			break;
+		pdev->dc = dc;
+		list_add(&pdev->list, &pending_devs);
+	}
+
+	list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) {
+		list_for_each_entry_safe(c, tc, &bch_cache_sets, list) {
+			char *pdev_set_uuid = pdev->dc->sb.set_uuid;
+			char *set_uuid = c->sb.uuid;
+
+			if (!memcmp(pdev_set_uuid, set_uuid, 16)) {
+				list_del(&pdev->list);
+				kfree(pdev);
+				break;
+			}
+		}
+	}
+	mutex_unlock(&bch_register_lock);
+
+	list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) {
+		pr_info("delete pdev %p", pdev);
+		list_del(&pdev->list);
+		bcache_device_stop(&pdev->dc->disk);
+		kfree(pdev);
+	}
+
+	return ret;
+}
+
 static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
 {
 	if (code == SYS_DOWN ||
@@ -2483,6 +2537,7 @@ static int __init bcache_init(void)
 	static const struct attribute *files[] = {
 		&ksysfs_register.attr,
 		&ksysfs_register_quiet.attr,
+		&ksysfs_pendings_cleanup.attr,
 		NULL
 	};
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write()
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (11 preceding siblings ...)
  2019-04-19 16:05 ` [RFC PATCH v2 12/16] bcache: add pendings_cleanup to stop pending bcache device Coly Li
@ 2019-04-19 16:05 ` Coly Li
       [not found]   ` <20190419231642.90AB82171F@mail.kernel.org>
  2019-04-23  7:09   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 14/16] bcache: try to flush btree nodes as many as possible Coly Li
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable

Current journal_max_cmp() and journal_min_cmp() assume that smaller fifo
index indicating elder journal entries, but this is only true when fifo
index is not swapped.

Fifo structure journal.pin is implemented by a cycle buffer, if the head
index reaches highest location of the cycle buffer, it will be swapped
to 0. Once the swapping happens, it means a smaller fifo index might be
associated to a newer journal entry. So the btree node with oldest
journal entry won't be selected by btree_flush_write() to flush out to
cache device. The result is, the oldest journal entries may always has
no chance to be written into cache device, and after a reboot
bch_journal_replay() may complain some journal entries are missing.

This patch handles the fifo index swapping conditions properly, then in
btree_flush_write() the btree node with oldest journal entry can be
slected from c->flush_btree correctly.

Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 47 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index bdb6f9cefe48..bc0e01151155 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -464,12 +464,47 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 }
 
 /* Journalling */
-#define journal_max_cmp(l, r) \
-	(fifo_idx(&c->journal.pin, btree_current_write(l)->journal) < \
-	 fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
-#define journal_min_cmp(l, r) \
-	(fifo_idx(&c->journal.pin, btree_current_write(l)->journal) > \
-	 fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
+#define journal_max_cmp(l, r)						\
+({									\
+	int l_idx, r_idx, f_idx, b_idx;					\
+	bool _ret = true;						\
+									\
+	l_idx = fifo_idx(&c->journal.pin, btree_current_write(l)->journal); \
+	r_idx = fifo_idx(&c->journal.pin, btree_current_write(r)->journal); \
+	f_idx = c->journal.pin.front;					\
+	b_idx = c->journal.pin.back;					\
+									\
+	_ret = (l_idx < r_idx);						\
+	/* in case fifo back pointer is swapped */			\
+	if (b_idx < f_idx) { 						\
+		if (l_idx <= b_idx && r_idx >= f_idx)			\
+			_ret = false;					\
+		else if (l_idx >= f_idx && r_idx <= b_idx)		\
+			_ret = true;					\
+	}								\
+	_ret;								\
+})
+
+#define journal_min_cmp(l, r)						\
+({									\
+	int l_idx, r_idx, f_idx, b_idx;					\
+	bool _ret = true;						\
+									\
+	l_idx = fifo_idx(&c->journal.pin, btree_current_write(l)->journal); \
+	r_idx = fifo_idx(&c->journal.pin, btree_current_write(r)->journal); \
+	f_idx = c->journal.pin.front;					\
+	b_idx = c->journal.pin.back;					\
+									\
+	_ret = (l_idx > r_idx);						\
+	/* in case fifo back pointer is swapped */			\
+	if (b_idx < f_idx) {						\
+		if (l_idx <= b_idx && r_idx >= f_idx)			\
+			_ret = true;					\
+		else if (l_idx >= f_idx && r_idx <= b_idx)		\
+			_ret = false;					\
+	}								\
+	_ret;								\
+})
 
 static void btree_flush_write(struct cache_set *c)
 {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 14/16] bcache: try to flush btree nodes as many as possible
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (12 preceding siblings ...)
  2019-04-19 16:05 ` [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write() Coly Li
@ 2019-04-19 16:05 ` Coly Li
  2019-04-23  7:10   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 15/16] bcache: improve bcache_reboot() Coly Li
  2019-04-19 16:05 ` [RFC PATCH v2 16/16] bcache: introduce spinlock_t flush_write_lock in struct journal Coly Li
  15 siblings, 1 reply; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When btree_flush_write() is called, it means the journal space is
exhuasted already. Current code only selects a single btree node to
write out, which may introduce huge cache bounce from the spinlock on
multiple cpu cores, when a lot of kworkers on journaling code path to
call btree_flush_write() for journal space reclaiming.

This patch tries to flush as many btree node as possible inside
a single call to btree_flush_write(), then the frequence of calling
btree_flush_write() can be reduced, which in turn reduces the cache
bounce from spinlock on multiple cpu cores. Please notice that this
patch does not reduce the total times of acquiring spinlock, a spin
lock is still acquired when select every single btree node to write
out, but this patch will try best to hold the spinlock on same cpu
core, which avoids the cache bounce where the spinlock is acquired by
multiple different cpu cores.

After the patch applied, in my pressure testing, 'top' shows more than
50% sys cpu time reduced from the kworks which competing spinlock
inside btree_flush_write().

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 7 ++++++-
 drivers/md/bcache/journal.h | 4 ++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index bc0e01151155..8536e76fcac9 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -514,6 +514,7 @@ static void btree_flush_write(struct cache_set *c)
 	 */
 	struct btree *b;
 	int i;
+	int n = FLUSH_BTREE_HEAP;
 
 	atomic_long_inc(&c->flush_write);
 
@@ -552,6 +553,10 @@ static void btree_flush_write(struct cache_set *c)
 
 		__bch_btree_node_write(b, NULL);
 		mutex_unlock(&b->write_lock);
+
+		/* try to flush btree nodes as many as possible */
+		if (--n > 0)
+			goto retry;
 	}
 }
 
@@ -1102,7 +1107,7 @@ int bch_journal_alloc(struct cache_set *c)
 	j->w[0].c = c;
 	j->w[1].c = c;
 
-	if (!(init_heap(&c->flush_btree, 128, GFP_KERNEL)) ||
+	if (!(init_heap(&c->flush_btree, FLUSH_BTREE_HEAP, GFP_KERNEL)) ||
 	    !(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL)) ||
 	    !(j->w[0].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS)) ||
 	    !(j->w[1].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS)))
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index 55f81443f304..a8be14c6f6d9 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -158,8 +158,8 @@ struct journal_device {
 #define journal_pin_cmp(c, l, r)				\
 	(fifo_idx(&(c)->journal.pin, (l)) > fifo_idx(&(c)->journal.pin, (r)))
 
-#define JOURNAL_PIN	20000
-
+#define FLUSH_BTREE_HEAP	128
+#define JOURNAL_PIN		20000
 /* Reserved jouranl space in sectors */
 #define BCH_JOURNAL_RPLY_RESERVE	6U
 #define BCH_JOURNAL_RESERVE		7U
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 15/16] bcache: improve bcache_reboot()
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (13 preceding siblings ...)
  2019-04-19 16:05 ` [RFC PATCH v2 14/16] bcache: try to flush btree nodes as many as possible Coly Li
@ 2019-04-19 16:05 ` Coly Li
  2019-04-23  7:13   ` Hannes Reinecke
  2019-04-19 16:05 ` [RFC PATCH v2 16/16] bcache: introduce spinlock_t flush_write_lock in struct journal Coly Li
  15 siblings, 1 reply; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

This patch tries to release mutex bch_register_lock early, to give
chance to stop cache set and bcache device early.

This patch also expends time out of stopping all bcache device from
2 seconds to 10 seconds, because stopping writeback rate update worker
may delay for 5 seconds, 2 seconds is not enough.

After this patch applied, stopping bcache devices during system reboot
or shutdown is very hard to be observed any more.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e988e46a6479..2d377a4a182f 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2453,10 +2453,13 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
 		list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
 			bcache_device_stop(&dc->disk);
 
+		mutex_unlock(&bch_register_lock);
+
 		/* What's a condition variable? */
 		while (1) {
-			long timeout = start + 2 * HZ - jiffies;
+			long timeout = start + 10 * HZ - jiffies;
 
+			mutex_lock(&bch_register_lock);
 			stopped = list_empty(&bch_cache_sets) &&
 				list_empty(&uncached_devices);
 
@@ -2468,7 +2471,6 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
 
 			mutex_unlock(&bch_register_lock);
 			schedule_timeout(timeout);
-			mutex_lock(&bch_register_lock);
 		}
 
 		finish_wait(&unregister_wait, &wait);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v2 16/16] bcache: introduce spinlock_t flush_write_lock in struct journal
  2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
                   ` (14 preceding siblings ...)
  2019-04-19 16:05 ` [RFC PATCH v2 15/16] bcache: improve bcache_reboot() Coly Li
@ 2019-04-19 16:05 ` Coly Li
  2019-04-23  7:14   ` Hannes Reinecke
  15 siblings, 1 reply; 47+ messages in thread
From: Coly Li @ 2019-04-19 16:05 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

In btree_flush_write(), iterating all cached btree nodes and adding them
into ordered heap c->flush_btree takes quite long time. In order to
protect ordered heap c->flush_btree, spin lock c->journal.lock is held
for all the iteration and heap ordering. When journal space is fully
occupied, btree_flush_write() might be called frequently, if the cached
btree node iteration takes too much time, kenrel will complain that
normal journal kworkers are blocked too long. Of cause write performance
drops at this moment.

This patch introduces a new spin lock member in struct journal, named
flush_write_lock. This lock is only used in btree_flush_write() and
protect the ordered heap c->flush_btree during all the cached btree node
iteration. Then there won't be lock contention on c->journal.lock.

After this fix, when journal space is fully occupied, it is very rare to
observe the journal kworker blocking timeout warning.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 5 +++--
 drivers/md/bcache/journal.h | 1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 8536e76fcac9..6e38470f6924 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -519,7 +519,7 @@ static void btree_flush_write(struct cache_set *c)
 	atomic_long_inc(&c->flush_write);
 
 retry:
-	spin_lock(&c->journal.lock);
+	spin_lock(&c->journal.flush_write_lock);
 	if (heap_empty(&c->flush_btree)) {
 		for_each_cached_btree(b, c, i)
 			if (btree_current_write(b)->journal) {
@@ -540,7 +540,7 @@ static void btree_flush_write(struct cache_set *c)
 
 	b = NULL;
 	heap_pop(&c->flush_btree, b, journal_min_cmp);
-	spin_unlock(&c->journal.lock);
+	spin_unlock(&c->journal.flush_write_lock);
 
 	if (b) {
 		mutex_lock(&b->write_lock);
@@ -1099,6 +1099,7 @@ int bch_journal_alloc(struct cache_set *c)
 	struct journal *j = &c->journal;
 
 	spin_lock_init(&j->lock);
+	spin_lock_init(&j->flush_write_lock);
 	INIT_DELAYED_WORK(&j->work, journal_write_work);
 
 	c->journal_delay_ms = 100;
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index a8be14c6f6d9..d8ad99f6191b 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -103,6 +103,7 @@ struct journal_write {
 /* Embedded in struct cache_set */
 struct journal {
 	spinlock_t		lock;
+	spinlock_t		flush_write_lock;
 	/* used when waiting because the journal was full */
 	struct closure_waitlist	wait;
 	struct closure		io;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write()
       [not found]   ` <20190419231642.90AB82171F@mail.kernel.org>
@ 2019-04-20 13:20     ` Coly Li
  0 siblings, 0 replies; 47+ messages in thread
From: Coly Li @ 2019-04-20 13:20 UTC (permalink / raw)
  To: Sasha Levin, linux-bcache; +Cc: linux-block, stable

On 2019/4/20 7:16 上午, Sasha Levin wrote:
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: all
> 
> The bot has tested the following trees: v5.0.8, v4.19.35, v4.14.112, v4.9.169, v4.4.178, v3.18.138.
> 
> v5.0.8: Build OK!
> v4.19.35: Build OK!
> v4.14.112: Failed to apply! Possible dependencies:
>     a728eacbbdd2 ("bcache: add journal statistic")
>     c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
> 
> v4.9.169: Failed to apply! Possible dependencies:
>     a728eacbbdd2 ("bcache: add journal statistic")
>     c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
> 
> v4.4.178: Failed to apply! Possible dependencies:
>     a728eacbbdd2 ("bcache: add journal statistic")
>     c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
> 
> v3.18.138: Failed to apply! Possible dependencies:
>     a728eacbbdd2 ("bcache: add journal statistic")
>     c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
> 
> 
> How should we proceed with this patch?

This patch will go into Linux v5.2. We can have them in stable after
they being upstream.

Thanks.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 11/16] bcache: add comments for closure_fn to be called in closure_queue()
  2019-04-19 16:05 ` [RFC PATCH v2 11/16] bcache: add comments for closure_fn to be called in closure_queue() Coly Li
@ 2019-04-21 17:43   ` Chaitanya Kulkarni
  2019-04-23  7:05   ` Hannes Reinecke
  1 sibling, 0 replies; 47+ messages in thread
From: Chaitanya Kulkarni @ 2019-04-21 17:43 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

Looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

On 04/19/2019 11:24 AM, Coly Li wrote:
> Add code comments to explain which call back function might be called
> for the closure_queue(). This is an effort to make code to be more
> understandable for readers.
>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 83a7cb0e0e45..9b41e0b62cc0 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -662,6 +662,11 @@ static const struct block_device_operations bcache_ops = {
>   void bcache_device_stop(struct bcache_device *d)
>   {
>   	if (!test_and_set_bit(BCACHE_DEV_CLOSING, &d->flags))
> +		/*
> +		 * closure_fn set to
> +		 * - cached device: cached_dev_flush()
> +		 * - flash dev: flash_dev_flush()
> +		 */
>   		closure_queue(&d->cl);
>   }
>
> @@ -1677,6 +1682,7 @@ static void __cache_set_unregister(struct closure *cl)
>   void bch_cache_set_stop(struct cache_set *c)
>   {
>   	if (!test_and_set_bit(CACHE_SET_STOPPING, &c->flags))
> +		/* closure_fn set to __cache_set_unregister() */
>   		closure_queue(&c->caching);
>   }
>
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 01/16] bcache: move definition of 'int ret' out of macro read_bucket()
  2019-04-19 16:04 ` [RFC PATCH v2 01/16] bcache: move definition of 'int ret' out of macro read_bucket() Coly Li
@ 2019-04-21 17:47   ` Chaitanya Kulkarni
  2019-04-22 15:11     ` Coly Li
  2019-04-23  6:50   ` Hannes Reinecke
  1 sibling, 1 reply; 47+ messages in thread
From: Chaitanya Kulkarni @ 2019-04-21 17:47 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

May be we should make this as an inline function instead of macro ?
Reason being:-
1. It is not a simple substitution.
2. It has multiple function calls.
3. With this patch it is now dependent on local variable declared
    outside of macro.

Any thoughts ?


On 04/19/2019 11:24 AM, Coly Li wrote:
> 'int ret' is defined as a local variable inside macro read_bucket().
> Since this macro is called multiple times, and following patches will
> use a 'int ret' variable in bch_journal_read(), this patch moves
> definition of 'int ret' from macro read_bucket() to range of function
> bch_journal_read().
>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index b2fd412715b1..6e18057d1d82 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -147,7 +147,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
>   {
>   #define read_bucket(b)							\
>   	({								\
> -		int ret = journal_read_bucket(ca, list, b);		\
> +		ret = journal_read_bucket(ca, list, b);			\
>   		__set_bit(b, bitmap);					\
>   		if (ret < 0)						\
>   			return ret;					\
> @@ -156,6 +156,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
>
>   	struct cache *ca;
>   	unsigned int iter;
> +	int ret = 0;
>
>   	for_each_cache(ca, c, iter) {
>   		struct journal_device *ja = &ca->journal;
> @@ -267,7 +268,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
>   					    struct journal_replay,
>   					    list)->j.seq;
>
> -	return 0;
> +	return ret;
>   #undef read_bucket
>   }
>
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 10/16] bcache: Add comments for blkdev_put() in registration code path
  2019-04-19 16:05 ` [RFC PATCH v2 10/16] bcache: Add comments for blkdev_put() in registration code path Coly Li
@ 2019-04-21 17:50   ` Chaitanya Kulkarni
  2019-04-23  7:05   ` Hannes Reinecke
  1 sibling, 0 replies; 47+ messages in thread
From: Chaitanya Kulkarni @ 2019-04-21 17:50 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

Looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

On 04/19/2019 11:24 AM, Coly Li wrote:
> Add comments to explain why in register_bcache() blkdev_put() won't
> be called in two location. Add comments to explain why blkdev_put()
> must be called in register_cache() when cache_alloc() failed.
>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index a435c506edba..83a7cb0e0e45 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2191,6 +2191,12 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
>
>   	ret = cache_alloc(ca);
>   	if (ret != 0) {
> +		/*
> +		 * If we failed here, it means ca->kobj is not initialzed yet,
> +		 * kobject_put() won't be called and there is no chance to
> +		 * call blkdev_put() to bdev in bch_cache_release(). So we
> +		 * explictly call blkdev_put() here.
> +		 */
>   		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
>   		if (ret == -ENOMEM)
>   			err = "cache_alloc(): -ENOMEM";
> @@ -2331,6 +2337,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   		mutex_lock(&bch_register_lock);
>   		ret = register_bdev(sb, sb_page, bdev, dc);
>   		mutex_unlock(&bch_register_lock);
> +		/* blkdev_put() will be called in cached_dev_free() */
>   		if (ret < 0)
>   			goto err;
>   	} else {
> @@ -2339,6 +2346,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   		if (!ca)
>   			goto err_close;
>
> +		/* blkdev_put() will be called in bch_cache_release() */
>   		if (register_cache(sb, sb_page, bdev, ca) != 0)
>   			goto err;
>   	}
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 07/16] bcache: add comments for kobj release callback routine
  2019-04-19 16:05 ` [RFC PATCH v2 07/16] bcache: add comments for kobj release callback routine Coly Li
@ 2019-04-21 17:52   ` Chaitanya Kulkarni
  2019-04-23  7:03   ` Hannes Reinecke
  1 sibling, 0 replies; 47+ messages in thread
From: Chaitanya Kulkarni @ 2019-04-21 17:52 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

Looks good.

Reviewed-by:- Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

On 04/19/2019 11:24 AM, Coly Li wrote:
> Bcache has several routines to release resources in implicit way, they
> are called when the associated kobj released. This patch adds code
> comments to notice when and which release callback will be called,
> - When dc->disk.kobj released:
>    void bch_cached_dev_release(struct kobject *kobj)
> - When d->kobj released:
>    void bch_flash_dev_release(struct kobject *kobj)
> - When c->kobj released:
>    void bch_cache_set_release(struct kobject *kobj)
> - When ca->kobj released
>    void bch_cache_release(struct kobject *kobj)
>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 036bffad0bfe..400af446c372 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1174,6 +1174,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
>   	return 0;
>   }
>
> +/* when dc->disk.kobj released */
>   void bch_cached_dev_release(struct kobject *kobj)
>   {
>   	struct cached_dev *dc = container_of(kobj, struct cached_dev,
> @@ -1326,6 +1327,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
>
>   /* Flash only volumes */
>
> +/* When d->kobj released */
>   void bch_flash_dev_release(struct kobject *kobj)
>   {
>   	struct bcache_device *d = container_of(kobj, struct bcache_device,
> @@ -1496,6 +1498,7 @@ bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
>   	return true;
>   }
>
> +/* When c->kobj released */
>   void bch_cache_set_release(struct kobject *kobj)
>   {
>   	struct cache_set *c = container_of(kobj, struct cache_set, kobj);
> @@ -2023,6 +2026,7 @@ static const char *register_cache_set(struct cache *ca)
>
>   /* Cache device */
>
> +/* When ca->kobj released */
>   void bch_cache_release(struct kobject *kobj)
>   {
>   	struct cache *ca = container_of(kobj, struct cache, kobj);
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 09/16] bcache: add error check for calling register_bdev()
  2019-04-19 16:05 ` [RFC PATCH v2 09/16] bcache: add error check for calling register_bdev() Coly Li
@ 2019-04-21 18:00   ` Chaitanya Kulkarni
  2019-04-23  7:04   ` Hannes Reinecke
  1 sibling, 0 replies; 47+ messages in thread
From: Chaitanya Kulkarni @ 2019-04-21 18:00 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 04/19/2019 11:24 AM, Coly Li wrote:
> This patch adds return value to register_bdev(). Then if failure happens
> inside register_bdev(), its caller register_bcache() may detect and
> handle the failure more properly.
>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 16 ++++++++++------
>   1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 400af446c372..a435c506edba 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1281,7 +1281,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>
>   /* Cached device - bcache superblock */
>
> -static void register_bdev(struct cache_sb *sb, struct page *sb_page,
> +static int register_bdev(struct cache_sb *sb, struct page *sb_page,
>   				 struct block_device *bdev,
>   				 struct cached_dev *dc)
>   {
> @@ -1319,10 +1319,11 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
>   	    BDEV_STATE(&dc->sb) == BDEV_STATE_STALE)
>   		bch_cached_dev_run(dc);
>
> -	return;
> +	return 0;
>   err:
>   	pr_notice("error %s: %s", dc->backing_dev_name, err);
>   	bcache_device_stop(&dc->disk);
So register_bdev has couple of goto err; calls where kobject interface 
is used directly :-

register_bdev
	kobject_add

and

indirectly :-
register_bdev
	bch_cache_accounting_add_kobjs

		kobject_add
Here we return -EIO, we should catch the error code from kobject calls 
and return that error instead of -EIO. Any thoughts ?

> +	return -EIO;
>   }
>
>   /* Flash only volumes */
> @@ -2273,7 +2274,7 @@ static bool bch_is_open(struct block_device *bdev)
>   static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   			       const char *buffer, size_t size)
>   {
> -	ssize_t ret = size;
> +	ssize_t ret = -EINVAL;
>   	const char *err = "cannot allocate memory";
>   	char *path = NULL;
>   	struct cache_sb *sb = NULL;
> @@ -2307,7 +2308,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   			if (!IS_ERR(bdev))
>   				bdput(bdev);
>   			if (attr == &ksysfs_register_quiet)
> -				goto out;
> +				goto quiet_out;
>   		}
>   		goto err;
>   	}
> @@ -2328,8 +2329,10 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   			goto err_close;
>
>   		mutex_lock(&bch_register_lock);
> -		register_bdev(sb, sb_page, bdev, dc);
> +		ret = register_bdev(sb, sb_page, bdev, dc);
>   		mutex_unlock(&bch_register_lock);
> +		if (ret < 0)
> +			goto err;
>   	} else {
>   		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
>
> @@ -2339,6 +2342,8 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   		if (register_cache(sb, sb_page, bdev, ca) != 0)
>   			goto err;
>   	}
> +quiet_out:
> +	ret = size;
>   out:
>   	if (sb_page)
>   		put_page(sb_page);
> @@ -2351,7 +2356,6 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
>   err:
>   	pr_info("error %s: %s", path, err);
> -	ret = -EINVAL;
>   	goto out;
>   }
>
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 01/16] bcache: move definition of 'int ret' out of macro read_bucket()
  2019-04-21 17:47   ` Chaitanya Kulkarni
@ 2019-04-22 15:11     ` Coly Li
  0 siblings, 0 replies; 47+ messages in thread
From: Coly Li @ 2019-04-22 15:11 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-bcache; +Cc: linux-block

On 2019/4/22 1:47 上午, Chaitanya Kulkarni wrote:
> May be we should make this as an inline function instead of macro ?
> Reason being:-
> 1. It is not a simple substitution.
> 2. It has multiple function calls.
> 3. With this patch it is now dependent on local variable declared
>     outside of macro.
> 
> Any thoughts ?
> 

Hi Chaitanya,

I agree with you. There are some symbol names shared with "caller" of
read_bucket(), so it is not very simple change to change the macro into
an inline function. At this moment I will keep the change as simple as
possible, and later I will think of change this macro into an inline
routine.

Thanks.

Coly Li


> 
> On 04/19/2019 11:24 AM, Coly Li wrote:
>> 'int ret' is defined as a local variable inside macro read_bucket().
>> Since this macro is called multiple times, and following patches will
>> use a 'int ret' variable in bch_journal_read(), this patch moves
>> definition of 'int ret' from macro read_bucket() to range of function
>> bch_journal_read().
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/journal.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index b2fd412715b1..6e18057d1d82 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -147,7 +147,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
>>   {
>>   #define read_bucket(b)							\
>>   	({								\
>> -		int ret = journal_read_bucket(ca, list, b);		\
>> +		ret = journal_read_bucket(ca, list, b);			\
>>   		__set_bit(b, bitmap);					\
>>   		if (ret < 0)						\
>>   			return ret;					\
>> @@ -156,6 +156,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
>>
>>   	struct cache *ca;
>>   	unsigned int iter;
>> +	int ret = 0;
>>
>>   	for_each_cache(ca, c, iter) {
>>   		struct journal_device *ja = &ca->journal;
>> @@ -267,7 +268,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
>>   					    struct journal_replay,
>>   					    list)->j.seq;
>>
>> -	return 0;
>> +	return ret;
>>   #undef read_bucket
>>   }
>>
>>
> 


-- 

Coly Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 01/16] bcache: move definition of 'int ret' out of macro read_bucket()
  2019-04-19 16:04 ` [RFC PATCH v2 01/16] bcache: move definition of 'int ret' out of macro read_bucket() Coly Li
  2019-04-21 17:47   ` Chaitanya Kulkarni
@ 2019-04-23  6:50   ` Hannes Reinecke
  1 sibling, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  6:50 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:04 PM, Coly Li wrote:
> 'int ret' is defined as a local variable inside macro read_bucket().
> Since this macro is called multiple times, and following patches will
> use a 'int ret' variable in bch_journal_read(), this patch moves
> definition of 'int ret' from macro read_bucket() to range of function
> bch_journal_read().
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index b2fd412715b1..6e18057d1d82 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -147,7 +147,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
>   {
>   #define read_bucket(b)							\
>   	({								\
> -		int ret = journal_read_bucket(ca, list, b);		\
> +		ret = journal_read_bucket(ca, list, b);			\
>   		__set_bit(b, bitmap);					\
>   		if (ret < 0)						\
>   			return ret;					\
> @@ -156,6 +156,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
>   
>   	struct cache *ca;
>   	unsigned int iter;
> +	int ret = 0;
>   
>   	for_each_cache(ca, c, iter) {
>   		struct journal_device *ja = &ca->journal;
> @@ -267,7 +268,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
>   					    struct journal_replay,
>   					    list)->j.seq;
>   
> -	return 0;
> +	return ret;
>   #undef read_bucket
>   }
>   
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim()
  2019-04-19 16:04 ` [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim() Coly Li
@ 2019-04-23  6:50   ` Hannes Reinecke
  0 siblings, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  6:50 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, stable

On 4/19/19 6:04 PM, Coly Li wrote:
> In journal_reclaim() ja->cur_idx of each cache will be update to
> reclaim available journal buckets. Variable 'int n' is used to count how
> many cache is successfully reclaimed, then n is set to c->journal.key
> by SET_KEY_PTRS(). Later in journal_write_unlocked(), a for_each_cache()
> loop will write the jset data onto each cache.
> 
> The problem is, if all jouranl buckets on each cache is full, the
> following code in journal_reclaim(),
> 
> 529 for_each_cache(ca, c, iter) {
> 530         struct journal_device *ja = &ca->journal;
> 531         unsigned int next = (ja->cur_idx + 1) % ca->sb.njournal_buckets;
> 532
> 533         /* No space available on this device */
> 534         if (next == ja->discard_idx)
> 535                 continue;
> 536
> 537         ja->cur_idx = next;
> 538         k->ptr[n++] = MAKE_PTR(0,
> 539                           bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
> 540                           ca->sb.nr_this_dev);
> 541 }
> 542
> 543 bkey_init(k);
> 544 SET_KEY_PTRS(k, n);
> 
> If there is no available bucket to reclaim, the if() condition at line
> 534 will always true, and n remains 0. Then at line 544, SET_KEY_PTRS()
> will set KEY_PTRS field of c->journal.key to 0.
> 
> Setting KEY_PTRS field of c->journal.key to 0 is wrong. Because in
> journal_write_unlocked() the journal data is written in following loop,
> 
> 649	for (i = 0; i < KEY_PTRS(k); i++) {
> 650-671		submit journal data to cache device
> 672	}
> 
> If KEY_PTRS field is set to 0 in jouranl_reclaim(), the journal data
> won't be written to cache device here. If system crahed or rebooted
> before bkeys of the lost journal entries written into btree nodes, data
> corruption will be reported during bcache reload after rebooting the
> system.
> 
> Indeed there is only one cache in a cache set, there is no need to set
> KEY_PTRS field in journal_reclaim() at all. But in order to keep the
> for_each_cache() logic consistent for now, this patch fixes the above
> problem by not setting 0 KEY_PTRS of journal key, if there is no bucket
> available to reclaim.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 11 +++++++----
>   1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 6e18057d1d82..5180bed911ef 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -541,11 +541,11 @@ static void journal_reclaim(struct cache_set *c)
>   				  ca->sb.nr_this_dev);
>   	}
>   
> -	bkey_init(k);
> -	SET_KEY_PTRS(k, n);
> -
> -	if (n)
> +	if (n) {
> +		bkey_init(k);
> +		SET_KEY_PTRS(k, n);
>   		c->journal.blocks_free = c->sb.bucket_size >> c->block_bits;
> +	}
>   out:
>   	if (!journal_full(&c->journal))
>   		__closure_wake_up(&c->journal.wait);
> @@ -672,6 +672,9 @@ static void journal_write_unlocked(struct closure *cl)
>   		ca->journal.seq[ca->journal.cur_idx] = w->data->seq;
>   	}
>   
> +	/* If KEY_PTRS(k) == 0, this jset gets lost in air */
> +	BUG_ON(i == 0);
> +
>   	atomic_dec_bug(&fifo_back(&c->journal.pin));
>   	bch_journal_next(&c->journal);
>   	journal_reclaim(c);
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay
  2019-04-19 16:04 ` [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay Coly Li
@ 2019-04-23  6:54   ` Hannes Reinecke
  2019-04-23  6:56     ` Coly Li
  0 siblings, 1 reply; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  6:54 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, stable

On 4/19/19 6:04 PM, Coly Li wrote:
> When bcache journal initiates during running cache set, cache set
> journal.blocks_free is initiated as 0. Then during journal replay if
> journal_meta() is called and an empty jset is written to cache device,
> journal_reclaim() is called. If there is available journal bucket to
> reclaim, c->journal.blocks_free is set to numbers of blocks of a journal
> bucket, which is c->sb.bucket_size >> c->block_bits.
> 
> Most of time the above process works correctly, expect the condtion
> when journal space is almost full. "Almost full" means there is no free
> journal bucket, but there are still free blocks in last available
> bucket indexed by ja->cur_idx.
> 
> If system crashes or reboots when journal space is almost full, problem
> comes. During cache set reload after the reboot, c->journal.blocks_free
> is initialized as 0, when jouranl replay process writes bcache jouranl,
> journal_reclaim() will be called to reclaim available journal bucket and
> set c->journal.blocks_free to c->sb.bucket_size >> c->block_bits. But
> there is no fully free bucket to reclaim in journal_reclaim(), so value
> of c->journal.blocks_free will keep 0. If the first journal entry
> processed by journal_replay() causes btree split and requires writing
> journal space by journal_meta(), journal_meta() has to go into an
> infinite loop to reclaim jouranl bucket, and blocks the whole cache set
> to run.
> 
> Such buggy situation can be solved if we do following things before
> journal replay starts,
> - Recover previous value of c->journal.blocks_free in last run time,
>    and set it to current c->journal.blocks_free as initial value.
> - Recover previous value of ja->cur_idx in last run time, and set it to
>    KEY_PTR of current c->journal.key as initial value.
> 
> After c->journal.blocks_free and c->journal.key are recovered, in
> condition when jouranl space is almost full and cache set is reloaded,
> meta journal entry from journal reply can be written into free blocks of
> the last available journal bucket, then old jouranl entries can be
> replayed and reclaimed for further journaling request.
> 
> This patch adds bch_journal_key_reload() to recover journal blocks_free
> and key ptr value for above purpose. bch_journal_key_reload() is called
> in bch_journal_read() before replying journal by bch_journal_replay().
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 87 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 87 insertions(+)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 5180bed911ef..a6deb16c15c8 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -143,6 +143,89 @@ reread:		left = ca->sb.bucket_size - offset;
>   	return ret;
>   }
>   
> +static int bch_journal_key_reload(struct cache_set *c)
> +{
> +	struct cache *ca;
> +	unsigned int iter, n = 0;
> +	struct bkey *k = &c->journal.key;
> +	int ret = 0;
> +
> +	for_each_cache(ca, c, iter) {
> +		struct journal_device *ja = &ca->journal;
> +		struct bio *bio = &ja->bio;
> +		struct jset *j, *data = c->journal.w[0].data;
> +		struct closure cl;
> +		unsigned int len, left;
> +		unsigned int offset = 0, used_blocks = 0;
> +		sector_t bucket = bucket_to_sector(c, ca->sb.d[ja->cur_idx]);
> +
> +		closure_init_stack(&cl);
> +
> +		while (offset < ca->sb.bucket_size) {
> +reread:			left = ca->sb.bucket_size - offset;

Please place the label on a line of its own.

> +			len = min_t(unsigned int,
> +				    left, PAGE_SECTORS << JSET_BITS);
> +
> +			bio_reset(bio);
> +			bio->bi_iter.bi_sector = bucket + offset;
> +			bio_set_dev(bio, ca->bdev);
> +			bio->bi_iter.bi_size = len << 9;
> +
> +			bio->bi_end_io = journal_read_endio;
> +			bio->bi_private = &cl;
> +			bio_set_op_attrs(bio, REQ_OP_READ, 0);
> +			bch_bio_map(bio, data);
> +
> +			closure_bio_submit(c, bio, &cl);
> +			closure_sync(&cl);
> +
> +			j = data;
> +			while (len) {
> +				size_t blocks, bytes = set_bytes(j);
> +
> +				if (j->magic != jset_magic(&ca->sb))
> +					goto out;
> +
> +				if (bytes > left << 9 ||
> +				    bytes > PAGE_SIZE << JSET_BITS) {
> +					pr_err("jset may be correpted: too big");
> +					ret = -EIO;
> +					goto err;
> +				}
> +
> +				if (bytes > len << 9)
> +					goto reread;
> +
> +				if (j->csum != csum_set(j)) {
> +					pr_err("jset may be corrupted: bad csum");
> +					ret = -EIO;
> +					goto err;
> +				}
> +
> +				blocks = set_blocks(j, block_bytes(c));
> +				used_blocks += blocks;
> +
> +				offset	+= blocks * ca->sb.block_size;
> +				len	-= blocks * ca->sb.block_size;
> +				j = ((void *) j) + blocks * block_bytes(ca);
> +			}
> +		}
> +out:
> +		c->journal.blocks_free =
> +			(c->sb.bucket_size >> c->block_bits) -
> +			used_blocks;
> +
> +		k->ptr[n++] = MAKE_PTR(0, bucket, ca->sb.nr_this_dev);
> +	}
> +
> +	BUG_ON(n == 0);
> +	bkey_init(k);
> +	SET_KEY_PTRS(k, n);
> +
> +err:
> +	return ret;
> +}
> +
>   int bch_journal_read(struct cache_set *c, struct list_head *list)
>   {
>   #define read_bucket(b)							\
This is a _quite_ convoluted nested loop.
It would be far better if it could be split into functions, so as to 
avoid the loop-within-loop constructs.
Oh, and some documentation (especially the 'reread' bit) would be nice.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay
  2019-04-23  6:54   ` Hannes Reinecke
@ 2019-04-23  6:56     ` Coly Li
  0 siblings, 0 replies; 47+ messages in thread
From: Coly Li @ 2019-04-23  6:56 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block, stable

On 2019/4/23 2:54 下午, Hannes Reinecke wrote:
> On 4/19/19 6:04 PM, Coly Li wrote:
>> When bcache journal initiates during running cache set, cache set
>> journal.blocks_free is initiated as 0. Then during journal replay if
>> journal_meta() is called and an empty jset is written to cache device,
>> journal_reclaim() is called. If there is available journal bucket to
>> reclaim, c->journal.blocks_free is set to numbers of blocks of a journal
>> bucket, which is c->sb.bucket_size >> c->block_bits.
>>
>> Most of time the above process works correctly, expect the condtion
>> when journal space is almost full. "Almost full" means there is no free
>> journal bucket, but there are still free blocks in last available
>> bucket indexed by ja->cur_idx.
>>
>> If system crashes or reboots when journal space is almost full, problem
>> comes. During cache set reload after the reboot, c->journal.blocks_free
>> is initialized as 0, when jouranl replay process writes bcache jouranl,
>> journal_reclaim() will be called to reclaim available journal bucket and
>> set c->journal.blocks_free to c->sb.bucket_size >> c->block_bits. But
>> there is no fully free bucket to reclaim in journal_reclaim(), so value
>> of c->journal.blocks_free will keep 0. If the first journal entry
>> processed by journal_replay() causes btree split and requires writing
>> journal space by journal_meta(), journal_meta() has to go into an
>> infinite loop to reclaim jouranl bucket, and blocks the whole cache set
>> to run.
>>
>> Such buggy situation can be solved if we do following things before
>> journal replay starts,
>> - Recover previous value of c->journal.blocks_free in last run time,
>>    and set it to current c->journal.blocks_free as initial value.
>> - Recover previous value of ja->cur_idx in last run time, and set it to
>>    KEY_PTR of current c->journal.key as initial value.
>>
>> After c->journal.blocks_free and c->journal.key are recovered, in
>> condition when jouranl space is almost full and cache set is reloaded,
>> meta journal entry from journal reply can be written into free blocks of
>> the last available journal bucket, then old jouranl entries can be
>> replayed and reclaimed for further journaling request.
>>
>> This patch adds bch_journal_key_reload() to recover journal blocks_free
>> and key ptr value for above purpose. bch_journal_key_reload() is called
>> in bch_journal_read() before replying journal by bch_journal_replay().
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/journal.c | 87
>> +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 87 insertions(+)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index 5180bed911ef..a6deb16c15c8 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -143,6 +143,89 @@ reread:        left = ca->sb.bucket_size - offset;
>>       return ret;
>>   }
>>   +static int bch_journal_key_reload(struct cache_set *c)
>> +{
>> +    struct cache *ca;
>> +    unsigned int iter, n = 0;
>> +    struct bkey *k = &c->journal.key;
>> +    int ret = 0;
>> +
>> +    for_each_cache(ca, c, iter) {
>> +        struct journal_device *ja = &ca->journal;
>> +        struct bio *bio = &ja->bio;
>> +        struct jset *j, *data = c->journal.w[0].data;
>> +        struct closure cl;
>> +        unsigned int len, left;
>> +        unsigned int offset = 0, used_blocks = 0;
>> +        sector_t bucket = bucket_to_sector(c, ca->sb.d[ja->cur_idx]);
>> +
>> +        closure_init_stack(&cl);
>> +
>> +        while (offset < ca->sb.bucket_size) {
>> +reread:            left = ca->sb.bucket_size - offset;
> 
> Please place the label on a line of its own.
> 
>> +            len = min_t(unsigned int,
>> +                    left, PAGE_SECTORS << JSET_BITS);
>> +
>> +            bio_reset(bio);
>> +            bio->bi_iter.bi_sector = bucket + offset;
>> +            bio_set_dev(bio, ca->bdev);
>> +            bio->bi_iter.bi_size = len << 9;
>> +
>> +            bio->bi_end_io = journal_read_endio;
>> +            bio->bi_private = &cl;
>> +            bio_set_op_attrs(bio, REQ_OP_READ, 0);
>> +            bch_bio_map(bio, data);
>> +
>> +            closure_bio_submit(c, bio, &cl);
>> +            closure_sync(&cl);
>> +
>> +            j = data;
>> +            while (len) {
>> +                size_t blocks, bytes = set_bytes(j);
>> +
>> +                if (j->magic != jset_magic(&ca->sb))
>> +                    goto out;
>> +
>> +                if (bytes > left << 9 ||
>> +                    bytes > PAGE_SIZE << JSET_BITS) {
>> +                    pr_err("jset may be correpted: too big");
>> +                    ret = -EIO;
>> +                    goto err;
>> +                }
>> +
>> +                if (bytes > len << 9)
>> +                    goto reread;
>> +
>> +                if (j->csum != csum_set(j)) {
>> +                    pr_err("jset may be corrupted: bad csum");
>> +                    ret = -EIO;
>> +                    goto err;
>> +                }
>> +
>> +                blocks = set_blocks(j, block_bytes(c));
>> +                used_blocks += blocks;
>> +
>> +                offset    += blocks * ca->sb.block_size;
>> +                len    -= blocks * ca->sb.block_size;
>> +                j = ((void *) j) + blocks * block_bytes(ca);
>> +            }
>> +        }
>> +out:
>> +        c->journal.blocks_free =
>> +            (c->sb.bucket_size >> c->block_bits) -
>> +            used_blocks;
>> +
>> +        k->ptr[n++] = MAKE_PTR(0, bucket, ca->sb.nr_this_dev);
>> +    }
>> +
>> +    BUG_ON(n == 0);
>> +    bkey_init(k);
>> +    SET_KEY_PTRS(k, n);
>> +
>> +err:
>> +    return ret;
>> +}
>> +
>>   int bch_journal_read(struct cache_set *c, struct list_head *list)
>>   {
>>   #define read_bucket(b)                            \
> This is a _quite_ convoluted nested loop.
> It would be far better if it could be split into functions, so as to
> avoid the loop-within-loop constructs.
> Oh, and some documentation (especially the 'reread' bit) would be nice.

Hi Hannes,

Sure I will fix them in next version. Thanks.


-- 

Coly Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 04/16] bcache: fix journal deadlock during jouranl replay
  2019-04-19 16:04 ` [RFC PATCH v2 04/16] bcache: fix journal deadlock during jouranl replay Coly Li
@ 2019-04-23  6:59   ` Hannes Reinecke
  2019-04-23  7:07     ` Coly Li
  0 siblings, 1 reply; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  6:59 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:04 PM, Coly Li wrote:
> A deadlock of bcache jouranling may happen during journal replay. Such
> deadlock happens when,
> - Journal space is totally full (no any free blocks) and system crashes
>    or reboots.
> - After reboot, the first journal entry handled by jouranl replay causes
>    btree split and jouranl_meta() is called to write an empty jset to
>    journal space.
> - There is no journal space to write and journal_reclaim() fails to get
>    any available bucket because this is the first replayed journal entry
>    to be blocked.
> Then the whole cache set is blocked from running.
> 
> This patch is an effort to fix such journal replay deadlock in a simpler
> way,
> - Add a bool varialbe 'in_replay' in struct journal, set it to true when
>    journal replay starts, and set it to false when journal replay
>    completed. in_replay is initialized to be false.
> - Reserve 6 sectors in journal bucket, do not use them in normal bcache
>    runtime. These sectors are only permitted to use during journal
>    replay (when c->jouranl.in_replay is true)
> 
> Then in normal bcache runtime, journal space won't be totally full and
> there are 6 sectors are always reserved for journal replay time. After
> system reboots, if bch_btree_insert() in bch_journal_replay() causes
> btree split and bch_journal_beta() gets called to require 1 sector
> from journal buckets to write an empty jset, there are enough reserved
> space to serve.
> 
> The reason to reserve 6 sectors is, we should choose a number that won't
> fix into a bucket size. If the reserved space happens to be a whole
> bucket, more logic has to be added in journal_replay() to handle
> journal.blocks_free with reserved spaces in journal replay time. This is
> why 6 sectors is choosed, it is 3KB and won't be any proper block size
> or bucket size.
> 
> The bcache btree node size is quite large, so btree node split won't be
> a frequent event. And when btree node split happens, new added key will
> be insert directly into uppper level or neighbor nodes and won't go into
> journal again, only bch_journal_meta() is called to write jset metadata
> which occupies 1 block in journal space. If blocksize is set to 4K size,
> reserve 6 sectors indeed is 2 blocks, so there can be two continuously
> btree splitting happen during journal replay, this is very very rare in
> practice. As default blocksize is set to sector size, that equals to
> 6 blocks reserved. Contiously splitting the btree for 6 times in journal
> replay is almost impossible, so the reserved space seems to be enough
> in my humble opinion.
> 
> If in future the reserved space turns out to be not enough, let's extend
> it then.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 100 ++++++++++++++++++++++++++++++++++++++++----
>   drivers/md/bcache/journal.h |   4 ++
>   2 files changed, 97 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index a6deb16c15c8..c60a702f53a9 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -415,6 +415,8 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
>   	uint64_t start = i->j.last_seq, end = i->j.seq, n = start;
>   	struct keylist keylist;
>   
> +	s->journal.in_replay = true;
> +
>   	list_for_each_entry(i, list, list) {
>   		BUG_ON(i->pin && atomic_read(i->pin) != 1);
>   
> @@ -448,6 +450,7 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
>   	pr_info("journal replay done, %i keys in %i entries, seq %llu",
>   		keys, entries, end);
>   err:
> +	s->journal.in_replay = false;
>   	while (!list_empty(list)) {
>   		i = list_first_entry(list, struct journal_replay, list);
>   		list_del(&i->list);

What about locking here?
Doesn't 'in_replay' need to be guarded by a spinlock or something?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 05/16] bcache: reserve space for journal_meta() in run time
  2019-04-19 16:04 ` [RFC PATCH v2 05/16] bcache: reserve space for journal_meta() in run time Coly Li
@ 2019-04-23  7:00   ` Hannes Reinecke
  0 siblings, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:00 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:04 PM, Coly Li wrote:
> Another journal deadlock of bcache jouranling can happen in normal
> bcache runtime. It is very rare to happen but there are people report
> bkey insert work queue blocked which caused by such deadlock.
> 
> This is how such jouranling deadlock in runtime happens,
> - Journal space is totally full and no free space to reclaim, jouranling
>    tasks waiting for space to write in journal_wait_for_write().
> - In order to have free journal space, btree_flush_write() is called to
>    flush earlest journaled in-memory btree key into btree node. Then all
>    journaled bkey in early used journal buckets are flushed to on-disk
>    btree, this journal bucket can be reclaimed for new coming jouranl
>    request.
> - But if the earlest jouranled bkey causes a btree node split during
>    insert it into btree node, finally journal_meta() will be called to
>    journal btree root (and other information) into the journal space.
> - Unfortunately the journal space is full, and the jouranl entries has
>    to be flushed in linear turn. So bch_journal_meta() from bkey insert
>    is blocked too.
> Then jouranling deadlock during bcache run time happens.
> 
> A method to fix such deadlock is to reserve some journal space too. The
> reserved space can only be used when,
> - Current journal bucket is the last journal bucket which has available
>    space to write into.
> - When calling bch_journal(), current jset is empty and there is no key
>    in the inserting key list. This means the journal request if from
>    bch_journal_meta() and no non-reserved space can be used.
> 
> Then if such journaling request is from bch_journal_meta() of inserting
> the earlest journaled bkey back into btree, the deadlock condition won't
> happen any more because the reserved space can be used for such
> scenario.
> 
> Since there are already 6 sectors reserved for journal replay, here we
> reserve 7 sectors for runtime meta journal from btree split caused by
> flushing journal entries back to btree node. Depends on block size from
> 1 sector to 4KB, the reserved space can serve for form 7 to 2 journal
> blocks. Indeed only one journal block reserved for such journal deadlock
> scenario is enough, 2 continuous btree splits cause by two adjoin bkey
> flushing from journal is very very rare to happen. So reserve 7 sectors
> should works.
> 
> Another reason for reserving 7 sectors is, there are already 6 sectors
> reserved fo journal repley, so in total there are 13 sectors reserved in
> last available journal bucket. 13 sectors won't be a proper bucket size,
> so we don't need to add more code to handle journal.blocks_free
> initialization for whole reserved jouranl bucket. Even such code logic
> is simple, less code is better in my humble opinion.
> 
> Again, if in future the reserved space turns out to be not enough, let's
> extend it then.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 89 +++++++++++++++++++++++++++++++++------------
>   drivers/md/bcache/journal.h |  1 +
>   2 files changed, 66 insertions(+), 24 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/16] bcache: add failure check to run_cache_set() for journal replay
  2019-04-19 16:04 ` [RFC PATCH v2 06/16] bcache: add failure check to run_cache_set() for journal replay Coly Li
@ 2019-04-23  7:02   ` Hannes Reinecke
  2019-04-23  7:09     ` Coly Li
  2019-04-24 16:06     ` Coly Li
  0 siblings, 2 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:02 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:04 PM, Coly Li wrote:
> Currently run_cache_set() has no return value, if there is failure in
> bch_journal_replay(), the caller of run_cache_set() has no idea about
> such failure and just continue to execute following code after
> run_cache_set().  The internal failure is triggered inside
> bch_journal_replay() and being handled in async way. This behavior is
> inefficient, while failure handling inside bch_journal_replay(), cache
> register code is still running to start the cache set. Registering and
> unregistering code running as same time may introduce some rare race
> condition, and make the code to be more hard to be understood.
> 
> This patch adds return value to run_cache_set(), and returns -EIO if
> bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
> such failure and stop registering code flow immedidately inside
> register_cache_set().
> 
> If journal replay fails, run_cache_set() can report error immediately
> to register_cache_set(). This patch makes the failure handling for
> bch_journal_replay() be in synchronized way, easier to understand and
> debug, and avoid poetential race condition for register-and-unregister
> in same time.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 17 ++++++++++++-----
>   1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index a697a3a923cd..036bffad0bfe 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1775,7 +1775,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
>   	return NULL;
>   }
>   
> -static void run_cache_set(struct cache_set *c)
> +static int run_cache_set(struct cache_set *c)
>   {
>   	const char *err = "cannot allocate memory";
>   	struct cached_dev *dc, *t;
> @@ -1869,7 +1869,9 @@ static void run_cache_set(struct cache_set *c)
>   		if (j->version < BCACHE_JSET_VERSION_UUID)
>   			__uuid_write(c);
>   
> -		bch_journal_replay(c, &journal);
> +		err = "bcache: replay journal failed";
> +		if (bch_journal_replay(c, &journal))
> +			goto err;
>   	} else {
>   		pr_notice("invalidating existing data");
>   
> @@ -1937,11 +1939,13 @@ static void run_cache_set(struct cache_set *c)
>   	flash_devs_run(c);
>   
>   	set_bit(CACHE_SET_RUNNING, &c->flags);
> -	return;
> +	return 0;
>   err:
>   	closure_sync(&cl);
>   	/* XXX: test this, it's broken */
>   	bch_cache_set_error(c, "%s", err);
> +
> +	return -EIO;
>   }
>   
Shouldn't you differentiate between -EIO if the journal replay failed 
and -ENOMEM?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 07/16] bcache: add comments for kobj release callback routine
  2019-04-19 16:05 ` [RFC PATCH v2 07/16] bcache: add comments for kobj release callback routine Coly Li
  2019-04-21 17:52   ` Chaitanya Kulkarni
@ 2019-04-23  7:03   ` Hannes Reinecke
  1 sibling, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:03 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:05 PM, Coly Li wrote:
> Bcache has several routines to release resources in implicit way, they
> are called when the associated kobj released. This patch adds code
> comments to notice when and which release callback will be called,
> - When dc->disk.kobj released:
>    void bch_cached_dev_release(struct kobject *kobj)
> - When d->kobj released:
>    void bch_flash_dev_release(struct kobject *kobj)
> - When c->kobj released:
>    void bch_cache_set_release(struct kobject *kobj)
> - When ca->kobj released
>    void bch_cache_release(struct kobject *kobj)
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 036bffad0bfe..400af446c372 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1174,6 +1174,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
>   	return 0;
>   }
>   
> +/* when dc->disk.kobj released */
>   void bch_cached_dev_release(struct kobject *kobj)
>   {
>   	struct cached_dev *dc = container_of(kobj, struct cached_dev,
> @@ -1326,6 +1327,7 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
>   
>   /* Flash only volumes */
>   
> +/* When d->kobj released */
>   void bch_flash_dev_release(struct kobject *kobj)
>   {
>   	struct bcache_device *d = container_of(kobj, struct bcache_device,
> @@ -1496,6 +1498,7 @@ bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
>   	return true;
>   }
>   
> +/* When c->kobj released */
>   void bch_cache_set_release(struct kobject *kobj)
>   {
>   	struct cache_set *c = container_of(kobj, struct cache_set, kobj);
> @@ -2023,6 +2026,7 @@ static const char *register_cache_set(struct cache *ca)
>   
>   /* Cache device */
>   
> +/* When ca->kobj released */
>   void bch_cache_release(struct kobject *kobj)
>   {
>   	struct cache *ca = container_of(kobj, struct cache, kobj);
> 
This could be inferred from the 'container_of' statement for each relase 
function, but if you want ...

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 08/16] bcache: return error immediately in bch_journal_replay()
  2019-04-19 16:05 ` [RFC PATCH v2 08/16] bcache: return error immediately in bch_journal_replay() Coly Li
@ 2019-04-23  7:04   ` Hannes Reinecke
  0 siblings, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:04 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:05 PM, Coly Li wrote:
> When failure happens inside bch_journal_replay(), calling
> cache_set_err_on() and handling the failure in async way is not a good
> idea. Because after bch_journal_replay() returns, registering code will
> continue to execute following steps, and unregistering code triggered
> by cache_set_err_on() is running in same time. First it is unnecessary
> to handle failure and unregister cache set in an async way, second there
> might be potential race condition to run register and unregister code
> for same cache set.
> 
> So in this patch, if failure happens in bch_journal_replay(), we don't
> call cache_set_err_on(), and just print out the same error message to
> kernel message buffer, then return -EIO immediately caller. Then caller
> can detect such failure and handle it in synchrnozied way.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index 6aa68ab7cd78..bdb6f9cefe48 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -420,9 +420,12 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
>   	list_for_each_entry(i, list, list) {
>   		BUG_ON(i->pin && atomic_read(i->pin) != 1);
>   
> -		cache_set_err_on(n != i->j.seq, s,
> -"bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
> -				 n, i->j.seq - 1, start, end);
> +		if (n != i->j.seq) {
> +			pr_err("bcache: journal entries %llu-%llu missing! (replaying %llu-%llu)",
> +			n, i->j.seq - 1, start, end);
> +			ret = -EIO;
> +			goto err;
> +		}
>   
>   		for (k = i->j.start;
>   		     k < bset_bkey_last(&i->j);
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 09/16] bcache: add error check for calling register_bdev()
  2019-04-19 16:05 ` [RFC PATCH v2 09/16] bcache: add error check for calling register_bdev() Coly Li
  2019-04-21 18:00   ` Chaitanya Kulkarni
@ 2019-04-23  7:04   ` Hannes Reinecke
  1 sibling, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:04 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:05 PM, Coly Li wrote:
> This patch adds return value to register_bdev(). Then if failure happens
> inside register_bdev(), its caller register_bcache() may detect and
> handle the failure more properly.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 16 ++++++++++------
>   1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 400af446c372..a435c506edba 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1281,7 +1281,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned int block_size)
>   
>   /* Cached device - bcache superblock */
>   
> -static void register_bdev(struct cache_sb *sb, struct page *sb_page,
> +static int register_bdev(struct cache_sb *sb, struct page *sb_page,
>   				 struct block_device *bdev,
>   				 struct cached_dev *dc)
>   {
> @@ -1319,10 +1319,11 @@ static void register_bdev(struct cache_sb *sb, struct page *sb_page,
>   	    BDEV_STATE(&dc->sb) == BDEV_STATE_STALE)
>   		bch_cached_dev_run(dc);
>   
> -	return;
> +	return 0;
>   err:
>   	pr_notice("error %s: %s", dc->backing_dev_name, err);
>   	bcache_device_stop(&dc->disk);
> +	return -EIO;
>   }
>   
>   /* Flash only volumes */
> @@ -2273,7 +2274,7 @@ static bool bch_is_open(struct block_device *bdev)
>   static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   			       const char *buffer, size_t size)
>   {
> -	ssize_t ret = size;
> +	ssize_t ret = -EINVAL;
>   	const char *err = "cannot allocate memory";
>   	char *path = NULL;
>   	struct cache_sb *sb = NULL;
> @@ -2307,7 +2308,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   			if (!IS_ERR(bdev))
>   				bdput(bdev);
>   			if (attr == &ksysfs_register_quiet)
> -				goto out;
> +				goto quiet_out;
>   		}
>   		goto err;
>   	}
> @@ -2328,8 +2329,10 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   			goto err_close;
>   
>   		mutex_lock(&bch_register_lock);
> -		register_bdev(sb, sb_page, bdev, dc);
> +		ret = register_bdev(sb, sb_page, bdev, dc);
>   		mutex_unlock(&bch_register_lock);
> +		if (ret < 0)
> +			goto err;
>   	} else {
>   		struct cache *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
>   
> @@ -2339,6 +2342,8 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   		if (register_cache(sb, sb_page, bdev, ca) != 0)
>   			goto err;
>   	}
> +quiet_out:
> +	ret = size;
>   out:
>   	if (sb_page)
>   		put_page(sb_page);
> @@ -2351,7 +2356,6 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   	blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
>   err:
>   	pr_info("error %s: %s", path, err);
> -	ret = -EINVAL;
>   	goto out;
>   }
>   
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 10/16] bcache: Add comments for blkdev_put() in registration code path
  2019-04-19 16:05 ` [RFC PATCH v2 10/16] bcache: Add comments for blkdev_put() in registration code path Coly Li
  2019-04-21 17:50   ` Chaitanya Kulkarni
@ 2019-04-23  7:05   ` Hannes Reinecke
  1 sibling, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:05 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:05 PM, Coly Li wrote:
> Add comments to explain why in register_bcache() blkdev_put() won't
> be called in two location. Add comments to explain why blkdev_put()
> must be called in register_cache() when cache_alloc() failed.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index a435c506edba..83a7cb0e0e45 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2191,6 +2191,12 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
>   
>   	ret = cache_alloc(ca);
>   	if (ret != 0) {
> +		/*
> +		 * If we failed here, it means ca->kobj is not initialzed yet,
> +		 * kobject_put() won't be called and there is no chance to
> +		 * call blkdev_put() to bdev in bch_cache_release(). So we
> +		 * explictly call blkdev_put() here.
> +		 */
>   		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
>   		if (ret == -ENOMEM)
>   			err = "cache_alloc(): -ENOMEM";
> @@ -2331,6 +2337,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   		mutex_lock(&bch_register_lock);
>   		ret = register_bdev(sb, sb_page, bdev, dc);
>   		mutex_unlock(&bch_register_lock);
> +		/* blkdev_put() will be called in cached_dev_free() */
>   		if (ret < 0)
>   			goto err;
>   	} else {
> @@ -2339,6 +2346,7 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   		if (!ca)
>   			goto err_close;
>   
> +		/* blkdev_put() will be called in bch_cache_release() */
>   		if (register_cache(sb, sb_page, bdev, ca) != 0)
>   			goto err;
>   	}
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 11/16] bcache: add comments for closure_fn to be called in closure_queue()
  2019-04-19 16:05 ` [RFC PATCH v2 11/16] bcache: add comments for closure_fn to be called in closure_queue() Coly Li
  2019-04-21 17:43   ` Chaitanya Kulkarni
@ 2019-04-23  7:05   ` Hannes Reinecke
  1 sibling, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:05 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:05 PM, Coly Li wrote:
> Add code comments to explain which call back function might be called
> for the closure_queue(). This is an effort to make code to be more
> understandable for readers.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 83a7cb0e0e45..9b41e0b62cc0 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -662,6 +662,11 @@ static const struct block_device_operations bcache_ops = {
>   void bcache_device_stop(struct bcache_device *d)
>   {
>   	if (!test_and_set_bit(BCACHE_DEV_CLOSING, &d->flags))
> +		/*
> +		 * closure_fn set to
> +		 * - cached device: cached_dev_flush()
> +		 * - flash dev: flash_dev_flush()
> +		 */
>   		closure_queue(&d->cl);
>   }
>   
> @@ -1677,6 +1682,7 @@ static void __cache_set_unregister(struct closure *cl)
>   void bch_cache_set_stop(struct cache_set *c)
>   {
>   	if (!test_and_set_bit(CACHE_SET_STOPPING, &c->flags))
> +		/* closure_fn set to __cache_set_unregister() */
>   		closure_queue(&c->caching);
>   }
>   
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 04/16] bcache: fix journal deadlock during jouranl replay
  2019-04-23  6:59   ` Hannes Reinecke
@ 2019-04-23  7:07     ` Coly Li
  0 siblings, 0 replies; 47+ messages in thread
From: Coly Li @ 2019-04-23  7:07 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block

On 2019/4/23 2:59 下午, Hannes Reinecke wrote:
> On 4/19/19 6:04 PM, Coly Li wrote:
>> A deadlock of bcache jouranling may happen during journal replay. Such
>> deadlock happens when,
>> - Journal space is totally full (no any free blocks) and system crashes
>>    or reboots.
>> - After reboot, the first journal entry handled by jouranl replay causes
>>    btree split and jouranl_meta() is called to write an empty jset to
>>    journal space.
>> - There is no journal space to write and journal_reclaim() fails to get
>>    any available bucket because this is the first replayed journal entry
>>    to be blocked.
>> Then the whole cache set is blocked from running.
>>
>> This patch is an effort to fix such journal replay deadlock in a simpler
>> way,
>> - Add a bool varialbe 'in_replay' in struct journal, set it to true when
>>    journal replay starts, and set it to false when journal replay
>>    completed. in_replay is initialized to be false.
>> - Reserve 6 sectors in journal bucket, do not use them in normal bcache
>>    runtime. These sectors are only permitted to use during journal
>>    replay (when c->jouranl.in_replay is true)
>>
>> Then in normal bcache runtime, journal space won't be totally full and
>> there are 6 sectors are always reserved for journal replay time. After
>> system reboots, if bch_btree_insert() in bch_journal_replay() causes
>> btree split and bch_journal_beta() gets called to require 1 sector
>> from journal buckets to write an empty jset, there are enough reserved
>> space to serve.
>>
>> The reason to reserve 6 sectors is, we should choose a number that won't
>> fix into a bucket size. If the reserved space happens to be a whole
>> bucket, more logic has to be added in journal_replay() to handle
>> journal.blocks_free with reserved spaces in journal replay time. This is
>> why 6 sectors is choosed, it is 3KB and won't be any proper block size
>> or bucket size.
>>
>> The bcache btree node size is quite large, so btree node split won't be
>> a frequent event. And when btree node split happens, new added key will
>> be insert directly into uppper level or neighbor nodes and won't go into
>> journal again, only bch_journal_meta() is called to write jset metadata
>> which occupies 1 block in journal space. If blocksize is set to 4K size,
>> reserve 6 sectors indeed is 2 blocks, so there can be two continuously
>> btree splitting happen during journal replay, this is very very rare in
>> practice. As default blocksize is set to sector size, that equals to
>> 6 blocks reserved. Contiously splitting the btree for 6 times in journal
>> replay is almost impossible, so the reserved space seems to be enough
>> in my humble opinion.
>>
>> If in future the reserved space turns out to be not enough, let's extend
>> it then.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/journal.c | 100
>> ++++++++++++++++++++++++++++++++++++++++----
>>   drivers/md/bcache/journal.h |   4 ++
>>   2 files changed, 97 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index a6deb16c15c8..c60a702f53a9 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -415,6 +415,8 @@ int bch_journal_replay(struct cache_set *s, struct
>> list_head *list)
>>       uint64_t start = i->j.last_seq, end = i->j.seq, n = start;
>>       struct keylist keylist;
>>   +    s->journal.in_replay = true;
>> +
>>       list_for_each_entry(i, list, list) {
>>           BUG_ON(i->pin && atomic_read(i->pin) != 1);
>>   @@ -448,6 +450,7 @@ int bch_journal_replay(struct cache_set *s,
>> struct list_head *list)
>>       pr_info("journal replay done, %i keys in %i entries, seq %llu",
>>           keys, entries, end);
>>   err:
>> +    s->journal.in_replay = false;
>>       while (!list_empty(list)) {
>>           i = list_first_entry(list, struct journal_replay, list);
>>           list_del(&i->list);
> 
> What about locking here?
> Doesn't 'in_replay' need to be guarded by a spinlock or something?

Hi Hannes,

s->journal.in_replay is only set in start up time, where the code is
sequentially executed, so it is sure on concurrent access to
s->journal.in_replay. After the cache runs, s->journal.in_replay is
always false, there is no concurrent access neither.

It is only used to tell the journal code that now it is in journal reply
and the journal-replay-reserved slots can be used. The set and check
sequence is in explicit linear order.

I will add code comments for s->journal.in_replay to make it more clear.

Thanks.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 12/16] bcache: add pendings_cleanup to stop pending bcache device
  2019-04-19 16:05 ` [RFC PATCH v2 12/16] bcache: add pendings_cleanup to stop pending bcache device Coly Li
@ 2019-04-23  7:08   ` Hannes Reinecke
  2019-04-23  7:15     ` Coly Li
  0 siblings, 1 reply; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:08 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:05 PM, Coly Li wrote:
> If a bcache device is in dirty state and its cache set is not
> registered, this bcache deivce will not appear in /dev/bcache<N>,
> and there is no way to stop it or remove the bcache kernel module.
> 
> This is an as-designed behavior, but sometimes people has to reboot
> whole system to release or stop the pending backing device.
> 
> This sysfs interface may remove such pending bcache devices when
> write anything into the sysfs file manually.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 55 insertions(+)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 9b41e0b62cc0..e988e46a6479 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2246,9 +2246,13 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
>   
>   static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   			       const char *buffer, size_t size);
> +static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
> +					 struct kobj_attribute *attr,
> +					 const char *buffer, size_t size);
>   
>   kobj_attribute_write(register,		register_bcache);
>   kobj_attribute_write(register_quiet,	register_bcache);
> +kobj_attribute_write(pendings_cleanup,	bch_pending_bdevs_cleanup);
>   
>   static bool bch_is_open_backing(struct block_device *bdev)
>   {
> @@ -2373,6 +2377,56 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
>   	goto out;
>   }
>   
> +
> +struct pdev {
> +	struct list_head list;
> +	struct cached_dev *dc;
> +};
> +
> +static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
> +					 struct kobj_attribute *attr,
> +					 const char *buffer,
> +					 size_t size)
> +{
> +	LIST_HEAD(pending_devs);
> +	ssize_t ret = size;
> +	struct cached_dev *dc, *tdc;
> +	struct pdev *pdev, *tpdev;
> +	struct cache_set *c, *tc;
> +
> +	mutex_lock(&bch_register_lock);
> +	list_for_each_entry_safe(dc, tdc, &uncached_devices, list) {
> +		pdev = kmalloc(sizeof(struct pdev), GFP_KERNEL);
> +		if (!pdev)
> +			break;
> +		pdev->dc = dc;
> +		list_add(&pdev->list, &pending_devs);
> +	}
> +
> +	list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) {
> +		list_for_each_entry_safe(c, tc, &bch_cache_sets, list) {
> +			char *pdev_set_uuid = pdev->dc->sb.set_uuid;
> +			char *set_uuid = c->sb.uuid;
> +
> +			if (!memcmp(pdev_set_uuid, set_uuid, 16)) {
> +				list_del(&pdev->list);
> +				kfree(pdev);
> +				break;
> +			}
> +		}
> +	}
> +	mutex_unlock(&bch_register_lock);
> +
> +	list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) {
> +		pr_info("delete pdev %p", pdev);
> +		list_del(&pdev->list);
> +		bcache_device_stop(&pdev->dc->disk);
> +		kfree(pdev);
> +	}
> +
> +	return ret;
> +}
> +
>   static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
>   {
>   	if (code == SYS_DOWN ||
> @@ -2483,6 +2537,7 @@ static int __init bcache_init(void)
>   	static const struct attribute *files[] = {
>   		&ksysfs_register.attr,
>   		&ksysfs_register_quiet.attr,
> +		&ksysfs_pendings_cleanup.attr,
>   		NULL
>   	};
>   
> 
_Actually_ I would like it better if the bcache device would be present 
in sysfs for these cases, too, albeit in a disabled state.
That would allow us to remove the device like normal and we wouldn't 
need to worry about yet another interface.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write()
  2019-04-19 16:05 ` [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write() Coly Li
       [not found]   ` <20190419231642.90AB82171F@mail.kernel.org>
@ 2019-04-23  7:09   ` Hannes Reinecke
  2019-04-23  7:16     ` Coly Li
  1 sibling, 1 reply; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:09 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block, stable

On 4/19/19 6:05 PM, Coly Li wrote:
> Current journal_max_cmp() and journal_min_cmp() assume that smaller fifo
> index indicating elder journal entries, but this is only true when fifo
> index is not swapped.
> 
> Fifo structure journal.pin is implemented by a cycle buffer, if the head
> index reaches highest location of the cycle buffer, it will be swapped
> to 0. Once the swapping happens, it means a smaller fifo index might be
> associated to a newer journal entry. So the btree node with oldest
> journal entry won't be selected by btree_flush_write() to flush out to
> cache device. The result is, the oldest journal entries may always has
> no chance to be written into cache device, and after a reboot
> bch_journal_replay() may complain some journal entries are missing.
> 
> This patch handles the fifo index swapping conditions properly, then in
> btree_flush_write() the btree node with oldest journal entry can be
> slected from c->flush_btree correctly.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 47 +++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 41 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
> index bdb6f9cefe48..bc0e01151155 100644
> --- a/drivers/md/bcache/journal.c
> +++ b/drivers/md/bcache/journal.c
> @@ -464,12 +464,47 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
>   }
>   
>   /* Journalling */
> -#define journal_max_cmp(l, r) \
> -	(fifo_idx(&c->journal.pin, btree_current_write(l)->journal) < \
> -	 fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
> -#define journal_min_cmp(l, r) \
> -	(fifo_idx(&c->journal.pin, btree_current_write(l)->journal) > \
> -	 fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
> +#define journal_max_cmp(l, r)						\
> +({									\
> +	int l_idx, r_idx, f_idx, b_idx;					\
> +	bool _ret = true;						\
> +									\
> +	l_idx = fifo_idx(&c->journal.pin, btree_current_write(l)->journal); \
> +	r_idx = fifo_idx(&c->journal.pin, btree_current_write(r)->journal); \
> +	f_idx = c->journal.pin.front;					\
> +	b_idx = c->journal.pin.back;					\
> +									\
> +	_ret = (l_idx < r_idx);						\
> +	/* in case fifo back pointer is swapped */			\
> +	if (b_idx < f_idx) { 						\
> +		if (l_idx <= b_idx && r_idx >= f_idx)			\
> +			_ret = false;					\
> +		else if (l_idx >= f_idx && r_idx <= b_idx)		\
> +			_ret = true;					\
> +	}								\
> +	_ret;								\
> +})
> +
> +#define journal_min_cmp(l, r)						\
> +({									\
> +	int l_idx, r_idx, f_idx, b_idx;					\
> +	bool _ret = true;						\
> +									\
> +	l_idx = fifo_idx(&c->journal.pin, btree_current_write(l)->journal); \
> +	r_idx = fifo_idx(&c->journal.pin, btree_current_write(r)->journal); \
> +	f_idx = c->journal.pin.front;					\
> +	b_idx = c->journal.pin.back;					\
> +									\
> +	_ret = (l_idx > r_idx);						\
> +	/* in case fifo back pointer is swapped */			\
> +	if (b_idx < f_idx) {						\
> +		if (l_idx <= b_idx && r_idx >= f_idx)			\
> +			_ret = true;					\
> +		else if (l_idx >= f_idx && r_idx <= b_idx)		\
> +			_ret = false;					\
> +	}								\
> +	_ret;								\
> +})
>   
>   static void btree_flush_write(struct cache_set *c)
>   {
> 
Please make it a proper function.
This is far too convoluted for being handled via #define, and it would
avoid cluttering the function namespace with hidden variables.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/16] bcache: add failure check to run_cache_set() for journal replay
  2019-04-23  7:02   ` Hannes Reinecke
@ 2019-04-23  7:09     ` Coly Li
  2019-04-24 16:06     ` Coly Li
  1 sibling, 0 replies; 47+ messages in thread
From: Coly Li @ 2019-04-23  7:09 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block

On 2019/4/23 3:02 下午, Hannes Reinecke wrote:
> On 4/19/19 6:04 PM, Coly Li wrote:
>> Currently run_cache_set() has no return value, if there is failure in
>> bch_journal_replay(), the caller of run_cache_set() has no idea about
>> such failure and just continue to execute following code after
>> run_cache_set().  The internal failure is triggered inside
>> bch_journal_replay() and being handled in async way. This behavior is
>> inefficient, while failure handling inside bch_journal_replay(), cache
>> register code is still running to start the cache set. Registering and
>> unregistering code running as same time may introduce some rare race
>> condition, and make the code to be more hard to be understood.
>>
>> This patch adds return value to run_cache_set(), and returns -EIO if
>> bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
>> such failure and stop registering code flow immedidately inside
>> register_cache_set().
>>
>> If journal replay fails, run_cache_set() can report error immediately
>> to register_cache_set(). This patch makes the failure handling for
>> bch_journal_replay() be in synchronized way, easier to understand and
>> debug, and avoid poetential race condition for register-and-unregister
>> in same time.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/super.c | 17 ++++++++++++-----
>>   1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index a697a3a923cd..036bffad0bfe 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1775,7 +1775,7 @@ struct cache_set *bch_cache_set_alloc(struct
>> cache_sb *sb)
>>       return NULL;
>>   }
>>   -static void run_cache_set(struct cache_set *c)
>> +static int run_cache_set(struct cache_set *c)
>>   {
>>       const char *err = "cannot allocate memory";
>>       struct cached_dev *dc, *t;
>> @@ -1869,7 +1869,9 @@ static void run_cache_set(struct cache_set *c)
>>           if (j->version < BCACHE_JSET_VERSION_UUID)
>>               __uuid_write(c);
>>   -        bch_journal_replay(c, &journal);
>> +        err = "bcache: replay journal failed";
>> +        if (bch_journal_replay(c, &journal))
>> +            goto err;
>>       } else {
>>           pr_notice("invalidating existing data");
>>   @@ -1937,11 +1939,13 @@ static void run_cache_set(struct cache_set *c)
>>       flash_devs_run(c);
>>         set_bit(CACHE_SET_RUNNING, &c->flags);
>> -    return;
>> +    return 0;
>>   err:
>>       closure_sync(&cl);
>>       /* XXX: test this, it's broken */
>>       bch_cache_set_error(c, "%s", err);
>> +
>> +    return -EIO;
>>   }
>>   
> Shouldn't you differentiate between -EIO if the journal replay failed
> and -ENOMEM?

Hi Hannes,

Currently I just return an error code to caller to indicate something
wrong, a more specific error code should be more informative. Let me
think how to do it.

Thanks.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 14/16] bcache: try to flush btree nodes as many as possible
  2019-04-19 16:05 ` [RFC PATCH v2 14/16] bcache: try to flush btree nodes as many as possible Coly Li
@ 2019-04-23  7:10   ` Hannes Reinecke
  0 siblings, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:10 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:05 PM, Coly Li wrote:
> When btree_flush_write() is called, it means the journal space is
> exhuasted already. Current code only selects a single btree node to
> write out, which may introduce huge cache bounce from the spinlock on
> multiple cpu cores, when a lot of kworkers on journaling code path to
> call btree_flush_write() for journal space reclaiming.
> 
> This patch tries to flush as many btree node as possible inside
> a single call to btree_flush_write(), then the frequence of calling
> btree_flush_write() can be reduced, which in turn reduces the cache
> bounce from spinlock on multiple cpu cores. Please notice that this
> patch does not reduce the total times of acquiring spinlock, a spin
> lock is still acquired when select every single btree node to write
> out, but this patch will try best to hold the spinlock on same cpu
> core, which avoids the cache bounce where the spinlock is acquired by
> multiple different cpu cores.
> 
> After the patch applied, in my pressure testing, 'top' shows more than
> 50% sys cpu time reduced from the kworks which competing spinlock
> inside btree_flush_write().
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 7 ++++++-
>   drivers/md/bcache/journal.h | 4 ++--
>   2 files changed, 8 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 15/16] bcache: improve bcache_reboot()
  2019-04-19 16:05 ` [RFC PATCH v2 15/16] bcache: improve bcache_reboot() Coly Li
@ 2019-04-23  7:13   ` Hannes Reinecke
  2019-04-23  7:18     ` Coly Li
  0 siblings, 1 reply; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:13 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:05 PM, Coly Li wrote:
> This patch tries to release mutex bch_register_lock early, to give
> chance to stop cache set and bcache device early.
> 
> This patch also expends time out of stopping all bcache device from
> 2 seconds to 10 seconds, because stopping writeback rate update worker
> may delay for 5 seconds, 2 seconds is not enough.
> 
> After this patch applied, stopping bcache devices during system reboot
> or shutdown is very hard to be observed any more.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/super.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index e988e46a6479..2d377a4a182f 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2453,10 +2453,13 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
>   		list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
>   			bcache_device_stop(&dc->disk);
>   
> +		mutex_unlock(&bch_register_lock);
> +
>   		/* What's a condition variable? */
>   		while (1) {
> -			long timeout = start + 2 * HZ - jiffies;
> +			long timeout = start + 10 * HZ - jiffies;
>   
> +			mutex_lock(&bch_register_lock);
>   			stopped = list_empty(&bch_cache_sets) &&
>   				list_empty(&uncached_devices);
>   
> @@ -2468,7 +2471,6 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
>   
>   			mutex_unlock(&bch_register_lock);
>   			schedule_timeout(timeout);
> -			mutex_lock(&bch_register_lock);
>   		}
>   
>   		finish_wait(&unregister_wait, &wait);
> Maybe you should even call 'schedule()' after the first mutex_unlock() 
to force execution of other tasks.
But anyway:

Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 16/16] bcache: introduce spinlock_t flush_write_lock in struct journal
  2019-04-19 16:05 ` [RFC PATCH v2 16/16] bcache: introduce spinlock_t flush_write_lock in struct journal Coly Li
@ 2019-04-23  7:14   ` Hannes Reinecke
  0 siblings, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2019-04-23  7:14 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 4/19/19 6:05 PM, Coly Li wrote:
> In btree_flush_write(), iterating all cached btree nodes and adding them
> into ordered heap c->flush_btree takes quite long time. In order to
> protect ordered heap c->flush_btree, spin lock c->journal.lock is held
> for all the iteration and heap ordering. When journal space is fully
> occupied, btree_flush_write() might be called frequently, if the cached
> btree node iteration takes too much time, kenrel will complain that
> normal journal kworkers are blocked too long. Of cause write performance
> drops at this moment.
> 
> This patch introduces a new spin lock member in struct journal, named
> flush_write_lock. This lock is only used in btree_flush_write() and
> protect the ordered heap c->flush_btree during all the cached btree node
> iteration. Then there won't be lock contention on c->journal.lock.
> 
> After this fix, when journal space is fully occupied, it is very rare to
> observe the journal kworker blocking timeout warning.
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>   drivers/md/bcache/journal.c | 5 +++--
>   drivers/md/bcache/journal.h | 1 +
>   2 files changed, 4 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 12/16] bcache: add pendings_cleanup to stop pending bcache device
  2019-04-23  7:08   ` Hannes Reinecke
@ 2019-04-23  7:15     ` Coly Li
  0 siblings, 0 replies; 47+ messages in thread
From: Coly Li @ 2019-04-23  7:15 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block

On 2019/4/23 3:08 下午, Hannes Reinecke wrote:
> On 4/19/19 6:05 PM, Coly Li wrote:
>> If a bcache device is in dirty state and its cache set is not
>> registered, this bcache deivce will not appear in /dev/bcache<N>,
>> and there is no way to stop it or remove the bcache kernel module.
>>
>> This is an as-designed behavior, but sometimes people has to reboot
>> whole system to release or stop the pending backing device.
>>
>> This sysfs interface may remove such pending bcache devices when
>> write anything into the sysfs file manually.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/super.c | 55
>> +++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 55 insertions(+)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index 9b41e0b62cc0..e988e46a6479 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -2246,9 +2246,13 @@ static int register_cache(struct cache_sb *sb,
>> struct page *sb_page,
>>     static ssize_t register_bcache(struct kobject *k, struct
>> kobj_attribute *attr,
>>                      const char *buffer, size_t size);
>> +static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
>> +                     struct kobj_attribute *attr,
>> +                     const char *buffer, size_t size);
>>     kobj_attribute_write(register,        register_bcache);
>>   kobj_attribute_write(register_quiet,    register_bcache);
>> +kobj_attribute_write(pendings_cleanup,    bch_pending_bdevs_cleanup);
>>     static bool bch_is_open_backing(struct block_device *bdev)
>>   {
>> @@ -2373,6 +2377,56 @@ static ssize_t register_bcache(struct kobject
>> *k, struct kobj_attribute *attr,
>>       goto out;
>>   }
>>   +
>> +struct pdev {
>> +    struct list_head list;
>> +    struct cached_dev *dc;
>> +};
>> +
>> +static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
>> +                     struct kobj_attribute *attr,
>> +                     const char *buffer,
>> +                     size_t size)
>> +{
>> +    LIST_HEAD(pending_devs);
>> +    ssize_t ret = size;
>> +    struct cached_dev *dc, *tdc;
>> +    struct pdev *pdev, *tpdev;
>> +    struct cache_set *c, *tc;
>> +
>> +    mutex_lock(&bch_register_lock);
>> +    list_for_each_entry_safe(dc, tdc, &uncached_devices, list) {
>> +        pdev = kmalloc(sizeof(struct pdev), GFP_KERNEL);
>> +        if (!pdev)
>> +            break;
>> +        pdev->dc = dc;
>> +        list_add(&pdev->list, &pending_devs);
>> +    }
>> +
>> +    list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) {
>> +        list_for_each_entry_safe(c, tc, &bch_cache_sets, list) {
>> +            char *pdev_set_uuid = pdev->dc->sb.set_uuid;
>> +            char *set_uuid = c->sb.uuid;
>> +
>> +            if (!memcmp(pdev_set_uuid, set_uuid, 16)) {
>> +                list_del(&pdev->list);
>> +                kfree(pdev);
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    mutex_unlock(&bch_register_lock);
>> +
>> +    list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) {
>> +        pr_info("delete pdev %p", pdev);
>> +        list_del(&pdev->list);
>> +        bcache_device_stop(&pdev->dc->disk);
>> +        kfree(pdev);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   static int bcache_reboot(struct notifier_block *n, unsigned long
>> code, void *x)
>>   {
>>       if (code == SYS_DOWN ||
>> @@ -2483,6 +2537,7 @@ static int __init bcache_init(void)
>>       static const struct attribute *files[] = {
>>           &ksysfs_register.attr,
>>           &ksysfs_register_quiet.attr,
>> +        &ksysfs_pendings_cleanup.attr,
>>           NULL
>>       };
>>  
> _Actually_ I would like it better if the bcache device would be present
> in sysfs for these cases, too, albeit in a disabled state.
> That would allow us to remove the device like normal and we wouldn't
> need to worry about yet another interface.

Hi Hannes,

I see. The awkward condition is,
- If only show up the bcache device in /sys/block/bcache<?>/ directory
but not show up in /dev/bcache<?> node, it might be more confused for users.
- If show up both bcache devices in /sys/block/bcache<?> and
/dev/bcache<?>, it means such device can be read/written, but it is
staled (there are dirty data on cache device), which may corrupt
existing data.

So I choose to not change current code behavior and just add a sysfs
interface.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write()
  2019-04-23  7:09   ` Hannes Reinecke
@ 2019-04-23  7:16     ` Coly Li
  0 siblings, 0 replies; 47+ messages in thread
From: Coly Li @ 2019-04-23  7:16 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block, stable

On 2019/4/23 3:09 下午, Hannes Reinecke wrote:
> On 4/19/19 6:05 PM, Coly Li wrote:
>> Current journal_max_cmp() and journal_min_cmp() assume that smaller fifo
>> index indicating elder journal entries, but this is only true when fifo
>> index is not swapped.
>>
>> Fifo structure journal.pin is implemented by a cycle buffer, if the head
>> index reaches highest location of the cycle buffer, it will be swapped
>> to 0. Once the swapping happens, it means a smaller fifo index might be
>> associated to a newer journal entry. So the btree node with oldest
>> journal entry won't be selected by btree_flush_write() to flush out to
>> cache device. The result is, the oldest journal entries may always has
>> no chance to be written into cache device, and after a reboot
>> bch_journal_replay() may complain some journal entries are missing.
>>
>> This patch handles the fifo index swapping conditions properly, then in
>> btree_flush_write() the btree node with oldest journal entry can be
>> slected from c->flush_btree correctly.
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/journal.c | 47
>> +++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 41 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
>> index bdb6f9cefe48..bc0e01151155 100644
>> --- a/drivers/md/bcache/journal.c
>> +++ b/drivers/md/bcache/journal.c
>> @@ -464,12 +464,47 @@ int bch_journal_replay(struct cache_set *s,
>> struct list_head *list)
>>   }
>>     /* Journalling */
>> -#define journal_max_cmp(l, r) \
>> -    (fifo_idx(&c->journal.pin, btree_current_write(l)->journal) < \
>> -     fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
>> -#define journal_min_cmp(l, r) \
>> -    (fifo_idx(&c->journal.pin, btree_current_write(l)->journal) > \
>> -     fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
>> +#define journal_max_cmp(l, r)                        \
>> +({                                    \
>> +    int l_idx, r_idx, f_idx, b_idx;                    \
>> +    bool _ret = true;                        \
>> +                                    \
>> +    l_idx = fifo_idx(&c->journal.pin,
>> btree_current_write(l)->journal); \
>> +    r_idx = fifo_idx(&c->journal.pin,
>> btree_current_write(r)->journal); \
>> +    f_idx = c->journal.pin.front;                    \
>> +    b_idx = c->journal.pin.back;                    \
>> +                                    \
>> +    _ret = (l_idx < r_idx);                        \
>> +    /* in case fifo back pointer is swapped */            \
>> +    if (b_idx < f_idx) {                         \
>> +        if (l_idx <= b_idx && r_idx >= f_idx)            \
>> +            _ret = false;                    \
>> +        else if (l_idx >= f_idx && r_idx <= b_idx)        \
>> +            _ret = true;                    \
>> +    }                                \
>> +    _ret;                                \
>> +})
>> +
>> +#define journal_min_cmp(l, r)                        \
>> +({                                    \
>> +    int l_idx, r_idx, f_idx, b_idx;                    \
>> +    bool _ret = true;                        \
>> +                                    \
>> +    l_idx = fifo_idx(&c->journal.pin,
>> btree_current_write(l)->journal); \
>> +    r_idx = fifo_idx(&c->journal.pin,
>> btree_current_write(r)->journal); \
>> +    f_idx = c->journal.pin.front;                    \
>> +    b_idx = c->journal.pin.back;                    \
>> +                                    \
>> +    _ret = (l_idx > r_idx);                        \
>> +    /* in case fifo back pointer is swapped */            \
>> +    if (b_idx < f_idx) {                        \
>> +        if (l_idx <= b_idx && r_idx >= f_idx)            \
>> +            _ret = true;                    \
>> +        else if (l_idx >= f_idx && r_idx <= b_idx)        \
>> +            _ret = false;                    \
>> +    }                                \
>> +    _ret;                                \
>> +})
>>     static void btree_flush_write(struct cache_set *c)
>>   {
>>
> Please make it a proper function.
> This is far too convoluted for being handled via #define, and it would
> avoid cluttering the function namespace with hidden variables.

Hi Hannes,

Sure let me do it in next version. Thanks.


-- 

Coly Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 15/16] bcache: improve bcache_reboot()
  2019-04-23  7:13   ` Hannes Reinecke
@ 2019-04-23  7:18     ` Coly Li
  0 siblings, 0 replies; 47+ messages in thread
From: Coly Li @ 2019-04-23  7:18 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block

On 2019/4/23 3:13 下午, Hannes Reinecke wrote:
> On 4/19/19 6:05 PM, Coly Li wrote:
>> This patch tries to release mutex bch_register_lock early, to give
>> chance to stop cache set and bcache device early.
>>
>> This patch also expends time out of stopping all bcache device from
>> 2 seconds to 10 seconds, because stopping writeback rate update worker
>> may delay for 5 seconds, 2 seconds is not enough.
>>
>> After this patch applied, stopping bcache devices during system reboot
>> or shutdown is very hard to be observed any more.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/super.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index e988e46a6479..2d377a4a182f 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -2453,10 +2453,13 @@ static int bcache_reboot(struct notifier_block
>> *n, unsigned long code, void *x)
>>           list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
>>               bcache_device_stop(&dc->disk);
>>   +        mutex_unlock(&bch_register_lock);
>> +
>>           /* What's a condition variable? */
>>           while (1) {
>> -            long timeout = start + 2 * HZ - jiffies;
>> +            long timeout = start + 10 * HZ - jiffies;
>>   +            mutex_lock(&bch_register_lock);
>>               stopped = list_empty(&bch_cache_sets) &&
>>                   list_empty(&uncached_devices);
>>   @@ -2468,7 +2471,6 @@ static int bcache_reboot(struct notifier_block
>> *n, unsigned long code, void *x)
>>                 mutex_unlock(&bch_register_lock);
>>               schedule_timeout(timeout);
>> -            mutex_lock(&bch_register_lock);
>>           }
>>             finish_wait(&unregister_wait, &wait);
>> Maybe you should even call 'schedule()' after the first mutex_unlock() 
> to force execution of other tasks.
> But anyway:
> 

Hi Hannes,

It makes sense for an explicit schedule(). I will add it in next
version. Thanks.


> Reviewed-by: Hannes Reinecke <hare@suse.com>
> 
> Cheers,
> 
> Hannes


-- 

Coly Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v2 06/16] bcache: add failure check to run_cache_set() for journal replay
  2019-04-23  7:02   ` Hannes Reinecke
  2019-04-23  7:09     ` Coly Li
@ 2019-04-24 16:06     ` Coly Li
  1 sibling, 0 replies; 47+ messages in thread
From: Coly Li @ 2019-04-24 16:06 UTC (permalink / raw)
  To: Hannes Reinecke, linux-bcache; +Cc: linux-block

On 2019/4/23 3:02 下午, Hannes Reinecke wrote:
> On 4/19/19 6:04 PM, Coly Li wrote:
>> Currently run_cache_set() has no return value, if there is failure in
>> bch_journal_replay(), the caller of run_cache_set() has no idea about
>> such failure and just continue to execute following code after
>> run_cache_set().  The internal failure is triggered inside
>> bch_journal_replay() and being handled in async way. This behavior is
>> inefficient, while failure handling inside bch_journal_replay(), cache
>> register code is still running to start the cache set. Registering and
>> unregistering code running as same time may introduce some rare race
>> condition, and make the code to be more hard to be understood.
>>
>> This patch adds return value to run_cache_set(), and returns -EIO if
>> bch_journal_rreplay() fails. Then caller of run_cache_set() may detect
>> such failure and stop registering code flow immedidately inside
>> register_cache_set().
>>
>> If journal replay fails, run_cache_set() can report error immediately
>> to register_cache_set(). This patch makes the failure handling for
>> bch_journal_replay() be in synchronized way, easier to understand and
>> debug, and avoid poetential race condition for register-and-unregister
>> in same time.
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>   drivers/md/bcache/super.c | 17 ++++++++++++-----
>>   1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index a697a3a923cd..036bffad0bfe 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -1775,7 +1775,7 @@ struct cache_set *bch_cache_set_alloc(struct
>> cache_sb *sb)
>>       return NULL;
>>   }
>>   -static void run_cache_set(struct cache_set *c)
>> +static int run_cache_set(struct cache_set *c)
>>   {
>>       const char *err = "cannot allocate memory";
>>       struct cached_dev *dc, *t;
>> @@ -1869,7 +1869,9 @@ static void run_cache_set(struct cache_set *c)
>>           if (j->version < BCACHE_JSET_VERSION_UUID)
>>               __uuid_write(c);
>>   -        bch_journal_replay(c, &journal);
>> +        err = "bcache: replay journal failed";
>> +        if (bch_journal_replay(c, &journal))
>> +            goto err;
>>       } else {
>>           pr_notice("invalidating existing data");
>>   @@ -1937,11 +1939,13 @@ static void run_cache_set(struct cache_set *c)
>>       flash_devs_run(c);
>>         set_bit(CACHE_SET_RUNNING, &c->flags);
>> -    return;
>> +    return 0;
>>   err:
>>       closure_sync(&cl);
>>       /* XXX: test this, it's broken */
>>       bch_cache_set_error(c, "%s", err);
>> +
>> +    return -EIO;
>>   }
>>   
> Shouldn't you differentiate between -EIO if the journal replay failed
> and -ENOMEM?

Hi Hannes,

After a little research on the code, differentiate error code to return
needs more work, I need to make sure some sub-routines also return
proper error code. It could be a later work, at this moment, return a
non-zero value to caller of run_cache_set() should work well.

Thanks.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2019-04-24 16:06 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-19 16:04 [RFC PATCH v2 00/16] bcache: fix journal no-space deadlock Coly Li
2019-04-19 16:04 ` [RFC PATCH v2 01/16] bcache: move definition of 'int ret' out of macro read_bucket() Coly Li
2019-04-21 17:47   ` Chaitanya Kulkarni
2019-04-22 15:11     ` Coly Li
2019-04-23  6:50   ` Hannes Reinecke
2019-04-19 16:04 ` [RFC PATCH v2 02/16] bcache: never set 0 to KEY_PTRS of jouranl key in journal_reclaim() Coly Li
2019-04-23  6:50   ` Hannes Reinecke
2019-04-19 16:04 ` [RFC PATCH v2 03/16] bcache: reload jouranl key information during journal replay Coly Li
2019-04-23  6:54   ` Hannes Reinecke
2019-04-23  6:56     ` Coly Li
2019-04-19 16:04 ` [RFC PATCH v2 04/16] bcache: fix journal deadlock during jouranl replay Coly Li
2019-04-23  6:59   ` Hannes Reinecke
2019-04-23  7:07     ` Coly Li
2019-04-19 16:04 ` [RFC PATCH v2 05/16] bcache: reserve space for journal_meta() in run time Coly Li
2019-04-23  7:00   ` Hannes Reinecke
2019-04-19 16:04 ` [RFC PATCH v2 06/16] bcache: add failure check to run_cache_set() for journal replay Coly Li
2019-04-23  7:02   ` Hannes Reinecke
2019-04-23  7:09     ` Coly Li
2019-04-24 16:06     ` Coly Li
2019-04-19 16:05 ` [RFC PATCH v2 07/16] bcache: add comments for kobj release callback routine Coly Li
2019-04-21 17:52   ` Chaitanya Kulkarni
2019-04-23  7:03   ` Hannes Reinecke
2019-04-19 16:05 ` [RFC PATCH v2 08/16] bcache: return error immediately in bch_journal_replay() Coly Li
2019-04-23  7:04   ` Hannes Reinecke
2019-04-19 16:05 ` [RFC PATCH v2 09/16] bcache: add error check for calling register_bdev() Coly Li
2019-04-21 18:00   ` Chaitanya Kulkarni
2019-04-23  7:04   ` Hannes Reinecke
2019-04-19 16:05 ` [RFC PATCH v2 10/16] bcache: Add comments for blkdev_put() in registration code path Coly Li
2019-04-21 17:50   ` Chaitanya Kulkarni
2019-04-23  7:05   ` Hannes Reinecke
2019-04-19 16:05 ` [RFC PATCH v2 11/16] bcache: add comments for closure_fn to be called in closure_queue() Coly Li
2019-04-21 17:43   ` Chaitanya Kulkarni
2019-04-23  7:05   ` Hannes Reinecke
2019-04-19 16:05 ` [RFC PATCH v2 12/16] bcache: add pendings_cleanup to stop pending bcache device Coly Li
2019-04-23  7:08   ` Hannes Reinecke
2019-04-23  7:15     ` Coly Li
2019-04-19 16:05 ` [RFC PATCH v2 13/16] bcache: fix fifo index swapping condition in btree_flush_write() Coly Li
     [not found]   ` <20190419231642.90AB82171F@mail.kernel.org>
2019-04-20 13:20     ` Coly Li
2019-04-23  7:09   ` Hannes Reinecke
2019-04-23  7:16     ` Coly Li
2019-04-19 16:05 ` [RFC PATCH v2 14/16] bcache: try to flush btree nodes as many as possible Coly Li
2019-04-23  7:10   ` Hannes Reinecke
2019-04-19 16:05 ` [RFC PATCH v2 15/16] bcache: improve bcache_reboot() Coly Li
2019-04-23  7:13   ` Hannes Reinecke
2019-04-23  7:18     ` Coly Li
2019-04-19 16:05 ` [RFC PATCH v2 16/16] bcache: introduce spinlock_t flush_write_lock in struct journal Coly Li
2019-04-23  7:14   ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.