All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/29] bcache candidate patches for Linux v5.3
@ 2019-06-14 13:13 Coly Li
  2019-06-14 13:13 ` [PATCH 01/29] bcache: Revert "bcache: fix high CPU occupancy during journal" Coly Li
                   ` (29 more replies)
  0 siblings, 30 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Hi folks,

Now I am testing the bcache patches for Linux v5.3, here I collect
all previously posted patches for your information. Any code review
and comment is welcome.

Thanks in advance.

Coly Li
---

Alexandru Ardelean (1):
  bcache: use sysfs_match_string() instead of __sysfs_match_string()

Coly Li (28):
  bcache: Revert "bcache: fix high CPU occupancy during journal"
  bcache: Revert "bcache: free heap cache_set->flush_btree in
    bch_journal_free"
  bcache: add code comments for journal_read_bucket()
  bcache: set largest seq to ja->seq[bucket_index] in
    journal_read_bucket()
  bcache: remove retry_flush_write from struct cache_set
  bcache: fix race in btree_flush_write()
  bcache: add reclaimed_journal_buckets to struct cache_set
  bcache: fix return value error in bch_journal_read()
  Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()"
  bcache: avoid flushing btree node in cache_set_flush() if io disabled
  bcache: ignore read-ahead request failure on backing device
  bcache: add io error counting in write_bdev_super_endio()
  bcache: remove "XXX:" comment line from run_cache_set()
  bcache: remove unnecessary prefetch() in bset_search_tree()
  bcache: add return value check to bch_cached_dev_run()
  bcache: remove unncessary code in bch_btree_keys_init()
  bcache: check CACHE_SET_IO_DISABLE in allocator code
  bcache: check CACHE_SET_IO_DISABLE bit in bch_journal()
  bcache: more detailed error message to bcache_device_link()
  bcache: add more error message in bch_cached_dev_attach()
  bcache: shrink btree node cache after bch_btree_check()
  bcache: improve error message in bch_cached_dev_run()
  bcache: make bset_search_tree() be more understandable
  bcache: add pendings_cleanup to stop pending bcache device
  bcache: avoid a deadlock in bcache_reboot()
  bcache: acquire bch_register_lock later in cached_dev_detach_finish()
  bcache: acquire bch_register_lock later in cached_dev_free()
  bcache: fix potential deadlock in cached_def_free()

 drivers/md/bcache/alloc.c     |   9 ++
 drivers/md/bcache/bcache.h    |   6 +-
 drivers/md/bcache/bset.c      |  61 ++++--------
 drivers/md/bcache/btree.c     |  19 +++-
 drivers/md/bcache/btree.h     |   2 +
 drivers/md/bcache/io.c        |  12 +++
 drivers/md/bcache/journal.c   | 141 +++++++++++++++++++--------
 drivers/md/bcache/journal.h   |   4 +
 drivers/md/bcache/super.c     | 218 +++++++++++++++++++++++++++++++++---------
 drivers/md/bcache/sysfs.c     |  63 ++++++++----
 drivers/md/bcache/util.h      |   2 -
 drivers/md/bcache/writeback.c |   4 +
 12 files changed, 389 insertions(+), 152 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/29] bcache: Revert "bcache: fix high CPU occupancy during journal"
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 02/29] bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" Coly Li
                   ` (28 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable, Tang Junhui

This reverts commit c4dc2497d50d9c6fb16aa0d07b6a14f3b2adb1e0.

This patch enlarges a race between normal btree flush code path and
flush_btree_write(), which causes deadlock when journal space is
exhausted. Reverts this patch makes the race window from 128 btree
nodes to only 1 btree nodes.

Fixes: c4dc2497d50d ("bcache: fix high CPU occupancy during journal")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Tang Junhui <tang.junhui.linux@gmail.com>
---
 drivers/md/bcache/bcache.h  |  2 --
 drivers/md/bcache/journal.c | 47 +++++++++++++++------------------------------
 drivers/md/bcache/util.h    |  2 --
 3 files changed, 15 insertions(+), 36 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index fdf75352e16a..e30a983a68cd 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -726,8 +726,6 @@ struct cache_set {
 
 #define BUCKET_HASH_BITS	12
 	struct hlist_head	bucket_hash[1 << BUCKET_HASH_BITS];
-
-	DECLARE_HEAP(struct btree *, flush_btree);
 };
 
 struct bbio {
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 12dae9348147..621ebf5c62a5 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -391,12 +391,6 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 }
 
 /* Journalling */
-#define journal_max_cmp(l, r) \
-	(fifo_idx(&c->journal.pin, btree_current_write(l)->journal) < \
-	 fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
-#define journal_min_cmp(l, r) \
-	(fifo_idx(&c->journal.pin, btree_current_write(l)->journal) > \
-	 fifo_idx(&(c)->journal.pin, btree_current_write(r)->journal))
 
 static void btree_flush_write(struct cache_set *c)
 {
@@ -404,35 +398,25 @@ static void btree_flush_write(struct cache_set *c)
 	 * Try to find the btree node with that references the oldest journal
 	 * entry, best is our current candidate and is locked if non NULL:
 	 */
-	struct btree *b;
-	int i;
+	struct btree *b, *best;
+	unsigned int i;
 
 	atomic_long_inc(&c->flush_write);
-
 retry:
-	spin_lock(&c->journal.lock);
-	if (heap_empty(&c->flush_btree)) {
-		for_each_cached_btree(b, c, i)
-			if (btree_current_write(b)->journal) {
-				if (!heap_full(&c->flush_btree))
-					heap_add(&c->flush_btree, b,
-						 journal_max_cmp);
-				else if (journal_max_cmp(b,
-					 heap_peek(&c->flush_btree))) {
-					c->flush_btree.data[0] = b;
-					heap_sift(&c->flush_btree, 0,
-						  journal_max_cmp);
-				}
+	best = NULL;
+
+	for_each_cached_btree(b, c, i)
+		if (btree_current_write(b)->journal) {
+			if (!best)
+				best = b;
+			else if (journal_pin_cmp(c,
+					btree_current_write(best)->journal,
+					btree_current_write(b)->journal)) {
+				best = b;
 			}
+		}
 
-		for (i = c->flush_btree.used / 2 - 1; i >= 0; --i)
-			heap_sift(&c->flush_btree, i, journal_min_cmp);
-	}
-
-	b = NULL;
-	heap_pop(&c->flush_btree, b, journal_min_cmp);
-	spin_unlock(&c->journal.lock);
-
+	b = best;
 	if (b) {
 		mutex_lock(&b->write_lock);
 		if (!btree_current_write(b)->journal) {
@@ -870,8 +854,7 @@ int bch_journal_alloc(struct cache_set *c)
 	j->w[0].c = c;
 	j->w[1].c = c;
 
-	if (!(init_heap(&c->flush_btree, 128, GFP_KERNEL)) ||
-	    !(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL)) ||
+	if (!(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL)) ||
 	    !(j->w[0].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS)) ||
 	    !(j->w[1].data = (void *) __get_free_pages(GFP_KERNEL, JSET_BITS)))
 		return -ENOMEM;
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 1fbced94e4cc..c029f7443190 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -113,8 +113,6 @@ do {									\
 
 #define heap_full(h)	((h)->used == (h)->size)
 
-#define heap_empty(h)	((h)->used == 0)
-
 #define DECLARE_FIFO(type, name)					\
 	struct {							\
 		size_t front, back, size, mask;				\
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 02/29] bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free"
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
  2019-06-14 13:13 ` [PATCH 01/29] bcache: Revert "bcache: fix high CPU occupancy during journal" Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 03/29] bcache: add code comments for journal_read_bucket() Coly Li
                   ` (27 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable, Shenghui Wang

This reverts commit 6268dc2c4703aabfb0b35681be709acf4c2826c6.

This patch depends on commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal") which is reverted in previous patch. So
revert this one too.

Fixes: 6268dc2c4703 ("bcache: free heap cache_set->flush_btree in bch_journal_free")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Cc: Shenghui Wang <shhuiw@foxmail.com>
---
 drivers/md/bcache/journal.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 621ebf5c62a5..f18c1e7b012c 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -839,7 +839,6 @@ void bch_journal_free(struct cache_set *c)
 	free_pages((unsigned long) c->journal.w[1].data, JSET_BITS);
 	free_pages((unsigned long) c->journal.w[0].data, JSET_BITS);
 	free_fifo(&c->journal.pin);
-	free_heap(&c->flush_btree);
 }
 
 int bch_journal_alloc(struct cache_set *c)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 03/29] bcache: add code comments for journal_read_bucket()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
  2019-06-14 13:13 ` [PATCH 01/29] bcache: Revert "bcache: fix high CPU occupancy during journal" Coly Li
  2019-06-14 13:13 ` [PATCH 02/29] bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 04/29] bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() Coly Li
                   ` (26 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

This patch adds more code comments in journal_read_bucket(), this is an
effort to make the code to be more understandable.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index f18c1e7b012c..50c9a1e405e6 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -100,6 +100,20 @@ reread:		left = ca->sb.bucket_size - offset;
 
 			blocks = set_blocks(j, block_bytes(ca->set));
 
+			/*
+			 * Nodes in 'list' are in linear increasing order of
+			 * i->j.seq, the node on head has the smallest (oldest)
+			 * journal seq, the node on tail has the biggest
+			 * (latest) journal seq.
+			 */
+
+			/*
+			 * Check from the oldest jset for last_seq. If
+			 * i->j.seq < j->last_seq, it means the oldest jset
+			 * in list is expired and useless, remove it from
+			 * this list. Otherwise, j is a condidate jset for
+			 * further following checks.
+			 */
 			while (!list_empty(list)) {
 				i = list_first_entry(list,
 					struct journal_replay, list);
@@ -109,13 +123,22 @@ reread:		left = ca->sb.bucket_size - offset;
 				kfree(i);
 			}
 
+			/* iterate list in reverse order (from latest jset) */
 			list_for_each_entry_reverse(i, list, list) {
 				if (j->seq == i->j.seq)
 					goto next_set;
 
+				/*
+				 * if j->seq is less than any i->j.last_seq
+				 * in list, j is an expired and useless jset.
+				 */
 				if (j->seq < i->j.last_seq)
 					goto next_set;
 
+				/*
+				 * 'where' points to first jset in list which
+				 * is elder then j.
+				 */
 				if (j->seq > i->j.seq) {
 					where = &i->list;
 					goto add;
@@ -129,6 +152,7 @@ reread:		left = ca->sb.bucket_size - offset;
 			if (!i)
 				return -ENOMEM;
 			memcpy(&i->j, j, bytes);
+			/* Add to the location after 'where' points to */
 			list_add(&i->list, where);
 			ret = 1;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 04/29] bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (2 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 03/29] bcache: add code comments for journal_read_bucket() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 05/29] bcache: remove retry_flush_write from struct cache_set Coly Li
                   ` (25 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

In journal_read_bucket() when setting ja->seq[bucket_index], there might
be potential case that a later non-maximum overwrites a better sequence
number to ja->seq[bucket_index]. This patch adds a check to make sure
that ja->seq[bucket_index] will be only set a new value if it is bigger
then current value.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 50c9a1e405e6..a9f1703a8514 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -156,7 +156,8 @@ reread:		left = ca->sb.bucket_size - offset;
 			list_add(&i->list, where);
 			ret = 1;
 
-			ja->seq[bucket_index] = j->seq;
+			if (j->seq > ja->seq[bucket_index])
+				ja->seq[bucket_index] = j->seq;
 next_set:
 			offset	+= blocks * ca->sb.block_size;
 			len	-= blocks * ca->sb.block_size;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 05/29] bcache: remove retry_flush_write from struct cache_set
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (3 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 04/29] bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 06/29] bcache: fix race in btree_flush_write() Coly Li
                   ` (24 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

In struct cache_set, retry_flush_write is added for commit c4dc2497d50d
("bcache: fix high CPU occupancy during journal") which is reverted in
previous patch.

Now it is useless anymore, and this patch removes it from bcache code.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h | 1 -
 drivers/md/bcache/sysfs.c  | 5 -----
 2 files changed, 6 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index e30a983a68cd..ae87cb01401e 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -706,7 +706,6 @@ struct cache_set {
 
 	atomic_long_t		reclaim;
 	atomic_long_t		flush_write;
-	atomic_long_t		retry_flush_write;
 
 	enum			{
 		ON_ERROR_UNREGISTER,
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 6cd44d3cf906..a67e1e57ea68 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -85,7 +85,6 @@ read_attribute(state);
 read_attribute(cache_read_races);
 read_attribute(reclaim);
 read_attribute(flush_write);
-read_attribute(retry_flush_write);
 read_attribute(writeback_keys_done);
 read_attribute(writeback_keys_failed);
 read_attribute(io_errors);
@@ -691,9 +690,6 @@ SHOW(__bch_cache_set)
 	sysfs_print(flush_write,
 		    atomic_long_read(&c->flush_write));
 
-	sysfs_print(retry_flush_write,
-		    atomic_long_read(&c->retry_flush_write));
-
 	sysfs_print(writeback_keys_done,
 		    atomic_long_read(&c->writeback_keys_done));
 	sysfs_print(writeback_keys_failed,
@@ -910,7 +906,6 @@ static struct attribute *bch_cache_set_internal_files[] = {
 	&sysfs_cache_read_races,
 	&sysfs_reclaim,
 	&sysfs_flush_write,
-	&sysfs_retry_flush_write,
 	&sysfs_writeback_keys_done,
 	&sysfs_writeback_keys_failed,
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 06/29] bcache: fix race in btree_flush_write()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (4 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 05/29] bcache: remove retry_flush_write from struct cache_set Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-27  9:16   ` Yaowei Bai
  2019-06-14 13:13 ` [PATCH 07/29] bcache: add reclaimed_journal_buckets to struct cache_set Coly Li
                   ` (23 subsequent siblings)
  29 siblings, 1 reply; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

There is a race between mca_reap(), btree_node_free() and journal code
btree_flush_write(), which results very rare and strange deadlock or
panic and are very hard to reproduce.

Let me explain how the race happens. In btree_flush_write() one btree
node with oldest journal pin is selected, then it is flushed to cache
device, the select-and-flush is a two steps operation. Between these two
steps, there are something may happen inside the race window,
- The selected btree node was reaped by mca_reap() and allocated to
  other requesters for other btree node.
- The slected btree node was selected, flushed and released by mca
  shrink callback bch_mca_scan().
When btree_flush_write() tries to flush the selected btree node, firstly
b->write_lock is held by mutex_lock(). If the race happens and the
memory of selected btree node is allocated to other btree node, if that
btree node's write_lock is held already, a deadlock very probably
happens here. A worse case is the memory of the selected btree node is
released, then all references to this btree node (e.g. b->write_lock)
will trigger NULL pointer deference panic.

This race was introduced in commit cafe56359144 ("bcache: A block layer
cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
occupancy during journal"), which selected 128 btree nodes and flushed
them one-by-one in a quite long time period.

Such race is not easy to reproduce before. On a Lenovo SR650 server with
48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
device assembled by 3 NVMe SSDs as backing device, this race can be
observed around every 10,000 times btree_flush_write() gets called. Both
deadlock and kernel panic all happened as aftermath of the race.

The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
is set when selecting btree nodes, and cleared after btree nodes
flushed. Then when mca_reap() selects a btree node with this bit set,
this btree node will be skipped. Since mca_reap() only reaps btree node
without BTREE_NODE_journal_flush flag, such race is avoided.

Once corner case should be noticed, that is btree_node_free(). It might
be called in some error handling code path. For example the following
code piece from btree_split(),
	2149 err_free2:
	2150         bkey_put(b->c, &n2->key);
	2151         btree_node_free(n2);
	2152         rw_unlock(true, n2);
	2153 err_free1:
	2154         bkey_put(b->c, &n1->key);
	2155         btree_node_free(n1);
	2156         rw_unlock(true, n1);
At line 2151 and 2155, the btree node n2 and n1 are released without
mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
If btree_node_free() is called directly in such error handling path,
and the selected btree node has BTREE_NODE_journal_flush bit set, just
wait for 1 jiffy and retry again. In this case this btree node won't
be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
and free the btree node memory.

Wait for 1 jiffy inside btree_node_free() does not hurt too much
performance here, the reasons are,
- Error handling code path is not frequently executed, and the race
  inside this cold path should be very rare. If the very rare race
  happens in the cold code path, waiting 1 jiffy should be acceptible.
- If bree_node_free() is called inside mca_reap(), it means the bit
  BTREE_NODE_journal_flush is checked already, no wait will happen here.

Beside the above fix, the way to select flushing btree nodes is also
changed in this patch. Let me explain what changes in this patch.

- Use another spinlock journal.flush_write_lock to replace the very
  hot journal.lock. We don't have to use journal.lock here, selecting
  candidate btree nodes takes a lot of time, hold journal.lock here will
  block other jouranling threads and drop the overall I/O performance.
- Only select flushing btree node from c->btree_cache list. When the
  machine has a large system memory, mca cache may have a huge number of
  cached btree nodes. Iterating all the cached nodes will take a lot
  of CPU time, and most of the nodes on c->btree_cache_freeable and
  c->btree_cache_freed lists are cleared and have need to flush. So only
  travel mca list c->btree_cache to select flushing btree node should be
  enough for most of the cases.
- Don't iterate whole c->btree_cache list, only reversely select first
  BTREE_FLUSH_NR (32) btree nodes to flush. Iterate all btree nodes from
  c->btree_cache and select the oldest journal pin btree nodes consumes
  huge number of CPU cycles if the list is huge (push and pop a node
  into/out of a heap is expensive). The last several dirty btree nodes
  on the tail of c->btree_cache list are earlest allocated and cached
  btree nodes, they are relative to the oldest journal pin btree nodes.
  Therefore only flushing BTREE_FLUSH_NR btree nodes from tail of
  c->btree_cache probably includes the oldest journal pin btree nodes.

In my testing, the above change decreases 50%+ CPU consumption when
journal space is full. Some times IOPS drops to 0 for 5-8 seconds,
comparing blocking I/O for 120+ seconds in previous code, this is much
better. Maybe there is room to improve in future, but at this momment
the fix looks fine and performs well in my testing.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/btree.c   | 15 +++++++-
 drivers/md/bcache/btree.h   |  2 +
 drivers/md/bcache/journal.c | 93 ++++++++++++++++++++++++++++++++++-----------
 drivers/md/bcache/journal.h |  4 ++
 4 files changed, 90 insertions(+), 24 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 773f5fdad25f..881dc238c7cb 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -656,6 +656,13 @@ static int mca_reap(struct btree *b, unsigned int min_order, bool flush)
 	}
 
 	mutex_lock(&b->write_lock);
+	/* don't reap btree node handling in btree_flush_write() */
+	if (btree_node_journal_flush(b)) {
+		pr_debug("bnode %p is flushing by journal, ignore", b);
+		mutex_unlock(&b->write_lock);
+		goto out_unlock;
+	}
+
 	if (btree_node_dirty(b))
 		__bch_btree_node_write(b, &cl);
 	mutex_unlock(&b->write_lock);
@@ -1067,8 +1074,14 @@ static void btree_node_free(struct btree *b)
 
 	BUG_ON(b == b->c->root);
 
+retry:
 	mutex_lock(&b->write_lock);
-
+	if (btree_node_journal_flush(b)) {
+		mutex_unlock(&b->write_lock);
+		pr_debug("bnode %p journal_flush set, retry", b);
+		schedule_timeout_interruptible(1);
+		goto retry;
+	}
 	if (btree_node_dirty(b))
 		btree_complete_write(b, btree_current_write(b));
 	clear_bit(BTREE_NODE_dirty, &b->flags);
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index d1c72ef64edf..76cfd121a486 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -158,11 +158,13 @@ enum btree_flags {
 	BTREE_NODE_io_error,
 	BTREE_NODE_dirty,
 	BTREE_NODE_write_idx,
+	BTREE_NODE_journal_flush,
 };
 
 BTREE_FLAG(io_error);
 BTREE_FLAG(dirty);
 BTREE_FLAG(write_idx);
+BTREE_FLAG(journal_flush);
 
 static inline struct btree_write *btree_current_write(struct btree *b)
 {
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index a9f1703a8514..303ef3d1fbc6 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -419,41 +419,87 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list)
 
 static void btree_flush_write(struct cache_set *c)
 {
-	/*
-	 * Try to find the btree node with that references the oldest journal
-	 * entry, best is our current candidate and is locked if non NULL:
-	 */
-	struct btree *b, *best;
-	unsigned int i;
+	struct btree *b, *btree_nodes[BTREE_FLUSH_NR];
+	unsigned int i, n;
+
+	if (c->journal.btree_flushing)
+		return;
+
+	spin_lock(&c->journal.flush_write_lock);
+	if (c->journal.btree_flushing) {
+		spin_unlock(&c->journal.flush_write_lock);
+		return;
+	}
+	c->journal.btree_flushing = true;
+	spin_unlock(&c->journal.flush_write_lock);
 
 	atomic_long_inc(&c->flush_write);
-retry:
-	best = NULL;
-
-	for_each_cached_btree(b, c, i)
-		if (btree_current_write(b)->journal) {
-			if (!best)
-				best = b;
-			else if (journal_pin_cmp(c,
-					btree_current_write(best)->journal,
-					btree_current_write(b)->journal)) {
-				best = b;
-			}
+	memset(btree_nodes, 0, sizeof(btree_nodes));
+	n = 0;
+
+	mutex_lock(&c->bucket_lock);
+	list_for_each_entry_reverse(b, &c->btree_cache, list) {
+		if (btree_node_journal_flush(b))
+			pr_err("BUG: flush_write bit should not be set here!");
+
+		mutex_lock(&b->write_lock);
+
+		if (!btree_node_dirty(b)) {
+			mutex_unlock(&b->write_lock);
+			continue;
+		}
+
+		if (!btree_current_write(b)->journal) {
+			mutex_unlock(&b->write_lock);
+			continue;
+		}
+
+		set_btree_node_journal_flush(b);
+
+		mutex_unlock(&b->write_lock);
+
+		btree_nodes[n++] = b;
+		if (n == BTREE_FLUSH_NR)
+			break;
+	}
+	mutex_unlock(&c->bucket_lock);
+
+	for (i = 0; i < n; i++) {
+		b = btree_nodes[i];
+		if (!b) {
+			pr_err("BUG: btree_nodes[%d] is NULL", i);
+			continue;
+		}
+
+		/* safe to check without holding b->write_lock */
+		if (!btree_node_journal_flush(b)) {
+			pr_err("BUG: bnode %p: journal_flush bit cleaned", b);
+			continue;
 		}
 
-	b = best;
-	if (b) {
 		mutex_lock(&b->write_lock);
 		if (!btree_current_write(b)->journal) {
 			mutex_unlock(&b->write_lock);
-			/* We raced */
-			atomic_long_inc(&c->retry_flush_write);
-			goto retry;
+			pr_debug("bnode %p: written by others", b);
+			clear_bit(BTREE_NODE_journal_flush, &b->flags);
+			continue;
+		}
+
+		if (!btree_node_dirty(b)) {
+			pr_debug("bnode %p: dirty bit cleaned by others", b);
+			clear_bit(BTREE_NODE_journal_flush, &b->flags);
+			mutex_unlock(&b->write_lock);
+			continue;
 		}
 
 		__bch_btree_node_write(b, NULL);
+		clear_bit(BTREE_NODE_journal_flush, &b->flags);
 		mutex_unlock(&b->write_lock);
 	}
+
+	spin_lock(&c->journal.flush_write_lock);
+	c->journal.btree_flushing = false;
+	spin_unlock(&c->journal.flush_write_lock);
 }
 
 #define last_seq(j)	((j)->seq - fifo_used(&(j)->pin) + 1)
@@ -871,6 +917,7 @@ int bch_journal_alloc(struct cache_set *c)
 	struct journal *j = &c->journal;
 
 	spin_lock_init(&j->lock);
+	spin_lock_init(&j->flush_write_lock);
 	INIT_DELAYED_WORK(&j->work, journal_write_work);
 
 	c->journal_delay_ms = 100;
diff --git a/drivers/md/bcache/journal.h b/drivers/md/bcache/journal.h
index 66f0facff84b..aeed791f05e7 100644
--- a/drivers/md/bcache/journal.h
+++ b/drivers/md/bcache/journal.h
@@ -103,6 +103,8 @@ struct journal_write {
 /* Embedded in struct cache_set */
 struct journal {
 	spinlock_t		lock;
+	spinlock_t		flush_write_lock;
+	bool			btree_flushing;
 	/* used when waiting because the journal was full */
 	struct closure_waitlist	wait;
 	struct closure		io;
@@ -154,6 +156,8 @@ struct journal_device {
 	struct bio_vec		bv[8];
 };
 
+#define BTREE_FLUSH_NR	32
+
 #define journal_pin_cmp(c, l, r)				\
 	(fifo_idx(&(c)->journal.pin, (l)) > fifo_idx(&(c)->journal.pin, (r)))
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 07/29] bcache: add reclaimed_journal_buckets to struct cache_set
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (5 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 06/29] bcache: fix race in btree_flush_write() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 08/29] bcache: fix return value error in bch_journal_read() Coly Li
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Now we have counters for how many times jouranl is reclaimed, how many
times cached dirty btree nodes are flushed, but we don't know how many
jouranl buckets are really reclaimed.

This patch adds reclaimed_journal_buckets into struct cache_set, this
is an increasing only counter, to tell how many journal buckets are
reclaimed since cache set runs. From all these three counters (reclaim,
reclaimed_journal_buckets, flush_write), we can have idea how well
current journal space reclaim code works.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h  | 1 +
 drivers/md/bcache/journal.c | 1 +
 drivers/md/bcache/sysfs.c   | 5 +++++
 3 files changed, 7 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index ae87cb01401e..732f9b6cb2bb 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -705,6 +705,7 @@ struct cache_set {
 	atomic_long_t		writeback_keys_failed;
 
 	atomic_long_t		reclaim;
+	atomic_long_t		reclaimed_journal_buckets;
 	atomic_long_t		flush_write;
 
 	enum			{
diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 303ef3d1fbc6..d75efeaa4baa 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -614,6 +614,7 @@ static void journal_reclaim(struct cache_set *c)
 		k->ptr[n++] = MAKE_PTR(0,
 				  bucket_to_sector(c, ca->sb.d[ja->cur_idx]),
 				  ca->sb.nr_this_dev);
+		atomic_long_inc(&c->reclaimed_journal_buckets);
 	}
 
 	if (n) {
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index a67e1e57ea68..7a88ba4bfbfb 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -84,6 +84,7 @@ read_attribute(bset_tree_stats);
 read_attribute(state);
 read_attribute(cache_read_races);
 read_attribute(reclaim);
+read_attribute(reclaimed_journal_buckets);
 read_attribute(flush_write);
 read_attribute(writeback_keys_done);
 read_attribute(writeback_keys_failed);
@@ -687,6 +688,9 @@ SHOW(__bch_cache_set)
 	sysfs_print(reclaim,
 		    atomic_long_read(&c->reclaim));
 
+	sysfs_print(reclaimed_journal_buckets,
+		    atomic_long_read(&c->reclaimed_journal_buckets));
+
 	sysfs_print(flush_write,
 		    atomic_long_read(&c->flush_write));
 
@@ -905,6 +909,7 @@ static struct attribute *bch_cache_set_internal_files[] = {
 	&sysfs_bset_tree_stats,
 	&sysfs_cache_read_races,
 	&sysfs_reclaim,
+	&sysfs_reclaimed_journal_buckets,
 	&sysfs_flush_write,
 	&sysfs_writeback_keys_done,
 	&sysfs_writeback_keys_failed,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 08/29] bcache: fix return value error in bch_journal_read()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (6 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 07/29] bcache: add reclaimed_journal_buckets to struct cache_set Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 09/29] Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" Coly Li
                   ` (21 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When everything is OK in bch_journal_read(), finally the return value
is returned by,
	return ret;
which assumes ret will be 0 here. This assumption is wrong when all
journal buckets as are full and filled with valid journal entries. In
such cache the last location referencess read_bucket() sets 'ret' to
1, which means new jset added into jset list. The jset list is list
'journal' in caller run_cache_set().

Return 1 to run_cache_set() means something wrong and the cache set
won't start, but indeed everything is OK.

This patch changes the line at end of bch_journal_read() to directly
return 0 since everything if verything is good. Then a bogus error
is fixed.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index d75efeaa4baa..7beff8884dd0 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -293,7 +293,7 @@ int bch_journal_read(struct cache_set *c, struct list_head *list)
 					    struct journal_replay,
 					    list)->j.seq;
 
-	return ret;
+	return 0;
 #undef read_bucket
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 09/29] Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()"
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (7 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 08/29] bcache: fix return value error in bch_journal_read() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 10/29] bcache: avoid flushing btree node in cache_set_flush() if io disabled Coly Li
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable

This reverts commit 6147305c73e4511ca1a975b766b97a779d442567.

Although this patch helps the failed bcache device to stop faster when
too many I/O errors detected on corresponding cached device, setting
CACHE_SET_IO_DISABLE bit to cache set c->flags was not a good idea. This
operation will disable all I/Os on cache set, which means other attached
bcache devices won't work neither.

Without this patch, the failed bcache device can also be stopped
eventually if internal I/O accomplished (e.g. writeback). Therefore here
I revert it.

Fixes: 6147305c73e4 ("bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()")
Reported-by: Yong Li <mr.liyong@qq.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/super.c | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 1b63ac876169..eaaa046fd95d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1437,8 +1437,6 @@ int bch_flash_dev_create(struct cache_set *c, uint64_t size)
 
 bool bch_cached_dev_error(struct cached_dev *dc)
 {
-	struct cache_set *c;
-
 	if (!dc || test_bit(BCACHE_DEV_CLOSING, &dc->disk.flags))
 		return false;
 
@@ -1449,21 +1447,6 @@ bool bch_cached_dev_error(struct cached_dev *dc)
 	pr_err("stop %s: too many IO errors on backing device %s\n",
 		dc->disk.disk->disk_name, dc->backing_dev_name);
 
-	/*
-	 * If the cached device is still attached to a cache set,
-	 * even dc->io_disable is true and no more I/O requests
-	 * accepted, cache device internal I/O (writeback scan or
-	 * garbage collection) may still prevent bcache device from
-	 * being stopped. So here CACHE_SET_IO_DISABLE should be
-	 * set to c->flags too, to make the internal I/O to cache
-	 * device rejected and stopped immediately.
-	 * If c is NULL, that means the bcache device is not attached
-	 * to any cache set, then no CACHE_SET_IO_DISABLE bit to set.
-	 */
-	c = dc->disk.c;
-	if (c && test_and_set_bit(CACHE_SET_IO_DISABLE, &c->flags))
-		pr_info("CACHE_SET_IO_DISABLE already set");
-
 	bcache_device_stop(&dc->disk);
 	return true;
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 10/29] bcache: avoid flushing btree node in cache_set_flush() if io disabled
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (8 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 09/29] Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 11/29] bcache: ignore read-ahead request failure on backing device Coly Li
                   ` (19 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When cache_set_flush() is called for too many I/O errors detected on
cache device and the cache set is retiring, inside the function it
doesn't make sense to flushing cached btree nodes from c->btree_cache
because CACHE_SET_IO_DISABLE is set on c->flags already and all I/Os
onto cache device will be rejected.

This patch checks in cache_set_flush() that whether CACHE_SET_IO_DISABLE
is set. If yes, then avoids to flush the cached btree nodes to reduce
more time and make cache set retiring more faster.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index eaaa046fd95d..da9d6a63b81a 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1553,13 +1553,17 @@ static void cache_set_flush(struct closure *cl)
 	if (!IS_ERR_OR_NULL(c->root))
 		list_add(&c->root->list, &c->btree_cache);
 
-	/* Should skip this if we're unregistering because of an error */
-	list_for_each_entry(b, &c->btree_cache, list) {
-		mutex_lock(&b->write_lock);
-		if (btree_node_dirty(b))
-			__bch_btree_node_write(b, NULL);
-		mutex_unlock(&b->write_lock);
-	}
+	/*
+	 * Avoid flushing cached nodes if cache set is retiring
+	 * due to too many I/O errors detected.
+	 */
+	if (!test_bit(CACHE_SET_IO_DISABLE, &c->flags))
+		list_for_each_entry(b, &c->btree_cache, list) {
+			mutex_lock(&b->write_lock);
+			if (btree_node_dirty(b))
+				__bch_btree_node_write(b, NULL);
+			mutex_unlock(&b->write_lock);
+		}
 
 	for_each_cache(ca, c, i)
 		if (ca->alloc_thread)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 11/29] bcache: ignore read-ahead request failure on backing device
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (9 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 10/29] bcache: avoid flushing btree node in cache_set_flush() if io disabled Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 12/29] bcache: add io error counting in write_bdev_super_endio() Coly Li
                   ` (18 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li, stable

When md raid device (e.g. raid456) is used as backing device, read-ahead
requests on a degrading and recovering md raid device might be failured
immediately by md raid code, but indeed this md raid array can still be
read or write for normal I/O requests. Therefore such failed read-ahead
request are not real hardware failure. Further more, after degrading and
recovering accomplished, read-ahead requests will be handled by md raid
array again.

For such condition, I/O failures of read-ahead requests don't indicate
real health status (because normal I/O still be served), they should not
be counted into I/O error counter dc->io_errors.

Since there is no simple way to detect whether the backing divice is a
md raid device, this patch simply ignores I/O failures for read-ahead
bios on backing device, to avoid bogus backing device failure on a
degrading md raid array.

Suggested-and-tested-by: Thorsten Knabe <linux@thorsten-knabe.de>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
---
 drivers/md/bcache/io.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index c25097968319..4d93f07f63e5 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -58,6 +58,18 @@ void bch_count_backing_io_errors(struct cached_dev *dc, struct bio *bio)
 
 	WARN_ONCE(!dc, "NULL pointer of struct cached_dev");
 
+	/*
+	 * Read-ahead requests on a degrading and recovering md raid
+	 * (e.g. raid6) device might be failured immediately by md
+	 * raid code, which is not a real hardware media failure. So
+	 * we shouldn't count failed REQ_RAHEAD bio to dc->io_errors.
+	 */
+	if (bio->bi_opf & REQ_RAHEAD) {
+		pr_warn_ratelimited("%s: Read-ahead I/O failed on backing device, ignore",
+				    dc->backing_dev_name);
+		return;
+	}
+
 	errors = atomic_add_return(1, &dc->io_errors);
 	if (errors < dc->error_limit)
 		pr_err("%s: IO error on backing device, unrecoverable",
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 12/29] bcache: add io error counting in write_bdev_super_endio()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (10 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 11/29] bcache: ignore read-ahead request failure on backing device Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 13/29] bcache: remove "XXX:" comment line from run_cache_set() Coly Li
                   ` (17 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When backing device super block is written by bch_write_bdev_super(),
the bio complete callback write_bdev_super_endio() simply ignores I/O
status. Indeed such write request also contribute to backing device
health status if the request failed.

This patch checkes bio->bi_status in write_bdev_super_endio(), if there
is error, bch_count_backing_io_errors() will be called to count an I/O
error to dc->io_errors.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index da9d6a63b81a..877113b62b0f 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -197,7 +197,9 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 static void write_bdev_super_endio(struct bio *bio)
 {
 	struct cached_dev *dc = bio->bi_private;
-	/* XXX: error checking */
+
+	if (bio->bi_status)
+		bch_count_backing_io_errors(dc, bio);
 
 	closure_put(&dc->sb_write);
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 13/29] bcache: remove "XXX:" comment line from run_cache_set()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (11 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 12/29] bcache: add io error counting in write_bdev_super_endio() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 14/29] bcache: remove unnecessary prefetch() in bset_search_tree() Coly Li
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

In previous bcache patches for Linux v5.2, the failure code path of
run_cache_set() is tested and fixed. So now the following comment
line can be removed from run_cache_set(),
	/* XXX: test this, it's broken */

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 877113b62b0f..3364b20567eb 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1946,7 +1946,7 @@ static int run_cache_set(struct cache_set *c)
 	}
 
 	closure_sync(&cl);
-	/* XXX: test this, it's broken */
+
 	bch_cache_set_error(c, "%s", err);
 
 	return -EIO;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 14/29] bcache: remove unnecessary prefetch() in bset_search_tree()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (12 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 13/29] bcache: remove "XXX:" comment line from run_cache_set() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 15/29] bcache: use sysfs_match_string() instead of __sysfs_match_string() Coly Li
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

In function bset_search_tree(), when p >= t->size, t->tree[0] will be
prefetched by the following code piece,
 974                 unsigned int p = n << 4;
 975
 976                 p &= ((int) (p - t->size)) >> 31;
 977
 978                 prefetch(&t->tree[p]);

The purpose of the above code is to avoid a branch instruction, but
when p >= t->size, prefetch(&t->tree[0]) has no positive performance
contribution at all. This patch avoids the unncessary prefetch by only
calling prefetch() when p < t->size.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bset.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index 8f07fa6e1739..aa2e4ab0fab9 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -960,22 +960,10 @@ static struct bset_search_iter bset_search_tree(struct bset_tree *t,
 	unsigned int inorder, j, n = 1;
 
 	do {
-		/*
-		 * A bit trick here.
-		 * If p < t->size, (int)(p - t->size) is a minus value and
-		 * the most significant bit is set, right shifting 31 bits
-		 * gets 1. If p >= t->size, the most significant bit is
-		 * not set, right shifting 31 bits gets 0.
-		 * So the following 2 lines equals to
-		 *	if (p >= t->size)
-		 *		p = 0;
-		 * but a branch instruction is avoided.
-		 */
 		unsigned int p = n << 4;
 
-		p &= ((int) (p - t->size)) >> 31;
-
-		prefetch(&t->tree[p]);
+		if (p < t->size)
+			prefetch(&t->tree[p]);
 
 		j = n;
 		f = &t->tree[j];
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 15/29] bcache: use sysfs_match_string() instead of __sysfs_match_string()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (13 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 14/29] bcache: remove unnecessary prefetch() in bset_search_tree() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 16/29] bcache: add return value check to bch_cached_dev_run() Coly Li
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Alexandru Ardelean, Coly Li

From: Alexandru Ardelean <alexandru.ardelean@analog.com>

The arrays (of strings) that are passed to __sysfs_match_string() are
static, so use sysfs_match_string() which does an implicit ARRAY_SIZE()
over these arrays.

Functionally, this doesn't change anything.
The change is more cosmetic.

It only shrinks the static arrays by 1 byte each.

Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com>
Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/sysfs.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 7a88ba4bfbfb..53f70f97241e 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -21,28 +21,24 @@ static const char * const bch_cache_modes[] = {
 	"writethrough",
 	"writeback",
 	"writearound",
-	"none",
-	NULL
+	"none"
 };
 
 /* Default is 0 ("auto") */
 static const char * const bch_stop_on_failure_modes[] = {
 	"auto",
-	"always",
-	NULL
+	"always"
 };
 
 static const char * const cache_replacement_policies[] = {
 	"lru",
 	"fifo",
-	"random",
-	NULL
+	"random"
 };
 
 static const char * const error_actions[] = {
 	"unregister",
-	"panic",
-	NULL
+	"panic"
 };
 
 write_attribute(attach);
@@ -333,7 +329,7 @@ STORE(__cached_dev)
 		bch_cached_dev_run(dc);
 
 	if (attr == &sysfs_cache_mode) {
-		v = __sysfs_match_string(bch_cache_modes, -1, buf);
+		v = sysfs_match_string(bch_cache_modes, buf);
 		if (v < 0)
 			return v;
 
@@ -344,7 +340,7 @@ STORE(__cached_dev)
 	}
 
 	if (attr == &sysfs_stop_when_cache_set_failed) {
-		v = __sysfs_match_string(bch_stop_on_failure_modes, -1, buf);
+		v = sysfs_match_string(bch_stop_on_failure_modes, buf);
 		if (v < 0)
 			return v;
 
@@ -794,7 +790,7 @@ STORE(__bch_cache_set)
 			    0, UINT_MAX);
 
 	if (attr == &sysfs_errors) {
-		v = __sysfs_match_string(error_actions, -1, buf);
+		v = sysfs_match_string(error_actions, buf);
 		if (v < 0)
 			return v;
 
@@ -1058,7 +1054,7 @@ STORE(__bch_cache)
 	}
 
 	if (attr == &sysfs_cache_replacement_policy) {
-		v = __sysfs_match_string(cache_replacement_policies, -1, buf);
+		v = sysfs_match_string(cache_replacement_policies, buf);
 		if (v < 0)
 			return v;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 16/29] bcache: add return value check to bch_cached_dev_run()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (14 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 15/29] bcache: use sysfs_match_string() instead of __sysfs_match_string() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 17/29] bcache: remove unncessary code in bch_btree_keys_init() Coly Li
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

This patch adds return value check to bch_cached_dev_run(), now if there
is error happens inside bch_cached_dev_run(), it can be catched.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h |  2 +-
 drivers/md/bcache/super.c  | 33 ++++++++++++++++++++++++++-------
 drivers/md/bcache/sysfs.c  |  7 +++++--
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 732f9b6cb2bb..013e35a9e317 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -1004,7 +1004,7 @@ int bch_flash_dev_create(struct cache_set *c, uint64_t size);
 int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
 			  uint8_t *set_uuid);
 void bch_cached_dev_detach(struct cached_dev *dc);
-void bch_cached_dev_run(struct cached_dev *dc);
+int bch_cached_dev_run(struct cached_dev *dc);
 void bcache_device_stop(struct bcache_device *d);
 
 void bch_cache_set_unregister(struct cache_set *c);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 3364b20567eb..74eb18b1af40 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -910,7 +910,7 @@ static int cached_dev_status_update(void *arg)
 }
 
 
-void bch_cached_dev_run(struct cached_dev *dc)
+int bch_cached_dev_run(struct cached_dev *dc)
 {
 	struct bcache_device *d = &dc->disk;
 	char *buf = kmemdup_nul(dc->sb.label, SB_LABEL_SIZE, GFP_KERNEL);
@@ -921,11 +921,14 @@ void bch_cached_dev_run(struct cached_dev *dc)
 		NULL,
 	};
 
+	if (dc->io_disable)
+		return -EIO;
+
 	if (atomic_xchg(&dc->running, 1)) {
 		kfree(env[1]);
 		kfree(env[2]);
 		kfree(buf);
-		return;
+		return -EBUSY;
 	}
 
 	if (!d->c &&
@@ -951,8 +954,11 @@ void bch_cached_dev_run(struct cached_dev *dc)
 	kfree(buf);
 
 	if (sysfs_create_link(&d->kobj, &disk_to_dev(d->disk)->kobj, "dev") ||
-	    sysfs_create_link(&disk_to_dev(d->disk)->kobj, &d->kobj, "bcache"))
+	    sysfs_create_link(&disk_to_dev(d->disk)->kobj,
+			      &d->kobj, "bcache")) {
 		pr_debug("error creating sysfs link");
+		return -ENOMEM;
+	}
 
 	dc->status_update_thread = kthread_run(cached_dev_status_update,
 					       dc, "bcache_status_update");
@@ -961,6 +967,8 @@ void bch_cached_dev_run(struct cached_dev *dc)
 			"continue to run without monitoring backing "
 			"device status");
 	}
+
+	return 0;
 }
 
 /*
@@ -1056,6 +1064,7 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
 	uint32_t rtime = cpu_to_le32((u32)ktime_get_real_seconds());
 	struct uuid_entry *u;
 	struct cached_dev *exist_dc, *t;
+	int ret = 0;
 
 	if ((set_uuid && memcmp(set_uuid, c->sb.set_uuid, 16)) ||
 	    (!set_uuid && memcmp(dc->sb.set_uuid, c->sb.set_uuid, 16)))
@@ -1165,7 +1174,12 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
 
 	bch_sectors_dirty_init(&dc->disk);
 
-	bch_cached_dev_run(dc);
+	ret = bch_cached_dev_run(dc);
+	if (ret && (ret != -EBUSY)) {
+		up_write(&dc->writeback_lock);
+		return ret;
+	}
+
 	bcache_device_link(&dc->disk, c, "bdev");
 	atomic_inc(&c->attached_dev_nr);
 
@@ -1292,6 +1306,7 @@ static int register_bdev(struct cache_sb *sb, struct page *sb_page,
 {
 	const char *err = "cannot allocate memory";
 	struct cache_set *c;
+	int ret = -ENOMEM;
 
 	bdevname(bdev, dc->backing_dev_name);
 	memcpy(&dc->sb, sb, sizeof(struct cache_sb));
@@ -1321,14 +1336,18 @@ static int register_bdev(struct cache_sb *sb, struct page *sb_page,
 		bch_cached_dev_attach(dc, c, NULL);
 
 	if (BDEV_STATE(&dc->sb) == BDEV_STATE_NONE ||
-	    BDEV_STATE(&dc->sb) == BDEV_STATE_STALE)
-		bch_cached_dev_run(dc);
+	    BDEV_STATE(&dc->sb) == BDEV_STATE_STALE) {
+		err = "failed to run cached device";
+		ret = bch_cached_dev_run(dc);
+		if (ret)
+			goto err;
+	}
 
 	return 0;
 err:
 	pr_notice("error %s: %s", dc->backing_dev_name, err);
 	bcache_device_stop(&dc->disk);
-	return -EIO;
+	return ret;
 }
 
 /* Flash only volumes */
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 53f70f97241e..82a2fe8ac86b 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -325,8 +325,11 @@ STORE(__cached_dev)
 		bch_cache_accounting_clear(&dc->accounting);
 
 	if (attr == &sysfs_running &&
-	    strtoul_or_return(buf))
-		bch_cached_dev_run(dc);
+	    strtoul_or_return(buf)) {
+		v = bch_cached_dev_run(dc);
+		if (v)
+			return v;
+	}
 
 	if (attr == &sysfs_cache_mode) {
 		v = sysfs_match_string(bch_cache_modes, buf);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 17/29] bcache: remove unncessary code in bch_btree_keys_init()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (15 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 16/29] bcache: add return value check to bch_cached_dev_run() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 18/29] bcache: check CACHE_SET_IO_DISABLE in allocator code Coly Li
                   ` (12 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Function bch_btree_keys_init() initializes b->set[].size and
b->set[].data to zero. As the code comments indicates, these code indeed
is unncessary, because both struct btree_keys and struct bset_tree are
nested embedded into struct btree, when struct btree is filled with 0
bits by kzalloc() in mca_bucket_alloc(), b->set[].size and
b->set[].data are initialized to 0 (a.k.a NULL) already.

This patch removes the redundant code, and add comments in
bch_btree_keys_init() and mca_bucket_alloc() to explain why it's safe.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bset.c  | 15 ++++++---------
 drivers/md/bcache/btree.c |  4 ++++
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index aa2e4ab0fab9..eedaf0f3e3f0 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -347,22 +347,19 @@ EXPORT_SYMBOL(bch_btree_keys_alloc);
 void bch_btree_keys_init(struct btree_keys *b, const struct btree_keys_ops *ops,
 			 bool *expensive_debug_checks)
 {
-	unsigned int i;
-
 	b->ops = ops;
 	b->expensive_debug_checks = expensive_debug_checks;
 	b->nsets = 0;
 	b->last_set_unwritten = 0;
 
-	/* XXX: shouldn't be needed */
-	for (i = 0; i < MAX_BSETS; i++)
-		b->set[i].size = 0;
 	/*
-	 * Second loop starts at 1 because b->keys[0]->data is the memory we
-	 * allocated
+	 * struct btree_keys in embedded in struct btree, and struct
+	 * bset_tree is embedded into struct btree_keys. They are all
+	 * initialized as 0 by kzalloc() in mca_bucket_alloc(), and
+	 * b->set[0].data is allocated in bch_btree_keys_alloc(), so we
+	 * don't have to initiate b->set[].size and b->set[].data here
+	 * any more.
 	 */
-	for (i = 1; i < MAX_BSETS; i++)
-		b->set[i].data = NULL;
 }
 EXPORT_SYMBOL(bch_btree_keys_init);
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 881dc238c7cb..c0dd8fde37af 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -613,6 +613,10 @@ static void mca_data_alloc(struct btree *b, struct bkey *k, gfp_t gfp)
 static struct btree *mca_bucket_alloc(struct cache_set *c,
 				      struct bkey *k, gfp_t gfp)
 {
+	/*
+	 * kzalloc() is necessary here for initialization,
+	 * see code comments in bch_btree_keys_init().
+	 */
 	struct btree *b = kzalloc(sizeof(struct btree), gfp);
 
 	if (!b)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 18/29] bcache: check CACHE_SET_IO_DISABLE in allocator code
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (16 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 17/29] bcache: remove unncessary code in bch_btree_keys_init() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 19/29] bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() Coly Li
                   ` (11 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

If CACHE_SET_IO_DISABLE of a cache set flag is set by too many I/O
errors, currently allocator routines can still continue allocate
space which may introduce inconsistent metadata state.

This patch checkes CACHE_SET_IO_DISABLE bit in following allocator
routines,
- bch_bucket_alloc()
- __bch_bucket_alloc_set()
Once CACHE_SET_IO_DISABLE is set on cache set, the allocator routines
may reject allocation request earlier to avoid potential inconsistent
metadata.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/alloc.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index f8986effcb50..6f776823b9ba 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -393,6 +393,11 @@ long bch_bucket_alloc(struct cache *ca, unsigned int reserve, bool wait)
 	struct bucket *b;
 	long r;
 
+
+	/* No allocation if CACHE_SET_IO_DISABLE bit is set */
+	if (unlikely(test_bit(CACHE_SET_IO_DISABLE, &ca->set->flags)))
+		return -1;
+
 	/* fastpath */
 	if (fifo_pop(&ca->free[RESERVE_NONE], r) ||
 	    fifo_pop(&ca->free[reserve], r))
@@ -484,6 +489,10 @@ int __bch_bucket_alloc_set(struct cache_set *c, unsigned int reserve,
 {
 	int i;
 
+	/* No allocation if CACHE_SET_IO_DISABLE bit is set */
+	if (unlikely(test_bit(CACHE_SET_IO_DISABLE, &c->flags)))
+		return -1;
+
 	lockdep_assert_held(&c->bucket_lock);
 	BUG_ON(!n || n > c->caches_loaded || n > MAX_CACHES_PER_SET);
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 19/29] bcache: check CACHE_SET_IO_DISABLE bit in bch_journal()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (17 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 18/29] bcache: check CACHE_SET_IO_DISABLE in allocator code Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 20/29] bcache: more detailed error message to bcache_device_link() Coly Li
                   ` (10 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When too many I/O errors happen on cache set and CACHE_SET_IO_DISABLE
bit is set, bch_journal() may continue to work because the journaling
bkey might be still in write set yet. The caller of bch_journal() may
believe the journal still work but the truth is in-memory journal write
set won't be written into cache device any more. This behavior may
introduce potential inconsistent metadata status.

This patch checks CACHE_SET_IO_DISABLE bit at the head of bch_journal(),
if the bit is set, bch_journal() returns NULL immediately to notice
caller to know journal does not work.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/journal.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/bcache/journal.c b/drivers/md/bcache/journal.c
index 7beff8884dd0..87af7c6513c5 100644
--- a/drivers/md/bcache/journal.c
+++ b/drivers/md/bcache/journal.c
@@ -867,6 +867,10 @@ atomic_t *bch_journal(struct cache_set *c,
 	struct journal_write *w;
 	atomic_t *ret;
 
+	/* No journaling if CACHE_SET_IO_DISABLE set already */
+	if (unlikely(test_bit(CACHE_SET_IO_DISABLE, &c->flags)))
+		return NULL;
+
 	if (!CACHE_SYNC(&c->sb))
 		return NULL;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 20/29] bcache: more detailed error message to bcache_device_link()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (18 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 19/29] bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 21/29] bcache: add more error message in bch_cached_dev_attach() Coly Li
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

This patch adds more accurate error message for specific
ssyfs_create_link() call, to help debugging failure during
bcache device start tup.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 74eb18b1af40..1fa3f4e26d02 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -693,6 +693,7 @@ static void bcache_device_link(struct bcache_device *d, struct cache_set *c,
 {
 	unsigned int i;
 	struct cache *ca;
+	int ret;
 
 	for_each_cache(ca, d->c, i)
 		bd_link_disk_holder(ca->bdev, d->disk);
@@ -700,9 +701,13 @@ static void bcache_device_link(struct bcache_device *d, struct cache_set *c,
 	snprintf(d->name, BCACHEDEVNAME_SIZE,
 		 "%s%u", name, d->id);
 
-	WARN(sysfs_create_link(&d->kobj, &c->kobj, "cache") ||
-	     sysfs_create_link(&c->kobj, &d->kobj, d->name),
-	     "Couldn't create device <-> cache set symlinks");
+	ret = sysfs_create_link(&d->kobj, &c->kobj, "cache");
+	if (ret < 0)
+		pr_err("Couldn't create device -> cache set symlink");
+
+	ret = sysfs_create_link(&c->kobj, &d->kobj, d->name);
+	if (ret < 0)
+		pr_err("Couldn't create cache set -> device symlink");
 
 	clear_bit(BCACHE_DEV_UNLINK_DONE, &d->flags);
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 21/29] bcache: add more error message in bch_cached_dev_attach()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (19 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 20/29] bcache: more detailed error message to bcache_device_link() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 22/29] bcache: shrink btree node cache after bch_btree_check() Coly Li
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

This patch adds more error message for attaching cached device, this is
helpful to debug code failure during bache device start up.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 1fa3f4e26d02..cf5673af3143 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1169,6 +1169,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
 	down_write(&dc->writeback_lock);
 	if (bch_cached_dev_writeback_start(dc)) {
 		up_write(&dc->writeback_lock);
+		pr_err("Couldn't start writeback facilities for %s",
+		       dc->disk.disk->disk_name);
 		return -ENOMEM;
 	}
 
@@ -1182,6 +1184,8 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
 	ret = bch_cached_dev_run(dc);
 	if (ret && (ret != -EBUSY)) {
 		up_write(&dc->writeback_lock);
+		pr_err("Couldn't run cached device %s",
+		       dc->backing_dev_name);
 		return ret;
 	}
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 22/29] bcache: shrink btree node cache after bch_btree_check()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (20 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 21/29] bcache: add more error message in bch_cached_dev_attach() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 23/29] bcache: improve error message in bch_cached_dev_run() Coly Li
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When cache set starts, bch_btree_check() will check all bkeys on cache
device by calculating the checksum. This operation will consume a huge
number of system memory if there are a lot of data cached. Since bcache
uses its own mca cache to maintain all its read-in btree nodes, and only
releases the cache space when system memory manage code starts to shrink
caches. There is will be a delay between bch_btree_check() returns and
the bcache shrink code gets called, so following memory allocatiion
might fail after bch_btree_check() finished. The most frequent one is
failure of creating allocator kernel thread.

This patch tries to proactively call bcache mca shrinker routine to
release around 25% cache memory, to help following memory allocation
to success. 'Around 25% cache memory' means when mca shrnker tries to
release cache memory, it might have to skip some busy memory objects,
so the result might be a few less than the expected amount.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index cf5673af3143..4a6406b53de1 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1866,6 +1866,24 @@ static int run_cache_set(struct cache_set *c)
 		if (bch_btree_check(c))
 			goto err;
 
+		/*
+		 * bch_btree_check() may occupy too much system memory which
+		 * will fail memory allocation operations in the following
+		 * routines before kernel triggers memory shrinker call backs.
+		 * Shrinking 25% mca cache memory proactively here to avoid
+		 * potential memory allocation failure.
+		 */
+		if (!c->shrinker_disabled) {
+			struct shrink_control sc;
+
+			sc.gfp_mask = GFP_KERNEL;
+			sc.nr_to_scan =
+				c->shrink.count_objects(&c->shrink, &sc) / 4;
+			pr_debug("try to shrink %lu (25%%) cached btree node",
+				 sc.nr_to_scan);
+			c->shrink.scan_objects(&c->shrink, &sc);
+		}
+
 		bch_journal_mark(c, &journal);
 		bch_initial_gc_finish(c);
 		pr_debug("btree_check() done");
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 23/29] bcache: improve error message in bch_cached_dev_run()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (21 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 22/29] bcache: shrink btree node cache after bch_btree_check() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 24/29] bcache: make bset_search_tree() be more understandable Coly Li
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

This patch adds more error message in bch_cached_dev_run() to indicate
the exact reason why an error value is returned. Please notice when
printing out the "is running already" message, pr_info() is used here,
because in this case also -EBUSY is returned, the bcache device can
continue to attach to the cache devince and run, so it won't be an
error level message in kernel message.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 4a6406b53de1..026e2df358c3 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -926,13 +926,18 @@ int bch_cached_dev_run(struct cached_dev *dc)
 		NULL,
 	};
 
-	if (dc->io_disable)
+	if (dc->io_disable) {
+		pr_err("I/O disabled on cached dev %s",
+		       dc->backing_dev_name);
 		return -EIO;
+	}
 
 	if (atomic_xchg(&dc->running, 1)) {
 		kfree(env[1]);
 		kfree(env[2]);
 		kfree(buf);
+		pr_info("cached dev %s is running already",
+		       dc->backing_dev_name);
 		return -EBUSY;
 	}
 
@@ -961,7 +966,7 @@ int bch_cached_dev_run(struct cached_dev *dc)
 	if (sysfs_create_link(&d->kobj, &disk_to_dev(d->disk)->kobj, "dev") ||
 	    sysfs_create_link(&disk_to_dev(d->disk)->kobj,
 			      &d->kobj, "bcache")) {
-		pr_debug("error creating sysfs link");
+		pr_err("Couldn't create bcache dev <-> disk sysfs symlinks");
 		return -ENOMEM;
 	}
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 24/29] bcache: make bset_search_tree() be more understandable
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (22 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 23/29] bcache: improve error message in bch_cached_dev_run() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 25/29] bcache: add pendings_cleanup to stop pending bcache device Coly Li
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

The purpose of following code in bset_search_tree() is to avoid a branch
instruction,
 994         if (likely(f->exponent != 127))
 995                 n = j * 2 + (((unsigned int)
 996                               (f->mantissa -
 997                                bfloat_mantissa(search, f))) >> 31);
 998         else
 999                 n = (bkey_cmp(tree_to_bkey(t, j), search) > 0)
1000                         ? j * 2
1001                         : j * 2 + 1;

This piece of code is not very clear to understand, even when I tried to
add code comment for it, I made mistake. This patch removes the implict
bit operation and uses explicit branch to calculate next location in
binary tree search.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bset.c | 30 +++++++++++-------------------
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/drivers/md/bcache/bset.c b/drivers/md/bcache/bset.c
index eedaf0f3e3f0..2722e7f83de3 100644
--- a/drivers/md/bcache/bset.c
+++ b/drivers/md/bcache/bset.c
@@ -965,25 +965,17 @@ static struct bset_search_iter bset_search_tree(struct bset_tree *t,
 		j = n;
 		f = &t->tree[j];
 
-		/*
-		 * Similar bit trick, use subtract operation to avoid a branch
-		 * instruction.
-		 *
-		 * n = (f->mantissa > bfloat_mantissa())
-		 *	? j * 2
-		 *	: j * 2 + 1;
-		 *
-		 * We need to subtract 1 from f->mantissa for the sign bit trick
-		 * to work  - that's done in make_bfloat()
-		 */
-		if (likely(f->exponent != 127))
-			n = j * 2 + (((unsigned int)
-				      (f->mantissa -
-				       bfloat_mantissa(search, f))) >> 31);
-		else
-			n = (bkey_cmp(tree_to_bkey(t, j), search) > 0)
-				? j * 2
-				: j * 2 + 1;
+		if (likely(f->exponent != 127)) {
+			if (f->mantissa >= bfloat_mantissa(search, f))
+				n = j * 2;
+			else
+				n = j * 2 + 1;
+		} else {
+			if (bkey_cmp(tree_to_bkey(t, j), search) > 0)
+				n = j * 2;
+			else
+				n = j * 2 + 1;
+		}
 	} while (n < t->size);
 
 	inorder = to_inorder(j, t);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 25/29] bcache: add pendings_cleanup to stop pending bcache device
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (23 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 24/29] bcache: make bset_search_tree() be more understandable Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 26/29] bcache: avoid a deadlock in bcache_reboot() Coly Li
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

If a bcache device is in dirty state and its cache set is not
registered, this bcache device will not appear in /dev/bcache<N>,
and there is no way to stop it or remove the bcache kernel module.

This is an as-designed behavior, but sometimes people has to reboot
whole system to release or stop the pending backing device.

This sysfs interface may remove such pending bcache devices when
write anything into the sysfs file manually.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 026e2df358c3..16ed617183f7 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2291,9 +2291,13 @@ static int register_cache(struct cache_sb *sb, struct page *sb_page,
 
 static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 			       const char *buffer, size_t size);
+static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
+					 struct kobj_attribute *attr,
+					 const char *buffer, size_t size);
 
 kobj_attribute_write(register,		register_bcache);
 kobj_attribute_write(register_quiet,	register_bcache);
+kobj_attribute_write(pendings_cleanup,	bch_pending_bdevs_cleanup);
 
 static bool bch_is_open_backing(struct block_device *bdev)
 {
@@ -2418,6 +2422,56 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	goto out;
 }
 
+
+struct pdev {
+	struct list_head list;
+	struct cached_dev *dc;
+};
+
+static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
+					 struct kobj_attribute *attr,
+					 const char *buffer,
+					 size_t size)
+{
+	LIST_HEAD(pending_devs);
+	ssize_t ret = size;
+	struct cached_dev *dc, *tdc;
+	struct pdev *pdev, *tpdev;
+	struct cache_set *c, *tc;
+
+	mutex_lock(&bch_register_lock);
+	list_for_each_entry_safe(dc, tdc, &uncached_devices, list) {
+		pdev = kmalloc(sizeof(struct pdev), GFP_KERNEL);
+		if (!pdev)
+			break;
+		pdev->dc = dc;
+		list_add(&pdev->list, &pending_devs);
+	}
+
+	list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) {
+		list_for_each_entry_safe(c, tc, &bch_cache_sets, list) {
+			char *pdev_set_uuid = pdev->dc->sb.set_uuid;
+			char *set_uuid = c->sb.uuid;
+
+			if (!memcmp(pdev_set_uuid, set_uuid, 16)) {
+				list_del(&pdev->list);
+				kfree(pdev);
+				break;
+			}
+		}
+	}
+	mutex_unlock(&bch_register_lock);
+
+	list_for_each_entry_safe(pdev, tpdev, &pending_devs, list) {
+		pr_info("delete pdev %p", pdev);
+		list_del(&pdev->list);
+		bcache_device_stop(&pdev->dc->disk);
+		kfree(pdev);
+	}
+
+	return ret;
+}
+
 static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
 {
 	if (code == SYS_DOWN ||
@@ -2536,6 +2590,7 @@ static int __init bcache_init(void)
 	static const struct attribute *files[] = {
 		&ksysfs_register.attr,
 		&ksysfs_register_quiet.attr,
+		&ksysfs_pendings_cleanup.attr,
 		NULL
 	};
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 26/29] bcache: avoid a deadlock in bcache_reboot()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (24 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 25/29] bcache: add pendings_cleanup to stop pending bcache device Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 27/29] bcache: acquire bch_register_lock later in cached_dev_detach_finish() Coly Li
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

It is quite frequently to observe deadlock in bcache_reboot() happens
and hang the system reboot process. The reason is, in bcache_reboot()
when calling bch_cache_set_stop() and bcache_device_stop() the mutex
bch_register_lock is held. But in the process to stop cache set and
bcache device, bch_register_lock will be acquired again. If this mutex
is held here, deadlock will happen inside the stopping process. The
aftermath of the deadlock is, whole system reboot gets hung.

The fix is to avoid holding bch_register_lock for the following loops
in bcache_reboot(),
       list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
                bch_cache_set_stop(c);

        list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
                bcache_device_stop(&dc->disk);

A module range variable 'bcache_is_reboot' is added, it sets to true
in bcache_reboot(). In register_bcache(), if bcache_is_reboot is checked
to be true, reject the registration by returning -EBUSY immediately.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 40 +++++++++++++++++++++++++++++++++++++++-
 drivers/md/bcache/sysfs.c | 26 ++++++++++++++++++++++++++
 2 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 16ed617183f7..2928adfea743 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -40,6 +40,7 @@ static const char invalid_uuid[] = {
 
 static struct kobject *bcache_kobj;
 struct mutex bch_register_lock;
+bool bcache_is_reboot;
 LIST_HEAD(bch_cache_sets);
 static LIST_HEAD(uncached_devices);
 
@@ -49,6 +50,7 @@ static wait_queue_head_t unregister_wait;
 struct workqueue_struct *bcache_wq;
 struct workqueue_struct *bch_journal_wq;
 
+
 #define BTREE_MAX_PAGES		(256 * 1024 / PAGE_SIZE)
 /* limitation of partitions number on single bcache device */
 #define BCACHE_MINORS		128
@@ -2345,6 +2347,11 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
 	if (!try_module_get(THIS_MODULE))
 		return -EBUSY;
 
+	/* For latest state of bcache_is_reboot */
+	smp_mb();
+	if (bcache_is_reboot)
+		return -EBUSY;
+
 	path = kstrndup(buffer, size, GFP_KERNEL);
 	if (!path)
 		goto err;
@@ -2474,6 +2481,9 @@ static ssize_t bch_pending_bdevs_cleanup(struct kobject *k,
 
 static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
 {
+	if (bcache_is_reboot)
+		return NOTIFY_DONE;
+
 	if (code == SYS_DOWN ||
 	    code == SYS_HALT ||
 	    code == SYS_POWER_OFF) {
@@ -2486,19 +2496,45 @@ static int bcache_reboot(struct notifier_block *n, unsigned long code, void *x)
 
 		mutex_lock(&bch_register_lock);
 
+		if (bcache_is_reboot)
+			goto out;
+
+		/* New registration is rejected since now */
+		bcache_is_reboot = true;
+		/*
+		 * Make registering caller (if there is) on other CPU
+		 * core know bcache_is_reboot set to true earlier
+		 */
+		smp_mb();
+
 		if (list_empty(&bch_cache_sets) &&
 		    list_empty(&uncached_devices))
 			goto out;
 
+		mutex_unlock(&bch_register_lock);
+
 		pr_info("Stopping all devices:");
 
+		/*
+		 * The reason bch_register_lock is not held to call
+		 * bch_cache_set_stop() and bcache_device_stop() is to
+		 * avoid potential deadlock during reboot, because cache
+		 * set or bcache device stopping process will acqurie
+		 * bch_register_lock too.
+		 *
+		 * We are safe here because bcache_is_reboot sets to
+		 * true already, register_bcache() will reject new
+		 * registration now. bcache_is_reboot also makes sure
+		 * bcache_reboot() won't be re-entered on by other thread,
+		 * so there is no race in following list iteration by
+		 * list_for_each_entry_safe().
+		 */
 		list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
 			bch_cache_set_stop(c);
 
 		list_for_each_entry_safe(dc, tdc, &uncached_devices, list)
 			bcache_device_stop(&dc->disk);
 
-		mutex_unlock(&bch_register_lock);
 
 		/*
 		 * Give an early chance for other kthreads and
@@ -2626,6 +2662,8 @@ static int __init bcache_init(void)
 	bch_debug_init();
 	closure_debug_init();
 
+	bcache_is_reboot = false;
+
 	return 0;
 err:
 	bcache_exit();
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 82a2fe8ac86b..4ab15442cab5 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -16,6 +16,8 @@
 #include <linux/sort.h>
 #include <linux/sched/clock.h>
 
+extern bool bcache_is_reboot;
+
 /* Default is 0 ("writethrough") */
 static const char * const bch_cache_modes[] = {
 	"writethrough",
@@ -267,6 +269,10 @@ STORE(__cached_dev)
 	struct cache_set *c;
 	struct kobj_uevent_env *env;
 
+	/* no user space access if system is rebooting */
+	if (bcache_is_reboot)
+		return -EBUSY;
+
 #define d_strtoul(var)		sysfs_strtoul(var, dc->var)
 #define d_strtoul_nonzero(var)	sysfs_strtoul_clamp(var, dc->var, 1, INT_MAX)
 #define d_strtoi_h(var)		sysfs_hatoi(var, dc->var)
@@ -407,6 +413,10 @@ STORE(bch_cached_dev)
 	struct cached_dev *dc = container_of(kobj, struct cached_dev,
 					     disk.kobj);
 
+	/* no user space access if system is rebooting */
+	if (bcache_is_reboot)
+		return -EBUSY;
+
 	mutex_lock(&bch_register_lock);
 	size = __cached_dev_store(kobj, attr, buf, size);
 
@@ -505,6 +515,10 @@ STORE(__bch_flash_dev)
 					       kobj);
 	struct uuid_entry *u = &d->c->uuids[d->id];
 
+	/* no user space access if system is rebooting */
+	if (bcache_is_reboot)
+		return -EBUSY;
+
 	sysfs_strtoul(data_csum,	d->data_csum);
 
 	if (attr == &sysfs_size) {
@@ -740,6 +754,10 @@ STORE(__bch_cache_set)
 	struct cache_set *c = container_of(kobj, struct cache_set, kobj);
 	ssize_t v;
 
+	/* no user space access if system is rebooting */
+	if (bcache_is_reboot)
+		return -EBUSY;
+
 	if (attr == &sysfs_unregister)
 		bch_cache_set_unregister(c);
 
@@ -859,6 +877,10 @@ STORE(bch_cache_set_internal)
 {
 	struct cache_set *c = container_of(kobj, struct cache_set, internal);
 
+	/* no user space access if system is rebooting */
+	if (bcache_is_reboot)
+		return -EBUSY;
+
 	return bch_cache_set_store(&c->kobj, attr, buf, size);
 }
 
@@ -1044,6 +1066,10 @@ STORE(__bch_cache)
 	struct cache *ca = container_of(kobj, struct cache, kobj);
 	ssize_t v;
 
+	/* no user space access if system is rebooting */
+	if (bcache_is_reboot)
+		return -EBUSY;
+
 	if (attr == &sysfs_discard) {
 		bool v = strtoul_or_return(buf);
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 27/29] bcache: acquire bch_register_lock later in cached_dev_detach_finish()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (25 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 26/29] bcache: avoid a deadlock in bcache_reboot() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 28/29] bcache: acquire bch_register_lock later in cached_dev_free() Coly Li
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

Now there is variable bcache_is_reboot to prevent device register or
unregister during reboot, it is unncessary to still hold mutex lock
bch_regsiter_lock before stopping writeback_rate_update kworker and
writeback kthread. And if the stopping kworker or kthread holding
bch_register_lock inside their routine (we used to have such problem
in writeback thread, thanks to Junhui Wang fixed it), it is very easy
to introduce deadlock during reboot/shutdown procedure.

Therefore in this patch, the location to acquire bch_register_lock is
moved to the location before calling calc_cached_dev_sectors(). Which
is later then original location in cached_dev_detach_finish().

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 2928adfea743..15916b3ec8bf 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1018,7 +1018,6 @@ static void cached_dev_detach_finish(struct work_struct *w)
 	BUG_ON(!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags));
 	BUG_ON(refcount_read(&dc->count));
 
-	mutex_lock(&bch_register_lock);
 
 	if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags))
 		cancel_writeback_rate_update_dwork(dc);
@@ -1034,6 +1033,8 @@ static void cached_dev_detach_finish(struct work_struct *w)
 	bch_write_bdev_super(dc, &cl);
 	closure_sync(&cl);
 
+	mutex_lock(&bch_register_lock);
+
 	calc_cached_dev_sectors(dc->disk.c);
 	bcache_device_detach(&dc->disk);
 	list_move(&dc->list, &uncached_devices);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 28/29] bcache: acquire bch_register_lock later in cached_dev_free()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (26 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 27/29] bcache: acquire bch_register_lock later in cached_dev_detach_finish() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-14 13:13 ` [PATCH 29/29] bcache: fix potential deadlock in cached_def_free() Coly Li
  2019-06-20  9:34 ` [PATCH 00/29] bcache candidate patches for Linux v5.3 Jens Axboe
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When enable lockdep engine, a lockdep warning can be observed when
reboot or shutdown system,

[ 3142.764557][    T1] bcache: bcache_reboot() Stopping all devices:
[ 3142.776265][ T2649]
[ 3142.777159][ T2649] ======================================================
[ 3142.780039][ T2649] WARNING: possible circular locking dependency detected
[ 3142.782869][ T2649] 5.2.0-rc4-lp151.20-default+ #1 Tainted: G        W
[ 3142.785684][ T2649] ------------------------------------------------------
[ 3142.788479][ T2649] kworker/3:67/2649 is trying to acquire lock:
[ 3142.790738][ T2649] 00000000aaf02291 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[ 3142.794678][ T2649]
[ 3142.794678][ T2649] but task is already holding lock:
[ 3142.797402][ T2649] 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.801462][ T2649]
[ 3142.801462][ T2649] which lock already depends on the new lock.
[ 3142.801462][ T2649]
[ 3142.805277][ T2649]
[ 3142.805277][ T2649] the existing dependency chain (in reverse order) is:
[ 3142.808902][ T2649]
[ 3142.808902][ T2649] -> #2 (&bch_register_lock){+.+.}:
[ 3142.812396][ T2649]        __mutex_lock+0x7a/0x9d0
[ 3142.814184][ T2649]        cached_dev_free+0x17/0x120 [bcache]
[ 3142.816415][ T2649]        process_one_work+0x2a4/0x640
[ 3142.818413][ T2649]        worker_thread+0x39/0x3f0
[ 3142.820276][ T2649]        kthread+0x125/0x140
[ 3142.822061][ T2649]        ret_from_fork+0x3a/0x50
[ 3142.823965][ T2649]
[ 3142.823965][ T2649] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[ 3142.827244][ T2649]        process_one_work+0x277/0x640
[ 3142.829160][ T2649]        worker_thread+0x39/0x3f0
[ 3142.830958][ T2649]        kthread+0x125/0x140
[ 3142.832674][ T2649]        ret_from_fork+0x3a/0x50
[ 3142.834915][ T2649]
[ 3142.834915][ T2649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[ 3142.838121][ T2649]        lock_acquire+0xb4/0x1c0
[ 3142.840025][ T2649]        flush_workqueue+0xae/0x4c0
[ 3142.842035][ T2649]        drain_workqueue+0xa9/0x180
[ 3142.844042][ T2649]        destroy_workqueue+0x17/0x250
[ 3142.846142][ T2649]        cached_dev_free+0x52/0x120 [bcache]
[ 3142.848530][ T2649]        process_one_work+0x2a4/0x640
[ 3142.850663][ T2649]        worker_thread+0x39/0x3f0
[ 3142.852464][ T2649]        kthread+0x125/0x140
[ 3142.854106][ T2649]        ret_from_fork+0x3a/0x50
[ 3142.855880][ T2649]
[ 3142.855880][ T2649] other info that might help us debug this:
[ 3142.855880][ T2649]
[ 3142.859663][ T2649] Chain exists of:
[ 3142.859663][ T2649]   (wq_completion)bcache_writeback_wq --> (work_completion)(&cl->work)#2 --> &bch_register_lock
[ 3142.859663][ T2649]
[ 3142.865424][ T2649]  Possible unsafe locking scenario:
[ 3142.865424][ T2649]
[ 3142.868022][ T2649]        CPU0                    CPU1
[ 3142.869885][ T2649]        ----                    ----
[ 3142.871751][ T2649]   lock(&bch_register_lock);
[ 3142.873379][ T2649]                                lock((work_completion)(&cl->work)#2);
[ 3142.876399][ T2649]                                lock(&bch_register_lock);
[ 3142.879727][ T2649]   lock((wq_completion)bcache_writeback_wq);
[ 3142.882064][ T2649]
[ 3142.882064][ T2649]  *** DEADLOCK ***
[ 3142.882064][ T2649]
[ 3142.885060][ T2649] 3 locks held by kworker/3:67/2649:
[ 3142.887245][ T2649]  #0: 00000000e774cdd0 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.890815][ T2649]  #1: 00000000f7df89da ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[ 3142.894884][ T2649]  #2: 000000004fcf89c5 (&bch_register_lock){+.+.}, at: cached_dev_free+0x17/0x120 [bcache]
[ 3142.898797][ T2649]
[ 3142.898797][ T2649] stack backtrace:
[ 3142.900961][ T2649] CPU: 3 PID: 2649 Comm: kworker/3:67 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
[ 3142.904789][ T2649] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[ 3142.909168][ T2649] Workqueue: events cached_dev_free [bcache]
[ 3142.911422][ T2649] Call Trace:
[ 3142.912656][ T2649]  dump_stack+0x85/0xcb
[ 3142.914181][ T2649]  print_circular_bug+0x19a/0x1f0
[ 3142.916193][ T2649]  __lock_acquire+0x16cd/0x1850
[ 3142.917936][ T2649]  ? __lock_acquire+0x6a8/0x1850
[ 3142.919704][ T2649]  ? lock_acquire+0xb4/0x1c0
[ 3142.921335][ T2649]  ? find_held_lock+0x34/0xa0
[ 3142.923052][ T2649]  lock_acquire+0xb4/0x1c0
[ 3142.924635][ T2649]  ? flush_workqueue+0x87/0x4c0
[ 3142.926375][ T2649]  flush_workqueue+0xae/0x4c0
[ 3142.928047][ T2649]  ? flush_workqueue+0x87/0x4c0
[ 3142.929824][ T2649]  ? drain_workqueue+0xa9/0x180
[ 3142.931686][ T2649]  drain_workqueue+0xa9/0x180
[ 3142.933534][ T2649]  destroy_workqueue+0x17/0x250
[ 3142.935787][ T2649]  cached_dev_free+0x52/0x120 [bcache]
[ 3142.937795][ T2649]  process_one_work+0x2a4/0x640
[ 3142.939803][ T2649]  worker_thread+0x39/0x3f0
[ 3142.941487][ T2649]  ? process_one_work+0x640/0x640
[ 3142.943389][ T2649]  kthread+0x125/0x140
[ 3142.944894][ T2649]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 3142.947744][ T2649]  ret_from_fork+0x3a/0x50
[ 3142.970358][ T2649] bcache: bcache_device_free() bcache0 stopped

Here is how the deadlock happens.
1) bcache_reboot() calls bcache_device_stop(), then inside
   bcache_device_stop() BCACHE_DEV_CLOSING bit is set on d->flags.
   Then closure_queue(&d->cl) is called to invoke cached_dev_flush().
2) In cached_dev_flush(), cached_dev_free() is called by continu_at().
3) In cached_dev_free(), when stopping the writeback kthread of the
   cached device by kthread_stop(), dc->writeback_thread will be waken
   up to quite the kthread while-loop, then cached_dev_put() is called
   in bch_writeback_thread().
4) Calling cached_dev_put() in writeback kthread may drop dc->count to
   0, then dc->detach kworker is scheduled, which is initialized as
   cached_dev_detach_finish().
5) Inside cached_dev_detach_finish(), the last line of code is to call
   closure_put(&dc->disk.cl), which drops the last reference counter of
   closrure dc->disk.cl, then the callback cached_dev_flush() gets
   called.
Now cached_dev_flush() is called for second time in the code path, the
first time is in step 2). And again bch_register_lock will be acquired
again, and a A-A lock (lockdep terminology) is happening.

The root cause of the above A-A lock is in cached_dev_free(), mutex
bch_register_lock is held before stopping writeback kthread and other
kworkers. Fortunately now we have variable 'bcache_is_reboot', which may
prevent device registration or unregistration during reboot/shutdown
time, so it is unncessary to hold bch_register_lock such early now.

This is how this patch fixes the reboot/shutdown time A-A lock issue:
After moving mutex_lock(&bch_register_lock) to a later location where
before atomic_read(&dc->running) in cached_dev_free(), such A-A lock
problem can be solved without any reboot time registration race.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 15916b3ec8bf..f376ba7e4d3f 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1223,8 +1223,6 @@ static void cached_dev_free(struct closure *cl)
 {
 	struct cached_dev *dc = container_of(cl, struct cached_dev, disk.cl);
 
-	mutex_lock(&bch_register_lock);
-
 	if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags))
 		cancel_writeback_rate_update_dwork(dc);
 
@@ -1235,6 +1233,8 @@ static void cached_dev_free(struct closure *cl)
 	if (!IS_ERR_OR_NULL(dc->status_update_thread))
 		kthread_stop(dc->status_update_thread);
 
+	mutex_lock(&bch_register_lock);
+
 	if (atomic_read(&dc->running))
 		bd_unlink_disk_holder(dc->bdev, dc->disk.disk);
 	bcache_device_free(&dc->disk);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 29/29] bcache: fix potential deadlock in cached_def_free()
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (27 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 28/29] bcache: acquire bch_register_lock later in cached_dev_free() Coly Li
@ 2019-06-14 13:13 ` Coly Li
  2019-06-20  9:34 ` [PATCH 00/29] bcache candidate patches for Linux v5.3 Jens Axboe
  29 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-14 13:13 UTC (permalink / raw)
  To: linux-bcache; +Cc: linux-block, Coly Li

When enable lockdep and reboot system with a writeback mode bcache
device, the following potential deadlock warning is reported by lockdep
engine.

[  101.536569][  T401] kworker/2:2/401 is trying to acquire lock:
[  101.538575][  T401] 00000000bbf6e6c7 ((wq_completion)bcache_writeback_wq){+.+.}, at: flush_workqueue+0x87/0x4c0
[  101.542054][  T401]
[  101.542054][  T401] but task is already holding lock:
[  101.544587][  T401] 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[  101.548386][  T401]
[  101.548386][  T401] which lock already depends on the new lock.
[  101.548386][  T401]
[  101.551874][  T401]
[  101.551874][  T401] the existing dependency chain (in reverse order) is:
[  101.555000][  T401]
[  101.555000][  T401] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[  101.557860][  T401]        process_one_work+0x277/0x640
[  101.559661][  T401]        worker_thread+0x39/0x3f0
[  101.561340][  T401]        kthread+0x125/0x140
[  101.562963][  T401]        ret_from_fork+0x3a/0x50
[  101.564718][  T401]
[  101.564718][  T401] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[  101.567701][  T401]        lock_acquire+0xb4/0x1c0
[  101.569651][  T401]        flush_workqueue+0xae/0x4c0
[  101.571494][  T401]        drain_workqueue+0xa9/0x180
[  101.573234][  T401]        destroy_workqueue+0x17/0x250
[  101.575109][  T401]        cached_dev_free+0x44/0x120 [bcache]
[  101.577304][  T401]        process_one_work+0x2a4/0x640
[  101.579357][  T401]        worker_thread+0x39/0x3f0
[  101.581055][  T401]        kthread+0x125/0x140
[  101.582709][  T401]        ret_from_fork+0x3a/0x50
[  101.584592][  T401]
[  101.584592][  T401] other info that might help us debug this:
[  101.584592][  T401]
[  101.588355][  T401]  Possible unsafe locking scenario:
[  101.588355][  T401]
[  101.590974][  T401]        CPU0                    CPU1
[  101.592889][  T401]        ----                    ----
[  101.594743][  T401]   lock((work_completion)(&cl->work)#2);
[  101.596785][  T401]                                lock((wq_completion)bcache_writeback_wq);
[  101.600072][  T401]                                lock((work_completion)(&cl->work)#2);
[  101.602971][  T401]   lock((wq_completion)bcache_writeback_wq);
[  101.605255][  T401]
[  101.605255][  T401]  *** DEADLOCK ***
[  101.605255][  T401]
[  101.608310][  T401] 2 locks held by kworker/2:2/401:
[  101.610208][  T401]  #0: 00000000cf2c7d17 ((wq_completion)events){+.+.}, at: process_one_work+0x21e/0x640
[  101.613709][  T401]  #1: 00000000f5f305b3 ((work_completion)(&cl->work)#2){+.+.}, at: process_one_work+0x21e/0x640
[  101.617480][  T401]
[  101.617480][  T401] stack backtrace:
[  101.619539][  T401] CPU: 2 PID: 401 Comm: kworker/2:2 Tainted: G        W         5.2.0-rc4-lp151.20-default+ #1
[  101.623225][  T401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  101.627210][  T401] Workqueue: events cached_dev_free [bcache]
[  101.629239][  T401] Call Trace:
[  101.630360][  T401]  dump_stack+0x85/0xcb
[  101.631777][  T401]  print_circular_bug+0x19a/0x1f0
[  101.633485][  T401]  __lock_acquire+0x16cd/0x1850
[  101.635184][  T401]  ? __lock_acquire+0x6a8/0x1850
[  101.636863][  T401]  ? lock_acquire+0xb4/0x1c0
[  101.638421][  T401]  ? find_held_lock+0x34/0xa0
[  101.640015][  T401]  lock_acquire+0xb4/0x1c0
[  101.641513][  T401]  ? flush_workqueue+0x87/0x4c0
[  101.643248][  T401]  flush_workqueue+0xae/0x4c0
[  101.644832][  T401]  ? flush_workqueue+0x87/0x4c0
[  101.646476][  T401]  ? drain_workqueue+0xa9/0x180
[  101.648303][  T401]  drain_workqueue+0xa9/0x180
[  101.649867][  T401]  destroy_workqueue+0x17/0x250
[  101.651503][  T401]  cached_dev_free+0x44/0x120 [bcache]
[  101.653328][  T401]  process_one_work+0x2a4/0x640
[  101.655029][  T401]  worker_thread+0x39/0x3f0
[  101.656693][  T401]  ? process_one_work+0x640/0x640
[  101.658501][  T401]  kthread+0x125/0x140
[  101.660012][  T401]  ? kthread_create_worker_on_cpu+0x70/0x70
[  101.661985][  T401]  ret_from_fork+0x3a/0x50
[  101.691318][  T401] bcache: bcache_device_free() bcache0 stopped

Here is how the above potential deadlock may happen in reboot/shutdown
code path,
1) bcache_reboot() is called firstly in the reboot/shutdown code path,
   then in bcache_reboot(), bcache_device_stop() is called.
2) bcache_device_stop() sets BCACHE_DEV_CLOSING on d->falgs, then call
   closure_queue(&d->cl) to invoke cached_dev_flush(). And in turn
   cached_dev_flush() calls cached_dev_free() via closure_at()
3) In cached_dev_free(), after stopped writebach kthread
   dc->writeback_thread, the kwork dc->writeback_write_wq is stopping by
   destroy_workqueue().
4) Inside destroy_workqueue(), drain_workqueue() is called. Inside
   drain_workqueue(), flush_workqueue() is called. Then wq->lockdep_map
   is acquired by lock_map_acquire() in flush_workqueue(). After the
   lock acquired the rest part of flush_workqueue() just wait for the
   workqueue to complete.
5) Now we look back at writeback thread routine bch_writeback_thread(),
   in the main while-loop, write_dirty() is called via continue_at() in
   read_dirty_submit(), which is called via continue_at() in while-loop
   level called function read_dirty(). Inside write_dirty() it may be
   re-called on workqueeu dc->writeback_write_wq via continue_at().
   It means when the writeback kthread is stopped in cached_dev_free()
   there might be still one kworker queued on dc->writeback_write_wq
   to execute write_dirty() again.
6) Now this kworker is scheduled on dc->writeback_write_wq to run by
   process_one_work() (which is called by worker_thread()). Before
   calling the kwork routine, wq->lockdep_map is acquired.
7) But wq->lockdep_map is acquired already in step 4), so a A-A lock
   (lockdep terminology) scenario happens.

Indeed on multiple cores syatem, the above deadlock is very rare to
happen, just as the code comments in process_one_work() says,
2263     * AFAICT there is no possible deadlock scenario between the
2264     * flush_work() and complete() primitives (except for
	   single-threaded
2265     * workqueues), so hiding them isn't a problem.

But it is still good to fix such lockdep warning, even no one running
bcache on single core system.

The fix is simple. This patch solves the above potential deadlock by,
- Do not destory workqueue dc->writeback_write_wq in cached_dev_free().
- Flush and destory dc->writeback_write_wq in writebach kthread routine
  bch_writeback_thread(), where after quit the thread main while-loop
  and before cached_dev_put() is called.

By this fix, dc->writeback_write_wq will be stopped and destroy before
the writeback kthread stopped, so the chance for a A-A locking on
wq->lockdep_map is disappeared, such A-A deadlock won't happen
any more.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c     | 2 --
 drivers/md/bcache/writeback.c | 4 ++++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index f376ba7e4d3f..06b4cc0cecce 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -1228,8 +1228,6 @@ static void cached_dev_free(struct closure *cl)
 
 	if (!IS_ERR_OR_NULL(dc->writeback_thread))
 		kthread_stop(dc->writeback_thread);
-	if (dc->writeback_write_wq)
-		destroy_workqueue(dc->writeback_write_wq);
 	if (!IS_ERR_OR_NULL(dc->status_update_thread))
 		kthread_stop(dc->status_update_thread);
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 73f0efac2b9f..df0f4e5a051a 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -735,6 +735,10 @@ static int bch_writeback_thread(void *arg)
 		}
 	}
 
+	if (dc->writeback_write_wq) {
+		flush_workqueue(dc->writeback_write_wq);
+		destroy_workqueue(dc->writeback_write_wq);
+	}
 	cached_dev_put(dc);
 	wait_for_kthread_stop();
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 00/29] bcache candidate patches for Linux v5.3
  2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
                   ` (28 preceding siblings ...)
  2019-06-14 13:13 ` [PATCH 29/29] bcache: fix potential deadlock in cached_def_free() Coly Li
@ 2019-06-20  9:34 ` Jens Axboe
  29 siblings, 0 replies; 34+ messages in thread
From: Jens Axboe @ 2019-06-20  9:34 UTC (permalink / raw)
  To: Coly Li, linux-bcache; +Cc: linux-block

On 6/14/19 7:13 AM, Coly Li wrote:
> Hi folks,
> 
> Now I am testing the bcache patches for Linux v5.3, here I collect
> all previously posted patches for your information. Any code review
> and comment is welcome.

Applied for 5.3.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/29] bcache: fix race in btree_flush_write()
  2019-06-14 13:13 ` [PATCH 06/29] bcache: fix race in btree_flush_write() Coly Li
@ 2019-06-27  9:16   ` Yaowei Bai
  2019-06-27 11:47     ` Coly Li
  0 siblings, 1 reply; 34+ messages in thread
From: Yaowei Bai @ 2019-06-27  9:16 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache, linux-block

On Fri, Jun 14, 2019 at 09:13:35PM +0800, Coly Li wrote:
> There is a race between mca_reap(), btree_node_free() and journal code
> btree_flush_write(), which results very rare and strange deadlock or
> panic and are very hard to reproduce.
> 
> Let me explain how the race happens. In btree_flush_write() one btree
> node with oldest journal pin is selected, then it is flushed to cache
> device, the select-and-flush is a two steps operation. Between these two
> steps, there are something may happen inside the race window,
> - The selected btree node was reaped by mca_reap() and allocated to
>   other requesters for other btree node.
> - The slected btree node was selected, flushed and released by mca
>   shrink callback bch_mca_scan().
> When btree_flush_write() tries to flush the selected btree node, firstly
> b->write_lock is held by mutex_lock(). If the race happens and the
> memory of selected btree node is allocated to other btree node, if that
> btree node's write_lock is held already, a deadlock very probably
> happens here. A worse case is the memory of the selected btree node is
> released, then all references to this btree node (e.g. b->write_lock)
> will trigger NULL pointer deference panic.
> 
> This race was introduced in commit cafe56359144 ("bcache: A block layer
> cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
> occupancy during journal"), which selected 128 btree nodes and flushed
> them one-by-one in a quite long time period.
> 
> Such race is not easy to reproduce before. On a Lenovo SR650 server with
> 48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
> device assembled by 3 NVMe SSDs as backing device, this race can be
> observed around every 10,000 times btree_flush_write() gets called. Both
> deadlock and kernel panic all happened as aftermath of the race.
> 
> The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
> is set when selecting btree nodes, and cleared after btree nodes
> flushed. Then when mca_reap() selects a btree node with this bit set,
> this btree node will be skipped. Since mca_reap() only reaps btree node
> without BTREE_NODE_journal_flush flag, such race is avoided.
> 
> Once corner case should be noticed, that is btree_node_free(). It might
> be called in some error handling code path. For example the following
> code piece from btree_split(),
> 	2149 err_free2:
> 	2150         bkey_put(b->c, &n2->key);
> 	2151         btree_node_free(n2);
> 	2152         rw_unlock(true, n2);
> 	2153 err_free1:
> 	2154         bkey_put(b->c, &n1->key);
> 	2155         btree_node_free(n1);
> 	2156         rw_unlock(true, n1);
> At line 2151 and 2155, the btree node n2 and n1 are released without
> mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
> If btree_node_free() is called directly in such error handling path,
> and the selected btree node has BTREE_NODE_journal_flush bit set, just
> wait for 1 jiffy and retry again. In this case this btree node won't
> be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
> and free the btree node memory.
> 
> Wait for 1 jiffy inside btree_node_free() does not hurt too much
> performance here, the reasons are,
> - Error handling code path is not frequently executed, and the race
>   inside this cold path should be very rare. If the very rare race
>   happens in the cold code path, waiting 1 jiffy should be acceptible.
> - If bree_node_free() is called inside mca_reap(), it means the bit
>   BTREE_NODE_journal_flush is checked already, no wait will happen here.
> 
> Beside the above fix, the way to select flushing btree nodes is also
> changed in this patch. Let me explain what changes in this patch.

Then this change should be split into another patch. :)

> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/29] bcache: fix race in btree_flush_write()
  2019-06-27  9:16   ` Yaowei Bai
@ 2019-06-27 11:47     ` Coly Li
  2019-06-27 12:45       ` Coly Li
  0 siblings, 1 reply; 34+ messages in thread
From: Coly Li @ 2019-06-27 11:47 UTC (permalink / raw)
  To: baiyaowei; +Cc: linux-bcache, linux-block

On 2019/6/27 5:16 下午, Yaowei Bai wrote:
> On Fri, Jun 14, 2019 at 09:13:35PM +0800, Coly Li wrote:
>> There is a race between mca_reap(), btree_node_free() and journal code
>> btree_flush_write(), which results very rare and strange deadlock or
>> panic and are very hard to reproduce.
>>
>> Let me explain how the race happens. In btree_flush_write() one btree
>> node with oldest journal pin is selected, then it is flushed to cache
>> device, the select-and-flush is a two steps operation. Between these two
>> steps, there are something may happen inside the race window,
>> - The selected btree node was reaped by mca_reap() and allocated to
>>   other requesters for other btree node.
>> - The slected btree node was selected, flushed and released by mca
>>   shrink callback bch_mca_scan().
>> When btree_flush_write() tries to flush the selected btree node, firstly
>> b->write_lock is held by mutex_lock(). If the race happens and the
>> memory of selected btree node is allocated to other btree node, if that
>> btree node's write_lock is held already, a deadlock very probably
>> happens here. A worse case is the memory of the selected btree node is
>> released, then all references to this btree node (e.g. b->write_lock)
>> will trigger NULL pointer deference panic.
>>
>> This race was introduced in commit cafe56359144 ("bcache: A block layer
>> cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
>> occupancy during journal"), which selected 128 btree nodes and flushed
>> them one-by-one in a quite long time period.
>>
>> Such race is not easy to reproduce before. On a Lenovo SR650 server with
>> 48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
>> device assembled by 3 NVMe SSDs as backing device, this race can be
>> observed around every 10,000 times btree_flush_write() gets called. Both
>> deadlock and kernel panic all happened as aftermath of the race.
>>
>> The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
>> is set when selecting btree nodes, and cleared after btree nodes
>> flushed. Then when mca_reap() selects a btree node with this bit set,
>> this btree node will be skipped. Since mca_reap() only reaps btree node
>> without BTREE_NODE_journal_flush flag, such race is avoided.
>>
>> Once corner case should be noticed, that is btree_node_free(). It might
>> be called in some error handling code path. For example the following
>> code piece from btree_split(),
>> 	2149 err_free2:
>> 	2150         bkey_put(b->c, &n2->key);
>> 	2151         btree_node_free(n2);
>> 	2152         rw_unlock(true, n2);
>> 	2153 err_free1:
>> 	2154         bkey_put(b->c, &n1->key);
>> 	2155         btree_node_free(n1);
>> 	2156         rw_unlock(true, n1);
>> At line 2151 and 2155, the btree node n2 and n1 are released without
>> mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
>> If btree_node_free() is called directly in such error handling path,
>> and the selected btree node has BTREE_NODE_journal_flush bit set, just
>> wait for 1 jiffy and retry again. In this case this btree node won't
>> be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
>> and free the btree node memory.
>>
>> Wait for 1 jiffy inside btree_node_free() does not hurt too much
>> performance here, the reasons are,
>> - Error handling code path is not frequently executed, and the race
>>   inside this cold path should be very rare. If the very rare race
>>   happens in the cold code path, waiting 1 jiffy should be acceptible.
>> - If bree_node_free() is called inside mca_reap(), it means the bit
>>   BTREE_NODE_journal_flush is checked already, no wait will happen here.
>>
>> Beside the above fix, the way to select flushing btree nodes is also
>> changed in this patch. Let me explain what changes in this patch.
> 
> Then this change should be split into another patch. :)

Hi Bai,

Sure it makes sense. I also realize splitting it into two patches may be
helpful for long term kernel maintainers to backport patches without
breaking KABI.

I will send a two patches version in the for-5.3 submit.

Thanks.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 06/29] bcache: fix race in btree_flush_write()
  2019-06-27 11:47     ` Coly Li
@ 2019-06-27 12:45       ` Coly Li
  0 siblings, 0 replies; 34+ messages in thread
From: Coly Li @ 2019-06-27 12:45 UTC (permalink / raw)
  To: baiyaowei; +Cc: linux-bcache, linux-block

On 2019/6/27 7:47 下午, Coly Li wrote:
> On 2019/6/27 5:16 下午, Yaowei Bai wrote:
>> On Fri, Jun 14, 2019 at 09:13:35PM +0800, Coly Li wrote:
>>> There is a race between mca_reap(), btree_node_free() and journal code
>>> btree_flush_write(), which results very rare and strange deadlock or
>>> panic and are very hard to reproduce.
>>>
>>> Let me explain how the race happens. In btree_flush_write() one btree
>>> node with oldest journal pin is selected, then it is flushed to cache
>>> device, the select-and-flush is a two steps operation. Between these two
>>> steps, there are something may happen inside the race window,
>>> - The selected btree node was reaped by mca_reap() and allocated to
>>>   other requesters for other btree node.
>>> - The slected btree node was selected, flushed and released by mca
>>>   shrink callback bch_mca_scan().
>>> When btree_flush_write() tries to flush the selected btree node, firstly
>>> b->write_lock is held by mutex_lock(). If the race happens and the
>>> memory of selected btree node is allocated to other btree node, if that
>>> btree node's write_lock is held already, a deadlock very probably
>>> happens here. A worse case is the memory of the selected btree node is
>>> released, then all references to this btree node (e.g. b->write_lock)
>>> will trigger NULL pointer deference panic.
>>>
>>> This race was introduced in commit cafe56359144 ("bcache: A block layer
>>> cache"), and enlarged by commit c4dc2497d50d ("bcache: fix high CPU
>>> occupancy during journal"), which selected 128 btree nodes and flushed
>>> them one-by-one in a quite long time period.
>>>
>>> Such race is not easy to reproduce before. On a Lenovo SR650 server with
>>> 48 Xeon cores, and configure 1 NVMe SSD as cache device, a MD raid0
>>> device assembled by 3 NVMe SSDs as backing device, this race can be
>>> observed around every 10,000 times btree_flush_write() gets called. Both
>>> deadlock and kernel panic all happened as aftermath of the race.
>>>
>>> The idea of the fix is to add a btree flag BTREE_NODE_journal_flush. It
>>> is set when selecting btree nodes, and cleared after btree nodes
>>> flushed. Then when mca_reap() selects a btree node with this bit set,
>>> this btree node will be skipped. Since mca_reap() only reaps btree node
>>> without BTREE_NODE_journal_flush flag, such race is avoided.
>>>
>>> Once corner case should be noticed, that is btree_node_free(). It might
>>> be called in some error handling code path. For example the following
>>> code piece from btree_split(),
>>> 	2149 err_free2:
>>> 	2150         bkey_put(b->c, &n2->key);
>>> 	2151         btree_node_free(n2);
>>> 	2152         rw_unlock(true, n2);
>>> 	2153 err_free1:
>>> 	2154         bkey_put(b->c, &n1->key);
>>> 	2155         btree_node_free(n1);
>>> 	2156         rw_unlock(true, n1);
>>> At line 2151 and 2155, the btree node n2 and n1 are released without
>>> mac_reap(), so BTREE_NODE_journal_flush also needs to be checked here.
>>> If btree_node_free() is called directly in such error handling path,
>>> and the selected btree node has BTREE_NODE_journal_flush bit set, just
>>> wait for 1 jiffy and retry again. In this case this btree node won't
>>> be skipped, just retry until the BTREE_NODE_journal_flush bit cleared,
>>> and free the btree node memory.
>>>
>>> Wait for 1 jiffy inside btree_node_free() does not hurt too much
>>> performance here, the reasons are,
>>> - Error handling code path is not frequently executed, and the race
>>>   inside this cold path should be very rare. If the very rare race
>>>   happens in the cold code path, waiting 1 jiffy should be acceptible.
>>> - If bree_node_free() is called inside mca_reap(), it means the bit
>>>   BTREE_NODE_journal_flush is checked already, no wait will happen here.
>>>
>>> Beside the above fix, the way to select flushing btree nodes is also
>>> changed in this patch. Let me explain what changes in this patch.
>>
>> Then this change should be split into another patch. :)
> 
> Hi Bai,
> 
> Sure it makes sense. I also realize splitting it into two patches may be
> helpful for long term kernel maintainers to backport patches without
> breaking KABI.
> 
> I will send a two patches version in the for-5.3 submit.

I just realize KABI broken is unavoidable, but splitting this patch into
two still makes sense, the performance optimization should not go into
the race fix.

Thanks.

-- 

Coly Li

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-06-27 12:45 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-14 13:13 [PATCH 00/29] bcache candidate patches for Linux v5.3 Coly Li
2019-06-14 13:13 ` [PATCH 01/29] bcache: Revert "bcache: fix high CPU occupancy during journal" Coly Li
2019-06-14 13:13 ` [PATCH 02/29] bcache: Revert "bcache: free heap cache_set->flush_btree in bch_journal_free" Coly Li
2019-06-14 13:13 ` [PATCH 03/29] bcache: add code comments for journal_read_bucket() Coly Li
2019-06-14 13:13 ` [PATCH 04/29] bcache: set largest seq to ja->seq[bucket_index] in journal_read_bucket() Coly Li
2019-06-14 13:13 ` [PATCH 05/29] bcache: remove retry_flush_write from struct cache_set Coly Li
2019-06-14 13:13 ` [PATCH 06/29] bcache: fix race in btree_flush_write() Coly Li
2019-06-27  9:16   ` Yaowei Bai
2019-06-27 11:47     ` Coly Li
2019-06-27 12:45       ` Coly Li
2019-06-14 13:13 ` [PATCH 07/29] bcache: add reclaimed_journal_buckets to struct cache_set Coly Li
2019-06-14 13:13 ` [PATCH 08/29] bcache: fix return value error in bch_journal_read() Coly Li
2019-06-14 13:13 ` [PATCH 09/29] Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()" Coly Li
2019-06-14 13:13 ` [PATCH 10/29] bcache: avoid flushing btree node in cache_set_flush() if io disabled Coly Li
2019-06-14 13:13 ` [PATCH 11/29] bcache: ignore read-ahead request failure on backing device Coly Li
2019-06-14 13:13 ` [PATCH 12/29] bcache: add io error counting in write_bdev_super_endio() Coly Li
2019-06-14 13:13 ` [PATCH 13/29] bcache: remove "XXX:" comment line from run_cache_set() Coly Li
2019-06-14 13:13 ` [PATCH 14/29] bcache: remove unnecessary prefetch() in bset_search_tree() Coly Li
2019-06-14 13:13 ` [PATCH 15/29] bcache: use sysfs_match_string() instead of __sysfs_match_string() Coly Li
2019-06-14 13:13 ` [PATCH 16/29] bcache: add return value check to bch_cached_dev_run() Coly Li
2019-06-14 13:13 ` [PATCH 17/29] bcache: remove unncessary code in bch_btree_keys_init() Coly Li
2019-06-14 13:13 ` [PATCH 18/29] bcache: check CACHE_SET_IO_DISABLE in allocator code Coly Li
2019-06-14 13:13 ` [PATCH 19/29] bcache: check CACHE_SET_IO_DISABLE bit in bch_journal() Coly Li
2019-06-14 13:13 ` [PATCH 20/29] bcache: more detailed error message to bcache_device_link() Coly Li
2019-06-14 13:13 ` [PATCH 21/29] bcache: add more error message in bch_cached_dev_attach() Coly Li
2019-06-14 13:13 ` [PATCH 22/29] bcache: shrink btree node cache after bch_btree_check() Coly Li
2019-06-14 13:13 ` [PATCH 23/29] bcache: improve error message in bch_cached_dev_run() Coly Li
2019-06-14 13:13 ` [PATCH 24/29] bcache: make bset_search_tree() be more understandable Coly Li
2019-06-14 13:13 ` [PATCH 25/29] bcache: add pendings_cleanup to stop pending bcache device Coly Li
2019-06-14 13:13 ` [PATCH 26/29] bcache: avoid a deadlock in bcache_reboot() Coly Li
2019-06-14 13:13 ` [PATCH 27/29] bcache: acquire bch_register_lock later in cached_dev_detach_finish() Coly Li
2019-06-14 13:13 ` [PATCH 28/29] bcache: acquire bch_register_lock later in cached_dev_free() Coly Li
2019-06-14 13:13 ` [PATCH 29/29] bcache: fix potential deadlock in cached_def_free() Coly Li
2019-06-20  9:34 ` [PATCH 00/29] bcache candidate patches for Linux v5.3 Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.