All of lore.kernel.org
 help / color / mirror / Atom feed
* [416 PATCH 00/13] Bcache changes for 4.16
@ 2018-01-08 20:21 Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 01/13] bcache: ret IOERR when read meets metadata error Michael Lyle
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe

Jens,

Please pick up the following reviewed changes for 4.16.  There's some
small cleanliness changes, a few minor bug fixes (some in preparation
for larger work), and ongoing work on writeback performance:

 11 files changed, 303 insertions(+), 133 deletions(-)

Coly Li (2):
      bcache: reduce cache_set devices iteration by devices_max_used
      bcache: fix misleading error message in bch_count_io_errors()

Kent Overstreet (2):
      bcache: Fix, improve efficiency of closure_sync()
      bcache: mark closure_sync() __sched

Michael Lyle (3):
      bcache: writeback: properly order backing device IO
      bcache: allow quick writeback when backing idle
      bcache: fix writeback target calc on large devices

Rui Hua (1):
      bcache: ret IOERR when read meets metadata error

Tang Junhui (3):
      bcache: stop writeback thread after detaching
      bcache: segregate flash only volume write streams
      bcache: fix wrong return value in bch_debug_init()

Vasyl Gomonovych (1):
      bcache: Use PTR_ERR_OR_ZERO()

Zhai Zhaoxuan (1):
      bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct()

Thanks,

Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [416 PATCH 01/13] bcache: ret IOERR when read meets metadata error
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 02/13] bcache: stop writeback thread after detaching Michael Lyle
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Rui Hua, Michael Lyle

From: Rui Hua <huarui.dev@gmail.com>

The read request might meet error when searching the btree, but the error
was not handled in cache_lookup(), and this kind of metadata failure will
not go into cached_dev_read_error(), finally, the upper layer will receive
bi_status=0.  In this patch we judge the metadata error by the return
value of bch_btree_map_keys(), there are two potential paths give rise to
the error:

1. Because the btree is not totally cached in memery, we maybe get error
   when read btree node from cache device (see bch_btree_node_get()), the
   likely errno is -EIO, -ENOMEM

2. When read miss happens, bch_btree_insert_check_key() will be called to
   insert a "replace_key" to btree(see cached_dev_cache_miss(), just for
   doing preparatory work before insert the missed data to cache device),
   a failure can also happen in this situation, the likely errno is
   -ENOMEM

bch_btree_map_keys() will return MAP_DONE in normal scenario, but we will
get either -EIO or -ENOMEM in above two cases. if this happened, we should
NOT recover data from backing device (when cache device is dirty) because
we don't know whether bkeys the read request covered are all clean.  And
after that happened, s->iop.status is still its initially value(0) before
we submit s->bio.bio, we set it to BLK_STS_IOERR, so it can go into
cached_dev_read_error(), and finally it can be passed to upper layer, or
recovered by reread from backing device.

[edit by mlyle: patch formatting, word-wrap, comment spelling,
commit log format]

Signed-off-by: Hua Rui <huarui.dev@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/request.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index c493fb947dc9..52b4ce24f9e2 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -576,6 +576,7 @@ static void cache_lookup(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, iop.cl);
 	struct bio *bio = &s->bio.bio;
+	struct cached_dev *dc;
 	int ret;
 
 	bch_btree_op_init(&s->op, -1);
@@ -588,6 +589,27 @@ static void cache_lookup(struct closure *cl)
 		return;
 	}
 
+	/*
+	 * We might meet err when searching the btree, If that happens, we will
+	 * get negative ret, in this scenario we should not recover data from
+	 * backing device (when cache device is dirty) because we don't know
+	 * whether bkeys the read request covered are all clean.
+	 *
+	 * And after that happened, s->iop.status is still its initial value
+	 * before we submit s->bio.bio
+	 */
+	if (ret < 0) {
+		BUG_ON(ret == -EINTR);
+		if (s->d && s->d->c &&
+				!UUID_FLASH_ONLY(&s->d->c->uuids[s->d->id])) {
+			dc = container_of(s->d, struct cached_dev, disk);
+			if (dc && atomic_read(&dc->has_dirty))
+				s->recoverable = false;
+		}
+		if (!s->iop.status)
+			s->iop.status = BLK_STS_IOERR;
+	}
+
 	closure_return(cl);
 }
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 02/13] bcache: stop writeback thread after detaching
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 01/13] bcache: ret IOERR when read meets metadata error Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 03/13] bcache: Use PTR_ERR_OR_ZERO() Michael Lyle
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Tang Junhui, Michael Lyle

From: Tang Junhui <tang.junhui@zte.com.cn>

Currently, when a cached device detaching from cache, writeback thread is
not stopped, and writeback_rate_update work is not canceled. For example,
after the following command:
echo 1 >/sys/block/sdb/bcache/detach
you can still see the writeback thread. Then you attach the device to the
cache again, bcache will create another writeback thread, for example,
after below command:
echo  ba0fb5cd-658a-4533-9806-6ce166d883b9 > /sys/block/sdb/bcache/attach
then you will see 2 writeback threads.
This patch stops writeback thread and cancels writeback_rate_update work
when cached device detaching from cache.

Compare with patch v1, this v2 patch moves code down into the register
lock for safety in case of any future changes as Coly and Mike suggested.

[edit by mlyle: commit log spelling/formatting]

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/super.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8399fe0651f2..553e841e897d 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -906,6 +906,12 @@ static void cached_dev_detach_finish(struct work_struct *w)
 
 	mutex_lock(&bch_register_lock);
 
+	cancel_delayed_work_sync(&dc->writeback_rate_update);
+	if (!IS_ERR_OR_NULL(dc->writeback_thread)) {
+		kthread_stop(dc->writeback_thread);
+		dc->writeback_thread = NULL;
+	}
+
 	memset(&dc->sb.set_uuid, 0, 16);
 	SET_BDEV_STATE(&dc->sb, BDEV_STATE_NONE);
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 03/13] bcache: Use PTR_ERR_OR_ZERO()
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 01/13] bcache: ret IOERR when read meets metadata error Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 02/13] bcache: stop writeback thread after detaching Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 04/13] bcache: segregate flash only volume write streams Michael Lyle
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Vasyl Gomonovych

From: Vasyl Gomonovych <gomonovych@gmail.com>

Fix ptr_ret.cocci warnings:
drivers/md/bcache/btree.c:1800:1-3: WARNING: PTR_ERR_OR_ZERO can be used

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/btree.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index ebb1874218e7..9e30713dbdb8 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1804,10 +1804,7 @@ static int bch_gc_thread(void *arg)
 int bch_gc_thread_start(struct cache_set *c)
 {
 	c->gc_thread = kthread_run(bch_gc_thread, c, "bcache_gc");
-	if (IS_ERR(c->gc_thread))
-		return PTR_ERR(c->gc_thread);
-
-	return 0;
+	return PTR_ERR_OR_ZERO(c->gc_thread);
 }
 
 /* Initial partial gc */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 04/13] bcache: segregate flash only volume write streams
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (2 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 03/13] bcache: Use PTR_ERR_OR_ZERO() Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 05/13] bcache: fix wrong return value in bch_debug_init() Michael Lyle
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Tang Junhui, Michael Lyle

From: Tang Junhui <tang.junhui@zte.com.cn>

In such scenario that there are some flash only volumes
, and some cached devices, when many tasks request these devices in
writeback mode, the write IOs may fall to the same bucket as bellow:
| cached data | flash data | cached data | cached data| flash data|
then after writeback of these cached devices, the bucket would
be like bellow bucket:
| free | flash data | free | free | flash data |

So, there are many free space in this bucket, but since data of flash
only volumes still exists, so this bucket cannot be reclaimable,
which would cause waste of bucket space.

In this patch, we segregate flash only volume write streams from
cached devices, so data from flash only volumes and cached devices
can store in different buckets.

Compare to v1 patch, this patch do not add a additionally open bucket
list, and it is try best to segregate flash only volume write streams
from cached devices, sectors of flash only volumes may still be mixed
with dirty sectors of cached device, but the number is very small.

[mlyle: fixed commit log formatting, permissions, line endings]

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/alloc.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index a0cc1bc6d884..6cc6c0f9c3a9 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -525,15 +525,21 @@ struct open_bucket {
 
 /*
  * We keep multiple buckets open for writes, and try to segregate different
- * write streams for better cache utilization: first we look for a bucket where
- * the last write to it was sequential with the current write, and failing that
- * we look for a bucket that was last used by the same task.
+ * write streams for better cache utilization: first we try to segregate flash
+ * only volume write streams from cached devices, secondly we look for a bucket
+ * where the last write to it was sequential with the current write, and
+ * failing that we look for a bucket that was last used by the same task.
  *
  * The ideas is if you've got multiple tasks pulling data into the cache at the
  * same time, you'll get better cache utilization if you try to segregate their
  * data and preserve locality.
  *
- * For example, say you've starting Firefox at the same time you're copying a
+ * For example, dirty sectors of flash only volume is not reclaimable, if their
+ * dirty sectors mixed with dirty sectors of cached device, such buckets will
+ * be marked as dirty and won't be reclaimed, though the dirty data of cached
+ * device have been written back to backend device.
+ *
+ * And say you've starting Firefox at the same time you're copying a
  * bunch of files. Firefox will likely end up being fairly hot and stay in the
  * cache awhile, but the data you copied might not be; if you wrote all that
  * data to the same buckets it'd get invalidated at the same time.
@@ -550,7 +556,10 @@ static struct open_bucket *pick_data_bucket(struct cache_set *c,
 	struct open_bucket *ret, *ret_task = NULL;
 
 	list_for_each_entry_reverse(ret, &c->data_buckets, list)
-		if (!bkey_cmp(&ret->key, search))
+		if (UUID_FLASH_ONLY(&c->uuids[KEY_INODE(&ret->key)]) !=
+		    UUID_FLASH_ONLY(&c->uuids[KEY_INODE(search)]))
+			continue;
+		else if (!bkey_cmp(&ret->key, search))
 			goto found;
 		else if (ret->last_write_point == write_point)
 			ret_task = ret;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 05/13] bcache: fix wrong return value in bch_debug_init()
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (3 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 04/13] bcache: segregate flash only volume write streams Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 06/13] bcache: writeback: properly order backing device IO Michael Lyle
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Tang Junhui

From: Tang Junhui <tang.junhui@zte.com.cn>

in bch_debug_init(), ret is always 0, and the return value is useless,
change it to return 0 if be success after calling debugfs_create_dir(),
else return a non-zero value.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/bcache.h    |   6 ---
 drivers/md/bcache/debug.c     |   5 +-
 drivers/md/bcache/writeback.c | 120 +++++++++++++++++++++++++++++-------------
 drivers/md/bcache/writeback.h |   3 ++
 4 files changed, 87 insertions(+), 47 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 843877e017e1..1784e50eb857 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -323,12 +323,6 @@ struct cached_dev {
 	struct bch_ratelimit	writeback_rate;
 	struct delayed_work	writeback_rate_update;
 
-	/*
-	 * Internal to the writeback code, so read_dirty() can keep track of
-	 * where it's at.
-	 */
-	sector_t		last_read;
-
 	/* Limit number of writeback bios in flight */
 	struct semaphore	in_flight;
 	struct task_struct	*writeback_thread;
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 879ab21074c6..af89408befe8 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -251,8 +251,7 @@ void bch_debug_exit(void)
 
 int __init bch_debug_init(struct kobject *kobj)
 {
-	int ret = 0;
-
 	debug = debugfs_create_dir("bcache", NULL);
-	return ret;
+
+	return IS_ERR_OR_NULL(debug);
 }
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 1ac2af6128b1..479095987f22 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -237,7 +237,9 @@ static void read_dirty_submit(struct closure *cl)
 static void read_dirty(struct cached_dev *dc)
 {
 	unsigned delay = 0;
-	struct keybuf_key *w;
+	struct keybuf_key *next, *keys[MAX_WRITEBACKS_IN_PASS], *w;
+	size_t size;
+	int nk, i;
 	struct dirty_io *io;
 	struct closure cl;
 
@@ -248,45 +250,87 @@ static void read_dirty(struct cached_dev *dc)
 	 * mempools.
 	 */
 
-	while (!kthread_should_stop()) {
-
-		w = bch_keybuf_next(&dc->writeback_keys);
-		if (!w)
-			break;
-
-		BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
-
-		if (KEY_START(&w->key) != dc->last_read ||
-		    jiffies_to_msecs(delay) > 50)
-			while (!kthread_should_stop() && delay)
-				delay = schedule_timeout_interruptible(delay);
-
-		dc->last_read	= KEY_OFFSET(&w->key);
-
-		io = kzalloc(sizeof(struct dirty_io) + sizeof(struct bio_vec)
-			     * DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
-			     GFP_KERNEL);
-		if (!io)
-			goto err;
-
-		w->private	= io;
-		io->dc		= dc;
-
-		dirty_init(w);
-		bio_set_op_attrs(&io->bio, REQ_OP_READ, 0);
-		io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0);
-		bio_set_dev(&io->bio, PTR_CACHE(dc->disk.c, &w->key, 0)->bdev);
-		io->bio.bi_end_io	= read_dirty_endio;
-
-		if (bch_bio_alloc_pages(&io->bio, GFP_KERNEL))
-			goto err_free;
-
-		trace_bcache_writeback(&w->key);
+	next = bch_keybuf_next(&dc->writeback_keys);
+
+	while (!kthread_should_stop() && next) {
+		size = 0;
+		nk = 0;
+
+		do {
+			BUG_ON(ptr_stale(dc->disk.c, &next->key, 0));
+
+			/*
+			 * Don't combine too many operations, even if they
+			 * are all small.
+			 */
+			if (nk >= MAX_WRITEBACKS_IN_PASS)
+				break;
+
+			/*
+			 * If the current operation is very large, don't
+			 * further combine operations.
+			 */
+			if (size >= MAX_WRITESIZE_IN_PASS)
+				break;
+
+			/*
+			 * Operations are only eligible to be combined
+			 * if they are contiguous.
+			 *
+			 * TODO: add a heuristic willing to fire a
+			 * certain amount of non-contiguous IO per pass,
+			 * so that we can benefit from backing device
+			 * command queueing.
+			 */
+			if ((nk != 0) && bkey_cmp(&keys[nk-1]->key,
+						&START_KEY(&next->key)))
+				break;
+
+			size += KEY_SIZE(&next->key);
+			keys[nk++] = next;
+		} while ((next = bch_keybuf_next(&dc->writeback_keys)));
+
+		/* Now we have gathered a set of 1..5 keys to write back. */
+		for (i = 0; i < nk; i++) {
+			w = keys[i];
+
+			io = kzalloc(sizeof(struct dirty_io) +
+				     sizeof(struct bio_vec) *
+				     DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
+				     GFP_KERNEL);
+			if (!io)
+				goto err;
+
+			w->private	= io;
+			io->dc		= dc;
+
+			dirty_init(w);
+			bio_set_op_attrs(&io->bio, REQ_OP_READ, 0);
+			io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0);
+			bio_set_dev(&io->bio,
+				    PTR_CACHE(dc->disk.c, &w->key, 0)->bdev);
+			io->bio.bi_end_io	= read_dirty_endio;
+
+			if (bch_bio_alloc_pages(&io->bio, GFP_KERNEL))
+				goto err_free;
+
+			trace_bcache_writeback(&w->key);
+
+			down(&dc->in_flight);
+
+			/* We've acquired a semaphore for the maximum
+			 * simultaneous number of writebacks; from here
+			 * everything happens asynchronously.
+			 */
+			closure_call(&io->cl, read_dirty_submit, NULL, &cl);
+		}
 
-		down(&dc->in_flight);
-		closure_call(&io->cl, read_dirty_submit, NULL, &cl);
+		delay = writeback_delay(dc, size);
 
-		delay = writeback_delay(dc, KEY_SIZE(&w->key));
+		while (!kthread_should_stop() && delay) {
+			schedule_timeout_interruptible(delay);
+			delay = writeback_delay(dc, 0);
+		}
 	}
 
 	if (0) {
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index a9e3ffb4b03c..6d26927267f8 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -5,6 +5,9 @@
 #define CUTOFF_WRITEBACK	40
 #define CUTOFF_WRITEBACK_SYNC	70
 
+#define MAX_WRITEBACKS_IN_PASS  5
+#define MAX_WRITESIZE_IN_PASS   5000	/* *512b */
+
 static inline uint64_t bcache_dev_sectors_dirty(struct bcache_device *d)
 {
 	uint64_t i, ret = 0;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 06/13] bcache: writeback: properly order backing device IO
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (4 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 05/13] bcache: fix wrong return value in bch_debug_init() Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 07/13] bcache: allow quick writeback when backing idle Michael Lyle
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Michael Lyle

Writeback keys are presently iterated and dispatched for writeback in
order of the logical block address on the backing device.  Multiple may
be, in parallel, read from the cache device and then written back
(especially when there are contiguous I/O).

However-- there was no guarantee with the existing code that the writes
would be issued in LBA order, as the reads from the cache device are
often re-ordered.  In turn, when writing back quickly, the backing disk
often has to seek backwards-- this slows writeback and increases
utilization.

This patch introduces an ordering mechanism that guarantees that the
original order of issue is maintained for the write portion of the I/O.
Performance for writeback is significantly improved when there are
multiple contiguous keys or high writeback rates.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Tested-by: Tang Junhui <tang.junhui@zte.com.cn>
---
 drivers/md/bcache/bcache.h    |  8 ++++++++
 drivers/md/bcache/writeback.c | 29 +++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 1784e50eb857..3be0fcc19b1f 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -330,6 +330,14 @@ struct cached_dev {
 
 	struct keybuf		writeback_keys;
 
+	/*
+	 * Order the write-half of writeback operations strongly in dispatch
+	 * order.  (Maintain LBA order; don't allow reads completing out of
+	 * order to re-order the writes...)
+	 */
+	struct closure_waitlist writeback_ordering_wait;
+	atomic_t		writeback_sequence_next;
+
 	/* For tracking sequential IO */
 #define RECENT_IO_BITS	7
 #define RECENT_IO	(1 << RECENT_IO_BITS)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 479095987f22..6e1d2fde43df 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -116,6 +116,7 @@ static unsigned writeback_delay(struct cached_dev *dc, unsigned sectors)
 struct dirty_io {
 	struct closure		cl;
 	struct cached_dev	*dc;
+	uint16_t		sequence;
 	struct bio		bio;
 };
 
@@ -194,6 +195,27 @@ static void write_dirty(struct closure *cl)
 {
 	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
 	struct keybuf_key *w = io->bio.bi_private;
+	struct cached_dev *dc = io->dc;
+
+	uint16_t next_sequence;
+
+	if (atomic_read(&dc->writeback_sequence_next) != io->sequence) {
+		/* Not our turn to write; wait for a write to complete */
+		closure_wait(&dc->writeback_ordering_wait, cl);
+
+		if (atomic_read(&dc->writeback_sequence_next) == io->sequence) {
+			/*
+			 * Edge case-- it happened in indeterminate order
+			 * relative to when we were added to wait list..
+			 */
+			closure_wake_up(&dc->writeback_ordering_wait);
+		}
+
+		continue_at(cl, write_dirty, io->dc->writeback_write_wq);
+		return;
+	}
+
+	next_sequence = io->sequence + 1;
 
 	/*
 	 * IO errors are signalled using the dirty bit on the key.
@@ -211,6 +233,9 @@ static void write_dirty(struct closure *cl)
 		closure_bio_submit(&io->bio, cl);
 	}
 
+	atomic_set(&dc->writeback_sequence_next, next_sequence);
+	closure_wake_up(&dc->writeback_ordering_wait);
+
 	continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
 }
 
@@ -242,7 +267,10 @@ static void read_dirty(struct cached_dev *dc)
 	int nk, i;
 	struct dirty_io *io;
 	struct closure cl;
+	uint16_t sequence = 0;
 
+	BUG_ON(!llist_empty(&dc->writeback_ordering_wait.list));
+	atomic_set(&dc->writeback_sequence_next, sequence);
 	closure_init_stack(&cl);
 
 	/*
@@ -303,6 +331,7 @@ static void read_dirty(struct cached_dev *dc)
 
 			w->private	= io;
 			io->dc		= dc;
+			io->sequence    = sequence++;
 
 			dirty_init(w);
 			bio_set_op_attrs(&io->bio, REQ_OP_READ, 0);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 07/13] bcache: allow quick writeback when backing idle
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (5 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 06/13] bcache: writeback: properly order backing device IO Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 08/13] bcache: Fix, improve efficiency of closure_sync() Michael Lyle
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Michael Lyle

If the control system would wait for at least half a second, and there's
been no reqs hitting the backing disk for awhile: use an alternate mode
where we have at most one contiguous set of writebacks in flight at a
time. (But don't otherwise delay).  If front-end IO appears, it will
still be quick, as it will only have to contend with one real operation
in flight.  But otherwise, we'll be sending data to the backing disk as
quickly as it can accept it (with one op at a time).

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/bcache.h    |  7 +++++++
 drivers/md/bcache/request.c   |  1 +
 drivers/md/bcache/writeback.c | 21 +++++++++++++++++++++
 3 files changed, 29 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 3be0fcc19b1f..5f7b0b2513cc 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -320,6 +320,13 @@ struct cached_dev {
 	 */
 	atomic_t		has_dirty;
 
+	/*
+	 * Set to zero by things that touch the backing volume-- except
+	 * writeback.  Incremented by writeback.  Used to determine when to
+	 * accelerate idle writeback.
+	 */
+	atomic_t		backing_idle;
+
 	struct bch_ratelimit	writeback_rate;
 	struct delayed_work	writeback_rate_update;
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 52b4ce24f9e2..ddd941056f3c 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -996,6 +996,7 @@ static blk_qc_t cached_dev_make_request(struct request_queue *q,
 	struct cached_dev *dc = container_of(d, struct cached_dev, disk);
 	int rw = bio_data_dir(bio);
 
+	atomic_set(&dc->backing_idle, 0);
 	generic_start_io_acct(q, rw, bio_sectors(bio), &d->disk->part0);
 
 	bio_set_dev(bio, dc->bdev);
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 6e1d2fde43df..f82ffb2e9b9b 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -356,6 +356,27 @@ static void read_dirty(struct cached_dev *dc)
 
 		delay = writeback_delay(dc, size);
 
+		/* If the control system would wait for at least half a
+		 * second, and there's been no reqs hitting the backing disk
+		 * for awhile: use an alternate mode where we have at most
+		 * one contiguous set of writebacks in flight at a time.  If
+		 * someone wants to do IO it will be quick, as it will only
+		 * have to contend with one operation in flight, and we'll
+		 * be round-tripping data to the backing disk as quickly as
+		 * it can accept it.
+		 */
+		if (delay >= HZ / 2) {
+			/* 3 means at least 1.5 seconds, up to 7.5 if we
+			 * have slowed way down.
+			 */
+			if (atomic_inc_return(&dc->backing_idle) >= 3) {
+				/* Wait for current I/Os to finish */
+				closure_sync(&cl);
+				/* And immediately launch a new set. */
+				delay = 0;
+			}
+		}
+
 		while (!kthread_should_stop() && delay) {
 			schedule_timeout_interruptible(delay);
 			delay = writeback_delay(dc, 0);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 08/13] bcache: Fix, improve efficiency of closure_sync()
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (6 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 07/13] bcache: allow quick writeback when backing idle Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 09/13] bcache: mark closure_sync() __sched Michael Lyle
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Kent Overstreet, Michael Lyle

From: Kent Overstreet <kent.overstreet@gmail.com>

Eliminates cases where sync can race and fail to complete / get stuck.
Removes many status flags and simplifies entering-and-exiting closure
sleeping behaviors.

[mlyle: fixed conflicts due to changed return behavior in mainline.
extended commit comment, and squashed down two commits that were mostly
contradictory to get to this state.  Changed __set_current_state to
_set_current_state per Jens review comment]

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/closure.c | 46 +++++++++++++++++-----------------
 drivers/md/bcache/closure.h | 60 +++++++++++++++++----------------------------
 2 files changed, 46 insertions(+), 60 deletions(-)

diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index 1841d0359bac..ca7ace6962a4 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -18,10 +18,6 @@ static inline void closure_put_after_sub(struct closure *cl, int flags)
 	BUG_ON(flags & CLOSURE_GUARD_MASK);
 	BUG_ON(!r && (flags & ~CLOSURE_DESTRUCTOR));
 
-	/* Must deliver precisely one wakeup */
-	if (r == 1 && (flags & CLOSURE_SLEEPING))
-		wake_up_process(cl->task);
-
 	if (!r) {
 		if (cl->fn && !(flags & CLOSURE_DESTRUCTOR)) {
 			atomic_set(&cl->remaining,
@@ -100,28 +96,34 @@ bool closure_wait(struct closure_waitlist *waitlist, struct closure *cl)
 }
 EXPORT_SYMBOL(closure_wait);
 
-/**
- * closure_sync - sleep until a closure has nothing left to wait on
- *
- * Sleeps until the refcount hits 1 - the thread that's running the closure owns
- * the last refcount.
- */
-void closure_sync(struct closure *cl)
+struct closure_syncer {
+	struct task_struct	*task;
+	int			done;
+};
+
+static void closure_sync_fn(struct closure *cl)
 {
-	while (1) {
-		__closure_start_sleep(cl);
-		closure_set_ret_ip(cl);
+	cl->s->done = 1;
+	wake_up_process(cl->s->task);
+}
 
-		if ((atomic_read(&cl->remaining) &
-		     CLOSURE_REMAINING_MASK) == 1)
-			break;
+void __closure_sync(struct closure *cl)
+{
+	struct closure_syncer s = { .task = current };
 
+	cl->s = &s;
+	continue_at(cl, closure_sync_fn, NULL);
+
+	while (1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (s.done)
+			break;
 		schedule();
 	}
 
-	__closure_end_sleep(cl);
+	__set_current_state(TASK_RUNNING);
 }
-EXPORT_SYMBOL(closure_sync);
+EXPORT_SYMBOL(__closure_sync);
 
 #ifdef CONFIG_BCACHE_CLOSURES_DEBUG
 
@@ -168,12 +170,10 @@ static int debug_seq_show(struct seq_file *f, void *data)
 			   cl, (void *) cl->ip, cl->fn, cl->parent,
 			   r & CLOSURE_REMAINING_MASK);
 
-		seq_printf(f, "%s%s%s%s\n",
+		seq_printf(f, "%s%s\n",
 			   test_bit(WORK_STRUCT_PENDING_BIT,
 				    work_data_bits(&cl->work)) ? "Q" : "",
-			   r & CLOSURE_RUNNING	? "R" : "",
-			   r & CLOSURE_STACK	? "S" : "",
-			   r & CLOSURE_SLEEPING	? "Sl" : "");
+			   r & CLOSURE_RUNNING	? "R" : "");
 
 		if (r & CLOSURE_WAITING)
 			seq_printf(f, " W %pF\n",
diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index ccfbea6f9f6b..392a87cf1b92 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -103,6 +103,7 @@
  */
 
 struct closure;
+struct closure_syncer;
 typedef void (closure_fn) (struct closure *);
 
 struct closure_waitlist {
@@ -115,10 +116,6 @@ enum closure_state {
 	 * the thread that owns the closure, and cleared by the thread that's
 	 * waking up the closure.
 	 *
-	 * CLOSURE_SLEEPING: Must be set before a thread uses a closure to sleep
-	 * - indicates that cl->task is valid and closure_put() may wake it up.
-	 * Only set or cleared by the thread that owns the closure.
-	 *
 	 * The rest are for debugging and don't affect behaviour:
 	 *
 	 * CLOSURE_RUNNING: Set when a closure is running (i.e. by
@@ -128,22 +125,16 @@ enum closure_state {
 	 * continue_at() and closure_return() clear it for you, if you're doing
 	 * something unusual you can use closure_set_dead() which also helps
 	 * annotate where references are being transferred.
-	 *
-	 * CLOSURE_STACK: Sanity check - remaining should never hit 0 on a
-	 * closure with this flag set
 	 */
 
-	CLOSURE_BITS_START	= (1 << 23),
-	CLOSURE_DESTRUCTOR	= (1 << 23),
-	CLOSURE_WAITING		= (1 << 25),
-	CLOSURE_SLEEPING	= (1 << 27),
-	CLOSURE_RUNNING		= (1 << 29),
-	CLOSURE_STACK		= (1 << 31),
+	CLOSURE_BITS_START	= (1U << 27),
+	CLOSURE_DESTRUCTOR	= (1U << 27),
+	CLOSURE_WAITING		= (1U << 29),
+	CLOSURE_RUNNING		= (1U << 31),
 };
 
 #define CLOSURE_GUARD_MASK					\
-	((CLOSURE_DESTRUCTOR|CLOSURE_WAITING|CLOSURE_SLEEPING|	\
-	  CLOSURE_RUNNING|CLOSURE_STACK) << 1)
+	((CLOSURE_DESTRUCTOR|CLOSURE_WAITING|CLOSURE_RUNNING) << 1)
 
 #define CLOSURE_REMAINING_MASK		(CLOSURE_BITS_START - 1)
 #define CLOSURE_REMAINING_INITIALIZER	(1|CLOSURE_RUNNING)
@@ -152,7 +143,7 @@ struct closure {
 	union {
 		struct {
 			struct workqueue_struct *wq;
-			struct task_struct	*task;
+			struct closure_syncer	*s;
 			struct llist_node	list;
 			closure_fn		*fn;
 		};
@@ -178,7 +169,19 @@ void closure_sub(struct closure *cl, int v);
 void closure_put(struct closure *cl);
 void __closure_wake_up(struct closure_waitlist *list);
 bool closure_wait(struct closure_waitlist *list, struct closure *cl);
-void closure_sync(struct closure *cl);
+void __closure_sync(struct closure *cl);
+
+/**
+ * closure_sync - sleep until a closure a closure has nothing left to wait on
+ *
+ * Sleeps until the refcount hits 1 - the thread that's running the closure owns
+ * the last refcount.
+ */
+static inline void closure_sync(struct closure *cl)
+{
+	if ((atomic_read(&cl->remaining) & CLOSURE_REMAINING_MASK) != 1)
+		__closure_sync(cl);
+}
 
 #ifdef CONFIG_BCACHE_CLOSURES_DEBUG
 
@@ -215,24 +218,6 @@ static inline void closure_set_waiting(struct closure *cl, unsigned long f)
 #endif
 }
 
-static inline void __closure_end_sleep(struct closure *cl)
-{
-	__set_current_state(TASK_RUNNING);
-
-	if (atomic_read(&cl->remaining) & CLOSURE_SLEEPING)
-		atomic_sub(CLOSURE_SLEEPING, &cl->remaining);
-}
-
-static inline void __closure_start_sleep(struct closure *cl)
-{
-	closure_set_ip(cl);
-	cl->task = current;
-	set_current_state(TASK_UNINTERRUPTIBLE);
-
-	if (!(atomic_read(&cl->remaining) & CLOSURE_SLEEPING))
-		atomic_add(CLOSURE_SLEEPING, &cl->remaining);
-}
-
 static inline void closure_set_stopped(struct closure *cl)
 {
 	atomic_sub(CLOSURE_RUNNING, &cl->remaining);
@@ -241,7 +226,6 @@ static inline void closure_set_stopped(struct closure *cl)
 static inline void set_closure_fn(struct closure *cl, closure_fn *fn,
 				  struct workqueue_struct *wq)
 {
-	BUG_ON(object_is_on_stack(cl));
 	closure_set_ip(cl);
 	cl->fn = fn;
 	cl->wq = wq;
@@ -300,7 +284,7 @@ static inline void closure_init(struct closure *cl, struct closure *parent)
 static inline void closure_init_stack(struct closure *cl)
 {
 	memset(cl, 0, sizeof(struct closure));
-	atomic_set(&cl->remaining, CLOSURE_REMAINING_INITIALIZER|CLOSURE_STACK);
+	atomic_set(&cl->remaining, CLOSURE_REMAINING_INITIALIZER);
 }
 
 /**
@@ -322,6 +306,8 @@ static inline void closure_wake_up(struct closure_waitlist *list)
  * This is because after calling continue_at() you no longer have a ref on @cl,
  * and whatever @cl owns may be freed out from under you - a running closure fn
  * has a ref on its own closure which continue_at() drops.
+ *
+ * Note you are expected to immediately return after using this macro.
  */
 #define continue_at(_cl, _fn, _wq)					\
 do {									\
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 09/13] bcache: mark closure_sync() __sched
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (7 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 08/13] bcache: Fix, improve efficiency of closure_sync() Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 10/13] bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct() Michael Lyle
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Kent Overstreet, Michael Lyle

From: Kent Overstreet <kent.overstreet@gmail.com>

[edit by mlyle: include sched/debug.h to get __sched]

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/closure.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/closure.c b/drivers/md/bcache/closure.c
index ca7ace6962a4..7f12920c14f7 100644
--- a/drivers/md/bcache/closure.c
+++ b/drivers/md/bcache/closure.c
@@ -8,6 +8,7 @@
 #include <linux/debugfs.h>
 #include <linux/module.h>
 #include <linux/seq_file.h>
+#include <linux/sched/debug.h>
 
 #include "closure.h"
 
@@ -107,7 +108,7 @@ static void closure_sync_fn(struct closure *cl)
 	wake_up_process(cl->s->task);
 }
 
-void __closure_sync(struct closure *cl)
+void __sched __closure_sync(struct closure *cl)
 {
 	struct closure_syncer s = { .task = current };
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 10/13] bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct()
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (8 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 09/13] bcache: mark closure_sync() __sched Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 11/13] bcache: reduce cache_set devices iteration by devices_max_used Michael Lyle
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Zhai Zhaoxuan

From: Zhai Zhaoxuan <kxuanobj@gmail.com>

The function cached_dev_make_request() and flash_dev_make_request() call
generic_start_io_acct() with (struct bcache_device)->disk when they start a
closure. Then the function bio_complete() calls generic_end_io_acct() with
(struct search)->orig_bio->bi_disk when the closure has done.
Since the `bi_disk` is not the bcache device, the generic_end_io_acct() is
called with a wrong device queue.

It causes the "inflight" (in struct hd_struct) counter keep increasing
without decreasing.

This patch fix the problem by calling generic_end_io_acct() with
(struct bcache_device)->disk.

Signed-off-by: Zhai Zhaoxuan <kxuanobj@gmail.com>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
---
 drivers/md/bcache/request.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index ddd941056f3c..1a46b41dac70 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -633,8 +633,8 @@ static void request_endio(struct bio *bio)
 static void bio_complete(struct search *s)
 {
 	if (s->orig_bio) {
-		struct request_queue *q = s->orig_bio->bi_disk->queue;
-		generic_end_io_acct(q, bio_data_dir(s->orig_bio),
+		generic_end_io_acct(s->d->disk->queue,
+				    bio_data_dir(s->orig_bio),
 				    &s->d->disk->part0, s->start_time);
 
 		trace_bcache_request_end(s->d, s->orig_bio);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 11/13] bcache: reduce cache_set devices iteration by devices_max_used
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (9 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 10/13] bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct() Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 12/13] bcache: fix misleading error message in bch_count_io_errors() Michael Lyle
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Coly Li

From: Coly Li <colyli@suse.de>

Member devices of struct cache_set is used to reference all attached
bcache devices to this cache set. If it is treated as array of pointers,
size of devices[] is indicated by member nr_uuids of struct cache_set.

nr_uuids is calculated in drivers/md/super.c:bch_cache_set_alloc(),
	bucket_bytes(c) / sizeof(struct uuid_entry)
Bucket size is determined by user space tool "make-bcache", by default it
is 1024 sectors (defined in bcache-tools/make-bcache.c:main()). So default
nr_uuids value is 4096 from the above calculation.

Every time when bcache code iterates bcache devices of a cache set, all
the 4096 pointers are checked even only 1 bcache device is attached to the
cache set, that's a wast of time and unncessary.

This patch adds a member devices_max_used to struct cache_set. Its value
is 1 + the maximum used index of devices[] in a cache set. When iterating
all valid bcache devices of a cache set, use c->devices_max_used in
for-loop may reduce a lot of useless checking.

Personally, my motivation of this patch is not for performance, I use it
in bcache debugging, which helps me to narrow down the scape to check
valid bcached devices of a cache set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
---
 drivers/md/bcache/bcache.h    | 1 +
 drivers/md/bcache/btree.c     | 2 +-
 drivers/md/bcache/super.c     | 9 ++++++---
 drivers/md/bcache/writeback.h | 2 +-
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 5f7b0b2513cc..9117da5f494b 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -497,6 +497,7 @@ struct cache_set {
 	int			caches_loaded;
 
 	struct bcache_device	**devices;
+	unsigned		devices_max_used;
 	struct list_head	cached_devs;
 	uint64_t		cached_dev_sectors;
 	struct closure		caching;
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 9e30713dbdb8..bf3a48aa9a9a 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1679,7 +1679,7 @@ static void bch_btree_gc_finish(struct cache_set *c)
 
 	/* don't reclaim buckets to which writeback keys point */
 	rcu_read_lock();
-	for (i = 0; i < c->nr_uuids; i++) {
+	for (i = 0; i < c->devices_max_used; i++) {
 		struct bcache_device *d = c->devices[i];
 		struct cached_dev *dc;
 		struct keybuf_key *w, *n;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 553e841e897d..d13e4ccb30a0 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -721,6 +721,9 @@ static void bcache_device_attach(struct bcache_device *d, struct cache_set *c,
 	d->c = c;
 	c->devices[id] = d;
 
+	if (id >= c->devices_max_used)
+		c->devices_max_used = id + 1;
+
 	closure_get(&c->caching);
 }
 
@@ -1267,7 +1270,7 @@ static int flash_devs_run(struct cache_set *c)
 	struct uuid_entry *u;
 
 	for (u = c->uuids;
-	     u < c->uuids + c->nr_uuids && !ret;
+	     u < c->uuids + c->devices_max_used && !ret;
 	     u++)
 		if (UUID_FLASH_ONLY(u))
 			ret = flash_dev_run(c, u);
@@ -1433,7 +1436,7 @@ static void __cache_set_unregister(struct closure *cl)
 
 	mutex_lock(&bch_register_lock);
 
-	for (i = 0; i < c->nr_uuids; i++)
+	for (i = 0; i < c->devices_max_used; i++)
 		if (c->devices[i]) {
 			if (!UUID_FLASH_ONLY(&c->uuids[i]) &&
 			    test_bit(CACHE_SET_UNREGISTERING, &c->flags)) {
@@ -1496,7 +1499,7 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 	c->bucket_bits		= ilog2(sb->bucket_size);
 	c->block_bits		= ilog2(sb->block_size);
 	c->nr_uuids		= bucket_bytes(c) / sizeof(struct uuid_entry);
-
+	c->devices_max_used	= 0;
 	c->btree_pages		= bucket_pages(c);
 	if (c->btree_pages > BTREE_MAX_PAGES)
 		c->btree_pages = max_t(int, c->btree_pages / 4,
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 6d26927267f8..f102b1f9bc51 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -24,7 +24,7 @@ static inline uint64_t  bcache_flash_devs_sectors_dirty(struct cache_set *c)
 
 	mutex_lock(&bch_register_lock);
 
-	for (i = 0; i < c->nr_uuids; i++) {
+	for (i = 0; i < c->devices_max_used; i++) {
 		struct bcache_device *d = c->devices[i];
 
 		if (!d || !UUID_FLASH_ONLY(&c->uuids[i]))
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 12/13] bcache: fix misleading error message in bch_count_io_errors()
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (10 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 11/13] bcache: reduce cache_set devices iteration by devices_max_used Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:21 ` [416 PATCH 13/13] bcache: fix writeback target calc on large devices Michael Lyle
  2018-01-08 20:42 ` [416 PATCH 00/13] Bcache changes for 4.16 Jens Axboe
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Coly Li

From: Coly Li <colyli@suse.de>

Bcache only does recoverable I/O for read operations by calling
cached_dev_read_error(). For write opertions there is no I/O recovery for
failed requests.

But in bch_count_io_errors() no matter read or write I/Os, before errors
counter reaches io error limit, pr_err() always prints "IO error on %,
recoverying". For write requests this information is misleading, because
there is no I/O recovery at all.

This patch adds a parameter 'is_read' to bch_count_io_errors(), and only
prints "recovering" by pr_err() when the bio direction is READ.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
---
 drivers/md/bcache/bcache.h    |  2 +-
 drivers/md/bcache/io.c        | 13 +++++++++----
 drivers/md/bcache/super.c     |  4 +++-
 drivers/md/bcache/writeback.c |  4 +++-
 4 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 9117da5f494b..5e2d4e80198e 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -862,7 +862,7 @@ static inline void wake_up_allocators(struct cache_set *c)
 
 /* Forward declarations */
 
-void bch_count_io_errors(struct cache *, blk_status_t, const char *);
+void bch_count_io_errors(struct cache *, blk_status_t, int, const char *);
 void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
 			      blk_status_t, const char *);
 void bch_bbio_endio(struct cache_set *, struct bio *, blk_status_t,
diff --git a/drivers/md/bcache/io.c b/drivers/md/bcache/io.c
index fac97ec2d0e2..a783c5a41ff1 100644
--- a/drivers/md/bcache/io.c
+++ b/drivers/md/bcache/io.c
@@ -51,7 +51,10 @@ void bch_submit_bbio(struct bio *bio, struct cache_set *c,
 
 /* IO errors */
 
-void bch_count_io_errors(struct cache *ca, blk_status_t error, const char *m)
+void bch_count_io_errors(struct cache *ca,
+			 blk_status_t error,
+			 int is_read,
+			 const char *m)
 {
 	/*
 	 * The halflife of an error is:
@@ -94,8 +97,9 @@ void bch_count_io_errors(struct cache *ca, blk_status_t error, const char *m)
 		errors >>= IO_ERROR_SHIFT;
 
 		if (errors < ca->set->error_limit)
-			pr_err("%s: IO error on %s, recovering",
-			       bdevname(ca->bdev, buf), m);
+			pr_err("%s: IO error on %s%s",
+			       bdevname(ca->bdev, buf), m,
+			       is_read ? ", recovering." : ".");
 		else
 			bch_cache_set_error(ca->set,
 					    "%s: too many IO errors %s",
@@ -108,6 +112,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio,
 {
 	struct bbio *b = container_of(bio, struct bbio, bio);
 	struct cache *ca = PTR_CACHE(c, &b->key, 0);
+	int is_read = (bio_data_dir(bio) == READ ? 1 : 0);
 
 	unsigned threshold = op_is_write(bio_op(bio))
 		? c->congested_write_threshold_us
@@ -129,7 +134,7 @@ void bch_bbio_count_io_errors(struct cache_set *c, struct bio *bio,
 			atomic_inc(&c->congested);
 	}
 
-	bch_count_io_errors(ca, error, m);
+	bch_count_io_errors(ca, error, is_read, m);
 }
 
 void bch_bbio_endio(struct cache_set *c, struct bio *bio,
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index d13e4ccb30a0..133b81225ea9 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -274,7 +274,9 @@ static void write_super_endio(struct bio *bio)
 {
 	struct cache *ca = bio->bi_private;
 
-	bch_count_io_errors(ca, bio->bi_status, "writing superblock");
+	/* is_read = 0 */
+	bch_count_io_errors(ca, bio->bi_status, 0,
+			    "writing superblock");
 	closure_put(&ca->set->sb_write);
 }
 
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index f82ffb2e9b9b..31b0a292a619 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -244,8 +244,10 @@ static void read_dirty_endio(struct bio *bio)
 	struct keybuf_key *w = bio->bi_private;
 	struct dirty_io *io = w->private;
 
+	/* is_read = 1 */
 	bch_count_io_errors(PTR_CACHE(io->dc->disk.c, &w->key, 0),
-			    bio->bi_status, "reading dirty data from cache");
+			    bio->bi_status, 1,
+			    "reading dirty data from cache");
 
 	dirty_endio(bio);
 }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [416 PATCH 13/13] bcache: fix writeback target calc on large devices
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (11 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 12/13] bcache: fix misleading error message in bch_count_io_errors() Michael Lyle
@ 2018-01-08 20:21 ` Michael Lyle
  2018-01-08 20:42 ` [416 PATCH 00/13] Bcache changes for 4.16 Jens Axboe
  13 siblings, 0 replies; 15+ messages in thread
From: Michael Lyle @ 2018-01-08 20:21 UTC (permalink / raw)
  To: linux-bcache, linux-block; +Cc: axboe, Michael Lyle

Bcache needs to scale the dirty data in the cache over the multiple
backing disks in order to calculate writeback rates for each.
The previous code did this by multiplying the target number of dirty
sectors by the backing device size, and expected it to fit into a
uint64_t; this blows up on relatively small backing devices.

The new approach figures out the bdev's share in 16384ths of the overall
cached data.  This is chosen to cope well when bdevs drastically vary in
size and to ensure that bcache can cross the petabyte boundary for each
backing device.

This has been improved based on Tang Junhui's feedback to ensure that
every device gets a share of dirty data, no matter how small it is
compared to the total backing pool.

The existing mechanism is very limited; this is purely a bug fix to
remove limits on volume size.  However, there still needs to be change
to make this "fair" over many volumes where some are idle.

Reported-by: Jack Douglas <jack@douglastechnology.co.uk>
Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reviewed-by: Tang Junhui <tang.junhui@zte.com.cn>
---
 drivers/md/bcache/writeback.c | 31 +++++++++++++++++++++++++++----
 drivers/md/bcache/writeback.h |  7 +++++++
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 31b0a292a619..51306a19ab03 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -18,17 +18,39 @@
 #include <trace/events/bcache.h>
 
 /* Rate limiting */
-
-static void __update_writeback_rate(struct cached_dev *dc)
+static uint64_t __calc_target_rate(struct cached_dev *dc)
 {
 	struct cache_set *c = dc->disk.c;
+
+	/*
+	 * This is the size of the cache, minus the amount used for
+	 * flash-only devices
+	 */
 	uint64_t cache_sectors = c->nbuckets * c->sb.bucket_size -
 				bcache_flash_devs_sectors_dirty(c);
+
+	/*
+	 * Unfortunately there is no control of global dirty data.  If the
+	 * user states that they want 10% dirty data in the cache, and has,
+	 * e.g., 5 backing volumes of equal size, we try and ensure each
+	 * backing volume uses about 2% of the cache for dirty data.
+	 */
+	uint32_t bdev_share =
+		div64_u64(bdev_sectors(dc->bdev) << WRITEBACK_SHARE_SHIFT,
+				c->cached_dev_sectors);
+
 	uint64_t cache_dirty_target =
 		div_u64(cache_sectors * dc->writeback_percent, 100);
-	int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
-				   c->cached_dev_sectors);
 
+	/* Ensure each backing dev gets at least one dirty share */
+	if (bdev_share < 1)
+		bdev_share = 1;
+
+	return (cache_dirty_target * bdev_share) >> WRITEBACK_SHARE_SHIFT;
+}
+
+static void __update_writeback_rate(struct cached_dev *dc)
+{
 	/*
 	 * PI controller:
 	 * Figures out the amount that should be written per second.
@@ -49,6 +71,7 @@ static void __update_writeback_rate(struct cached_dev *dc)
 	 * This acts as a slow, long-term average that is not subject to
 	 * variations in usage like the p term.
 	 */
+	int64_t target = __calc_target_rate(dc);
 	int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
 	int64_t error = dirty - target;
 	int64_t proportional_scaled =
diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index f102b1f9bc51..66f1c527fa24 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -8,6 +8,13 @@
 #define MAX_WRITEBACKS_IN_PASS  5
 #define MAX_WRITESIZE_IN_PASS   5000	/* *512b */
 
+/*
+ * 14 (16384ths) is chosen here as something that each backing device
+ * should be a reasonable fraction of the share, and not to blow up
+ * until individual backing devices are a petabyte.
+ */
+#define WRITEBACK_SHARE_SHIFT   14
+
 static inline uint64_t bcache_dev_sectors_dirty(struct bcache_device *d)
 {
 	uint64_t i, ret = 0;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [416 PATCH 00/13] Bcache changes for 4.16
  2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
                   ` (12 preceding siblings ...)
  2018-01-08 20:21 ` [416 PATCH 13/13] bcache: fix writeback target calc on large devices Michael Lyle
@ 2018-01-08 20:42 ` Jens Axboe
  13 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2018-01-08 20:42 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache, linux-block

On 1/8/18 1:21 PM, Michael Lyle wrote:
> Jens,
> 
> Please pick up the following reviewed changes for 4.16.  There's some
> small cleanliness changes, a few minor bug fixes (some in preparation
> for larger work), and ongoing work on writeback performance:
> 
>  11 files changed, 303 insertions(+), 133 deletions(-)
> 
> Coly Li (2):
>       bcache: reduce cache_set devices iteration by devices_max_used
>       bcache: fix misleading error message in bch_count_io_errors()
> 
> Kent Overstreet (2):
>       bcache: Fix, improve efficiency of closure_sync()
>       bcache: mark closure_sync() __sched
> 
> Michael Lyle (3):
>       bcache: writeback: properly order backing device IO
>       bcache: allow quick writeback when backing idle
>       bcache: fix writeback target calc on large devices
> 
> Rui Hua (1):
>       bcache: ret IOERR when read meets metadata error
> 
> Tang Junhui (3):
>       bcache: stop writeback thread after detaching
>       bcache: segregate flash only volume write streams
>       bcache: fix wrong return value in bch_debug_init()
> 
> Vasyl Gomonovych (1):
>       bcache: Use PTR_ERR_OR_ZERO()
> 
> Zhai Zhaoxuan (1):
>       bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct()

Looks good, thanks Mike. Applied for 4.16.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-01-08 20:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-08 20:21 [416 PATCH 00/13] Bcache changes for 4.16 Michael Lyle
2018-01-08 20:21 ` [416 PATCH 01/13] bcache: ret IOERR when read meets metadata error Michael Lyle
2018-01-08 20:21 ` [416 PATCH 02/13] bcache: stop writeback thread after detaching Michael Lyle
2018-01-08 20:21 ` [416 PATCH 03/13] bcache: Use PTR_ERR_OR_ZERO() Michael Lyle
2018-01-08 20:21 ` [416 PATCH 04/13] bcache: segregate flash only volume write streams Michael Lyle
2018-01-08 20:21 ` [416 PATCH 05/13] bcache: fix wrong return value in bch_debug_init() Michael Lyle
2018-01-08 20:21 ` [416 PATCH 06/13] bcache: writeback: properly order backing device IO Michael Lyle
2018-01-08 20:21 ` [416 PATCH 07/13] bcache: allow quick writeback when backing idle Michael Lyle
2018-01-08 20:21 ` [416 PATCH 08/13] bcache: Fix, improve efficiency of closure_sync() Michael Lyle
2018-01-08 20:21 ` [416 PATCH 09/13] bcache: mark closure_sync() __sched Michael Lyle
2018-01-08 20:21 ` [416 PATCH 10/13] bcache: fix unmatched generic_end_io_acct() & generic_start_io_acct() Michael Lyle
2018-01-08 20:21 ` [416 PATCH 11/13] bcache: reduce cache_set devices iteration by devices_max_used Michael Lyle
2018-01-08 20:21 ` [416 PATCH 12/13] bcache: fix misleading error message in bch_count_io_errors() Michael Lyle
2018-01-08 20:21 ` [416 PATCH 13/13] bcache: fix writeback target calc on large devices Michael Lyle
2018-01-08 20:42 ` [416 PATCH 00/13] Bcache changes for 4.16 Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.