All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] bcache: incremental GC and dirty data init
@ 2018-04-12  6:38 tang.junhui
  2018-04-12  6:38 ` [PATCH 1/4] bcache: finish incremental GC tang.junhui
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: tang.junhui @ 2018-04-12  6:38 UTC (permalink / raw)
  To: kent.overstreet, colyli, mlyle; +Cc: linux-bcache, linux-block, tang.junhui

Hi maintainers and folks,

Some patches of this patch set have been sent before, they are not merged
yet, and I add two new patches to solve some issues I found while testing.
since They are interdependent, so I make a patch set and resend them.

[PATCH 1/4] bcache: finish incremental GC
[PATCH 2/4] bcache: calculate the number of incremental GC nodes according to
			the total of btree nodes
[PATCH 3/4] bcache: notify allocator to prepare for GC
[PATCH 4/4] bcache: fix I/O significant decline while backend devices registering

These patches are useful to prevent I/O fluctuations or even jump 0 while GC or
cached devices registering. I have test them for some times, I hope somebody could
have a review for these patch, any comment would be welcome.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] bcache: finish incremental GC
  2018-04-12  6:38 [PATCH 0/4] bcache: incremental GC and dirty data init tang.junhui
@ 2018-04-12  6:38 ` tang.junhui
  2018-04-12 14:08   ` Coly Li
  2018-04-12  6:38 ` [PATCH 2/4] bcache: calculate the number of incremental GC nodes according to the total of btree nodes tang.junhui
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: tang.junhui @ 2018-04-12  6:38 UTC (permalink / raw)
  To: kent.overstreet, colyli, mlyle; +Cc: linux-bcache, linux-block, tang.junhui

From: Tang Junhui <tang.junhui@zte.com.cn>

In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.

This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.

By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
---
 drivers/md/bcache/bcache.h  |  5 +++++
 drivers/md/bcache/btree.c   | 15 ++++++++++++++-
 drivers/md/bcache/request.c |  3 +++
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 843877e..ab4c9ca 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -445,6 +445,7 @@ struct cache {
 
 struct gc_stat {
 	size_t			nodes;
+	size_t			nodes_pre;
 	size_t			key_bytes;
 
 	size_t			nkeys;
@@ -568,6 +569,10 @@ struct cache_set {
 	 */
 	atomic_t		rescale;
 	/*
+	 * used for GC, identify if any front side I/Os is inflight
+	 */
+	atomic_t		io_inflight;
+	/*
 	 * When we invalidate buckets, we use both the priority and the amount
 	 * of good data to determine which buckets to reuse first - to weight
 	 * those together consistently we keep track of the smallest nonzero
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 81e8dc3..b36d276 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -90,6 +90,9 @@
 
 #define MAX_NEED_GC		64
 #define MAX_SAVE_PRIO		72
+#define MIN_GC_NODES		100
+#define GC_SLEEP_TIME		100
+
 
 #define PTR_DIRTY_BIT		(((uint64_t) 1 << 36))
 
@@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, struct btree_op *op,
 		memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
 		r->b = NULL;
 
+		if (atomic_read(&b->c->io_inflight) &&
+		    gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
+			gc->nodes_pre =  gc->nodes;
+			ret = -EAGAIN;
+			break;
+		}
+
 		if (need_resched()) {
 			ret = -EAGAIN;
 			break;
@@ -1748,7 +1758,10 @@ static void bch_btree_gc(struct cache_set *c)
 		closure_sync(&writes);
 		cond_resched();
 
-		if (ret && ret != -EAGAIN)
+		if (ret == -EAGAIN)
+			schedule_timeout_interruptible(msecs_to_jiffies
+						       (GC_SLEEP_TIME));
+		else if (ret)
 			pr_warn("gc failed!");
 	} while (ret);
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 643c3021..26750a9 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -637,7 +637,9 @@ static void do_bio_hook(struct search *s, struct bio *orig_bio)
 static void search_free(struct closure *cl)
 {
 	struct search *s = container_of(cl, struct search, cl);
+
 	bio_complete(s);
+	atomic_dec(&s->d->c->io_inflight);
 
 	if (s->iop.bio)
 		bio_put(s->iop.bio);
@@ -655,6 +657,7 @@ static inline struct search *search_alloc(struct bio *bio,
 
 	closure_init(&s->cl, NULL);
 	do_bio_hook(s, bio);
+	atomic_inc(&d->c->io_inflight);
 
 	s->orig_bio		= bio;
 	s->cache_miss		= NULL;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] bcache: calculate the number of incremental GC nodes according to the total of btree nodes
  2018-04-12  6:38 [PATCH 0/4] bcache: incremental GC and dirty data init tang.junhui
  2018-04-12  6:38 ` [PATCH 1/4] bcache: finish incremental GC tang.junhui
@ 2018-04-12  6:38 ` tang.junhui
  2018-04-12 14:21   ` Coly Li
  2018-04-12  6:38 ` [PATCH 3/4] bcache: notify allocator to prepare for GC tang.junhui
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: tang.junhui @ 2018-04-12  6:38 UTC (permalink / raw)
  To: kent.overstreet, colyli, mlyle; +Cc: linux-bcache, linux-block, tang.junhui

From: Tang Junhui <tang.junhui@zte.com.cn>

This patch base on "[PATCH] bcache: finish incremental GC".

Since incremental GC would stop 100ms when front side I/O comes, so when
there are many btree nodes, if GC only processes constant (100) nodes each
time, GC would last a long time, and the front I/Os would run out of the
buckets (since no new bucket can be allocated during GC), and I/Os be
blocked again.

So GC should not process constant nodes, but varied nodes according to the
number of btree nodes. In this patch, GC is divided into constant (100)
times, so when there are many btree nodes, GC can process more nodes each
time, otherwise GC will process less nodes each time (but no less than
MIN_GC_NODES).

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
---
 drivers/md/bcache/btree.c | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index b36d276..10f967e 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -90,10 +90,10 @@
 
 #define MAX_NEED_GC		64
 #define MAX_SAVE_PRIO		72
+#define MAX_GC_TIMES		100
 #define MIN_GC_NODES		100
 #define GC_SLEEP_TIME		100
 
-
 #define PTR_DIRTY_BIT		(((uint64_t) 1 << 36))
 
 #define PTR_HASH(c, k)							\
@@ -1519,6 +1519,31 @@ static unsigned btree_gc_count_keys(struct btree *b)
 	return ret;
 }
 
+static size_t btree_gc_min_nodes(struct cache_set *c)
+{
+	size_t			min_nodes;
+
+	/*
+	 * Since incremental GC would stop 100ms when front
+	 * side I/O comes, so when there are many btree nodes,
+	 * if GC only processes constant (100) nodes each time,
+	 * GC would last a long time, and the front side I/Os
+	 * would run out of the buckets (since no new bucket
+	 * can be allocated during GC), and be blocked again.
+	 * So GC should not process constant nodes, but varied
+	 * nodes according to the number of btree nodes, which
+	 * realized by dividing GC into constant(100) times,
+	 * so when there are many btree nodes, GC can process
+	 * more nodes each time, otherwise, GC will process less
+	 * nodes each time (but no less than MIN_GC_NODES)
+	 */
+	min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
+	if (min_nodes < MIN_GC_NODES)
+		min_nodes = MIN_GC_NODES;
+
+	return min_nodes;
+}
+
 static int btree_gc_recurse(struct btree *b, struct btree_op *op,
 			    struct closure *writes, struct gc_stat *gc)
 {
@@ -1585,7 +1610,7 @@ static int btree_gc_recurse(struct btree *b, struct btree_op *op,
 		r->b = NULL;
 
 		if (atomic_read(&b->c->io_inflight) &&
-		    gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
+		    gc->nodes >= gc->nodes_pre + btree_gc_min_nodes(b->c)) {
 			gc->nodes_pre =  gc->nodes;
 			ret = -EAGAIN;
 			break;
@@ -1841,8 +1866,14 @@ static int bch_btree_check_recurse(struct btree *b, struct btree_op *op)
 		do {
 			k = bch_btree_iter_next_filter(&iter, &b->keys,
 						       bch_ptr_bad);
-			if (k)
+			if (k) {
 				btree_node_prefetch(b, k);
+				/*
+				 * initiallize c->gc_stats.nodes
+				 * for incremental GC
+				 */
+				b->c->gc_stats.nodes++;
+			}
 
 			if (p)
 				ret = btree(check_recurse, p, b, op);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] bcache: notify allocator to prepare for GC
  2018-04-12  6:38 [PATCH 0/4] bcache: incremental GC and dirty data init tang.junhui
  2018-04-12  6:38 ` [PATCH 1/4] bcache: finish incremental GC tang.junhui
  2018-04-12  6:38 ` [PATCH 2/4] bcache: calculate the number of incremental GC nodes according to the total of btree nodes tang.junhui
@ 2018-04-12  6:38 ` tang.junhui
  2018-04-12 14:46   ` Coly Li
  2018-04-12  6:38 ` [PATCH 4/4] bcache: fix I/O significant decline while backend devices registering tang.junhui
  2018-04-12  6:45 ` [PATCH 0/4] bcache: incremental GC and dirty data init Coly Li
  4 siblings, 1 reply; 10+ messages in thread
From: tang.junhui @ 2018-04-12  6:38 UTC (permalink / raw)
  To: kent.overstreet, colyli, mlyle; +Cc: linux-bcache, linux-block, tang.junhui

From: Tang Junhui <tang.junhui@zte.com.cn>

Since no new bucket can be allocated during GC, and front side I/Os would
run out of all the buckets, so notify allocator to pack the free_inc queue
full of buckets before GC, then we could have enough buckets for front side
I/Os during GC period.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
---
 drivers/md/bcache/alloc.c  | 11 +++++++-
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/btree.c  | 69 ++++++++++++++++++++++++++++++++++++++++++++--
 drivers/md/bcache/btree.h  |  4 +++
 4 files changed, 83 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index a0cc1bc..85020cc 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -323,7 +323,8 @@ static int bch_allocator_thread(void *arg)
 		 * possibly issue discards to them, then we add the bucket to
 		 * the free list:
 		 */
-		while (!fifo_empty(&ca->free_inc)) {
+		while (!fifo_empty(&ca->free_inc) &&
+		       ca->prepare_gc != GC_PREPARING) {
 			long bucket;
 
 			fifo_pop(&ca->free_inc, bucket);
@@ -353,6 +354,14 @@ static int bch_allocator_thread(void *arg)
 		invalidate_buckets(ca);
 
 		/*
+		 * Let GC continue
+		 */
+		if (ca->prepare_gc == GC_PREPARING) {
+			ca->prepare_gc = GC_PREPARED;
+			wake_up_gc(ca->set);
+		}
+
+		/*
 		 * Now, we write their new gens to disk so we can start writing
 		 * new stuff to them:
 		 */
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index ab4c9ca..61a6ff2 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -428,6 +428,8 @@ struct cache {
 	 * cpu
 	 */
 	unsigned		invalidate_needs_gc;
+	/* used to notify allocator to prepare GC*/
+	unsigned int		prepare_gc;
 
 	bool			discard; /* Get rid of? */
 
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 10f967e..aba0700 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1805,6 +1805,46 @@ static void bch_btree_gc(struct cache_set *c)
 	bch_moving_gc(c);
 }
 
+static bool cache_gc_prepare_none(struct cache_set *c)
+{
+	struct cache *ca;
+	unsigned int i;
+
+	for_each_cache(ca, c, i)
+		if (ca->prepare_gc != GC_PREPARE_NONE)
+			return false;
+	return true;
+}
+
+static bool cache_gc_prepare_ok(struct cache_set *c)
+{
+	struct cache *ca;
+	unsigned int i;
+
+	for_each_cache(ca, c, i)
+		if (ca->prepare_gc != GC_PREPARED)
+			return false;
+	return true;
+}
+
+static void set_cache_gc_prepare_none(struct cache_set *c)
+{
+	struct cache *ca;
+	unsigned int i;
+
+	for_each_cache(ca, c, i)
+		ca->prepare_gc = GC_PREPARE_NONE;
+}
+
+static void set_cache_gc_preparing(struct cache_set *c)
+{
+	struct cache *ca;
+	unsigned int i;
+
+	for_each_cache(ca, c, i)
+		ca->prepare_gc = GC_PREPARING;
+}
+
 static bool gc_should_run(struct cache_set *c)
 {
 	struct cache *ca;
@@ -1814,8 +1854,33 @@ static bool gc_should_run(struct cache_set *c)
 		if (ca->invalidate_needs_gc)
 			return true;
 
-	if (atomic_read(&c->sectors_to_gc) < 0)
-		return true;
+	if (atomic_read(&c->sectors_to_gc) < 0) {
+		mutex_lock(&c->bucket_lock);
+		if (cache_gc_prepare_none(c)) {
+			/*
+			 * notify allocator thread to prepare for GC
+			 */
+			set_cache_gc_preparing(c);
+			mutex_unlock(&c->bucket_lock);
+			pr_info("gc preparing");
+			return false;
+		} else if (cache_gc_prepare_ok(c)) {
+			/*
+			 * alloc thread finished preparing,
+			 * and continue to GC
+			 */
+			set_cache_gc_prepare_none(c);
+			mutex_unlock(&c->bucket_lock);
+			pr_info("gc prepared ok");
+			return true;
+		} else {
+			/*
+			 * waitting allocator finishing prepareing
+			 */
+			mutex_unlock(&c->bucket_lock);
+			return false;
+		}
+	}
 
 	return false;
 }
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
index d211e2c..e60bd7a 100644
--- a/drivers/md/bcache/btree.h
+++ b/drivers/md/bcache/btree.h
@@ -102,6 +102,10 @@
 #include "bset.h"
 #include "debug.h"
 
+#define GC_PREPARE_NONE		0
+#define GC_PREPARING		1
+#define GC_PREPARED		2
+
 struct btree_write {
 	atomic_t		*journal;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] bcache: fix I/O significant decline while backend devices registering
  2018-04-12  6:38 [PATCH 0/4] bcache: incremental GC and dirty data init tang.junhui
                   ` (2 preceding siblings ...)
  2018-04-12  6:38 ` [PATCH 3/4] bcache: notify allocator to prepare for GC tang.junhui
@ 2018-04-12  6:38 ` tang.junhui
  2018-04-12 15:20   ` Coly Li
  2018-04-12  6:45 ` [PATCH 0/4] bcache: incremental GC and dirty data init Coly Li
  4 siblings, 1 reply; 10+ messages in thread
From: tang.junhui @ 2018-04-12  6:38 UTC (permalink / raw)
  To: kent.overstreet, colyli, mlyle; +Cc: linux-bcache, linux-block, tang.junhui

From: Tang Junhui <tang.junhui@zte.com.cn>

I attached several backend devices in the same cache set, and produced lots
of dirty data by running small rand I/O writes in a long time, then I
continue run I/O in the others cached devices, and stopped a cached device,
after a mean while, I register the stopped device again, I see the running
I/O in the others cached devices dropped significantly, sometimes even
jumps to zero.

In currently code, bcache would traverse each keys and btree node to count
the dirty data under read locker, and the writes threads can not get the
btree write locker, and when there is a lot of keys and btree node in the
registering device, it would last several seconds, so the write I/Os in
others cached device are blocked and declined significantly.

In this patch, when a device registering to a ache set, which exist others
cached devices with running I/Os, we get the amount of dirty data of the
device in an incremental way, and do not block other cached devices all the
time.

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
---
 drivers/md/bcache/writeback.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 56a3788..2467d3a 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -488,10 +488,14 @@ static int bch_writeback_thread(void *arg)
 }
 
 /* Init */
+#define INIT_KEYS_MIN			500000
+#define INIT_KEYS_SLEEP_TIME		100
 
 struct sectors_dirty_init {
 	struct btree_op	op;
 	unsigned	inode;
+	size_t		count;
+	struct bkey	start;
 };
 
 static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
@@ -506,18 +510,38 @@ static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
 		bcache_dev_sectors_dirty_add(b->c, KEY_INODE(k),
 					     KEY_START(k), KEY_SIZE(k));
 
+	op->count++;
+	if (atomic_read(&b->c->io_inflight) &&
+	   !(op->count % INIT_KEYS_MIN)) {
+		bkey_copy_key(&op->start, k);
+		return -EAGAIN;
+	}
+
 	return MAP_CONTINUE;
 }
 
 void bch_sectors_dirty_init(struct bcache_device *d)
 {
 	struct sectors_dirty_init op;
+	int ret;
 
 	bch_btree_op_init(&op.op, -1);
 	op.inode = d->id;
+	op.count = 0;
+	op.start = KEY(op.inode, 0, 0);
 
-	bch_btree_map_keys(&op.op, d->c, &KEY(op.inode, 0, 0),
+	do {
+		ret = bch_btree_map_keys(&op.op, d->c, &op.start,
 			   sectors_dirty_init_fn, 0);
+		if (ret == -EAGAIN)
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(INIT_KEYS_SLEEP_TIME));
+		else if (ret < 0) {
+			pr_warn("sectors dirty init failed, ret=%d!", ret);
+			break;
+		}
+	} while (ret == -EAGAIN);
+
 }
 
 void bch_cached_dev_writeback_init(struct cached_dev *dc)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] bcache: incremental GC and dirty data init
  2018-04-12  6:38 [PATCH 0/4] bcache: incremental GC and dirty data init tang.junhui
                   ` (3 preceding siblings ...)
  2018-04-12  6:38 ` [PATCH 4/4] bcache: fix I/O significant decline while backend devices registering tang.junhui
@ 2018-04-12  6:45 ` Coly Li
  4 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2018-04-12  6:45 UTC (permalink / raw)
  To: tang.junhui, kent.overstreet, mlyle; +Cc: linux-bcache, linux-block

On 2018/4/12 2:38 PM, tang.junhui@zte.com.cn wrote:
> Hi maintainers and folks,
> 
> Some patches of this patch set have been sent before, they are not merged
> yet, and I add two new patches to solve some issues I found while testing.
> since They are interdependent, so I make a patch set and resend them.
> 
> [PATCH 1/4] bcache: finish incremental GC
> [PATCH 2/4] bcache: calculate the number of incremental GC nodes according to
> 			the total of btree nodes
> [PATCH 3/4] bcache: notify allocator to prepare for GC
> [PATCH 4/4] bcache: fix I/O significant decline while backend devices registering
> 
> These patches are useful to prevent I/O fluctuations or even jump 0 while GC or
> cached devices registering. I have test them for some times, I hope somebody could
> have a review for these patch, any comment would be welcome.
> 

Hi Junhui,

Copied, these patches are in my to-review list :-)

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] bcache: finish incremental GC
  2018-04-12  6:38 ` [PATCH 1/4] bcache: finish incremental GC tang.junhui
@ 2018-04-12 14:08   ` Coly Li
  0 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2018-04-12 14:08 UTC (permalink / raw)
  To: tang.junhui; +Cc: kent.overstreet, mlyle, linux-bcache, linux-block

On 2018/4/12 2:38 PM, tang.junhui@zte.com.cn wrote:
> From: Tang Junhui <tang.junhui@zte.com.cn>
> 
> In GC thread, we record the latest GC key in gc_done, which is expected
> to be used for incremental GC, but in currently code, we didn't realize
> it. When GC runs, front side IO would be blocked until the GC over, it
> would be a long time if there is a lot of btree nodes.
> 
> This patch realizes incremental GC, the main ideal is that, when there
> are front side I/Os, after GC some nodes (100), we stop GC, release locker
> of the btree node, and go to process the front side I/Os for some times
> (100 ms), then go back to GC again.
> 
> By this patch, when we doing GC, I/Os are not blocked all the time, and
> there is no obvious I/Os zero jump problem any more.
> 
> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>

Hi Junhui,

I reply my comments in line,

> ---
>  drivers/md/bcache/bcache.h  |  5 +++++
>  drivers/md/bcache/btree.c   | 15 ++++++++++++++-
>  drivers/md/bcache/request.c |  3 +++
>  3 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 843877e..ab4c9ca 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -445,6 +445,7 @@ struct cache {
>  
>  struct gc_stat {
>  	size_t			nodes;
> +	size_t			nodes_pre;
>  	size_t			key_bytes;
>  
>  	size_t			nkeys;
> @@ -568,6 +569,10 @@ struct cache_set {
>  	 */
>  	atomic_t		rescale;
>  	/*
> +	 * used for GC, identify if any front side I/Os is inflight
> +	 */
> +	atomic_t		io_inflight;

Could you please to rename the above variable to something like
search_inflight ? Just to be more explicit, considering we have many
types of io requests.

> +	/*
>  	 * When we invalidate buckets, we use both the priority and the amount
>  	 * of good data to determine which buckets to reuse first - to weight
>  	 * those together consistently we keep track of the smallest nonzero
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 81e8dc3..b36d276 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -90,6 +90,9 @@
>  
>  #define MAX_NEED_GC		64
>  #define MAX_SAVE_PRIO		72
> +#define MIN_GC_NODES		100
> +#define GC_SLEEP_TIME		100

How about naming the above macro as GC_SLEEP_MS ? So people may
understand the sleep time is 100ms without checking more context.

> +
>  
>  #define PTR_DIRTY_BIT		(((uint64_t) 1 << 36))
>  
> @@ -1581,6 +1584,13 @@ static int btree_gc_recurse(struct btree *b, struct btree_op *op,
>  		memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
>  		r->b = NULL;
>  
> +		if (atomic_read(&b->c->io_inflight) &&
> +		    gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {

On 32bit machines, gc->nodes is a 32bit count, if it is overflowed the
above check will be false for a very long time, and in some condition it
might be always false after gc->nodes overflowed.

How about adding the following check before the if() statement ?
		/* in case gc->nodes is overflowed */
		if (gc->nodes_pre < gc->nodes)
			gc->noeds_pre = gc->nodes;

> +			gc->nodes_pre =  gc->nodes;
> +			ret = -EAGAIN;
> +			break;
> +		}
> +
>  		if (need_resched()) {
>  			ret = -EAGAIN;
>  			break;
> @@ -1748,7 +1758,10 @@ static void bch_btree_gc(struct cache_set *c)
>  		closure_sync(&writes);
>  		cond_resched();
>  
> -		if (ret && ret != -EAGAIN)
> +		if (ret == -EAGAIN)
> +			schedule_timeout_interruptible(msecs_to_jiffies
> +						       (GC_SLEEP_TIME));
> +		else if (ret)
>  			pr_warn("gc failed!");
>  	} while (ret);
>  
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 643c3021..26750a9 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -637,7 +637,9 @@ static void do_bio_hook(struct search *s, struct bio *orig_bio)
>  static void search_free(struct closure *cl)
>  {
>  	struct search *s = container_of(cl, struct search, cl);
> +
>  	bio_complete(s);
> +	atomic_dec(&s->d->c->io_inflight);
>  
>  	if (s->iop.bio)
>  		bio_put(s->iop.bio);
> @@ -655,6 +657,7 @@ static inline struct search *search_alloc(struct bio *bio,
>  
>  	closure_init(&s->cl, NULL);
>  	do_bio_hook(s, bio);
> +	atomic_inc(&d->c->io_inflight);
>  

Similar variable naming.

>  	s->orig_bio		= bio;
>  	s->cache_miss		= NULL;
> 

Thanks for your effort.

Coly Li

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/4] bcache: calculate the number of incremental GC nodes according to the total of btree nodes
  2018-04-12  6:38 ` [PATCH 2/4] bcache: calculate the number of incremental GC nodes according to the total of btree nodes tang.junhui
@ 2018-04-12 14:21   ` Coly Li
  0 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2018-04-12 14:21 UTC (permalink / raw)
  To: tang.junhui; +Cc: kent.overstreet, mlyle, linux-bcache, linux-block

On 2018/4/12 2:38 PM, tang.junhui@zte.com.cn wrote:
> From: Tang Junhui <tang.junhui@zte.com.cn>
> 
> This patch base on "[PATCH] bcache: finish incremental GC".
> 
> Since incremental GC would stop 100ms when front side I/O comes, so when
> there are many btree nodes, if GC only processes constant (100) nodes each
> time, GC would last a long time, and the front I/Os would run out of the
> buckets (since no new bucket can be allocated during GC), and I/Os be
> blocked again.
> 
> So GC should not process constant nodes, but varied nodes according to the
> number of btree nodes. In this patch, GC is divided into constant (100)
> times, so when there are many btree nodes, GC can process more nodes each
> time, otherwise GC will process less nodes each time (but no less than
> MIN_GC_NODES).
> 
> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
> ---
>  drivers/md/bcache/btree.c | 37 ++++++++++++++++++++++++++++++++++---
>  1 file changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index b36d276..10f967e 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -90,10 +90,10 @@
>  
>  #define MAX_NEED_GC		64
>  #define MAX_SAVE_PRIO		72
> +#define MAX_GC_TIMES		100
>  #define MIN_GC_NODES		100
>  #define GC_SLEEP_TIME		100
>  
> -
>  #define PTR_DIRTY_BIT		(((uint64_t) 1 << 36))
>  
>  #define PTR_HASH(c, k)							\
> @@ -1519,6 +1519,31 @@ static unsigned btree_gc_count_keys(struct btree *b)
>  	return ret;
>  }
>  
> +static size_t btree_gc_min_nodes(struct cache_set *c)
> +{
> +	size_t			min_nodes;
> +
> +	/*
> +	 * Since incremental GC would stop 100ms when front
> +	 * side I/O comes, so when there are many btree nodes,
> +	 * if GC only processes constant (100) nodes each time,
> +	 * GC would last a long time, and the front side I/Os
> +	 * would run out of the buckets (since no new bucket
> +	 * can be allocated during GC), and be blocked again.
> +	 * So GC should not process constant nodes, but varied
> +	 * nodes according to the number of btree nodes, which
> +	 * realized by dividing GC into constant(100) times,
> +	 * so when there are many btree nodes, GC can process
> +	 * more nodes each time, otherwise, GC will process less
> +	 * nodes each time (but no less than MIN_GC_NODES)
> +	 */
> +	min_nodes = c->gc_stats.nodes / MAX_GC_TIMES;
> +	if (min_nodes < MIN_GC_NODES)
> +		min_nodes = MIN_GC_NODES;
> +
> +	return min_nodes;
> +}
> +
>  static int btree_gc_recurse(struct btree *b, struct btree_op *op,
>  			    struct closure *writes, struct gc_stat *gc)
>  {
> @@ -1585,7 +1610,7 @@ static int btree_gc_recurse(struct btree *b, struct btree_op *op,
>  		r->b = NULL;
>  
>  		if (atomic_read(&b->c->io_inflight) &&
> -		    gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
> +		    gc->nodes >= gc->nodes_pre + btree_gc_min_nodes(b->c)) {
>  			gc->nodes_pre =  gc->nodes;
>  			ret = -EAGAIN;
>  			break;
> @@ -1841,8 +1866,14 @@ static int bch_btree_check_recurse(struct btree *b, struct btree_op *op)
>  		do {
>  			k = bch_btree_iter_next_filter(&iter, &b->keys,
>  						       bch_ptr_bad);
> -			if (k)
> +			if (k) {

Hi Junhui,

>  				btree_node_prefetch(b, k);
> +				/*
> +				 * initiallize c->gc_stats.nodes
> +				 * for incremental GC
> +				 */
> +				b->c->gc_stats.nodes++;

I don't get the point why increase gc_stats.nodes here. Could you please
explain a bit more ?

Thanks.

Coly Li


> +			}
>  
>  			if (p)
>  				ret = btree(check_recurse, p, b, op);
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/4] bcache: notify allocator to prepare for GC
  2018-04-12  6:38 ` [PATCH 3/4] bcache: notify allocator to prepare for GC tang.junhui
@ 2018-04-12 14:46   ` Coly Li
  0 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2018-04-12 14:46 UTC (permalink / raw)
  To: tang.junhui; +Cc: kent.overstreet, mlyle, linux-bcache, linux-block

On 2018/4/12 2:38 PM, tang.junhui@zte.com.cn wrote:
> From: Tang Junhui <tang.junhui@zte.com.cn>
> 
> Since no new bucket can be allocated during GC, and front side I/Os would
> run out of all the buckets, so notify allocator to pack the free_inc queue
> full of buckets before GC, then we could have enough buckets for front side
> I/Os during GC period.
> 
> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>

Hi Junhui,

This method is complicated IMHO. Why not in bch_allocator_thread()
simple call wake_up_gc(ca->set) after invalidate_buckets() ?

Thanks.

Coly Li

> ---
>  drivers/md/bcache/alloc.c  | 11 +++++++-
>  drivers/md/bcache/bcache.h |  2 ++
>  drivers/md/bcache/btree.c  | 69 ++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/md/bcache/btree.h  |  4 +++
>  4 files changed, 83 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index a0cc1bc..85020cc 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -323,7 +323,8 @@ static int bch_allocator_thread(void *arg)
>  		 * possibly issue discards to them, then we add the bucket to
>  		 * the free list:
>  		 */
> -		while (!fifo_empty(&ca->free_inc)) {
> +		while (!fifo_empty(&ca->free_inc) &&
> +		       ca->prepare_gc != GC_PREPARING) {
>  			long bucket;
>  
>  			fifo_pop(&ca->free_inc, bucket);
> @@ -353,6 +354,14 @@ static int bch_allocator_thread(void *arg)
>  		invalidate_buckets(ca);
>  
>  		/*
> +		 * Let GC continue
> +		 */
> +		if (ca->prepare_gc == GC_PREPARING) {
> +			ca->prepare_gc = GC_PREPARED;
> +			wake_up_gc(ca->set);
> +		}
> +
> +		/*
>  		 * Now, we write their new gens to disk so we can start writing
>  		 * new stuff to them:
>  		 */
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index ab4c9ca..61a6ff2 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -428,6 +428,8 @@ struct cache {
>  	 * cpu
>  	 */
>  	unsigned		invalidate_needs_gc;
> +	/* used to notify allocator to prepare GC*/
> +	unsigned int		prepare_gc;
>  
>  	bool			discard; /* Get rid of? */
>  
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 10f967e..aba0700 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1805,6 +1805,46 @@ static void bch_btree_gc(struct cache_set *c)
>  	bch_moving_gc(c);
>  }
>  
> +static bool cache_gc_prepare_none(struct cache_set *c)
> +{
> +	struct cache *ca;
> +	unsigned int i;
> +
> +	for_each_cache(ca, c, i)
> +		if (ca->prepare_gc != GC_PREPARE_NONE)
> +			return false;
> +	return true;
> +}
> +
> +static bool cache_gc_prepare_ok(struct cache_set *c)
> +{
> +	struct cache *ca;
> +	unsigned int i;
> +
> +	for_each_cache(ca, c, i)
> +		if (ca->prepare_gc != GC_PREPARED)
> +			return false;
> +	return true;
> +}
> +
> +static void set_cache_gc_prepare_none(struct cache_set *c)
> +{
> +	struct cache *ca;
> +	unsigned int i;
> +
> +	for_each_cache(ca, c, i)
> +		ca->prepare_gc = GC_PREPARE_NONE;
> +}
> +
> +static void set_cache_gc_preparing(struct cache_set *c)
> +{
> +	struct cache *ca;
> +	unsigned int i;
> +
> +	for_each_cache(ca, c, i)
> +		ca->prepare_gc = GC_PREPARING;
> +}
> +
>  static bool gc_should_run(struct cache_set *c)
>  {
>  	struct cache *ca;
> @@ -1814,8 +1854,33 @@ static bool gc_should_run(struct cache_set *c)
>  		if (ca->invalidate_needs_gc)
>  			return true;
>  
> -	if (atomic_read(&c->sectors_to_gc) < 0)
> -		return true;
> +	if (atomic_read(&c->sectors_to_gc) < 0) {
> +		mutex_lock(&c->bucket_lock);
> +		if (cache_gc_prepare_none(c)) {
> +			/*
> +			 * notify allocator thread to prepare for GC
> +			 */
> +			set_cache_gc_preparing(c);
> +			mutex_unlock(&c->bucket_lock);
> +			pr_info("gc preparing");
> +			return false;
> +		} else if (cache_gc_prepare_ok(c)) {
> +			/*
> +			 * alloc thread finished preparing,
> +			 * and continue to GC
> +			 */
> +			set_cache_gc_prepare_none(c);
> +			mutex_unlock(&c->bucket_lock);
> +			pr_info("gc prepared ok");
> +			return true;
> +		} else {
> +			/*
> +			 * waitting allocator finishing prepareing
> +			 */
> +			mutex_unlock(&c->bucket_lock);
> +			return false;
> +		}
> +	}
>  
>  	return false;
>  }
> diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
> index d211e2c..e60bd7a 100644
> --- a/drivers/md/bcache/btree.h
> +++ b/drivers/md/bcache/btree.h
> @@ -102,6 +102,10 @@
>  #include "bset.h"
>  #include "debug.h"
>  
> +#define GC_PREPARE_NONE		0
> +#define GC_PREPARING		1
> +#define GC_PREPARED		2
> +
>  struct btree_write {
>  	atomic_t		*journal;
>  
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] bcache: fix I/O significant decline while backend devices registering
  2018-04-12  6:38 ` [PATCH 4/4] bcache: fix I/O significant decline while backend devices registering tang.junhui
@ 2018-04-12 15:20   ` Coly Li
  0 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2018-04-12 15:20 UTC (permalink / raw)
  To: tang.junhui; +Cc: kent.overstreet, mlyle, linux-bcache, linux-block

On 2018/4/12 2:38 PM, tang.junhui@zte.com.cn wrote:
> From: Tang Junhui <tang.junhui@zte.com.cn>
> 
> I attached several backend devices in the same cache set, and produced lots
> of dirty data by running small rand I/O writes in a long time, then I
> continue run I/O in the others cached devices, and stopped a cached device,
> after a mean while, I register the stopped device again, I see the running
> I/O in the others cached devices dropped significantly, sometimes even
> jumps to zero.
> 
> In currently code, bcache would traverse each keys and btree node to count
> the dirty data under read locker, and the writes threads can not get the
> btree write locker, and when there is a lot of keys and btree node in the
> registering device, it would last several seconds, so the write I/Os in
> others cached device are blocked and declined significantly.
> 
> In this patch, when a device registering to a ache set, which exist others
> cached devices with running I/Os, we get the amount of dirty data of the
> device in an incremental way, and do not block other cached devices all the
> time.
> 
> Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
> ---
>  drivers/md/bcache/writeback.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 56a3788..2467d3a 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -488,10 +488,14 @@ static int bch_writeback_thread(void *arg)
>  }
>  

Hi Junhui,

>  /* Init */
> +#define INIT_KEYS_MIN			500000
> +#define INIT_KEYS_SLEEP_TIME		100
>  

How about renaming the above macros to something like
INIT_KEYS_WAIT_COUNTER and INIT_KEYS_SLEEP_MS ?

Also I suggested to rename io_inflight in previous patch review comments.

For rested part of this patch looks good to me.

Reviewed-by: Coly Li <colyli@suse.de>

Thanks.

Coly Li

>  struct sectors_dirty_init {
>  	struct btree_op	op;
>  	unsigned	inode;
> +	size_t		count;
> +	struct bkey	start;
>  };
>  
>  static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
> @@ -506,18 +510,38 @@ static int sectors_dirty_init_fn(struct btree_op *_op, struct btree *b,
>  		bcache_dev_sectors_dirty_add(b->c, KEY_INODE(k),
>  					     KEY_START(k), KEY_SIZE(k));
>  
> +	op->count++;
> +	if (atomic_read(&b->c->io_inflight) &&
> +	   !(op->count % INIT_KEYS_MIN)) {
> +		bkey_copy_key(&op->start, k);
> +		return -EAGAIN;
> +	}
> +
>  	return MAP_CONTINUE;
>  }
>  
>  void bch_sectors_dirty_init(struct bcache_device *d)
>  {
>  	struct sectors_dirty_init op;
> +	int ret;
>  
>  	bch_btree_op_init(&op.op, -1);
>  	op.inode = d->id;
> +	op.count = 0;
> +	op.start = KEY(op.inode, 0, 0);
>  
> -	bch_btree_map_keys(&op.op, d->c, &KEY(op.inode, 0, 0),
> +	do {
> +		ret = bch_btree_map_keys(&op.op, d->c, &op.start,
>  			   sectors_dirty_init_fn, 0);
> +		if (ret == -EAGAIN)
> +			schedule_timeout_interruptible(
> +				msecs_to_jiffies(INIT_KEYS_SLEEP_TIME));
> +		else if (ret < 0) {
> +			pr_warn("sectors dirty init failed, ret=%d!", ret);
> +			break;
> +		}
> +	} while (ret == -EAGAIN);
> +
>  }
>  
>  void bch_cached_dev_writeback_init(struct cached_dev *dc)
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-04-12 15:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-12  6:38 [PATCH 0/4] bcache: incremental GC and dirty data init tang.junhui
2018-04-12  6:38 ` [PATCH 1/4] bcache: finish incremental GC tang.junhui
2018-04-12 14:08   ` Coly Li
2018-04-12  6:38 ` [PATCH 2/4] bcache: calculate the number of incremental GC nodes according to the total of btree nodes tang.junhui
2018-04-12 14:21   ` Coly Li
2018-04-12  6:38 ` [PATCH 3/4] bcache: notify allocator to prepare for GC tang.junhui
2018-04-12 14:46   ` Coly Li
2018-04-12  6:38 ` [PATCH 4/4] bcache: fix I/O significant decline while backend devices registering tang.junhui
2018-04-12 15:20   ` Coly Li
2018-04-12  6:45 ` [PATCH 0/4] bcache: incremental GC and dirty data init Coly Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.