[PATCH 1/5] bcache: don't write back data if reading it failed

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/5] bcache: don't write back data if reading it failed
@ 2017-09-26 19:24 Michael Lyle
  2017-09-26 19:24 ` [PATCH 2/5] bcache: implement PI controller for writeback rate Michael Lyle
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Michael Lyle @ 2017-09-26 19:24 UTC (permalink / raw)
  To: linux-bcache; +Cc: i, Michael Lyle

If an IO operation fails, and we didn't successfully read data from the
cache, don't writeback invalid/partial data to the backing disk.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/writeback.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index e663ca082183..eea49bf36401 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -179,13 +179,20 @@ static void write_dirty(struct closure *cl)
 	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
 	struct keybuf_key *w = io->bio.bi_private;
 
-	dirty_init(w);
-	bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
-	io->bio.bi_iter.bi_sector = KEY_START(&w->key);
-	bio_set_dev(&io->bio, io->dc->bdev);
-	io->bio.bi_end_io	= dirty_endio;
+	/* IO errors are signalled using the dirty bit on the key.
+	 * If we failed to read, we should not attempt to write to the
+	 * backing device.  Instead, immediately go to write_dirty_finish
+	 * to clean up.
+	 */
+	if (KEY_DIRTY(&w->key)) {
+		dirty_init(w);
+		bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
+		io->bio.bi_iter.bi_sector = KEY_START(&w->key);
+		bio_set_dev(&io->bio, io->dc->bdev);
+		io->bio.bi_end_io	= dirty_endio;
 
-	closure_bio_submit(&io->bio, cl);
+		closure_bio_submit(&io->bio, cl);
+	}
 
 	continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/5] bcache: implement PI controller for writeback rate
  2017-09-26 19:24 [PATCH 1/5] bcache: don't write back data if reading it failed Michael Lyle
@ 2017-09-26 19:24 ` Michael Lyle
  2017-09-27  8:35   ` Coly Li
  2017-09-26 19:24 ` [PATCH 3/5] bcache: smooth writeback rate control Michael Lyle
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Michael Lyle @ 2017-09-26 19:24 UTC (permalink / raw)
  To: linux-bcache; +Cc: i, Michael Lyle

bcache uses a control system to attempt to keep the amount of dirty data
in cache at a user-configured level, while not responding excessively to
transients and variations in write rate.  Previously, the system was a
PD controller; but the output from it was integrated, turning the
Proportional term into an Integral term, and turning the Derivative term
into a crude Proportional term.  Performance of the controller has been
uneven in production, and it has tended to respond slowly, oscillate,
and overshoot.

This patch set replaces the current control system with an explicit PI
controller and tuning that should be correct for most hardware.  By
default, it attempts to write at a rate that would retire 1/40th of the
current excess blocks per second.  An integral term in turn works to
remove steady state errors.

IMO, this yields benefits in simplicity (removing weighted average
filtering, etc) and system performance.

Another small change is a tunable parameter is introduced to allow the
user to specify a minimum rate at which dirty blocks are retired.

There is a slight difference from earlier versions of the patch in
integral handling to prevent excessive negative integral windup.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/bcache.h    |  9 ++---
 drivers/md/bcache/sysfs.c     | 18 +++++----
 drivers/md/bcache/writeback.c | 89 ++++++++++++++++++++++++-------------------
 3 files changed, 64 insertions(+), 52 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 2ed9bd231d84..eb83be693d60 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -265,9 +265,6 @@ struct bcache_device {
 	atomic_t		*stripe_sectors_dirty;
 	unsigned long		*full_dirty_stripes;
 
-	unsigned long		sectors_dirty_last;
-	long			sectors_dirty_derivative;
-
 	struct bio_set		*bio_split;
 
 	unsigned		data_csum:1;
@@ -362,12 +359,14 @@ struct cached_dev {
 
 	uint64_t		writeback_rate_target;
 	int64_t			writeback_rate_proportional;
-	int64_t			writeback_rate_derivative;
+	int64_t			writeback_rate_integral;
+	int64_t			writeback_rate_integral_scaled;
 	int64_t			writeback_rate_change;
 
 	unsigned		writeback_rate_update_seconds;
-	unsigned		writeback_rate_d_term;
+	unsigned		writeback_rate_i_term_inverse;
 	unsigned		writeback_rate_p_term_inverse;
+	unsigned		writeback_rate_minimum;
 };
 
 enum alloc_reserve {
diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
index 104c57cd666c..eb493814759c 100644
--- a/drivers/md/bcache/sysfs.c
+++ b/drivers/md/bcache/sysfs.c
@@ -81,8 +81,9 @@ rw_attribute(writeback_delay);
 rw_attribute(writeback_rate);
 
 rw_attribute(writeback_rate_update_seconds);
-rw_attribute(writeback_rate_d_term);
+rw_attribute(writeback_rate_i_term_inverse);
 rw_attribute(writeback_rate_p_term_inverse);
+rw_attribute(writeback_rate_minimum);
 read_attribute(writeback_rate_debug);
 
 read_attribute(stripe_size);
@@ -130,15 +131,16 @@ SHOW(__bch_cached_dev)
 	sysfs_hprint(writeback_rate,	dc->writeback_rate.rate << 9);
 
 	var_print(writeback_rate_update_seconds);
-	var_print(writeback_rate_d_term);
+	var_print(writeback_rate_i_term_inverse);
 	var_print(writeback_rate_p_term_inverse);
+	var_print(writeback_rate_minimum);
 
 	if (attr == &sysfs_writeback_rate_debug) {
 		char rate[20];
 		char dirty[20];
 		char target[20];
 		char proportional[20];
-		char derivative[20];
+		char integral[20];
 		char change[20];
 		s64 next_io;
 
@@ -146,7 +148,7 @@ SHOW(__bch_cached_dev)
 		bch_hprint(dirty,	bcache_dev_sectors_dirty(&dc->disk) << 9);
 		bch_hprint(target,	dc->writeback_rate_target << 9);
 		bch_hprint(proportional,dc->writeback_rate_proportional << 9);
-		bch_hprint(derivative,	dc->writeback_rate_derivative << 9);
+		bch_hprint(integral,	dc->writeback_rate_integral_scaled << 9);
 		bch_hprint(change,	dc->writeback_rate_change << 9);
 
 		next_io = div64_s64(dc->writeback_rate.next - local_clock(),
@@ -157,11 +159,11 @@ SHOW(__bch_cached_dev)
 			       "dirty:\t\t%s\n"
 			       "target:\t\t%s\n"
 			       "proportional:\t%s\n"
-			       "derivative:\t%s\n"
+			       "integral:\t%s\n"
 			       "change:\t\t%s/sec\n"
 			       "next io:\t%llims\n",
 			       rate, dirty, target, proportional,
-			       derivative, change, next_io);
+			       integral, change, next_io);
 	}
 
 	sysfs_hprint(dirty_data,
@@ -213,7 +215,7 @@ STORE(__cached_dev)
 			    dc->writeback_rate.rate, 1, INT_MAX);
 
 	d_strtoul_nonzero(writeback_rate_update_seconds);
-	d_strtoul(writeback_rate_d_term);
+	d_strtoul(writeback_rate_i_term_inverse);
 	d_strtoul_nonzero(writeback_rate_p_term_inverse);
 
 	d_strtoi_h(sequential_cutoff);
@@ -319,7 +321,7 @@ static struct attribute *bch_cached_dev_files[] = {
 	&sysfs_writeback_percent,
 	&sysfs_writeback_rate,
 	&sysfs_writeback_rate_update_seconds,
-	&sysfs_writeback_rate_d_term,
+	&sysfs_writeback_rate_i_term_inverse,
 	&sysfs_writeback_rate_p_term_inverse,
 	&sysfs_writeback_rate_debug,
 	&sysfs_dirty_data,
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index eea49bf36401..5a8f5655151b 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -25,48 +25,60 @@ static void __update_writeback_rate(struct cached_dev *dc)
 				bcache_flash_devs_sectors_dirty(c);
 	uint64_t cache_dirty_target =
 		div_u64(cache_sectors * dc->writeback_percent, 100);
-
 	int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
 				   c->cached_dev_sectors);
 
-	/* PD controller */
-
+	/* PI controller:
+	 * Figures out the amount that should be written per second.
+	 *
+	 * First, the error (number of sectors that are dirty beyond our
+	 * target) is calculated.  The error is accumulated (numerically
+	 * integrated).
+	 *
+	 * Then, the proportional value and integral value are scaled
+	 * based on configured values.  These are stored as inverses to
+	 * avoid fixed point math and to make configuration easy-- e.g.
+	 * the default value of 40 for writeback_rate_p_term_inverse
+	 * attempts to write at a rate that would retire all the dirty
+	 * blocks in 40 seconds.
+	 *
+	 * The writeback_rate_i_inverse value of 10000 means that 1/10000th
+	 * of the error is accumulated in the integral term per second.
+	 * This acts as a slow, long-term average that is not subject to
+	 * variations in usage like the p term.
+	 */
 	int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
-	int64_t derivative = dirty - dc->disk.sectors_dirty_last;
-	int64_t proportional = dirty - target;
-	int64_t change;
-
-	dc->disk.sectors_dirty_last = dirty;
-
-	/* Scale to sectors per second */
-
-	proportional *= dc->writeback_rate_update_seconds;
-	proportional = div_s64(proportional, dc->writeback_rate_p_term_inverse);
-
-	derivative = div_s64(derivative, dc->writeback_rate_update_seconds);
-
-	derivative = ewma_add(dc->disk.sectors_dirty_derivative, derivative,
-			      (dc->writeback_rate_d_term /
-			       dc->writeback_rate_update_seconds) ?: 1, 0);
-
-	derivative *= dc->writeback_rate_d_term;
-	derivative = div_s64(derivative, dc->writeback_rate_p_term_inverse);
-
-	change = proportional + derivative;
+	int64_t error = dirty - target;
+	int64_t proportional_scaled =
+		div_s64(error, dc->writeback_rate_p_term_inverse);
+	int64_t integral_scaled, new_rate;
+
+	if ((error < 0 && dc->writeback_rate_integral > 0) ||
+	    (error > 0 && time_before64(local_clock(),
+			 dc->writeback_rate.next + NSEC_PER_MSEC))) {
+		/* Only decrease the integral term if it's more than
+		 * zero.  Only increase the integral term if the device
+		 * is keeping up.  (Don't wind up the integral
+		 * ineffectively in either case).
+		 *
+		 * It's necessary to scale this by
+		 * writeback_rate_update_seconds to keep the integral
+		 * term dimensioned properly.
+		 */
+		dc->writeback_rate_integral += error *
+			dc->writeback_rate_update_seconds;
+	}
 
-	/* Don't increase writeback rate if the device isn't keeping up */
-	if (change > 0 &&
-	    time_after64(local_clock(),
-			 dc->writeback_rate.next + NSEC_PER_MSEC))
-		change = 0;
+	integral_scaled = div_s64(dc->writeback_rate_integral,
+			dc->writeback_rate_i_term_inverse);
 
-	dc->writeback_rate.rate =
-		clamp_t(int64_t, (int64_t) dc->writeback_rate.rate + change,
-			1, NSEC_PER_MSEC);
+	new_rate = clamp_t(int64_t, (proportional_scaled + integral_scaled),
+			dc->writeback_rate_minimum, NSEC_PER_MSEC);
 
-	dc->writeback_rate_proportional = proportional;
-	dc->writeback_rate_derivative = derivative;
-	dc->writeback_rate_change = change;
+	dc->writeback_rate_proportional = proportional_scaled;
+	dc->writeback_rate_integral_scaled = integral_scaled;
+	dc->writeback_rate_change = new_rate - dc->writeback_rate.rate;
+	dc->writeback_rate.rate = new_rate;
 	dc->writeback_rate_target = target;
 }
 
@@ -498,8 +510,6 @@ void bch_sectors_dirty_init(struct bcache_device *d)
 
 	bch_btree_map_keys(&op.op, d->c, &KEY(op.inode, 0, 0),
 			   sectors_dirty_init_fn, 0);
-
-	d->sectors_dirty_last = bcache_dev_sectors_dirty(d);
 }
 
 void bch_cached_dev_writeback_init(struct cached_dev *dc)
@@ -513,10 +523,11 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
 	dc->writeback_percent		= 10;
 	dc->writeback_delay		= 30;
 	dc->writeback_rate.rate		= 1024;
+	dc->writeback_rate_minimum	= 1;
 
 	dc->writeback_rate_update_seconds = 5;
-	dc->writeback_rate_d_term	= 30;
-	dc->writeback_rate_p_term_inverse = 6000;
+	dc->writeback_rate_p_term_inverse = 40;
+	dc->writeback_rate_i_term_inverse = 10000;
 
 	INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
 }
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/5] bcache: smooth writeback rate control
  2017-09-26 19:24 [PATCH 1/5] bcache: don't write back data if reading it failed Michael Lyle
  2017-09-26 19:24 ` [PATCH 2/5] bcache: implement PI controller for writeback rate Michael Lyle
@ 2017-09-26 19:24 ` Michael Lyle
  2017-09-27  8:44   ` Coly Li
  2017-09-26 19:24 ` [PATCH 4/5] bcache: writeback: collapse contiguous IO better Michael Lyle
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Michael Lyle @ 2017-09-26 19:24 UTC (permalink / raw)
  To: linux-bcache; +Cc: i, Michael Lyle

This works in conjunction with the new PI controller.  Currently, in
real-world workloads, the rate controller attempts to write back 1
sector per second.  In practice, these minimum-rate writebacks are
between 4k and 60k in test scenarios, since bcache aggregates and
attempts to do contiguous writes and because filesystems on top of
bcachefs typically write 4k or more.

Previously, bcache used to guarantee to write at least once per second.
This means that the actual writeback rate would exceed the configured
amount by a factor of 8-120 or more.

This patch adjusts to be willing to sleep up to 2.5 seconds, and to
target writing 4k/second.  On the smallest writes, it will sleep 1
second like before, but many times it will sleep longer and load the
backing device less.  This keeps the loading on the cache and backing
device related to writeback more consistent when writing back at low
rates.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/util.c      | 10 ++++++++--
 drivers/md/bcache/writeback.c |  2 +-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 176d3c2ef5f5..4dbe37e82877 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -232,8 +232,14 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done)
 
 	d->next += div_u64(done * NSEC_PER_SEC, d->rate);
 
-	if (time_before64(now + NSEC_PER_SEC, d->next))
-		d->next = now + NSEC_PER_SEC;
+	/* Bound the time.  Don't let us fall further than 2 seconds behind
+	 * (this prevents unnecessary backlog that would make it impossible
+	 * to catch up).  If we're ahead of the desired writeback rate,
+	 * don't let us sleep more than 2.5 seconds (so we can notice/respond
+	 * if the control system tells us to speed up!).
+	 */
+	if (time_before64(now + NSEC_PER_SEC * 5 / 2, d->next))
+		d->next = now + NSEC_PER_SEC * 5 / 2;
 
 	if (time_after64(now - NSEC_PER_SEC * 2, d->next))
 		d->next = now - NSEC_PER_SEC * 2;
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 5a8f5655151b..0b7c89813635 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -523,7 +523,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
 	dc->writeback_percent		= 10;
 	dc->writeback_delay		= 30;
 	dc->writeback_rate.rate		= 1024;
-	dc->writeback_rate_minimum	= 1;
+	dc->writeback_rate_minimum	= 8;
 
 	dc->writeback_rate_update_seconds = 5;
 	dc->writeback_rate_p_term_inverse = 40;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/5] bcache: writeback: collapse contiguous IO better
  2017-09-26 19:24 [PATCH 1/5] bcache: don't write back data if reading it failed Michael Lyle
  2017-09-26 19:24 ` [PATCH 2/5] bcache: implement PI controller for writeback rate Michael Lyle
  2017-09-26 19:24 ` [PATCH 3/5] bcache: smooth writeback rate control Michael Lyle
@ 2017-09-26 19:24 ` Michael Lyle
  2017-09-26 19:24 ` [PATCH 5/5] bcache: writeback: properly order backing device IO Michael Lyle
  2017-09-27  8:22 ` [PATCH 1/5] bcache: don't write back data if reading it failed Coly Li
  4 siblings, 0 replies; 11+ messages in thread
From: Michael Lyle @ 2017-09-26 19:24 UTC (permalink / raw)
  To: linux-bcache; +Cc: i, Michael Lyle

Previously, there was some logic that attempted to immediately issue
writeback of backing-contiguous blocks when the writeback rate was
fast.

The previous logic did not have any limits on the aggregate size it
would issue, nor the number of keys it would combine at once.  It
would also discard the chance to do a contiguous write when the
writeback rate was low-- e.g. at "background" writeback of target
rate = 8, it would not combine two adjacent 4k writes and would
instead seek the disk twice.

This patch imposes limits and explicitly understands the size of
contiguous I/O during issue.  It also will combine contiguous I/O
in all circumstances, not just when writeback is requested to be
relatively fast.

It is a win on its own, but also lays the groundwork for skip writes to
short keys to make the I/O more sequential/contiguous.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/bcache.h    |   6 --
 drivers/md/bcache/writeback.c | 131 ++++++++++++++++++++++++++++++------------
 2 files changed, 93 insertions(+), 44 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index eb83be693d60..da803a3b1981 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -321,12 +321,6 @@ struct cached_dev {
 	struct bch_ratelimit	writeback_rate;
 	struct delayed_work	writeback_rate_update;
 
-	/*
-	 * Internal to the writeback code, so read_dirty() can keep track of
-	 * where it's at.
-	 */
-	sector_t		last_read;
-
 	/* Limit number of writeback bios in flight */
 	struct semaphore	in_flight;
 	struct task_struct	*writeback_thread;
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index 0b7c89813635..cf29c44605b9 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -229,10 +229,26 @@ static void read_dirty_submit(struct closure *cl)
 	continue_at(cl, write_dirty, io->dc->writeback_write_wq);
 }
 
+static inline bool keys_contiguous(struct cached_dev *dc,
+		struct keybuf_key *first, struct keybuf_key *second)
+{
+	if (PTR_CACHE(dc->disk.c, &second->key, 0)->bdev !=
+			PTR_CACHE(dc->disk.c, &first->key, 0)->bdev)
+		return false;
+
+	if (PTR_OFFSET(&second->key, 0) !=
+			(PTR_OFFSET(&first->key, 0) + KEY_SIZE(&first->key)))
+		return false;
+
+	return true;
+}
+
 static void read_dirty(struct cached_dev *dc)
 {
 	unsigned delay = 0;
-	struct keybuf_key *w;
+	struct keybuf_key *next, *keys[5], *w;
+	size_t size;
+	int nk, i;
 	struct dirty_io *io;
 	struct closure cl;
 
@@ -243,45 +259,84 @@ static void read_dirty(struct cached_dev *dc)
 	 * mempools.
 	 */
 
-	while (!kthread_should_stop()) {
-
-		w = bch_keybuf_next(&dc->writeback_keys);
-		if (!w)
-			break;
-
-		BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
-
-		if (KEY_START(&w->key) != dc->last_read ||
-		    jiffies_to_msecs(delay) > 50)
-			while (!kthread_should_stop() && delay)
-				delay = schedule_timeout_interruptible(delay);
-
-		dc->last_read	= KEY_OFFSET(&w->key);
-
-		io = kzalloc(sizeof(struct dirty_io) + sizeof(struct bio_vec)
-			     * DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
-			     GFP_KERNEL);
-		if (!io)
-			goto err;
-
-		w->private	= io;
-		io->dc		= dc;
-
-		dirty_init(w);
-		bio_set_op_attrs(&io->bio, REQ_OP_READ, 0);
-		io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0);
-		bio_set_dev(&io->bio, PTR_CACHE(dc->disk.c, &w->key, 0)->bdev);
-		io->bio.bi_end_io	= read_dirty_endio;
-
-		if (bio_alloc_pages(&io->bio, GFP_KERNEL))
-			goto err_free;
-
-		trace_bcache_writeback(&w->key);
+	next = bch_keybuf_next(&dc->writeback_keys);
+
+	while (!kthread_should_stop() && next) {
+		size = 0;
+		nk = 0;
+
+		do {
+			BUG_ON(ptr_stale(dc->disk.c, &next->key, 0));
+
+			/* Don't combine too many operations, even if they
+			 * are all small.
+			 */
+			if (nk >= 5)
+				break;
+
+			/* If the current operation is very large, don't
+			 * further combine operations.
+			 */
+			if (size > 5000)
+				break;
+
+			/* Operations are only eligible to be combined
+			 * if they are contiguous.
+			 *
+			 * TODO: add a heuristic willing to fire a
+			 * certain amount of non-contiguous IO per pass,
+			 * so that we can benefit from backing device
+			 * command queueing.
+			 */
+			if (nk != 0 && !keys_contiguous(dc, keys[nk-1], next))
+				break;
+
+			size += KEY_SIZE(&next->key);
+			keys[nk++] = next;
+		} while ((next = bch_keybuf_next(&dc->writeback_keys)));
+
+		/* Now we have gathered a set of 1..5 keys to write back. */
+
+		for (i = 0; i < nk; i++) {
+			w = keys[i];
+
+			io = kzalloc(sizeof(struct dirty_io) +
+				     sizeof(struct bio_vec) *
+				     DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
+				     GFP_KERNEL);
+			if (!io)
+				goto err;
+
+			w->private	= io;
+			io->dc		= dc;
+
+			dirty_init(w);
+			bio_set_op_attrs(&io->bio, REQ_OP_READ, 0);
+			io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0);
+			bio_set_dev(&io->bio,
+				    PTR_CACHE(dc->disk.c, &w->key, 0)->bdev);
+			io->bio.bi_end_io	= read_dirty_endio;
+
+			if (bio_alloc_pages(&io->bio, GFP_KERNEL))
+				goto err_free;
+
+			trace_bcache_writeback(&w->key);
+
+			down(&dc->in_flight);
+
+			/* We've acquired a semaphore for the maximum
+			 * simultaneous number of writebacks; from here
+			 * everything happens asynchronously.
+			 */
+			closure_call(&io->cl, read_dirty_submit, NULL, &cl);
+		}
 
-		down(&dc->in_flight);
-		closure_call(&io->cl, read_dirty_submit, NULL, &cl);
+		delay = writeback_delay(dc, size);
 
-		delay = writeback_delay(dc, KEY_SIZE(&w->key));
+		while (!kthread_should_stop() && delay) {
+			schedule_timeout_interruptible(delay);
+			delay = writeback_delay(dc, 0);
+		}
 	}
 
 	if (0) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/5] bcache: writeback: properly order backing device IO
  2017-09-26 19:24 [PATCH 1/5] bcache: don't write back data if reading it failed Michael Lyle
                   ` (2 preceding siblings ...)
  2017-09-26 19:24 ` [PATCH 4/5] bcache: writeback: collapse contiguous IO better Michael Lyle
@ 2017-09-26 19:24 ` Michael Lyle
  2017-09-27  8:22 ` [PATCH 1/5] bcache: don't write back data if reading it failed Coly Li
  4 siblings, 0 replies; 11+ messages in thread
From: Michael Lyle @ 2017-09-26 19:24 UTC (permalink / raw)
  To: linux-bcache; +Cc: i, Michael Lyle

Writeback keys are presently iterated and dispatched for writeback in
order of the logical block address on the backing device.  Multiple may
be, in parallel, read from the cache device and then written back
(especially when there are contiguous I/O).

However-- there was no guarantee with the existing code that the writes
would be issued in LBA order, as the reads from the cache device are
often re-ordered.  In turn, when writing back quickly, the backing disk
often has to seek backwards-- this slows writeback and increases
utilization.

This patch introduces an ordering mechanism that guarantees that the
original order of issue is maintained for the write portion of the I/O.
Performance for writeback is significantly improved when there are
multiple contiguous keys or high writeback rates.

Signed-off-by: Michael Lyle <mlyle@lyle.org>
---
 drivers/md/bcache/bcache.h    |  7 +++++++
 drivers/md/bcache/writeback.c | 28 ++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index da803a3b1981..6ae6efda6c57 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -328,6 +328,13 @@ struct cached_dev {
 
 	struct keybuf		writeback_keys;
 
+	/* Order the write-half of writeback operations strongly in dispatch
+	 * order.  (Maintain LBA order; don't allow reads completing out of
+	 * order to re-order the writes...)
+	 */
+	struct closure_waitlist writeback_ordering_wait;
+	atomic_t		writeback_sequence_next;
+
 	/* For tracking sequential IO */
 #define RECENT_IO_BITS	7
 #define RECENT_IO	(1 << RECENT_IO_BITS)
diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
index cf29c44605b9..3ca775fbae7f 100644
--- a/drivers/md/bcache/writeback.c
+++ b/drivers/md/bcache/writeback.c
@@ -112,6 +112,7 @@ static unsigned writeback_delay(struct cached_dev *dc, unsigned sectors)
 struct dirty_io {
 	struct closure		cl;
 	struct cached_dev	*dc;
+	uint16_t		sequence;
 	struct bio		bio;
 };
 
@@ -190,6 +191,26 @@ static void write_dirty(struct closure *cl)
 {
 	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
 	struct keybuf_key *w = io->bio.bi_private;
+	struct cached_dev *dc = io->dc;
+
+	uint16_t next_sequence;
+
+	if (atomic_read(&dc->writeback_sequence_next) != io->sequence) {
+		/* Not our turn to write; wait for a write to complete */
+		closure_wait(&dc->writeback_ordering_wait, cl);
+
+		if (atomic_read(&dc->writeback_sequence_next) == io->sequence) {
+			/* Edge case-- it happened in indeterminate order
+			 * relative to when we were added to wait list..
+			 */
+			closure_wake_up(&dc->writeback_ordering_wait);
+		}
+
+		continue_at(cl, write_dirty, io->dc->writeback_write_wq);
+		return;
+	}
+
+	next_sequence = io->sequence + 1;
 
 	/* IO errors are signalled using the dirty bit on the key.
 	 * If we failed to read, we should not attempt to write to the
@@ -206,6 +227,9 @@ static void write_dirty(struct closure *cl)
 		closure_bio_submit(&io->bio, cl);
 	}
 
+	atomic_set(&dc->writeback_sequence_next, next_sequence);
+	closure_wake_up(&dc->writeback_ordering_wait);
+
 	continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
 }
 
@@ -251,7 +275,10 @@ static void read_dirty(struct cached_dev *dc)
 	int nk, i;
 	struct dirty_io *io;
 	struct closure cl;
+	uint16_t sequence = 0;
 
+	BUG_ON(!llist_empty(&dc->writeback_ordering_wait.list));
+	atomic_set(&dc->writeback_sequence_next, sequence);
 	closure_init_stack(&cl);
 
 	/*
@@ -309,6 +336,7 @@ static void read_dirty(struct cached_dev *dc)
 
 			w->private	= io;
 			io->dc		= dc;
+			io->sequence    = sequence++;
 
 			dirty_init(w);
 			bio_set_op_attrs(&io->bio, REQ_OP_READ, 0);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/5] bcache: don't write back data if reading it failed
  2017-09-26 19:24 [PATCH 1/5] bcache: don't write back data if reading it failed Michael Lyle
                   ` (3 preceding siblings ...)
  2017-09-26 19:24 ` [PATCH 5/5] bcache: writeback: properly order backing device IO Michael Lyle
@ 2017-09-27  8:22 ` Coly Li
  2017-09-27  8:28   ` Michael Lyle
  4 siblings, 1 reply; 11+ messages in thread
From: Coly Li @ 2017-09-27  8:22 UTC (permalink / raw)
  To: Michael Lyle; +Cc: linux-bcache

On 2017/9/27 上午3:24, Michael Lyle wrote:
> If an IO operation fails, and we didn't successfully read data from the
> cache, don't writeback invalid/partial data to the backing disk.
> 
> Signed-off-by: Michael Lyle <mlyle@lyle.org>
> ---
>  drivers/md/bcache/writeback.c | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index e663ca082183..eea49bf36401 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -179,13 +179,20 @@ static void write_dirty(struct closure *cl)
>  	struct dirty_io *io = container_of(cl, struct dirty_io, cl);
>  	struct keybuf_key *w = io->bio.bi_private;
>  
> -	dirty_init(w);
> -	bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
> -	io->bio.bi_iter.bi_sector = KEY_START(&w->key);
> -	bio_set_dev(&io->bio, io->dc->bdev);
> -	io->bio.bi_end_io	= dirty_endio;
> +	/* IO errors are signalled using the dirty bit on the key.
> +	 * If we failed to read, we should not attempt to write to the
> +	 * backing device.  Instead, immediately go to write_dirty_finish
> +	 * to clean up.
> +	 */
> +	if (KEY_DIRTY(&w->key)) {
Maybe it should be "if (!KEY_DIRTY(&w->key))" ? In dirty_endio(),
SET_KEY_DIRTY() is called only when bio->bi_status is not 0.


> +		dirty_init(w);
> +		bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
> +		io->bio.bi_iter.bi_sector = KEY_START(&w->key);
> +		bio_set_dev(&io->bio, io->dc->bdev);
> +		io->bio.bi_end_io	= dirty_endio;
>  
> -	closure_bio_submit(&io->bio, cl);
> +		closure_bio_submit(&io->bio, cl);
> +	}
>  
>  	continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
>  }
> 


-- 
Coly Li

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/5] bcache: don't write back data if reading it failed
  2017-09-27  8:22 ` [PATCH 1/5] bcache: don't write back data if reading it failed Coly Li
@ 2017-09-27  8:28   ` Michael Lyle
  2017-09-27  8:40     ` Coly Li
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Lyle @ 2017-09-27  8:28 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache

SET_KEY_DIRTY sets it to false when bi_status is not 0.

We do the write if KEY_DIRTY is true, else it is skipped.

Mike

On Wed, Sep 27, 2017 at 1:22 AM, Coly Li <i@coly.li> wrote:
> On 2017/9/27 上午3:24, Michael Lyle wrote:
>> If an IO operation fails, and we didn't successfully read data from the
>> cache, don't writeback invalid/partial data to the backing disk.
>>
>> Signed-off-by: Michael Lyle <mlyle@lyle.org>
>> ---
>>  drivers/md/bcache/writeback.c | 19 +++++++++++++------
>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>> index e663ca082183..eea49bf36401 100644
>> --- a/drivers/md/bcache/writeback.c
>> +++ b/drivers/md/bcache/writeback.c
>> @@ -179,13 +179,20 @@ static void write_dirty(struct closure *cl)
>>       struct dirty_io *io = container_of(cl, struct dirty_io, cl);
>>       struct keybuf_key *w = io->bio.bi_private;
>>
>> -     dirty_init(w);
>> -     bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
>> -     io->bio.bi_iter.bi_sector = KEY_START(&w->key);
>> -     bio_set_dev(&io->bio, io->dc->bdev);
>> -     io->bio.bi_end_io       = dirty_endio;
>> +     /* IO errors are signalled using the dirty bit on the key.
>> +      * If we failed to read, we should not attempt to write to the
>> +      * backing device.  Instead, immediately go to write_dirty_finish
>> +      * to clean up.
>> +      */
>> +     if (KEY_DIRTY(&w->key)) {
> Maybe it should be "if (!KEY_DIRTY(&w->key))" ? In dirty_endio(),
> SET_KEY_DIRTY() is called only when bio->bi_status is not 0.
>
>
>> +             dirty_init(w);
>> +             bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
>> +             io->bio.bi_iter.bi_sector = KEY_START(&w->key);
>> +             bio_set_dev(&io->bio, io->dc->bdev);
>> +             io->bio.bi_end_io       = dirty_endio;
>>
>> -     closure_bio_submit(&io->bio, cl);
>> +             closure_bio_submit(&io->bio, cl);
>> +     }
>>
>>       continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
>>  }
>>
>
>
> --
> Coly Li

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/5] bcache: implement PI controller for writeback rate
  2017-09-26 19:24 ` [PATCH 2/5] bcache: implement PI controller for writeback rate Michael Lyle
@ 2017-09-27  8:35   ` Coly Li
  2017-09-27  8:40     ` Michael Lyle
  0 siblings, 1 reply; 11+ messages in thread
From: Coly Li @ 2017-09-27  8:35 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache

On 2017/9/27 上午3:24, Michael Lyle wrote:
> bcache uses a control system to attempt to keep the amount of dirty data
> in cache at a user-configured level, while not responding excessively to
> transients and variations in write rate.  Previously, the system was a
> PD controller; but the output from it was integrated, turning the
> Proportional term into an Integral term, and turning the Derivative term
> into a crude Proportional term.  Performance of the controller has been
> uneven in production, and it has tended to respond slowly, oscillate,
> and overshoot.
> 
> This patch set replaces the current control system with an explicit PI
> controller and tuning that should be correct for most hardware.  By
> default, it attempts to write at a rate that would retire 1/40th of the
> current excess blocks per second.  An integral term in turn works to
> remove steady state errors.
> 
> IMO, this yields benefits in simplicity (removing weighted average
> filtering, etc) and system performance.
> 
> Another small change is a tunable parameter is introduced to allow the
> user to specify a minimum rate at which dirty blocks are retired.
> 
> There is a slight difference from earlier versions of the patch in
> integral handling to prevent excessive negative integral windup.
> 
> Signed-off-by: Michael Lyle <mlyle@lyle.org>

Reviewed-by: Coly Li <colyli@suse.de>

Last comment, could you please to use current code comments style in
bcache, like this,

/*
 * The allocation code needs gc_mark in struct bucket to be correct, but
 * it's not while a gc is in progress. Protected by bucket_lock.
 */

Or like this,

/*
 * For any bio we don't skip we subtract the number of sectors from
 * rescale; when it hits 0 we rescale all the bucket priorities.
 */

Rested part are OK to me. Thanks for your effort to implement a simpler
PI controller.

Coly Li

> ---
>  drivers/md/bcache/bcache.h    |  9 ++---
>  drivers/md/bcache/sysfs.c     | 18 +++++----
>  drivers/md/bcache/writeback.c | 89 ++++++++++++++++++++++++-------------------
>  3 files changed, 64 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 2ed9bd231d84..eb83be693d60 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -265,9 +265,6 @@ struct bcache_device {
>  	atomic_t		*stripe_sectors_dirty;
>  	unsigned long		*full_dirty_stripes;
>  
> -	unsigned long		sectors_dirty_last;
> -	long			sectors_dirty_derivative;
> -
>  	struct bio_set		*bio_split;
>  
>  	unsigned		data_csum:1;
> @@ -362,12 +359,14 @@ struct cached_dev {
>  
>  	uint64_t		writeback_rate_target;
>  	int64_t			writeback_rate_proportional;
> -	int64_t			writeback_rate_derivative;
> +	int64_t			writeback_rate_integral;
> +	int64_t			writeback_rate_integral_scaled;
>  	int64_t			writeback_rate_change;
>  
>  	unsigned		writeback_rate_update_seconds;
> -	unsigned		writeback_rate_d_term;
> +	unsigned		writeback_rate_i_term_inverse;
>  	unsigned		writeback_rate_p_term_inverse;
> +	unsigned		writeback_rate_minimum;
>  };
>  
>  enum alloc_reserve {
> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
> index 104c57cd666c..eb493814759c 100644
> --- a/drivers/md/bcache/sysfs.c
> +++ b/drivers/md/bcache/sysfs.c
> @@ -81,8 +81,9 @@ rw_attribute(writeback_delay);
>  rw_attribute(writeback_rate);
>  
>  rw_attribute(writeback_rate_update_seconds);
> -rw_attribute(writeback_rate_d_term);
> +rw_attribute(writeback_rate_i_term_inverse);
>  rw_attribute(writeback_rate_p_term_inverse);
> +rw_attribute(writeback_rate_minimum);
>  read_attribute(writeback_rate_debug);
>  
>  read_attribute(stripe_size);
> @@ -130,15 +131,16 @@ SHOW(__bch_cached_dev)
>  	sysfs_hprint(writeback_rate,	dc->writeback_rate.rate << 9);
>  
>  	var_print(writeback_rate_update_seconds);
> -	var_print(writeback_rate_d_term);
> +	var_print(writeback_rate_i_term_inverse);
>  	var_print(writeback_rate_p_term_inverse);
> +	var_print(writeback_rate_minimum);
>  
>  	if (attr == &sysfs_writeback_rate_debug) {
>  		char rate[20];
>  		char dirty[20];
>  		char target[20];
>  		char proportional[20];
> -		char derivative[20];
> +		char integral[20];
>  		char change[20];
>  		s64 next_io;
>  
> @@ -146,7 +148,7 @@ SHOW(__bch_cached_dev)
>  		bch_hprint(dirty,	bcache_dev_sectors_dirty(&dc->disk) << 9);
>  		bch_hprint(target,	dc->writeback_rate_target << 9);
>  		bch_hprint(proportional,dc->writeback_rate_proportional << 9);
> -		bch_hprint(derivative,	dc->writeback_rate_derivative << 9);
> +		bch_hprint(integral,	dc->writeback_rate_integral_scaled << 9);
>  		bch_hprint(change,	dc->writeback_rate_change << 9);
>  
>  		next_io = div64_s64(dc->writeback_rate.next - local_clock(),
> @@ -157,11 +159,11 @@ SHOW(__bch_cached_dev)
>  			       "dirty:\t\t%s\n"
>  			       "target:\t\t%s\n"
>  			       "proportional:\t%s\n"
> -			       "derivative:\t%s\n"
> +			       "integral:\t%s\n"
>  			       "change:\t\t%s/sec\n"
>  			       "next io:\t%llims\n",
>  			       rate, dirty, target, proportional,
> -			       derivative, change, next_io);
> +			       integral, change, next_io);
>  	}
>  
>  	sysfs_hprint(dirty_data,
> @@ -213,7 +215,7 @@ STORE(__cached_dev)
>  			    dc->writeback_rate.rate, 1, INT_MAX);
>  
>  	d_strtoul_nonzero(writeback_rate_update_seconds);
> -	d_strtoul(writeback_rate_d_term);
> +	d_strtoul(writeback_rate_i_term_inverse);
>  	d_strtoul_nonzero(writeback_rate_p_term_inverse);
>  
>  	d_strtoi_h(sequential_cutoff);
> @@ -319,7 +321,7 @@ static struct attribute *bch_cached_dev_files[] = {
>  	&sysfs_writeback_percent,
>  	&sysfs_writeback_rate,
>  	&sysfs_writeback_rate_update_seconds,
> -	&sysfs_writeback_rate_d_term,
> +	&sysfs_writeback_rate_i_term_inverse,
>  	&sysfs_writeback_rate_p_term_inverse,
>  	&sysfs_writeback_rate_debug,
>  	&sysfs_dirty_data,
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index eea49bf36401..5a8f5655151b 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -25,48 +25,60 @@ static void __update_writeback_rate(struct cached_dev *dc)
>  				bcache_flash_devs_sectors_dirty(c);
>  	uint64_t cache_dirty_target =
>  		div_u64(cache_sectors * dc->writeback_percent, 100);
> -
>  	int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
>  				   c->cached_dev_sectors);
>  
> -	/* PD controller */
> -
> +	/* PI controller:
> +	 * Figures out the amount that should be written per second.
> +	 *
> +	 * First, the error (number of sectors that are dirty beyond our
> +	 * target) is calculated.  The error is accumulated (numerically
> +	 * integrated).
> +	 *
> +	 * Then, the proportional value and integral value are scaled
> +	 * based on configured values.  These are stored as inverses to
> +	 * avoid fixed point math and to make configuration easy-- e.g.
> +	 * the default value of 40 for writeback_rate_p_term_inverse
> +	 * attempts to write at a rate that would retire all the dirty
> +	 * blocks in 40 seconds.
> +	 *
> +	 * The writeback_rate_i_inverse value of 10000 means that 1/10000th
> +	 * of the error is accumulated in the integral term per second.
> +	 * This acts as a slow, long-term average that is not subject to
> +	 * variations in usage like the p term.
> +	 */
>  	int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
> -	int64_t derivative = dirty - dc->disk.sectors_dirty_last;
> -	int64_t proportional = dirty - target;
> -	int64_t change;
> -
> -	dc->disk.sectors_dirty_last = dirty;
> -
> -	/* Scale to sectors per second */
> -
> -	proportional *= dc->writeback_rate_update_seconds;
> -	proportional = div_s64(proportional, dc->writeback_rate_p_term_inverse);
> -
> -	derivative = div_s64(derivative, dc->writeback_rate_update_seconds);
> -
> -	derivative = ewma_add(dc->disk.sectors_dirty_derivative, derivative,
> -			      (dc->writeback_rate_d_term /
> -			       dc->writeback_rate_update_seconds) ?: 1, 0);
> -
> -	derivative *= dc->writeback_rate_d_term;
> -	derivative = div_s64(derivative, dc->writeback_rate_p_term_inverse);
> -
> -	change = proportional + derivative;
> +	int64_t error = dirty - target;
> +	int64_t proportional_scaled =
> +		div_s64(error, dc->writeback_rate_p_term_inverse);
> +	int64_t integral_scaled, new_rate;
> +
> +	if ((error < 0 && dc->writeback_rate_integral > 0) ||
> +	    (error > 0 && time_before64(local_clock(),
> +			 dc->writeback_rate.next + NSEC_PER_MSEC))) {
> +		/* Only decrease the integral term if it's more than
> +		 * zero.  Only increase the integral term if the device
> +		 * is keeping up.  (Don't wind up the integral
> +		 * ineffectively in either case).
> +		 *
> +		 * It's necessary to scale this by
> +		 * writeback_rate_update_seconds to keep the integral
> +		 * term dimensioned properly.
> +		 */
> +		dc->writeback_rate_integral += error *
> +			dc->writeback_rate_update_seconds;
> +	}
>  
> -	/* Don't increase writeback rate if the device isn't keeping up */
> -	if (change > 0 &&
> -	    time_after64(local_clock(),
> -			 dc->writeback_rate.next + NSEC_PER_MSEC))
> -		change = 0;
> +	integral_scaled = div_s64(dc->writeback_rate_integral,
> +			dc->writeback_rate_i_term_inverse);
>  
> -	dc->writeback_rate.rate =
> -		clamp_t(int64_t, (int64_t) dc->writeback_rate.rate + change,
> -			1, NSEC_PER_MSEC);
> +	new_rate = clamp_t(int64_t, (proportional_scaled + integral_scaled),
> +			dc->writeback_rate_minimum, NSEC_PER_MSEC);
>  
> -	dc->writeback_rate_proportional = proportional;
> -	dc->writeback_rate_derivative = derivative;
> -	dc->writeback_rate_change = change;
> +	dc->writeback_rate_proportional = proportional_scaled;
> +	dc->writeback_rate_integral_scaled = integral_scaled;
> +	dc->writeback_rate_change = new_rate - dc->writeback_rate.rate;
> +	dc->writeback_rate.rate = new_rate;
>  	dc->writeback_rate_target = target;
>  }
>  
> @@ -498,8 +510,6 @@ void bch_sectors_dirty_init(struct bcache_device *d)
>  
>  	bch_btree_map_keys(&op.op, d->c, &KEY(op.inode, 0, 0),
>  			   sectors_dirty_init_fn, 0);
> -
> -	d->sectors_dirty_last = bcache_dev_sectors_dirty(d);
>  }
>  
>  void bch_cached_dev_writeback_init(struct cached_dev *dc)
> @@ -513,10 +523,11 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>  	dc->writeback_percent		= 10;
>  	dc->writeback_delay		= 30;
>  	dc->writeback_rate.rate		= 1024;
> +	dc->writeback_rate_minimum	= 1;
>  
>  	dc->writeback_rate_update_seconds = 5;
> -	dc->writeback_rate_d_term	= 30;
> -	dc->writeback_rate_p_term_inverse = 6000;
> +	dc->writeback_rate_p_term_inverse = 40;
> +	dc->writeback_rate_i_term_inverse = 10000;
>  
>  	INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
>  }
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/5] bcache: don't write back data if reading it failed
  2017-09-27  8:28   ` Michael Lyle
@ 2017-09-27  8:40     ` Coly Li
  0 siblings, 0 replies; 11+ messages in thread
From: Coly Li @ 2017-09-27  8:40 UTC (permalink / raw)
  To: Michael Lyle; +Cc: linux-bcache

On 2017/9/27 下午4:28, Michael Lyle wrote:
> SET_KEY_DIRTY sets it to false when bi_status is not 0.
> 
> We do the write if KEY_DIRTY is true, else it is skipped.

I see your point, yes you are right.

Reviewed-by: Coly Li <colyli@suse.de>

Thanks!


> On Wed, Sep 27, 2017 at 1:22 AM, Coly Li <i@coly.li> wrote:
>> On 2017/9/27 上午3:24, Michael Lyle wrote:
>>> If an IO operation fails, and we didn't successfully read data from the
>>> cache, don't writeback invalid/partial data to the backing disk.
>>>
>>> Signed-off-by: Michael Lyle <mlyle@lyle.org>
>>> ---
>>>  drivers/md/bcache/writeback.c | 19 +++++++++++++------
>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>>> index e663ca082183..eea49bf36401 100644
>>> --- a/drivers/md/bcache/writeback.c
>>> +++ b/drivers/md/bcache/writeback.c
>>> @@ -179,13 +179,20 @@ static void write_dirty(struct closure *cl)
>>>       struct dirty_io *io = container_of(cl, struct dirty_io, cl);
>>>       struct keybuf_key *w = io->bio.bi_private;
>>>
>>> -     dirty_init(w);
>>> -     bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
>>> -     io->bio.bi_iter.bi_sector = KEY_START(&w->key);
>>> -     bio_set_dev(&io->bio, io->dc->bdev);
>>> -     io->bio.bi_end_io       = dirty_endio;
>>> +     /* IO errors are signalled using the dirty bit on the key.
>>> +      * If we failed to read, we should not attempt to write to the
>>> +      * backing device.  Instead, immediately go to write_dirty_finish
>>> +      * to clean up.
>>> +      */
>>> +     if (KEY_DIRTY(&w->key)) {
>> Maybe it should be "if (!KEY_DIRTY(&w->key))" ? In dirty_endio(),
>> SET_KEY_DIRTY() is called only when bio->bi_status is not 0.
>>
>>
>>> +             dirty_init(w);
>>> +             bio_set_op_attrs(&io->bio, REQ_OP_WRITE, 0);
>>> +             io->bio.bi_iter.bi_sector = KEY_START(&w->key);
>>> +             bio_set_dev(&io->bio, io->dc->bdev);
>>> +             io->bio.bi_end_io       = dirty_endio;
>>>
>>> -     closure_bio_submit(&io->bio, cl);
>>> +             closure_bio_submit(&io->bio, cl);
>>> +     }
>>>
>>>       continue_at(cl, write_dirty_finish, io->dc->writeback_write_wq);
>>>  }
>>>
>>
>>
>> --
>> Coly Li


-- 
Coly Li

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/5] bcache: implement PI controller for writeback rate
  2017-09-27  8:35   ` Coly Li
@ 2017-09-27  8:40     ` Michael Lyle
  0 siblings, 0 replies; 11+ messages in thread
From: Michael Lyle @ 2017-09-27  8:40 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache

Coly--

Excellent, thank you for all your help.  I will send out another
version of these patches after a little more time for review.  I have
made the change you've suggested.

Mike

On Wed, Sep 27, 2017 at 1:35 AM, Coly Li <i@coly.li> wrote:
> On 2017/9/27 上午3:24, Michael Lyle wrote:
>> bcache uses a control system to attempt to keep the amount of dirty data
>> in cache at a user-configured level, while not responding excessively to
>> transients and variations in write rate.  Previously, the system was a
>> PD controller; but the output from it was integrated, turning the
>> Proportional term into an Integral term, and turning the Derivative term
>> into a crude Proportional term.  Performance of the controller has been
>> uneven in production, and it has tended to respond slowly, oscillate,
>> and overshoot.
>>
>> This patch set replaces the current control system with an explicit PI
>> controller and tuning that should be correct for most hardware.  By
>> default, it attempts to write at a rate that would retire 1/40th of the
>> current excess blocks per second.  An integral term in turn works to
>> remove steady state errors.
>>
>> IMO, this yields benefits in simplicity (removing weighted average
>> filtering, etc) and system performance.
>>
>> Another small change is a tunable parameter is introduced to allow the
>> user to specify a minimum rate at which dirty blocks are retired.
>>
>> There is a slight difference from earlier versions of the patch in
>> integral handling to prevent excessive negative integral windup.
>>
>> Signed-off-by: Michael Lyle <mlyle@lyle.org>
>
> Reviewed-by: Coly Li <colyli@suse.de>
>
> Last comment, could you please to use current code comments style in
> bcache, like this,
>
> /*
>  * The allocation code needs gc_mark in struct bucket to be correct, but
>  * it's not while a gc is in progress. Protected by bucket_lock.
>  */
>
> Or like this,
>
> /*
>  * For any bio we don't skip we subtract the number of sectors from
>  * rescale; when it hits 0 we rescale all the bucket priorities.
>  */
>
> Rested part are OK to me. Thanks for your effort to implement a simpler
> PI controller.
>
> Coly Li
>
>> ---
>>  drivers/md/bcache/bcache.h    |  9 ++---
>>  drivers/md/bcache/sysfs.c     | 18 +++++----
>>  drivers/md/bcache/writeback.c | 89 ++++++++++++++++++++++++-------------------
>>  3 files changed, 64 insertions(+), 52 deletions(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 2ed9bd231d84..eb83be693d60 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -265,9 +265,6 @@ struct bcache_device {
>>       atomic_t                *stripe_sectors_dirty;
>>       unsigned long           *full_dirty_stripes;
>>
>> -     unsigned long           sectors_dirty_last;
>> -     long                    sectors_dirty_derivative;
>> -
>>       struct bio_set          *bio_split;
>>
>>       unsigned                data_csum:1;
>> @@ -362,12 +359,14 @@ struct cached_dev {
>>
>>       uint64_t                writeback_rate_target;
>>       int64_t                 writeback_rate_proportional;
>> -     int64_t                 writeback_rate_derivative;
>> +     int64_t                 writeback_rate_integral;
>> +     int64_t                 writeback_rate_integral_scaled;
>>       int64_t                 writeback_rate_change;
>>
>>       unsigned                writeback_rate_update_seconds;
>> -     unsigned                writeback_rate_d_term;
>> +     unsigned                writeback_rate_i_term_inverse;
>>       unsigned                writeback_rate_p_term_inverse;
>> +     unsigned                writeback_rate_minimum;
>>  };
>>
>>  enum alloc_reserve {
>> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
>> index 104c57cd666c..eb493814759c 100644
>> --- a/drivers/md/bcache/sysfs.c
>> +++ b/drivers/md/bcache/sysfs.c
>> @@ -81,8 +81,9 @@ rw_attribute(writeback_delay);
>>  rw_attribute(writeback_rate);
>>
>>  rw_attribute(writeback_rate_update_seconds);
>> -rw_attribute(writeback_rate_d_term);
>> +rw_attribute(writeback_rate_i_term_inverse);
>>  rw_attribute(writeback_rate_p_term_inverse);
>> +rw_attribute(writeback_rate_minimum);
>>  read_attribute(writeback_rate_debug);
>>
>>  read_attribute(stripe_size);
>> @@ -130,15 +131,16 @@ SHOW(__bch_cached_dev)
>>       sysfs_hprint(writeback_rate,    dc->writeback_rate.rate << 9);
>>
>>       var_print(writeback_rate_update_seconds);
>> -     var_print(writeback_rate_d_term);
>> +     var_print(writeback_rate_i_term_inverse);
>>       var_print(writeback_rate_p_term_inverse);
>> +     var_print(writeback_rate_minimum);
>>
>>       if (attr == &sysfs_writeback_rate_debug) {
>>               char rate[20];
>>               char dirty[20];
>>               char target[20];
>>               char proportional[20];
>> -             char derivative[20];
>> +             char integral[20];
>>               char change[20];
>>               s64 next_io;
>>
>> @@ -146,7 +148,7 @@ SHOW(__bch_cached_dev)
>>               bch_hprint(dirty,       bcache_dev_sectors_dirty(&dc->disk) << 9);
>>               bch_hprint(target,      dc->writeback_rate_target << 9);
>>               bch_hprint(proportional,dc->writeback_rate_proportional << 9);
>> -             bch_hprint(derivative,  dc->writeback_rate_derivative << 9);
>> +             bch_hprint(integral,    dc->writeback_rate_integral_scaled << 9);
>>               bch_hprint(change,      dc->writeback_rate_change << 9);
>>
>>               next_io = div64_s64(dc->writeback_rate.next - local_clock(),
>> @@ -157,11 +159,11 @@ SHOW(__bch_cached_dev)
>>                              "dirty:\t\t%s\n"
>>                              "target:\t\t%s\n"
>>                              "proportional:\t%s\n"
>> -                            "derivative:\t%s\n"
>> +                            "integral:\t%s\n"
>>                              "change:\t\t%s/sec\n"
>>                              "next io:\t%llims\n",
>>                              rate, dirty, target, proportional,
>> -                            derivative, change, next_io);
>> +                            integral, change, next_io);
>>       }
>>
>>       sysfs_hprint(dirty_data,
>> @@ -213,7 +215,7 @@ STORE(__cached_dev)
>>                           dc->writeback_rate.rate, 1, INT_MAX);
>>
>>       d_strtoul_nonzero(writeback_rate_update_seconds);
>> -     d_strtoul(writeback_rate_d_term);
>> +     d_strtoul(writeback_rate_i_term_inverse);
>>       d_strtoul_nonzero(writeback_rate_p_term_inverse);
>>
>>       d_strtoi_h(sequential_cutoff);
>> @@ -319,7 +321,7 @@ static struct attribute *bch_cached_dev_files[] = {
>>       &sysfs_writeback_percent,
>>       &sysfs_writeback_rate,
>>       &sysfs_writeback_rate_update_seconds,
>> -     &sysfs_writeback_rate_d_term,
>> +     &sysfs_writeback_rate_i_term_inverse,
>>       &sysfs_writeback_rate_p_term_inverse,
>>       &sysfs_writeback_rate_debug,
>>       &sysfs_dirty_data,
>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>> index eea49bf36401..5a8f5655151b 100644
>> --- a/drivers/md/bcache/writeback.c
>> +++ b/drivers/md/bcache/writeback.c
>> @@ -25,48 +25,60 @@ static void __update_writeback_rate(struct cached_dev *dc)
>>                               bcache_flash_devs_sectors_dirty(c);
>>       uint64_t cache_dirty_target =
>>               div_u64(cache_sectors * dc->writeback_percent, 100);
>> -
>>       int64_t target = div64_u64(cache_dirty_target * bdev_sectors(dc->bdev),
>>                                  c->cached_dev_sectors);
>>
>> -     /* PD controller */
>> -
>> +     /* PI controller:
>> +      * Figures out the amount that should be written per second.
>> +      *
>> +      * First, the error (number of sectors that are dirty beyond our
>> +      * target) is calculated.  The error is accumulated (numerically
>> +      * integrated).
>> +      *
>> +      * Then, the proportional value and integral value are scaled
>> +      * based on configured values.  These are stored as inverses to
>> +      * avoid fixed point math and to make configuration easy-- e.g.
>> +      * the default value of 40 for writeback_rate_p_term_inverse
>> +      * attempts to write at a rate that would retire all the dirty
>> +      * blocks in 40 seconds.
>> +      *
>> +      * The writeback_rate_i_inverse value of 10000 means that 1/10000th
>> +      * of the error is accumulated in the integral term per second.
>> +      * This acts as a slow, long-term average that is not subject to
>> +      * variations in usage like the p term.
>> +      */
>>       int64_t dirty = bcache_dev_sectors_dirty(&dc->disk);
>> -     int64_t derivative = dirty - dc->disk.sectors_dirty_last;
>> -     int64_t proportional = dirty - target;
>> -     int64_t change;
>> -
>> -     dc->disk.sectors_dirty_last = dirty;
>> -
>> -     /* Scale to sectors per second */
>> -
>> -     proportional *= dc->writeback_rate_update_seconds;
>> -     proportional = div_s64(proportional, dc->writeback_rate_p_term_inverse);
>> -
>> -     derivative = div_s64(derivative, dc->writeback_rate_update_seconds);
>> -
>> -     derivative = ewma_add(dc->disk.sectors_dirty_derivative, derivative,
>> -                           (dc->writeback_rate_d_term /
>> -                            dc->writeback_rate_update_seconds) ?: 1, 0);
>> -
>> -     derivative *= dc->writeback_rate_d_term;
>> -     derivative = div_s64(derivative, dc->writeback_rate_p_term_inverse);
>> -
>> -     change = proportional + derivative;
>> +     int64_t error = dirty - target;
>> +     int64_t proportional_scaled =
>> +             div_s64(error, dc->writeback_rate_p_term_inverse);
>> +     int64_t integral_scaled, new_rate;
>> +
>> +     if ((error < 0 && dc->writeback_rate_integral > 0) ||
>> +         (error > 0 && time_before64(local_clock(),
>> +                      dc->writeback_rate.next + NSEC_PER_MSEC))) {
>> +             /* Only decrease the integral term if it's more than
>> +              * zero.  Only increase the integral term if the device
>> +              * is keeping up.  (Don't wind up the integral
>> +              * ineffectively in either case).
>> +              *
>> +              * It's necessary to scale this by
>> +              * writeback_rate_update_seconds to keep the integral
>> +              * term dimensioned properly.
>> +              */
>> +             dc->writeback_rate_integral += error *
>> +                     dc->writeback_rate_update_seconds;
>> +     }
>>
>> -     /* Don't increase writeback rate if the device isn't keeping up */
>> -     if (change > 0 &&
>> -         time_after64(local_clock(),
>> -                      dc->writeback_rate.next + NSEC_PER_MSEC))
>> -             change = 0;
>> +     integral_scaled = div_s64(dc->writeback_rate_integral,
>> +                     dc->writeback_rate_i_term_inverse);
>>
>> -     dc->writeback_rate.rate =
>> -             clamp_t(int64_t, (int64_t) dc->writeback_rate.rate + change,
>> -                     1, NSEC_PER_MSEC);
>> +     new_rate = clamp_t(int64_t, (proportional_scaled + integral_scaled),
>> +                     dc->writeback_rate_minimum, NSEC_PER_MSEC);
>>
>> -     dc->writeback_rate_proportional = proportional;
>> -     dc->writeback_rate_derivative = derivative;
>> -     dc->writeback_rate_change = change;
>> +     dc->writeback_rate_proportional = proportional_scaled;
>> +     dc->writeback_rate_integral_scaled = integral_scaled;
>> +     dc->writeback_rate_change = new_rate - dc->writeback_rate.rate;
>> +     dc->writeback_rate.rate = new_rate;
>>       dc->writeback_rate_target = target;
>>  }
>>
>> @@ -498,8 +510,6 @@ void bch_sectors_dirty_init(struct bcache_device *d)
>>
>>       bch_btree_map_keys(&op.op, d->c, &KEY(op.inode, 0, 0),
>>                          sectors_dirty_init_fn, 0);
>> -
>> -     d->sectors_dirty_last = bcache_dev_sectors_dirty(d);
>>  }
>>
>>  void bch_cached_dev_writeback_init(struct cached_dev *dc)
>> @@ -513,10 +523,11 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>>       dc->writeback_percent           = 10;
>>       dc->writeback_delay             = 30;
>>       dc->writeback_rate.rate         = 1024;
>> +     dc->writeback_rate_minimum      = 1;
>>
>>       dc->writeback_rate_update_seconds = 5;
>> -     dc->writeback_rate_d_term       = 30;
>> -     dc->writeback_rate_p_term_inverse = 6000;
>> +     dc->writeback_rate_p_term_inverse = 40;
>> +     dc->writeback_rate_i_term_inverse = 10000;
>>
>>       INIT_DELAYED_WORK(&dc->writeback_rate_update, update_writeback_rate);
>>  }
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/5] bcache: smooth writeback rate control
  2017-09-26 19:24 ` [PATCH 3/5] bcache: smooth writeback rate control Michael Lyle
@ 2017-09-27  8:44   ` Coly Li
  0 siblings, 0 replies; 11+ messages in thread
From: Coly Li @ 2017-09-27  8:44 UTC (permalink / raw)
  To: Michael Lyle, linux-bcache

On 2017/9/27 上午3:24, Michael Lyle wrote:
> This works in conjunction with the new PI controller.  Currently, in
> real-world workloads, the rate controller attempts to write back 1
> sector per second.  In practice, these minimum-rate writebacks are
> between 4k and 60k in test scenarios, since bcache aggregates and
> attempts to do contiguous writes and because filesystems on top of
> bcachefs typically write 4k or more.
> 
> Previously, bcache used to guarantee to write at least once per second.
> This means that the actual writeback rate would exceed the configured
> amount by a factor of 8-120 or more.
> 
> This patch adjusts to be willing to sleep up to 2.5 seconds, and to
> target writing 4k/second.  On the smallest writes, it will sleep 1
> second like before, but many times it will sleep longer and load the
> backing device less.  This keeps the loading on the cache and backing
> device related to writeback more consistent when writing back at low
> rates.
> 
> Signed-off-by: Michael Lyle <mlyle@lyle.org>

Reviewed-by: Coly Li <colyli@suse.de>

Added to for-test. Thanks.

> ---
>  drivers/md/bcache/util.c      | 10 ++++++++--
>  drivers/md/bcache/writeback.c |  2 +-
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
> index 176d3c2ef5f5..4dbe37e82877 100644
> --- a/drivers/md/bcache/util.c
> +++ b/drivers/md/bcache/util.c
> @@ -232,8 +232,14 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done)
>  
>  	d->next += div_u64(done * NSEC_PER_SEC, d->rate);
>  
> -	if (time_before64(now + NSEC_PER_SEC, d->next))
> -		d->next = now + NSEC_PER_SEC;
> +	/* Bound the time.  Don't let us fall further than 2 seconds behind
> +	 * (this prevents unnecessary backlog that would make it impossible
> +	 * to catch up).  If we're ahead of the desired writeback rate,
> +	 * don't let us sleep more than 2.5 seconds (so we can notice/respond
> +	 * if the control system tells us to speed up!).
> +	 */
> +	if (time_before64(now + NSEC_PER_SEC * 5 / 2, d->next))
> +		d->next = now + NSEC_PER_SEC * 5 / 2;
>  
>  	if (time_after64(now - NSEC_PER_SEC * 2, d->next))
>  		d->next = now - NSEC_PER_SEC * 2;
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index 5a8f5655151b..0b7c89813635 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -523,7 +523,7 @@ void bch_cached_dev_writeback_init(struct cached_dev *dc)
>  	dc->writeback_percent		= 10;
>  	dc->writeback_delay		= 30;
>  	dc->writeback_rate.rate		= 1024;
> -	dc->writeback_rate_minimum	= 1;
> +	dc->writeback_rate_minimum	= 8;
>  
>  	dc->writeback_rate_update_seconds = 5;
>  	dc->writeback_rate_p_term_inverse = 40;
> 


-- 
Coly Li

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-09-27  8:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-26 19:24 [PATCH 1/5] bcache: don't write back data if reading it failed Michael Lyle
2017-09-26 19:24 ` [PATCH 2/5] bcache: implement PI controller for writeback rate Michael Lyle
2017-09-27  8:35   ` Coly Li
2017-09-27  8:40     ` Michael Lyle
2017-09-26 19:24 ` [PATCH 3/5] bcache: smooth writeback rate control Michael Lyle
2017-09-27  8:44   ` Coly Li
2017-09-26 19:24 ` [PATCH 4/5] bcache: writeback: collapse contiguous IO better Michael Lyle
2017-09-26 19:24 ` [PATCH 5/5] bcache: writeback: properly order backing device IO Michael Lyle
2017-09-27  8:22 ` [PATCH 1/5] bcache: don't write back data if reading it failed Coly Li
2017-09-27  8:28   ` Michael Lyle
2017-09-27  8:40     ` Coly Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.