All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 00/13] block-throttle: proportional throttle
@ 2016-02-22 22:01 Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 01/13] block: estimate disk performance Shaohua Li
                   ` (13 more replies)
  0 siblings, 14 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

Hi,

Currently we have 2 iocontrollers. blk-throttling is bandwidth/iops based. CFQ
is weight based. It would be great there is a unified iocontroller for the two.
And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
for blk-mq. It's time to have a scalable iocontroller supporting both
bandwidth/weight based control and working with blk-mq.

blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
It has a global lock which is scaring for scalability, but it's not terrible in
practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
this isn't a big problem for today's workload. This patchset then try to make a
unified iocontroller with blk-throttling.

The idea is pretty simple. If we know disk total capability, we can divide the
capability to cgroups according to their weight. blk-throttling can account IO
cost and use the calculated capability to throttle cgroup. The problem is how
to estimate the disk total capability and IO cost. We can use bandwidth
(capability is bandwidth, IO cost is request size), IOPS (capability is IOPS,
IO cost is 1), or anything better. There are a lot of disscussions about this
topic, but we can't find the best option right now. Bandwidth/IOPS isn't
optimal but the only option at hand and should work generally. This patch set
tries to create a framework and make bandwidth/IOPS based proportional
throttling work. Later we can add other options for capability/cost
measurement. I'll focus on bandwidth based approach here. The first 9 patches
demonstrate the ideas and should be pretty straightforward.

The problem is we don't know the max bandwidth a disk can provide for a
specific workload, which depends on the device and IO pattern. The estimated
bandwidth by patch 1 will be always not accurate unless the disk is already in
max bandwidth. To solve this issue, we always over estimate the bandwidth. Over
esitmate bandwidth, workload dispatchs more IO, estimated bandwidth becomes
higher, dispatches even more IO. The loop will run till we enter a stable
state, in which the disk gets max bandwidth. The 'slightly adjust and run into
stable state' is the core algorithm the patch series use. We also use it to
detect inactive cgroup.

Over estimate bandwidth can introduce fairness issue, because it's possible
some cgroups can use the extra bandwidth but others not, the cgroup using the
extra bandwidth gets more share than expected. On the other hand, smaller extra
bandwidth means disk gets to max bandwidth more slowly. We assign 1/8 extra
bandwidth in the patch.

The tricky part is cgroup might not use its share fully. The cgroup can run
into this because it is not willing to (for example, a fio job with '--rate'
option) or not able to (for example, a specific IO depth limits the bandwidth)
dispatch enough IO. Let's have some examples. Assume cg1 has 20% share, cg2 has
80% share.

1. disk bandwidth 200M/s. cgroup hasn't limitation
cg1 bps = 40M/s, cg2 bps = 160M/s

2. disk bandwidth 200M/s. cg2 has rate limit 10M/s
cg1 bps = 190M/s, cg2 bps = 10M/s
In this case, cg1's bps isn't 10/4 = 2.5M/s.

3. disk bandwidth 200M/s. cg2 has rate limit 10M/s, cg1 has limit 100M/s
cg1 bps = 100M/s, cg2 bps = 10M/s

We should detect cgroup which has big share but can't use its share. Otherwise,
we can't drive disk to max bandwidth. To solve the issue, if a cgroup doesn't
use its share, we adjust its weight/share. For example, cg2 in example 2 will
get 5% share even user sets its share to 80%. The adjustment is slight to avoid
spike, but we will reach a stable state eventually.

It's possible the adjustment is wrong. If a cgroup with shrinked share hits
bandwidth limit, we recover its share. There is another tricky adjustment case

4. disk bandwidth 100M/s. each cgroup has 80M/s.
cg1 bps = 20M/s, cg2 bps = 80M/s

This is the ideal state. cg1 uses its share fully. But since we assign 1/8
extra bandwidth, cg2 doesn't use its share fully. The adjustment algorithm will
adjust cg2's share, say 5M/s. cg1 gets 25M/s then. cg2 can only get 75M/s
because the disk max bandwidth is 100M/s. If this adjustment continues, both
cg1 and cg2 will get 50M/s bandwidth. To mitigate this issue, if a cgroup's
bandwidth drops after its share is shrinked, we restore its share. This is
controversial sometimes. In above case, when cg2 gets 75M/s, cg1 might get
40M/s, because IO pattern changes and max bandwidth changes. Somebody might
think 40M/75M is better, but this patch series choose 20M/80M.

We don't bias read/sync IO in the patch set yet. Idealy we should divide a
cgroup's share for read/write IO and give read IO more share. The problem is
some cgroups might only do read or write. A fixed read/write ratio will make
such cgroups waste their share. This issue can be fixed if we introduce a sub
service queue for read and write of a cgroup in the future.

I have been tested the patches in different setups. Test setup, scripts and
results are uploaded at:
https://github.com/shligit/iocontroller-test

There are still some tests we don't get optimal performance yet and some
calculations must be revised, but I think the code is good enough to
demonstrate the idea and different issues. Comments and benchmarks are warmly
welcome!

-----------------------------------------------------------------

Shaohua Li (13):
  block: estimate disk performance
  blk-throttle: cleanup io cost related stuff
  blk-throttle: add abstract to index data
  blk-throttle: weight based throttling
  blk-throttling: detect inactive cgroup
  blk-throttle: add per-cgroup data
  blk-throttle: add interface for proporation based throttle
  blk-throttle: add cgroup2 interface
  blk-throttle: add trace for new proporation throttle
  blk-throttle: over estimate bandwidth
  blk-throttle: shrink cgroup share if its target is overestimated
  blk-throttle: restore shrinked cgroup share
  blk-throttle: detect wrong shrink

 block/blk-core.c           |   56 ++
 block/blk-sysfs.c          |   13 +
 block/blk-throttle.c       | 1217 ++++++++++++++++++++++++++++++++++++++------
 include/linux/blk-cgroup.h |   10 +
 include/linux/blkdev.h     |    7 +
 5 files changed, 1158 insertions(+), 145 deletions(-)

-- 
2.6.5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH V2 01/13] block: estimate disk performance
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 02/13] blk-throttle: cleanup io cost related stuff Shaohua Li
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

weight based blk-throttling can use the estimated performance to
calculate cgroup bandwidth/IOPS. That said we never can't get accurate
disk performance in a complex workload, a roughly estimation is enough.

One issue is workload might not send enough IO to make disk 100% busy.
The calculation should compensate disk utilization. The calculation
equation is:
bps = (bytes * HZ / time) * (time / busy_time) = bytes * HZ / busy_time
iops = (ios * HZ / time) * (time / busy_time) = ios * HZ / busy_time

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-core.c       | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-sysfs.c      | 13 ++++++++++++
 include/linux/blkdev.h |  7 +++++++
 3 files changed, 76 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index ab51685..4244f28 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1919,6 +1919,59 @@ static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
 	return 0;
 }
 
+#define UPDATE_TIME (HZ / 2)
+static void blk_update_perf(struct request_queue *q,
+	struct hd_struct *p)
+{
+	unsigned long now = jiffies;
+	unsigned long last = q->bw_timestamp;
+	sector_t read_sect, write_sect, tmp_sect;
+	unsigned long read_ios, write_ios, tmp_ios;
+	unsigned long current_ticks;
+	unsigned int busy_ticks;
+
+	if (time_before(now, last + UPDATE_TIME))
+		return;
+
+	if (cmpxchg(&q->bw_timestamp, last, now) != last)
+		return;
+
+	tmp_sect = part_stat_read(p, sectors[READ]);
+	read_sect = tmp_sect - q->last_sects[READ];
+	q->last_sects[READ] = tmp_sect;
+	tmp_sect = part_stat_read(p, sectors[WRITE]);
+	write_sect = tmp_sect - q->last_sects[WRITE];
+	q->last_sects[WRITE] = tmp_sect;
+
+	tmp_ios = part_stat_read(p, ios[READ]);
+	read_ios = tmp_ios - q->last_ios[READ];
+	q->last_ios[READ] = tmp_ios;
+	tmp_ios = part_stat_read(p, ios[WRITE]);
+	write_ios = tmp_ios - q->last_ios[WRITE];
+	q->last_ios[WRITE] = tmp_ios;
+
+	current_ticks = part_stat_read(p, io_ticks);
+	busy_ticks = current_ticks - q->last_ticks;
+	q->last_ticks = current_ticks;
+
+	/* Don't account for long idle */
+	if (now - last > UPDATE_TIME * 2)
+		return;
+	/* Disk load is too low or driver doesn't account io_ticks */
+	if (busy_ticks == 0)
+		return;
+
+	if (busy_ticks > now - last)
+		busy_ticks = now - last;
+
+	tmp_sect = (read_sect + write_sect) * HZ;
+	sector_div(tmp_sect, busy_ticks);
+	q->disk_bw = tmp_sect;
+
+	tmp_ios = (read_ios + write_ios) * HZ / busy_ticks;
+	q->disk_iops = tmp_ios;
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -1991,6 +2044,9 @@ generic_make_request_checks(struct bio *bio)
 	 */
 	create_io_context(GFP_ATOMIC, q->node);
 
+	blk_update_perf(q,
+		part->partno ? &part_to_disk(part)->part0 : part);
+
 	if (!blkcg_bio_issue_check(q, bio))
 		return false;
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e140cc4..b59461c 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -348,6 +348,13 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	return ret;
 }
 
+static ssize_t queue_avg_perf_show(struct request_queue *q, char *page)
+{
+	return sprintf(page, "%llu %llu\n",
+		       (unsigned long long)q->disk_bw * 512,
+		       (unsigned long long)q->disk_iops);
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_requests_show,
@@ -479,6 +486,11 @@ static struct queue_sysfs_entry queue_poll_entry = {
 	.store = queue_poll_store,
 };
 
+static struct queue_sysfs_entry queue_avg_perf_entry = {
+	.attr = {.name = "average_perf", .mode = S_IRUGO },
+	.show = queue_avg_perf_show,
+};
+
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
 	&queue_ra_entry.attr,
@@ -504,6 +516,7 @@ static struct attribute *default_attrs[] = {
 	&queue_iostats_entry.attr,
 	&queue_random_entry.attr,
 	&queue_poll_entry.attr,
+	&queue_avg_perf_entry.attr,
 	NULL,
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 29189ae..d2d6a7b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -466,6 +466,13 @@ struct request_queue {
 	struct bio_set		*bio_split;
 
 	bool			mq_sysfs_init_done;
+
+	unsigned long bw_timestamp;
+	unsigned long last_ticks;
+	sector_t last_sects[2];
+	unsigned long last_ios[2];
+	sector_t disk_bw;
+	unsigned long disk_iops;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 02/13] blk-throttle: cleanup io cost related stuff
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 01/13] block: estimate disk performance Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 03/13] blk-throttle: add abstract to index data Shaohua Li
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

Move io cost related data to separate data structure and corresponding
code to separate functions. In the future, we will seek more advanced
algorithm to measure IO cost. The separation makes things easier.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 230 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 142 insertions(+), 88 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 2149a1d..a4ae0a7 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,19 @@ enum tg_state_flags {
 
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
+struct throtl_io_cost {
+	/* bytes per second rate limits */
+	uint64_t bps[2];
+
+	/* IOPS limits */
+	unsigned int iops[2];
+
+	/* Number of bytes disptached in current slice */
+	uint64_t bytes_disp[2];
+	/* Number of bio's dispatched in current slice */
+	unsigned int io_disp[2];
+};
+
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -119,16 +132,7 @@ struct throtl_grp {
 	/* are there any throtl rules between this group and td? */
 	bool has_rules[2];
 
-	/* bytes per second rate limits */
-	uint64_t bps[2];
-
-	/* IOPS limits */
-	unsigned int iops[2];
-
-	/* Number of bytes disptached in current slice */
-	uint64_t bytes_disp[2];
-	/* Number of bio's dispatched in current slice */
-	unsigned int io_disp[2];
+	struct throtl_io_cost io_cost;
 
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
@@ -323,6 +327,14 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq)
 		    (unsigned long)sq);
 }
 
+static inline void io_cost_init(struct throtl_grp *tg)
+{
+	tg->io_cost.bps[READ] = -1;
+	tg->io_cost.bps[WRITE] = -1;
+	tg->io_cost.iops[READ] = -1;
+	tg->io_cost.iops[WRITE] = -1;
+}
+
 static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 {
 	struct throtl_grp *tg;
@@ -340,10 +352,8 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	}
 
 	RB_CLEAR_NODE(&tg->rb_node);
-	tg->bps[READ] = -1;
-	tg->bps[WRITE] = -1;
-	tg->iops[READ] = -1;
-	tg->iops[WRITE] = -1;
+
+	io_cost_init(tg);
 
 	return &tg->pd;
 }
@@ -374,6 +384,11 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 	tg->td = td;
 }
 
+static inline bool io_cost_has_limit(struct throtl_grp *tg, int rw)
+{
+	return tg->io_cost.bps[rw] != -1 || tg->io_cost.iops[rw] != -1;
+}
+
 /*
  * Set has_rules[] if @tg or any of its parents have limits configured.
  * This doesn't require walking up to the top of the hierarchy as the
@@ -386,7 +401,7 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 	for (rw = READ; rw <= WRITE; rw++)
 		tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-				    (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+				    io_cost_has_limit(tg, rw);
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -547,11 +562,17 @@ static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq,
 	return false;
 }
 
+static inline void io_cost_start_new_slice(struct throtl_grp *tg,
+		bool rw)
+{
+	tg->io_cost.bytes_disp[rw] = 0;
+	tg->io_cost.io_disp[rw] = 0;
+}
+
 static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 		bool rw, unsigned long start)
 {
-	tg->bytes_disp[rw] = 0;
-	tg->io_disp[rw] = 0;
+	io_cost_start_new_slice(tg, rw);
 
 	/*
 	 * Previous slice has expired. We must have trimmed it after last
@@ -571,8 +592,7 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 
 static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 {
-	tg->bytes_disp[rw] = 0;
-	tg->io_disp[rw] = 0;
+	io_cost_start_new_slice(tg, rw);
 	tg->slice_start[rw] = jiffies;
 	tg->slice_end[rw] = jiffies + throtl_slice;
 	throtl_log(&tg->service_queue,
@@ -606,11 +626,38 @@ static bool throtl_slice_used(struct throtl_grp *tg, bool rw)
 	return 1;
 }
 
+static inline bool io_cost_trim_slice(struct throtl_grp *tg, bool rw,
+	unsigned long nr_slices)
+{
+	unsigned long io_trim;
+	u64 bytes_trim, tmp;
+
+	tmp = tg->io_cost.bps[rw] * throtl_slice * nr_slices;
+	do_div(tmp, HZ);
+	bytes_trim = tmp;
+
+	io_trim = (tg->io_cost.iops[rw] * throtl_slice * nr_slices) / HZ;
+
+	if (!bytes_trim && !io_trim)
+		return false;
+
+	if (tg->io_cost.bytes_disp[rw] >= bytes_trim)
+		tg->io_cost.bytes_disp[rw] -= bytes_trim;
+	else
+		tg->io_cost.bytes_disp[rw] = 0;
+
+	if (tg->io_cost.io_disp[rw] >= io_trim)
+		tg->io_cost.io_disp[rw] -= io_trim;
+	else
+		tg->io_cost.io_disp[rw] = 0;
+
+	return true;
+}
+
 /* Trim the used slices and adjust slice start accordingly */
 static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 {
-	unsigned long nr_slices, time_elapsed, io_trim;
-	u64 bytes_trim, tmp;
+	unsigned long nr_slices, time_elapsed;
 
 	BUG_ON(time_before(tg->slice_end[rw], tg->slice_start[rw]));
 
@@ -638,34 +685,18 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 
 	if (!nr_slices)
 		return;
-	tmp = tg->bps[rw] * throtl_slice * nr_slices;
-	do_div(tmp, HZ);
-	bytes_trim = tmp;
-
-	io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
 
-	if (!bytes_trim && !io_trim)
+	if (!io_cost_trim_slice(tg, rw, nr_slices))
 		return;
-
-	if (tg->bytes_disp[rw] >= bytes_trim)
-		tg->bytes_disp[rw] -= bytes_trim;
-	else
-		tg->bytes_disp[rw] = 0;
-
-	if (tg->io_disp[rw] >= io_trim)
-		tg->io_disp[rw] -= io_trim;
-	else
-		tg->io_disp[rw] = 0;
-
 	tg->slice_start[rw] += nr_slices * throtl_slice;
 
 	throtl_log(&tg->service_queue,
-		   "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim,
+		   "[%c] trim slice nr=%lu start=%lu end=%lu jiffies=%lu",
+		   rw == READ ? 'R' : 'W', nr_slices,
 		   tg->slice_start[rw], tg->slice_end[rw], jiffies);
 }
 
-static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
+static bool io_cost_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 				  unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
@@ -688,7 +719,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	 * have been trimmed.
 	 */
 
-	tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+	tmp = (u64)tg->io_cost.iops[rw] * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 
 	if (tmp > UINT_MAX)
@@ -696,14 +727,15 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	else
 		io_allowed = tmp;
 
-	if (tg->io_disp[rw] + 1 <= io_allowed) {
+	if (tg->io_cost.io_disp[rw] + 1 <= io_allowed) {
 		if (wait)
 			*wait = 0;
 		return true;
 	}
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+	jiffy_wait = ((tg->io_cost.io_disp[rw] + 1) * HZ) /
+		tg->io_cost.iops[rw] + 1;
 
 	if (jiffy_wait > jiffy_elapsed)
 		jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -712,10 +744,10 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 
 	if (wait)
 		*wait = jiffy_wait;
-	return 0;
+	return false;
 }
 
-static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
+static bool io_cost_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 				 unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
@@ -730,19 +762,21 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
 
-	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+	tmp = tg->io_cost.bps[rw] * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 	bytes_allowed = tmp;
 
-	if (tg->bytes_disp[rw] + bio->bi_iter.bi_size <= bytes_allowed) {
+	if (tg->io_cost.bytes_disp[rw] + bio->bi_iter.bi_size <=
+	    bytes_allowed) {
 		if (wait)
 			*wait = 0;
 		return true;
 	}
 
 	/* Calc approx time to dispatch */
-	extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+	extra_bytes = tg->io_cost.bytes_disp[rw] +
+		bio->bi_iter.bi_size - bytes_allowed;
+	jiffy_wait = div64_u64(extra_bytes * HZ, tg->io_cost.bps[rw]);
 
 	if (!jiffy_wait)
 		jiffy_wait = 1;
@@ -754,7 +788,21 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 	jiffy_wait = jiffy_wait + (jiffy_elapsed_rnd - jiffy_elapsed);
 	if (wait)
 		*wait = jiffy_wait;
-	return 0;
+	return false;
+}
+
+static bool io_cost_with_in_limit(struct throtl_grp *tg, struct bio *bio,
+		unsigned long *wait)
+{
+	unsigned long bps_wait = 0, iops_wait = 0;
+
+	if (io_cost_with_in_bps_limit(tg, bio, &bps_wait) &&
+	    io_cost_with_in_iops_limit(tg, bio, &iops_wait)) {
+		*wait = 0;
+		return true;
+	}
+	*wait = max(bps_wait, iops_wait);
+	return false;
 }
 
 /*
@@ -765,7 +813,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 			    unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
-	unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
+	unsigned long max_wait = 0;
 
 	/*
  	 * Currently whole state machine of group depends on first bio
@@ -777,7 +825,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
-	if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
+	if (!io_cost_has_limit(tg, rw)) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -795,32 +843,33 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 			throtl_extend_slice(tg, rw, jiffies + throtl_slice);
 	}
 
-	if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
-	    tg_with_in_iops_limit(tg, bio, &iops_wait)) {
+	if (io_cost_with_in_limit(tg, bio, &max_wait)) {
 		if (wait)
 			*wait = 0;
-		return 1;
+		return true;
 	}
 
-	max_wait = max(bps_wait, iops_wait);
-
 	if (wait)
 		*wait = max_wait;
 
 	if (time_before(tg->slice_end[rw], jiffies + max_wait))
 		throtl_extend_slice(tg, rw, jiffies + max_wait);
 
-	return 0;
+	return false;
 }
 
-static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
+static inline void io_cost_charge_bio(struct throtl_grp *tg, struct bio *bio)
 {
 	bool rw = bio_data_dir(bio);
 
 	/* Charge the bio to the group */
-	tg->bytes_disp[rw] += bio->bi_iter.bi_size;
-	tg->io_disp[rw]++;
+	tg->io_cost.bytes_disp[rw] += bio->bi_iter.bi_size;
+	tg->io_cost.io_disp[rw]++;
+}
 
+static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
+{
+	io_cost_charge_bio(tg, bio);
 	/*
 	 * REQ_THROTTLED is used to prevent the same bio to be throttled
 	 * more than once as a throttled bio will go through blk-throtl the
@@ -1152,8 +1201,8 @@ static void tg_conf_updated(struct throtl_grp *tg)
 
 	throtl_log(&tg->service_queue,
 		   "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
-		   tg->bps[READ], tg->bps[WRITE],
-		   tg->iops[READ], tg->iops[WRITE]);
+		   tg->io_cost.bps[READ], tg->io_cost.bps[WRITE],
+		   tg->io_cost.iops[READ], tg->io_cost.iops[WRITE]);
 
 	/*
 	 * Update has_rules[] flags for the updated tg's subtree.  A tg is
@@ -1230,25 +1279,25 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
 static struct cftype throtl_legacy_files[] = {
 	{
 		.name = "throttle.read_bps_device",
-		.private = offsetof(struct throtl_grp, bps[READ]),
+		.private = offsetof(struct throtl_grp, io_cost.bps[READ]),
 		.seq_show = tg_print_conf_u64,
 		.write = tg_set_conf_u64,
 	},
 	{
 		.name = "throttle.write_bps_device",
-		.private = offsetof(struct throtl_grp, bps[WRITE]),
+		.private = offsetof(struct throtl_grp, io_cost.bps[WRITE]),
 		.seq_show = tg_print_conf_u64,
 		.write = tg_set_conf_u64,
 	},
 	{
 		.name = "throttle.read_iops_device",
-		.private = offsetof(struct throtl_grp, iops[READ]),
+		.private = offsetof(struct throtl_grp, io_cost.iops[READ]),
 		.seq_show = tg_print_conf_uint,
 		.write = tg_set_conf_uint,
 	},
 	{
 		.name = "throttle.write_iops_device",
-		.private = offsetof(struct throtl_grp, iops[WRITE]),
+		.private = offsetof(struct throtl_grp, io_cost.iops[WRITE]),
 		.seq_show = tg_print_conf_uint,
 		.write = tg_set_conf_uint,
 	},
@@ -1274,18 +1323,22 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
 
 	if (!dname)
 		return 0;
-	if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 &&
-	    tg->iops[READ] == -1 && tg->iops[WRITE] == -1)
+	if (tg->io_cost.bps[READ] == -1 && tg->io_cost.bps[WRITE] == -1 &&
+	    tg->io_cost.iops[READ] == -1 && tg->io_cost.iops[WRITE] == -1)
 		return 0;
 
-	if (tg->bps[READ] != -1)
-		snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
-	if (tg->bps[WRITE] != -1)
-		snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
-	if (tg->iops[READ] != -1)
-		snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
-	if (tg->iops[WRITE] != -1)
-		snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
+	if (tg->io_cost.bps[READ] != -1)
+		snprintf(bufs[0], sizeof(bufs[0]), "%llu",
+			tg->io_cost.bps[READ]);
+	if (tg->io_cost.bps[WRITE] != -1)
+		snprintf(bufs[1], sizeof(bufs[1]), "%llu",
+			tg->io_cost.bps[WRITE]);
+	if (tg->io_cost.iops[READ] != -1)
+		snprintf(bufs[2], sizeof(bufs[2]), "%u",
+			tg->io_cost.iops[READ]);
+	if (tg->io_cost.iops[WRITE] != -1)
+		snprintf(bufs[3], sizeof(bufs[3]), "%u",
+			tg->io_cost.iops[WRITE]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
 		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
@@ -1314,10 +1367,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 	tg = blkg_to_tg(ctx.blkg);
 
-	v[0] = tg->bps[READ];
-	v[1] = tg->bps[WRITE];
-	v[2] = tg->iops[READ];
-	v[3] = tg->iops[WRITE];
+	v[0] = tg->io_cost.bps[READ];
+	v[1] = tg->io_cost.bps[WRITE];
+	v[2] = tg->io_cost.iops[READ];
+	v[3] = tg->io_cost.iops[WRITE];
 
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
@@ -1354,10 +1407,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 			goto out_finish;
 	}
 
-	tg->bps[READ] = v[0];
-	tg->bps[WRITE] = v[1];
-	tg->iops[READ] = v[2];
-	tg->iops[WRITE] = v[3];
+	tg->io_cost.bps[READ] = v[0];
+	tg->io_cost.bps[WRITE] = v[1];
+	tg->io_cost.iops[READ] = v[2];
+	tg->io_cost.iops[WRITE] = v[3];
 
 	tg_conf_updated(tg);
 	ret = 0;
@@ -1455,8 +1508,9 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	/* out-of-limit, queue to @tg */
 	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
-		   tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw],
-		   tg->io_disp[rw], tg->iops[rw],
+		   tg->io_cost.bytes_disp[rw], bio->bi_iter.bi_size,
+		   tg->io_cost.bps[rw], tg->io_cost.io_disp[rw],
+		   tg->io_cost.iops[rw],
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 03/13] blk-throttle: add abstract to index data
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 01/13] block: estimate disk performance Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 02/13] blk-throttle: cleanup io cost related stuff Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 04/13] blk-throttle: weight based throttling Shaohua Li
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

For throttle, we have separate data to track READ/WRITE IO. For weight
based throttle, we combine read/write bandwidth and check combined
bandwidth. Separate throttle doesn't work, because it's possible one
cgroup only does READ and the other only does WRITE. This patch adds
an abstract to index data. With it, later patch can easily do the
read/write combination track for weight based throttle.

In the future, we can introduce a sub queue for read and write for each
cgroup and give read more shares.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 182 +++++++++++++++++++++++++++++----------------------
 1 file changed, 102 insertions(+), 80 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a4ae0a7..90f937f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -207,6 +207,11 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 		return container_of(sq, struct throtl_data, service_queue);
 }
 
+static inline int tg_data_index(struct throtl_grp *tg, bool rw)
+{
+	return rw;
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -338,7 +343,7 @@ static inline void io_cost_init(struct throtl_grp *tg)
 static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 {
 	struct throtl_grp *tg;
-	int rw;
+	int i;
 
 	tg = kzalloc_node(sizeof(*tg), gfp, node);
 	if (!tg)
@@ -346,9 +351,9 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 
 	throtl_service_queue_init(&tg->service_queue);
 
-	for (rw = READ; rw <= WRITE; rw++) {
-		throtl_qnode_init(&tg->qnode_on_self[rw], tg);
-		throtl_qnode_init(&tg->qnode_on_parent[rw], tg);
+	for (i = READ; i <= WRITE; i++) {
+		throtl_qnode_init(&tg->qnode_on_self[i], tg);
+		throtl_qnode_init(&tg->qnode_on_parent[i], tg);
 	}
 
 	RB_CLEAR_NODE(&tg->rb_node);
@@ -384,9 +389,9 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 	tg->td = td;
 }
 
-static inline bool io_cost_has_limit(struct throtl_grp *tg, int rw)
+static inline bool io_cost_has_limit(struct throtl_grp *tg, int index)
 {
-	return tg->io_cost.bps[rw] != -1 || tg->io_cost.iops[rw] != -1;
+	return tg->io_cost.bps[index] != -1 || tg->io_cost.iops[index] != -1;
 }
 
 /*
@@ -397,11 +402,11 @@ static inline bool io_cost_has_limit(struct throtl_grp *tg, int rw)
 static void tg_update_has_rules(struct throtl_grp *tg)
 {
 	struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
-	int rw;
+	int i;
 
-	for (rw = READ; rw <= WRITE; rw++)
-		tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-				    io_cost_has_limit(tg, rw);
+	for (i = READ; i <= WRITE; i++)
+		tg->has_rules[i] = (parent_tg && parent_tg->has_rules[i]) ||
+				    io_cost_has_limit(tg, i);
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -565,13 +570,15 @@ static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq,
 static inline void io_cost_start_new_slice(struct throtl_grp *tg,
 		bool rw)
 {
-	tg->io_cost.bytes_disp[rw] = 0;
-	tg->io_cost.io_disp[rw] = 0;
+	int index = tg_data_index(tg, rw);
+	tg->io_cost.bytes_disp[index] = 0;
+	tg->io_cost.io_disp[index] = 0;
 }
 
 static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 		bool rw, unsigned long start)
 {
+	int index = tg_data_index(tg, rw);
 	io_cost_start_new_slice(tg, rw);
 
 	/*
@@ -580,47 +587,52 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 	 * that bandwidth. Do try to make use of that bandwidth while giving
 	 * credit.
 	 */
-	if (time_after_eq(start, tg->slice_start[rw]))
-		tg->slice_start[rw] = start;
+	if (time_after_eq(start, tg->slice_start[index]))
+		tg->slice_start[index] = start;
 
-	tg->slice_end[rw] = jiffies + throtl_slice;
+	tg->slice_end[index] = jiffies + throtl_slice;
 	throtl_log(&tg->service_queue,
 		   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
+		   rw == READ ? 'R' : 'W', tg->slice_start[index],
+		   tg->slice_end[index], jiffies);
 }
 
 static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 {
+	int index = tg_data_index(tg, rw);
+
 	io_cost_start_new_slice(tg, rw);
-	tg->slice_start[rw] = jiffies;
-	tg->slice_end[rw] = jiffies + throtl_slice;
+	tg->slice_start[index] = jiffies;
+	tg->slice_end[index] = jiffies + throtl_slice;
 	throtl_log(&tg->service_queue,
 		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
+		   rw == READ ? 'R' : 'W', tg->slice_start[index],
+		   tg->slice_end[index], jiffies);
 }
 
 static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
 					unsigned long jiffy_end)
 {
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+	int index = tg_data_index(tg, rw);
+	tg->slice_end[index] = roundup(jiffy_end, throtl_slice);
 }
 
 static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
 				       unsigned long jiffy_end)
 {
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+	int index = tg_data_index(tg, rw);
+	tg->slice_end[index] = roundup(jiffy_end, throtl_slice);
 	throtl_log(&tg->service_queue,
 		   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
+		   rw == READ ? 'R' : 'W', tg->slice_start[index],
+		   tg->slice_end[index], jiffies);
 }
 
 /* Determine if previously allocated or extended slice is complete or not */
 static bool throtl_slice_used(struct throtl_grp *tg, bool rw)
 {
-	if (time_in_range(jiffies, tg->slice_start[rw], tg->slice_end[rw]))
+	int index = tg_data_index(tg, rw);
+	if (time_in_range(jiffies, tg->slice_start[index], tg->slice_end[index]))
 		return false;
 
 	return 1;
@@ -631,25 +643,26 @@ static inline bool io_cost_trim_slice(struct throtl_grp *tg, bool rw,
 {
 	unsigned long io_trim;
 	u64 bytes_trim, tmp;
+	int index = tg_data_index(tg, rw);
 
-	tmp = tg->io_cost.bps[rw] * throtl_slice * nr_slices;
+	tmp = tg->io_cost.bps[index] * throtl_slice * nr_slices;
 	do_div(tmp, HZ);
 	bytes_trim = tmp;
 
-	io_trim = (tg->io_cost.iops[rw] * throtl_slice * nr_slices) / HZ;
+	io_trim = (tg->io_cost.iops[index] * throtl_slice * nr_slices) / HZ;
 
 	if (!bytes_trim && !io_trim)
 		return false;
 
-	if (tg->io_cost.bytes_disp[rw] >= bytes_trim)
-		tg->io_cost.bytes_disp[rw] -= bytes_trim;
+	if (tg->io_cost.bytes_disp[index] >= bytes_trim)
+		tg->io_cost.bytes_disp[index] -= bytes_trim;
 	else
-		tg->io_cost.bytes_disp[rw] = 0;
+		tg->io_cost.bytes_disp[index] = 0;
 
-	if (tg->io_cost.io_disp[rw] >= io_trim)
-		tg->io_cost.io_disp[rw] -= io_trim;
+	if (tg->io_cost.io_disp[index] >= io_trim)
+		tg->io_cost.io_disp[index] -= io_trim;
 	else
-		tg->io_cost.io_disp[rw] = 0;
+		tg->io_cost.io_disp[index] = 0;
 
 	return true;
 }
@@ -658,8 +671,9 @@ static inline bool io_cost_trim_slice(struct throtl_grp *tg, bool rw,
 static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 {
 	unsigned long nr_slices, time_elapsed;
+	int index = tg_data_index(tg, rw);
 
-	BUG_ON(time_before(tg->slice_end[rw], tg->slice_start[rw]));
+	BUG_ON(time_before(tg->slice_end[index], tg->slice_start[index]));
 
 	/*
 	 * If bps are unlimited (-1), then time slice don't get
@@ -679,7 +693,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 
 	throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
 
-	time_elapsed = jiffies - tg->slice_start[rw];
+	time_elapsed = jiffies - tg->slice_start[index];
 
 	nr_slices = time_elapsed / throtl_slice;
 
@@ -688,23 +702,24 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 
 	if (!io_cost_trim_slice(tg, rw, nr_slices))
 		return;
-	tg->slice_start[rw] += nr_slices * throtl_slice;
+	tg->slice_start[index] += nr_slices * throtl_slice;
 
 	throtl_log(&tg->service_queue,
 		   "[%c] trim slice nr=%lu start=%lu end=%lu jiffies=%lu",
 		   rw == READ ? 'R' : 'W', nr_slices,
-		   tg->slice_start[rw], tg->slice_end[rw], jiffies);
+		   tg->slice_start[index], tg->slice_end[index], jiffies);
 }
 
 static bool io_cost_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 				  unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
+	int index = tg_data_index(tg, rw);
 	unsigned int io_allowed;
 	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
 	u64 tmp;
 
-	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
+	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[index];
 
 	/* Slice has just started. Consider one slice interval */
 	if (!jiffy_elapsed)
@@ -719,7 +734,7 @@ static bool io_cost_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	 * have been trimmed.
 	 */
 
-	tmp = (u64)tg->io_cost.iops[rw] * jiffy_elapsed_rnd;
+	tmp = (u64)tg->io_cost.iops[index] * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 
 	if (tmp > UINT_MAX)
@@ -727,15 +742,15 @@ static bool io_cost_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	else
 		io_allowed = tmp;
 
-	if (tg->io_cost.io_disp[rw] + 1 <= io_allowed) {
+	if (tg->io_cost.io_disp[index] + 1 <= io_allowed) {
 		if (wait)
 			*wait = 0;
 		return true;
 	}
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_cost.io_disp[rw] + 1) * HZ) /
-		tg->io_cost.iops[rw] + 1;
+	jiffy_wait = ((tg->io_cost.io_disp[index] + 1) * HZ) /
+		tg->io_cost.iops[index] + 1;
 
 	if (jiffy_wait > jiffy_elapsed)
 		jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -751,10 +766,11 @@ static bool io_cost_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 				 unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
+	int index = tg_data_index(tg, rw);
 	u64 bytes_allowed, extra_bytes, tmp;
 	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
 
-	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
+	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[index];
 
 	/* Slice has just started. Consider one slice interval */
 	if (!jiffy_elapsed)
@@ -762,11 +778,11 @@ static bool io_cost_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
 
-	tmp = tg->io_cost.bps[rw] * jiffy_elapsed_rnd;
+	tmp = tg->io_cost.bps[index] * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 	bytes_allowed = tmp;
 
-	if (tg->io_cost.bytes_disp[rw] + bio->bi_iter.bi_size <=
+	if (tg->io_cost.bytes_disp[index] + bio->bi_iter.bi_size <=
 	    bytes_allowed) {
 		if (wait)
 			*wait = 0;
@@ -774,9 +790,9 @@ static bool io_cost_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 	}
 
 	/* Calc approx time to dispatch */
-	extra_bytes = tg->io_cost.bytes_disp[rw] +
+	extra_bytes = tg->io_cost.bytes_disp[index] +
 		bio->bi_iter.bi_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, tg->io_cost.bps[rw]);
+	jiffy_wait = div64_u64(extra_bytes * HZ, tg->io_cost.bps[index]);
 
 	if (!jiffy_wait)
 		jiffy_wait = 1;
@@ -813,6 +829,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 			    unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
+	int index = tg_data_index(tg, rw);
 	unsigned long max_wait = 0;
 
 	/*
@@ -821,11 +838,11 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	 * this function with a different bio if there are other bios
 	 * queued.
 	 */
-	BUG_ON(tg->service_queue.nr_queued[rw] &&
-	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
+	BUG_ON(tg->service_queue.nr_queued[index] &&
+	       bio != throtl_peek_queued(&tg->service_queue.queued[index]));
 
 	/* If tg->bps = -1, then BW is unlimited */
-	if (!io_cost_has_limit(tg, rw)) {
+	if (!io_cost_has_limit(tg, index)) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -839,7 +856,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	if (throtl_slice_used(tg, rw))
 		throtl_start_new_slice(tg, rw);
 	else {
-		if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
+		if (time_before(tg->slice_end[index], jiffies + throtl_slice))
 			throtl_extend_slice(tg, rw, jiffies + throtl_slice);
 	}
 
@@ -852,7 +869,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	if (wait)
 		*wait = max_wait;
 
-	if (time_before(tg->slice_end[rw], jiffies + max_wait))
+	if (time_before(tg->slice_end[index], jiffies + max_wait))
 		throtl_extend_slice(tg, rw, jiffies + max_wait);
 
 	return false;
@@ -861,10 +878,11 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 static inline void io_cost_charge_bio(struct throtl_grp *tg, struct bio *bio)
 {
 	bool rw = bio_data_dir(bio);
+	int index = tg_data_index(tg, rw);
 
 	/* Charge the bio to the group */
-	tg->io_cost.bytes_disp[rw] += bio->bi_iter.bi_size;
-	tg->io_cost.io_disp[rw]++;
+	tg->io_cost.bytes_disp[index] += bio->bi_iter.bi_size;
+	tg->io_cost.io_disp[index]++;
 }
 
 static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
@@ -894,9 +912,10 @@ static void throtl_add_bio_tg(struct bio *bio, struct throtl_qnode *qn,
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	bool rw = bio_data_dir(bio);
+	int index = tg_data_index(tg, rw);
 
 	if (!qn)
-		qn = &tg->qnode_on_self[rw];
+		qn = &tg->qnode_on_self[index];
 
 	/*
 	 * If @tg doesn't currently have any bios queued in the same
@@ -904,12 +923,12 @@ static void throtl_add_bio_tg(struct bio *bio, struct throtl_qnode *qn,
 	 * dispatched.  Mark that @tg was empty.  This is automatically
 	 * cleaered on the next tg_update_disptime().
 	 */
-	if (!sq->nr_queued[rw])
+	if (!sq->nr_queued[index])
 		tg->flags |= THROTL_TG_WAS_EMPTY;
 
-	throtl_qnode_add_bio(bio, qn, &sq->queued[rw]);
+	throtl_qnode_add_bio(bio, qn, &sq->queued[index]);
 
-	sq->nr_queued[rw]++;
+	sq->nr_queued[index]++;
 	throtl_enqueue_tg(tg);
 }
 
@@ -940,9 +959,10 @@ static void tg_update_disptime(struct throtl_grp *tg)
 static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
 					struct throtl_grp *parent_tg, bool rw)
 {
+	int index = tg_data_index(child_tg, rw);
 	if (throtl_slice_used(parent_tg, rw)) {
 		throtl_start_new_slice_with_credit(parent_tg, rw,
-				child_tg->slice_start[rw]);
+				child_tg->slice_start[index]);
 	}
 
 }
@@ -954,6 +974,7 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 	struct throtl_grp *parent_tg = sq_to_tg(parent_sq);
 	struct throtl_grp *tg_to_put = NULL;
 	struct bio *bio;
+	int index = tg_data_index(tg, rw);
 
 	/*
 	 * @bio is being transferred from @tg to @parent_sq.  Popping a bio
@@ -961,8 +982,8 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 	 * getting released prematurely.  Remember the tg to put and put it
 	 * after @bio is transferred to @parent_sq.
 	 */
-	bio = throtl_pop_queued(&sq->queued[rw], &tg_to_put);
-	sq->nr_queued[rw]--;
+	bio = throtl_pop_queued(&sq->queued[index], &tg_to_put);
+	sq->nr_queued[index]--;
 
 	throtl_charge_bio(tg, bio);
 
@@ -974,13 +995,13 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 	 * responsible for issuing these bios.
 	 */
 	if (parent_tg) {
-		throtl_add_bio_tg(bio, &tg->qnode_on_parent[rw], parent_tg);
+		throtl_add_bio_tg(bio, &tg->qnode_on_parent[index], parent_tg);
 		start_parent_slice_with_credit(tg, parent_tg, rw);
 	} else {
-		throtl_qnode_add_bio(bio, &tg->qnode_on_parent[rw],
-				     &parent_sq->queued[rw]);
-		BUG_ON(tg->td->nr_queued[rw] <= 0);
-		tg->td->nr_queued[rw]--;
+		throtl_qnode_add_bio(bio, &tg->qnode_on_parent[index],
+				     &parent_sq->queued[index]);
+		BUG_ON(tg->td->nr_queued[index] <= 0);
+		tg->td->nr_queued[index]--;
 	}
 
 	throtl_trim_slice(tg, rw);
@@ -1139,13 +1160,13 @@ static void blk_throtl_dispatch_work_fn(struct work_struct *work)
 	struct bio_list bio_list_on_stack;
 	struct bio *bio;
 	struct blk_plug plug;
-	int rw;
+	int i;
 
 	bio_list_init(&bio_list_on_stack);
 
 	spin_lock_irq(q->queue_lock);
-	for (rw = READ; rw <= WRITE; rw++)
-		while ((bio = throtl_pop_queued(&td_sq->queued[rw], NULL)))
+	for (i = READ; i <= WRITE; i++)
+		while ((bio = throtl_pop_queued(&td_sq->queued[i], NULL)))
 			bio_list_add(&bio_list_on_stack, bio);
 	spin_unlock_irq(q->queue_lock);
 
@@ -1453,12 +1474,13 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
 	struct throtl_service_queue *sq;
 	bool rw = bio_data_dir(bio);
+	int index = tg_data_index(tg, rw);
 	bool throttled = false;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
 	/* see throtl_charge_bio() */
-	if ((bio->bi_rw & REQ_THROTTLED) || !tg->has_rules[rw])
+	if ((bio->bi_rw & REQ_THROTTLED) || !tg->has_rules[index])
 		goto out;
 
 	spin_lock_irq(q->queue_lock);
@@ -1470,7 +1492,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	while (true) {
 		/* throtl is FIFO - if bios are already queued, should queue */
-		if (sq->nr_queued[rw])
+		if (sq->nr_queued[index])
 			break;
 
 		/* if above limits, break to queue */
@@ -1498,7 +1520,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		 * Climb up the ladder.  If we''re already at the top, it
 		 * can be executed directly.
 		 */
-		qn = &tg->qnode_on_parent[rw];
+		qn = &tg->qnode_on_parent[index];
 		sq = sq->parent_sq;
 		tg = sq_to_tg(sq);
 		if (!tg)
@@ -1508,13 +1530,13 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	/* out-of-limit, queue to @tg */
 	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
-		   tg->io_cost.bytes_disp[rw], bio->bi_iter.bi_size,
-		   tg->io_cost.bps[rw], tg->io_cost.io_disp[rw],
-		   tg->io_cost.iops[rw],
+		   tg->io_cost.bytes_disp[index], bio->bi_iter.bi_size,
+		   tg->io_cost.bps[index], tg->io_cost.io_disp[index],
+		   tg->io_cost.iops[index],
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
-	tg->td->nr_queued[rw]++;
+	tg->td->nr_queued[index]++;
 	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
 
@@ -1577,7 +1599,7 @@ void blk_throtl_drain(struct request_queue *q)
 	struct blkcg_gq *blkg;
 	struct cgroup_subsys_state *pos_css;
 	struct bio *bio;
-	int rw;
+	int i;
 
 	queue_lockdep_assert_held(q);
 	rcu_read_lock();
@@ -1598,8 +1620,8 @@ void blk_throtl_drain(struct request_queue *q)
 	spin_unlock_irq(q->queue_lock);
 
 	/* all bios now should be in td->service_queue, issue them */
-	for (rw = READ; rw <= WRITE; rw++)
-		while ((bio = throtl_pop_queued(&td->service_queue.queued[rw],
+	for (i = READ; i <= WRITE; i++)
+		while ((bio = throtl_pop_queued(&td->service_queue.queued[i],
 						NULL)))
 			generic_make_request(bio);
 
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 04/13] blk-throttle: weight based throttling
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (2 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 03/13] blk-throttle: add abstract to index data Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 05/13] blk-throttling: detect inactive cgroup Shaohua Li
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

We know total bandwidth of a disk and can calculate cgroup's bandwidth
percentage against disk bandwidth according to its weight. We can easily
calculate cgroup bandwidth.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 160 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 156 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 90f937f..fafe9c2 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2,6 +2,7 @@
  * Interface for controlling IO bandwidth on a request queue
  *
  * Copyright (C) 2010 Vivek Goyal <vgoyal@redhat.com>
+ * Proportional throttle - Shaohua Li <shli@kernel.org>
  */
 
 #include <linux/module.h>
@@ -12,6 +13,12 @@
 #include <linux/blk-cgroup.h>
 #include "blk.h"
 
+#define MAX_WEIGHT (10000)
+#define MIN_WEIGHT (1)
+#define DFT_WEIGHT (100)
+#define SHARE_SHIFT (14)
+#define MAX_SHARE (1 << SHARE_SHIFT)
+
 /* Max dispatch from a group in 1 round */
 static int throtl_grp_quantum = 8;
 
@@ -74,6 +81,10 @@ struct throtl_service_queue {
 	unsigned int		nr_pending;	/* # queued in the tree */
 	unsigned long		first_pending_disptime;	/* disptime of the first tg */
 	struct timer_list	pending_timer;	/* fires on first_pending_disptime */
+
+	unsigned int		weight; /* this queue's weight against siblings */
+	unsigned int		children_weight; /* children weight */
+	unsigned int		share; /* disk bandwidth share of the queue */
 };
 
 enum tg_state_flags {
@@ -139,6 +150,22 @@ struct throtl_grp {
 	unsigned long slice_end[2];
 };
 
+enum run_mode {
+	MODE_NONE = 0,
+	MODE_THROTTLE = 1, /* bandwidth/iops based throttle */
+	/* below are weight based */
+	MODE_WEIGHT_BANDWIDTH = 2,
+	MODE_WEIGHT_IOPS = 3,
+	MAX_MODE = 4,
+};
+
+static char *run_mode_name[MAX_MODE] = {
+	[MODE_NONE] = "none",
+	[MODE_THROTTLE] = "throttle",
+	[MODE_WEIGHT_BANDWIDTH] = "weight_bw",
+	[MODE_WEIGHT_IOPS] = "weight_iops",
+};
+
 struct throtl_data
 {
 	/* service tree for active throtl groups */
@@ -156,8 +183,14 @@ struct throtl_data
 
 	/* Work for dispatching throttled bios */
 	struct work_struct dispatch_work;
+	enum run_mode mode;
 };
 
+static bool td_weight_based(struct throtl_data *td)
+{
+	return td->mode > MODE_THROTTLE;
+}
+
 static void throtl_pending_timer_fn(unsigned long arg);
 
 static inline struct throtl_grp *pd_to_tg(struct blkg_policy_data *pd)
@@ -209,9 +242,33 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 
 static inline int tg_data_index(struct throtl_grp *tg, bool rw)
 {
+	if (td_weight_based(tg->td))
+		return 0;
 	return rw;
 }
 
+static inline uint64_t queue_bandwidth(struct throtl_data *td)
+{
+	uint64_t bw;
+
+	bw = td->queue->disk_bw * 512;
+	/* can't estimate bandwidth, can't do proporation control */
+	if (bw == 0)
+		bw = -1;
+	return bw;
+}
+
+static inline uint64_t queue_iops(struct throtl_data *td)
+{
+	uint64_t iops;
+
+	iops = td->queue->disk_iops;
+	/* can't estimate iops, can't do proporation control */
+	if (iops == 0)
+		iops = -1;
+	return iops;
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -386,6 +443,8 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 	sq->parent_sq = &td->service_queue;
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
 		sq->parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
+	sq->weight = DFT_WEIGHT;
+	sq->parent_sq->children_weight += sq->weight;
 	tg->td = td;
 }
 
@@ -406,7 +465,8 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 	for (i = READ; i <= WRITE; i++)
 		tg->has_rules[i] = (parent_tg && parent_tg->has_rules[i]) ||
-				    io_cost_has_limit(tg, i);
+				    io_cost_has_limit(tg, i) ||
+				    (td_weight_based(tg->td));
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -421,6 +481,10 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 static void throtl_pd_free(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
+	struct throtl_service_queue *sq = &tg->service_queue;
+
+	if (sq->parent_sq)
+		sq->parent_sq->children_weight -= sq->weight;
 
 	del_timer_sync(&tg->service_queue.pending_timer);
 	kfree(tg);
@@ -812,6 +876,10 @@ static bool io_cost_with_in_limit(struct throtl_grp *tg, struct bio *bio,
 {
 	unsigned long bps_wait = 0, iops_wait = 0;
 
+	if (tg->td->mode == MODE_WEIGHT_BANDWIDTH)
+		return io_cost_with_in_bps_limit(tg, bio, wait);
+	if (tg->td->mode == MODE_WEIGHT_IOPS)
+		return io_cost_with_in_iops_limit(tg, bio, wait);
 	if (io_cost_with_in_bps_limit(tg, bio, &bps_wait) &&
 	    io_cost_with_in_iops_limit(tg, bio, &iops_wait)) {
 		*wait = 0;
@@ -967,6 +1035,77 @@ static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
 
 }
 
+static void tg_update_perf(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *sq;
+	u64 new_bps, abs_bps = 0;
+	unsigned int new_iops, abs_iops = 0;
+
+	sq = &tg->service_queue;
+
+	/* '/' cgroup in cgroup2 */
+	if (!td_weight_based(tg->td) || !sq->parent_sq ||
+	    (cgroup_subsys_on_dfl(io_cgrp_subsys) && !tg_to_blkg(tg)->parent))
+		return;
+
+	if (tg->td->mode == MODE_WEIGHT_BANDWIDTH) {
+		new_bps = max_t(uint64_t,
+			(queue_bandwidth(tg->td) * sq->share) >> SHARE_SHIFT,
+			1024);
+		if (new_bps > tg->io_cost.bps[0])
+			abs_bps = new_bps - tg->io_cost.bps[0];
+		if (new_bps < tg->io_cost.bps[0])
+			abs_bps = tg->io_cost.bps[0] - new_bps;
+		if (abs_bps > (tg->io_cost.bps[0] >> 3))
+			throtl_start_new_slice(tg, 0);
+		tg->io_cost.bps[0] = new_bps;
+		tg->io_cost.iops[0] = -1;
+	} else {
+		new_iops = max_t(uint64_t,
+			(queue_iops(tg->td) * sq->share) >> SHARE_SHIFT,
+			1);
+		if (new_iops > tg->io_cost.iops[0])
+			abs_iops = new_iops - tg->io_cost.iops[0];
+		if (new_iops < tg->io_cost.iops[0])
+			abs_iops = tg->io_cost.iops[0] - new_iops;
+		if (abs_iops > (tg->io_cost.iops[0] >> 3))
+			throtl_start_new_slice(tg, 0);
+		tg->io_cost.iops[0] = new_iops;
+		tg->io_cost.bps[0] = -1;
+	}
+}
+
+/* update share of tg's siblings */
+static void tg_update_share(struct throtl_data *td, struct throtl_grp *tg)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg, *parent_blkg;
+	struct throtl_grp *child;
+
+	if (!td_weight_based(td))
+		return;
+	if (!tg || !tg->service_queue.parent_sq ||
+	    !tg->service_queue.parent_sq->parent_sq)
+		parent_blkg = td->queue->root_blkg;
+	else
+		parent_blkg = tg_to_blkg(sq_to_tg(tg->service_queue.parent_sq));
+
+	blkg_for_each_descendant_pre(blkg, pos_css, parent_blkg) {
+		struct throtl_service_queue *sq;
+
+		child = blkg_to_tg(blkg);
+		sq = &child->service_queue;
+
+		if (!sq->parent_sq)
+			continue;
+
+		sq->share = max_t(unsigned int,
+			sq->parent_sq->share * sq->weight /
+				sq->parent_sq->children_weight,
+			1);
+	}
+}
+
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1014,11 +1153,18 @@ static int throtl_dispatch_tg(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	unsigned int nr_reads = 0, nr_writes = 0;
-	unsigned int max_nr_reads = throtl_grp_quantum*3/4;
-	unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads;
+	unsigned int max_nr_reads;
+	unsigned int max_nr_writes;
 	struct bio *bio;
 
-	/* Try to dispatch 75% READS and 25% WRITES */
+	if (td_weight_based(tg->td)) {
+		max_nr_reads = throtl_grp_quantum;
+		max_nr_writes = 0;
+	} else {
+		/* Try to dispatch 75% READS and 25% WRITES */
+		max_nr_reads = throtl_grp_quantum * 3 / 4;
+		max_nr_writes = throtl_grp_quantum - max_nr_reads;
+	}
 
 	while ((bio = throtl_peek_queued(&sq->queued[READ])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
@@ -1039,6 +1185,9 @@ static int throtl_dispatch_tg(struct throtl_grp *tg)
 		if (nr_writes >= max_nr_writes)
 			break;
 	}
+	if (nr_reads + nr_writes) {
+		tg_update_perf(tg);
+	}
 
 	return nr_reads + nr_writes;
 }
@@ -1494,6 +1643,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[index])
 			break;
+		tg_update_perf(tg);
 
 		/* if above limits, break to queue */
 		if (!tg_may_dispatch(tg, bio, NULL))
@@ -1639,6 +1789,8 @@ int blk_throtl_init(struct request_queue *q)
 
 	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 	throtl_service_queue_init(&td->service_queue);
+	td->service_queue.share = MAX_SHARE;
+	td->mode = MODE_NONE;
 
 	q->td = td;
 	td->queue = q;
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 05/13] blk-throttling: detect inactive cgroup
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (3 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 04/13] blk-throttle: weight based throttling Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 06/13] blk-throttle: add per-cgroup data Shaohua Li
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

We don't have a tree to maintain active cgroups. If a cgroup is inactive
for some time, it should be excluded from bandwidth calculation.
Otherwise we will only assign partial bandwidth to active cgroups,
active cgroups will dispatch less IO, the estimated queue bandwidth
drops, which will cause active cgroups dispatch even less IO.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 74 insertions(+), 5 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index fafe9c2..43de1dc 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -18,6 +18,8 @@
 #define DFT_WEIGHT (100)
 #define SHARE_SHIFT (14)
 #define MAX_SHARE (1 << SHARE_SHIFT)
+/* must be less than the interval we update bandwidth */
+#define CGCHECK_TIME (msecs_to_jiffies(100))
 
 /* Max dispatch from a group in 1 round */
 static int throtl_grp_quantum = 8;
@@ -83,8 +85,11 @@ struct throtl_service_queue {
 	struct timer_list	pending_timer;	/* fires on first_pending_disptime */
 
 	unsigned int		weight; /* this queue's weight against siblings */
+	unsigned int		acting_weight; /* actual weight of the queue */
 	unsigned int		children_weight; /* children weight */
 	unsigned int		share; /* disk bandwidth share of the queue */
+
+	unsigned long active_timestamp;
 };
 
 enum tg_state_flags {
@@ -184,6 +189,7 @@ struct throtl_data
 	/* Work for dispatching throttled bios */
 	struct work_struct dispatch_work;
 	enum run_mode mode;
+	unsigned long last_check_timestamp;
 };
 
 static bool td_weight_based(struct throtl_data *td)
@@ -444,7 +450,7 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
 		sq->parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
 	sq->weight = DFT_WEIGHT;
-	sq->parent_sq->children_weight += sq->weight;
+	sq->acting_weight = 0;
 	tg->td = td;
 }
 
@@ -483,8 +489,10 @@ static void throtl_pd_free(struct blkg_policy_data *pd)
 	struct throtl_grp *tg = pd_to_tg(pd);
 	struct throtl_service_queue *sq = &tg->service_queue;
 
-	if (sq->parent_sq)
-		sq->parent_sq->children_weight -= sq->weight;
+	if (sq->acting_weight && sq->parent_sq) {
+		sq->parent_sq->children_weight -= sq->acting_weight;
+		sq->acting_weight = 0;
+	}
 
 	del_timer_sync(&tg->service_queue.pending_timer);
 	kfree(tg);
@@ -1096,16 +1104,74 @@ static void tg_update_share(struct throtl_data *td, struct throtl_grp *tg)
 		child = blkg_to_tg(blkg);
 		sq = &child->service_queue;
 
-		if (!sq->parent_sq)
+		if (!sq->parent_sq || !sq->acting_weight)
 			continue;
 
 		sq->share = max_t(unsigned int,
-			sq->parent_sq->share * sq->weight /
+			sq->parent_sq->share * sq->acting_weight /
 				sq->parent_sq->children_weight,
 			1);
 	}
 }
 
+static void tg_update_active_time(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *sq = &tg->service_queue;
+	unsigned long now = jiffies;
+
+	tg = NULL;
+	while (sq->parent_sq) {
+		sq->active_timestamp = now;
+		if (!sq->acting_weight) {
+			sq->acting_weight = sq->weight;
+			sq->parent_sq->children_weight += sq->acting_weight;
+			tg = sq_to_tg(sq);
+		}
+		sq = sq->parent_sq;
+	}
+
+	if (tg)
+		tg_update_share(tg->td, tg);
+}
+
+static void detect_inactive_cg(struct throtl_grp *tg)
+{
+	struct throtl_data *td = tg->td;
+	struct throtl_service_queue *sq = &tg->service_queue;
+	unsigned long now = jiffies;
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+	bool update_share = false;
+
+	tg_update_active_time(tg);
+
+	if (time_before(now, td->last_check_timestamp + CGCHECK_TIME))
+		return;
+	td->last_check_timestamp = now;
+
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		tg = blkg_to_tg(blkg);
+		sq = &tg->service_queue;
+		/* '/' cgroup of cgroup2 */
+		if (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+		    !tg_to_blkg(tg)->parent)
+			continue;
+
+		if (sq->parent_sq &&
+		    time_before(sq->active_timestamp + CGCHECK_TIME, now) &&
+		    !(sq->nr_queued[READ] || sq->nr_queued[WRITE])) {
+			if (sq->acting_weight && sq->parent_sq) {
+				sq->parent_sq->children_weight -= sq->acting_weight;
+				sq->acting_weight = 0;
+				update_share = true;
+			}
+		}
+	}
+
+	if (update_share)
+		tg_update_share(td, NULL);
+}
+
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1186,6 +1252,7 @@ static int throtl_dispatch_tg(struct throtl_grp *tg)
 			break;
 	}
 	if (nr_reads + nr_writes) {
+		detect_inactive_cg(tg);
 		tg_update_perf(tg);
 	}
 
@@ -1639,6 +1706,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	sq = &tg->service_queue;
 
+	detect_inactive_cg(tg);
 	while (true) {
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[index])
@@ -1791,6 +1859,7 @@ int blk_throtl_init(struct request_queue *q)
 	throtl_service_queue_init(&td->service_queue);
 	td->service_queue.share = MAX_SHARE;
 	td->mode = MODE_NONE;
+	td->service_queue.acting_weight = DFT_WEIGHT;
 
 	q->td = td;
 	td->queue = q;
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 06/13] blk-throttle: add per-cgroup data
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (4 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 05/13] blk-throttling: detect inactive cgroup Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 07/13] blk-throttle: add interface for proporation based throttle Shaohua Li
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

Currently we only per-cgroup per-queue data. This adds per-cgroup data
(cgroup weight). Changing the per-cgroup weight will change all
per-cgroup per-queue weight.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 43de1dc..a0fd33e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -112,6 +112,7 @@ struct throtl_io_cost {
 	unsigned int io_disp[2];
 };
 
+/* per cgroup per device data */
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -155,6 +156,14 @@ struct throtl_grp {
 	unsigned long slice_end[2];
 };
 
+/* per-cgroup data */
+struct throtl_group_data {
+	/* must be the first member */
+	struct blkcg_policy_data cpd;
+
+	unsigned int weight;
+};
+
 enum run_mode {
 	MODE_NONE = 0,
 	MODE_THROTTLE = 1, /* bandwidth/iops based throttle */
@@ -246,6 +255,16 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 		return container_of(sq, struct throtl_data, service_queue);
 }
 
+static inline struct throtl_group_data *cpd_to_tgd(struct blkcg_policy_data *cpd)
+{
+	return cpd ? container_of(cpd, struct throtl_group_data, cpd) : NULL;
+}
+
+static inline struct throtl_group_data *blkcg_to_tgd(struct blkcg *blkcg)
+{
+	return cpd_to_tgd(blkcg_to_cpd(blkcg, &blkcg_policy_throtl));
+}
+
 static inline int tg_data_index(struct throtl_grp *tg, bool rw)
 {
 	if (td_weight_based(tg->td))
@@ -385,6 +404,28 @@ static struct bio *throtl_pop_queued(struct list_head *queued,
 	return bio;
 }
 
+static struct blkcg_policy_data *throtl_cpd_alloc(gfp_t gfp)
+{
+	struct throtl_group_data *tgd;
+
+	tgd = kzalloc(sizeof(*tgd), gfp);
+	if (!tgd)
+		return NULL;
+	return &tgd->cpd;
+}
+
+static void throtl_cpd_init(struct blkcg_policy_data *cpd)
+{
+	struct throtl_group_data *tgd = cpd_to_tgd(cpd);
+
+	tgd->weight = DFT_WEIGHT;
+}
+
+static void throtl_cpd_free(struct blkcg_policy_data *cpd)
+{
+	kfree(cpd_to_tgd(cpd));
+}
+
 /* init a service_queue, assumes the caller zeroed it */
 static void throtl_service_queue_init(struct throtl_service_queue *sq)
 {
@@ -449,7 +490,7 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 	sq->parent_sq = &td->service_queue;
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
 		sq->parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
-	sq->weight = DFT_WEIGHT;
+	sq->weight = blkcg_to_tgd(blkg->blkcg)->weight;
 	sq->acting_weight = 0;
 	tg->td = td;
 }
@@ -1677,6 +1718,10 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.dfl_cftypes		= throtl_files,
 	.legacy_cftypes		= throtl_legacy_files,
 
+	.cpd_alloc_fn		= throtl_cpd_alloc,
+	.cpd_init_fn		= throtl_cpd_init,
+	.cpd_free_fn		= throtl_cpd_free,
+
 	.pd_alloc_fn		= throtl_pd_alloc,
 	.pd_init_fn		= throtl_pd_init,
 	.pd_online_fn		= throtl_pd_online,
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 07/13] blk-throttle: add interface for proporation based throttle
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (5 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 06/13] blk-throttle: add per-cgroup data Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 08/13] blk-throttle: add cgroup2 interface Shaohua Li
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

There is throttle.mode_device interface. By default blk-throttle is in
NONE mode. Setting original bps/iops limit wil change mode to THROTTLE
automatically, user doesn't need to configure the mode, which is for
backward compabitbility. To use proporation based throttle, user must
configure device to proper mode. 'weight_bw' is for bandwidth
proporation and 'weight_iops' is for iops proporation. Currently
switching between THROTTLR mode and proporation modes is prohibited.
This might be changed in the future.

expected usage:
set to bandwidth based proporation mode
$echo "8:0 weight_bw" > /sys/fs/cgroup/blkio/throttle.mode_device
$mkdir /sys/fs/cgroup/blkio/test
set cgroup weight for all disks
$echo "200" > /sys/fs/cgroup/blkio/test/throttle.weight
or set cgroup weight for one disk
$echo "8:0 200" > /sys/fs/cgroup/blkio/test/throttle.weight_device
$echo $$ > /sys/fs/cgroup/blkio/test/cgroup.procs

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 223 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 223 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a0fd33e..a594000 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -86,6 +86,7 @@ struct throtl_service_queue {
 
 	unsigned int		weight; /* this queue's weight against siblings */
 	unsigned int		acting_weight; /* actual weight of the queue */
+	unsigned int		new_weight; /* weight changed to */
 	unsigned int		children_weight; /* children weight */
 	unsigned int		share; /* disk bandwidth share of the queue */
 
@@ -1529,11 +1530,16 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 		v = -1;
 
 	tg = blkg_to_tg(ctx.blkg);
+	if (td_weight_based(tg->td)) {
+		ret = -EBUSY;
+		goto out_finish;
+	}
 
 	if (is_u64)
 		*(u64 *)((void *)tg + of_cft(of)->private) = v;
 	else
 		*(unsigned int *)((void *)tg + of_cft(of)->private) = v;
+	tg->td->mode = MODE_THROTTLE;
 
 	tg_conf_updated(tg);
 	ret = 0;
@@ -1554,8 +1560,217 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
 	return tg_set_conf(of, buf, nbytes, off, false);
 }
 
+static int tg_print_weight(struct seq_file *sf, void *v)
+{
+	struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+	struct throtl_group_data *tgd = blkcg_to_tgd(blkcg);
+	unsigned int weight = 0;
+
+	if (tgd)
+		weight = tgd->weight;
+	seq_printf(sf, "%u\n", weight);
+	return 0;
+}
+
+static int tg_set_weight(struct cgroup_subsys_state *css,
+	struct cftype *cft, u64 val)
+{
+	struct blkcg *blkcg = css_to_blkcg(css);
+	struct throtl_group_data *tgd;
+	struct blkcg_gq *blkg;
+
+	if (val < MIN_WEIGHT)
+		val = MIN_WEIGHT;
+	if (val > MAX_WEIGHT)
+		val = MAX_WEIGHT;
+
+	spin_lock_irq(&blkcg->lock);
+	tgd = blkcg_to_tgd(blkcg);
+	if (!tgd) {
+		spin_unlock_irq(&blkcg->lock);
+		return -EINVAL;
+	}
+	tgd->weight = val;
+
+	hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (!tg)
+			continue;
+		/* can't hold queue->lock here, weight changing is deferred */
+		if (td_weight_based(tg->td))
+			tg->service_queue.new_weight = val;
+	}
+	spin_unlock_irq(&blkcg->lock);
+	return 0;
+}
+
+static void __tg_set_weight(struct throtl_grp *tg, unsigned int weight)
+{
+	unsigned int old_weight;
+
+	old_weight = tg->service_queue.acting_weight;
+
+	tg->service_queue.weight = weight;
+	tg->service_queue.new_weight = 0;
+	if (old_weight && tg->service_queue.parent_sq) {
+		struct throtl_service_queue *psq = tg->service_queue.parent_sq;
+		if (weight > old_weight)
+			psq->children_weight += weight - old_weight;
+		else if (weight < old_weight)
+			psq->children_weight -= old_weight - weight;
+		tg->service_queue.acting_weight = weight;
+	}
+
+	tg_update_share(tg->td, tg);
+}
+
+static ssize_t tg_set_weight_device(struct kernfs_open_file *of,
+			   char *buf, size_t nbytes, loff_t off)
+{
+	struct blkcg *blkcg = css_to_blkcg(of_css(of));
+	struct blkg_conf_ctx ctx;
+	struct throtl_grp *tg;
+	unsigned int weight;
+	int ret;
+
+	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
+	if (ret)
+		return ret;
+
+	ret = -EINVAL;
+	if (sscanf(ctx.body, "%u", &weight) != 1)
+		goto out_finish;
+	if (weight < MIN_WEIGHT)
+		weight = MIN_WEIGHT;
+	if (weight > MAX_WEIGHT)
+		weight = MAX_WEIGHT;
+
+	tg = blkg_to_tg(ctx.blkg);
+	if (!td_weight_based(tg->td)) {
+		ret = -EBUSY;
+		goto out_finish;
+	}
+
+	__tg_set_weight(tg, weight);
+
+	tg_conf_updated(tg);
+	ret = 0;
+out_finish:
+	blkg_conf_finish(&ctx);
+	return ret ?: nbytes;
+}
+
+static u64 tg_prfill_mode_device(struct seq_file *sf,
+	struct blkg_policy_data *pd, int off)
+{
+	struct throtl_grp *tg = pd_to_tg(pd);
+	const char *dname = blkg_dev_name(pd->blkg);
+
+	if (!dname)
+		return 0;
+	if (tg->td->mode == MODE_NONE)
+		return 0;
+	seq_printf(sf, "%s %s\n", dname, run_mode_name[tg->td->mode]);
+	return 0;
+}
+
+static int throtl_print_mode_device(struct seq_file *sf, void *v)
+{
+	int i;
+	seq_printf(sf, "available ");
+	for (i = 0; i < MAX_MODE; i++)
+		seq_printf(sf, "%s ", run_mode_name[i]);
+	seq_printf(sf, "\n");
+	seq_printf(sf, "default %s\n", run_mode_name[MODE_NONE]);
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)),
+		tg_prfill_mode_device,  &blkcg_policy_throtl, 0, false);
+	return 0;
+}
+
+static u64 tg_prfill_weight_uint(struct seq_file *sf,
+	struct blkg_policy_data *pd, int off)
+{
+	struct throtl_grp *tg = pd_to_tg(pd);
+	struct throtl_group_data *tgd = blkcg_to_tgd(pd_to_blkg(pd)->blkcg);
+	unsigned int v = *(unsigned int *)((void *)tg + off);
+
+	if (v == tgd->weight)
+		return 0;
+	return __blkg_prfill_u64(sf, pd, v);
+}
+
+static int tg_print_weight_device(struct seq_file *sf, void *v)
+{
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_weight_uint,
+			  &blkcg_policy_throtl, seq_cft(sf)->private, false);
+	return 0;
+}
+
+static ssize_t tg_set_mode_device(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	struct blkcg *blkcg = css_to_blkcg(of_css(of));
+	struct blkg_conf_ctx ctx;
+	struct throtl_grp *tg;
+	int ret;
+	char mode_name[20] = "";
+	int mode;
+
+	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
+	if (ret)
+		return ret;
+
+	ret = -EINVAL;
+	if (sscanf(ctx.body, "%s", mode_name) != 1)
+		goto out_finish;
+
+	for (mode = 0; mode < MAX_MODE; mode++)
+		if (!strcmp(mode_name, run_mode_name[mode]))
+			break;
+	if (mode == MAX_MODE)
+		goto out_finish;
+
+	tg = blkg_to_tg(ctx.blkg);
+	if (tg->td->mode == mode) {
+		ret = 0;
+		goto out_finish;
+	}
+	/* Don't allow switching between throttle and weight based currently */
+	if (tg->td->mode != MODE_NONE) {
+		ret = -EBUSY;
+		goto out_finish;
+	}
+
+	tg->td->mode = mode;
+
+	ret = 0;
+out_finish:
+	blkg_conf_finish(&ctx);
+	return ret ?: nbytes;
+}
+
 static struct cftype throtl_legacy_files[] = {
 	{
+		.name = "throttle.mode_device",
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.seq_show = throtl_print_mode_device,
+		.write = tg_set_mode_device,
+	},
+	{
+		.name = "throttle.weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = tg_print_weight,
+		.write_u64 = tg_set_weight,
+	},
+	{
+		.name = "throttle.weight_device",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.private = offsetof(struct throtl_grp, service_queue.weight),
+		.seq_show = tg_print_weight_device,
+		.write = tg_set_weight_device,
+	},
+	{
 		.name = "throttle.read_bps_device",
 		.private = offsetof(struct throtl_grp, io_cost.bps[READ]),
 		.seq_show = tg_print_conf_u64,
@@ -1728,6 +1943,13 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_free_fn		= throtl_pd_free,
 };
 
+static void tg_check_new_weight(struct throtl_grp *tg)
+{
+	if (!td_weight_based(tg->td) || !tg->service_queue.new_weight)
+		return;
+	__tg_set_weight(tg, tg->service_queue.new_weight);
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1751,6 +1973,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	sq = &tg->service_queue;
 
+	tg_check_new_weight(tg);
 	detect_inactive_cg(tg);
 	while (true) {
 		/* throtl is FIFO - if bios are already queued, should queue */
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 08/13] blk-throttle: add cgroup2 interface
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (6 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 07/13] blk-throttle: add interface for proporation based throttle Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 09/13] blk-throttle: add trace for new proporation throttle Shaohua Li
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

Example usage of the cgroup2 interface:
$echo "+io" > /sys/fs/cgroup/cgroup.subtree_control
set bandwidth based proporation mode
$echo "8:0 weight_bw" > /sys/fs/cgroup/io.throttle.mode_device
$mkdir /sys/fs/cgroup/test
set cgroup weight
$echo "8:0 200" > /sys/fs/cgroup/test/io.throttle.weight
$echo $$ > /sys/fs/cgroup/test/cgroup.procs

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a594000..01ca04e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1912,6 +1912,35 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 	return ret ?: nbytes;
 }
 
+static int tg_print_cg2_weight(struct seq_file *sf, void *v)
+{
+	struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+	struct throtl_group_data *tgd = blkcg_to_tgd(blkcg);
+
+	seq_printf(sf, "default %u\n", tgd->weight);
+	return tg_print_weight_device(sf, v);
+}
+
+static ssize_t tg_set_cg2_weight(struct kernfs_open_file *of,
+	char *buf, size_t nbytes, loff_t off)
+{
+	char *endp;
+	int ret;
+	u64 v;
+
+	buf = strim(buf);
+
+	/* "WEIGHT" or "default WEIGHT" sets the default weight */
+	v = simple_strtoull(buf, &endp, 0);
+	if (*endp == '\0' || sscanf(buf, "default %llu", &v) == 1) {
+		ret = tg_set_weight(of_css(of), of_cft(of), v);
+		return ret ?: nbytes;
+	}
+
+	/* "MAJ:MIN WEIGHT" */
+	return tg_set_weight_device(of, buf, nbytes, off);
+}
+
 static struct cftype throtl_files[] = {
 	{
 		.name = "max",
@@ -1919,6 +1948,19 @@ static struct cftype throtl_files[] = {
 		.seq_show = tg_print_max,
 		.write = tg_set_max,
 	},
+	{
+		.name = "throttle.weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.private = offsetof(struct throtl_grp, service_queue.weight),
+		.seq_show = tg_print_cg2_weight,
+		.write = tg_set_cg2_weight,
+	},
+	{
+		.name = "throttle.mode_device",
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.seq_show = throtl_print_mode_device,
+		.write = tg_set_mode_device,
+	},
 	{ }	/* terminate */
 };
 
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 09/13] blk-throttle: add trace for new proporation throttle
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (7 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 08/13] blk-throttle: add cgroup2 interface Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 10/13] blk-throttle: over estimate bandwidth Shaohua Li
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

The trace log is very helpful to track bug and performance issues.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 01ca04e..68e2598 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1153,6 +1153,7 @@ static void tg_update_share(struct throtl_data *td, struct throtl_grp *tg)
 			sq->parent_sq->share * sq->acting_weight /
 				sq->parent_sq->children_weight,
 			1);
+		throtl_log(sq, "new share=%u", sq->share);
 	}
 }
 
@@ -1168,6 +1169,8 @@ static void tg_update_active_time(struct throtl_grp *tg)
 			sq->acting_weight = sq->weight;
 			sq->parent_sq->children_weight += sq->acting_weight;
 			tg = sq_to_tg(sq);
+			throtl_log(sq, "active weight=%u parent_weight=%u",
+				sq->acting_weight, sq->parent_sq->children_weight);
 		}
 		sq = sq->parent_sq;
 	}
@@ -1206,6 +1209,8 @@ static void detect_inactive_cg(struct throtl_grp *tg)
 				sq->parent_sq->children_weight -= sq->acting_weight;
 				sq->acting_weight = 0;
 				update_share = true;
+				throtl_log(sq, "inactive weight=%u parent_weight=%u",
+					sq->weight, sq->parent_sq->children_weight);
 			}
 		}
 	}
@@ -1621,6 +1626,10 @@ static void __tg_set_weight(struct throtl_grp *tg, unsigned int weight)
 			psq->children_weight -= old_weight - weight;
 		tg->service_queue.acting_weight = weight;
 	}
+	throtl_log(&tg->service_queue, "weight=%d parent_weight=%d",
+		tg->service_queue.acting_weight,
+		tg->service_queue.parent_sq ?
+			tg->service_queue.parent_sq->children_weight : 0);
 
 	tg_update_share(tg->td, tg);
 }
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 10/13] blk-throttle: over estimate bandwidth
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (8 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 09/13] blk-throttle: add trace for new proporation throttle Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 11/13] blk-throttle: shrink cgroup share if its target is overestimated Shaohua Li
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

If we don't dispatch enough requests, disk can't reach the max
bandwidth. As we don't know the max bandwidth, we over estimate the
bandwidth and dispatch more requests. This way the disk can reach the
max bandwidth slowly. The downside is this can introduce fairness issue,
but since we only over estimate 1/8 extra bandwidth, the fairness issue
isn't big. But again, this could cause fairness issue for specific
workloads.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 68e2598..f78d470 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -281,6 +281,8 @@ static inline uint64_t queue_bandwidth(struct throtl_data *td)
 	/* can't estimate bandwidth, can't do proporation control */
 	if (bw == 0)
 		bw = -1;
+	else
+		bw += bw >> 3;
 	return bw;
 }
 
@@ -292,6 +294,8 @@ static inline uint64_t queue_iops(struct throtl_data *td)
 	/* can't estimate iops, can't do proporation control */
 	if (iops == 0)
 		iops = -1;
+	else
+		iops += iops >> 3;
 	return iops;
 }
 
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 11/13] blk-throttle: shrink cgroup share if its target is overestimated
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (9 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 10/13] blk-throttle: over estimate bandwidth Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 12/13] blk-throttle: restore shrinked cgroup share Shaohua Li
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

cgroup's share could be over estimated. For example, cg1 gets 20% share,
cg2 gets 80% share.

1. disk max bps 200, cgroup has no limitation
cg1 bps == 40, cg2 bps = 160

2. disk max bps 200, cg2 can only dispatch 10 bps
cg1 bps == 190 (not 2.5), cg2 bps == 10
cg2 can't use its share, so we must adjust its share and cg1 can get max
bandwidth

3. disk max bps 200, each cgroup can only dispatch 80 bps
cg1 bps == 80 (not 20), cg2 bps == 80
we adjust cg2's share for the same reason

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c       | 235 +++++++++++++++++++++++++++++++++++++++++----
 include/linux/blk-cgroup.h |  10 ++
 2 files changed, 228 insertions(+), 17 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f78d470..2271c7e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -18,7 +18,11 @@
 #define DFT_WEIGHT (100)
 #define SHARE_SHIFT (14)
 #define MAX_SHARE (1 << SHARE_SHIFT)
-/* must be less than the interval we update bandwidth */
+/*
+ * must be less than the interval we update bandwidth
+ * must be multiple of throtl_slice, otherwise we don't calculate cgroup
+ * bandwidth correctly
+ */
 #define CGCHECK_TIME (msecs_to_jiffies(100))
 
 /* Max dispatch from a group in 1 round */
@@ -89,8 +93,6 @@ struct throtl_service_queue {
 	unsigned int		new_weight; /* weight changed to */
 	unsigned int		children_weight; /* children weight */
 	unsigned int		share; /* disk bandwidth share of the queue */
-
-	unsigned long active_timestamp;
 };
 
 enum tg_state_flags {
@@ -111,6 +113,15 @@ struct throtl_io_cost {
 	uint64_t bytes_disp[2];
 	/* Number of bio's dispatched in current slice */
 	unsigned int io_disp[2];
+
+	uint64_t bytes_disp_active;
+	unsigned int io_disp_active;
+	uint64_t act_bps;
+	unsigned int act_iops;
+	bool check_weight;
+
+	uint64_t target_bps;
+	unsigned int target_iops;
 };
 
 /* per cgroup per device data */
@@ -1005,6 +1016,9 @@ static inline void io_cost_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	/* Charge the bio to the group */
 	tg->io_cost.bytes_disp[index] += bio->bi_iter.bi_size;
 	tg->io_cost.io_disp[index]++;
+
+	tg->io_cost.bytes_disp_active += bio->bi_iter.bi_size;
+	tg->io_cost.io_disp_active++;
 }
 
 static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
@@ -1161,14 +1175,12 @@ static void tg_update_share(struct throtl_data *td, struct throtl_grp *tg)
 	}
 }
 
-static void tg_update_active_time(struct throtl_grp *tg)
+static void tg_init_acting_weight(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
-	unsigned long now = jiffies;
 
 	tg = NULL;
 	while (sq->parent_sq) {
-		sq->active_timestamp = now;
 		if (!sq->acting_weight) {
 			sq->acting_weight = sq->weight;
 			sq->parent_sq->children_weight += sq->acting_weight;
@@ -1183,6 +1195,173 @@ static void tg_update_active_time(struct throtl_grp *tg)
 		tg_update_share(tg->td, tg);
 }
 
+static void precheck_tg_activity(struct throtl_grp *tg, unsigned long elapsed_time,
+	bool *has_overrun)
+{
+	struct throtl_service_queue *sq = &tg->service_queue;
+	struct throtl_service_queue *psq = sq->parent_sq;
+
+	if (!psq)
+		return;
+
+	/* don't check if the queue has no IO for a long time */
+	if (elapsed_time > 4 * CGCHECK_TIME)
+		return;
+
+	if (tg->td->mode == MODE_WEIGHT_BANDWIDTH) {
+		if (tg->io_cost.bps[0] == -1)
+			return;
+
+		tg->io_cost.act_bps = tg->io_cost.bytes_disp_active * HZ;
+		do_div(tg->io_cost.act_bps, elapsed_time);
+
+		/* the cgroup dispatches significantly more IO */
+		if (tg->io_cost.act_bps > tg->io_cost.bps[0] +
+				(tg->io_cost.bps[0] >> 2)) {
+			*has_overrun = true;
+			throtl_log(sq, "excessive bps. Actual %lld, expected %lld",
+				(unsigned long long)tg->io_cost.act_bps,
+				(unsigned long long)tg->io_cost.bps[0]);
+		}
+		return;
+	}
+
+	if (tg->io_cost.iops[0] == -1)
+		return;
+
+	tg->io_cost.act_iops = tg->io_cost.io_disp_active * HZ / elapsed_time;
+	/* the cgroup dispatches significantly more IO */
+	if (tg->io_cost.act_iops > tg->io_cost.iops[0] +
+	    (tg->io_cost.iops[0] >> 2)) {
+		*has_overrun = true;
+		throtl_log(sq, "excessive iops. Actual %u, expected %u",
+			tg->io_cost.act_iops, tg->io_cost.iops[0]);
+	}
+}
+
+static bool detect_one_inactive_cg(struct throtl_grp *tg, unsigned long elapsed_time)
+{
+	struct throtl_service_queue *sq = &tg->service_queue;
+	struct throtl_service_queue *psq = tg->service_queue.parent_sq;
+	struct throtl_grp *ptg = sq_to_tg(psq);
+	struct throtl_grp *child;
+	struct blkcg_gq *blkg;
+	struct cgroup_subsys_state *pos_css;
+	uint64_t adjusted_bps = 0;
+	unsigned int adjusted_iops = 0;
+	unsigned int adjusted_weight = 0;
+	unsigned int none_scaled_weight = 0;
+	unsigned int new_children_weight;
+	uint64_t parent_bps;
+	unsigned int parent_iops;
+	uint64_t tmp_bps = 0;
+	unsigned int tmp_iops;
+	bool bandwidth_mode = tg->td->mode == MODE_WEIGHT_BANDWIDTH;
+
+	tg->io_cost.check_weight = false;
+
+	if (!psq)
+		return false;
+
+	/* don't check if the queue has no IO for a long time */
+	if (elapsed_time > 4 * CGCHECK_TIME ||
+	    sq->acting_weight == psq->children_weight)
+		return false;
+
+	blkg_for_each_child(blkg, pos_css, tg_to_blkg(ptg)) {
+		child = blkg_to_tg(blkg);
+		sq = &child->service_queue;
+		child->io_cost.check_weight = false;
+		child->io_cost.target_bps = 0;
+		child->io_cost.target_iops = 0;
+		if (sq->acting_weight == sq->weight)
+			none_scaled_weight += sq->acting_weight;
+
+		if (bandwidth_mode) {
+			if (child->io_cost.bps[0] == -1)
+				continue;
+			if (child->io_cost.act_bps +
+			    (child->io_cost.act_bps >> 3) >= child->io_cost.bps[0])
+				continue;
+
+			tmp_bps = child->io_cost.bps[0] - child->io_cost.act_bps -
+				(child->io_cost.act_bps >> 3);
+			/* this cg's target bps will drop tmp_bps/2 */
+			adjusted_bps = tmp_bps >> 2;
+			child->io_cost.target_bps = child->io_cost.bps[0] -
+				adjusted_bps;
+		} else {
+			if (child->io_cost.iops[0] == -1)
+				continue;
+			if (child->io_cost.act_iops +
+			    (child->io_cost.act_iops >> 3) >= child->io_cost.iops[0])
+				continue;
+
+			tmp_iops = child->io_cost.iops[0] - child->io_cost.act_iops -
+				(child->io_cost.act_iops >> 3);
+			/* this cg's target iops will drop tmp_iops/2 */
+			adjusted_iops = tmp_iops >> 2;
+			child->io_cost.target_iops = child->io_cost.iops[0] -
+				adjusted_iops;
+		}
+
+		adjusted_weight += sq->acting_weight;
+		if (sq->acting_weight == sq->weight)
+			none_scaled_weight -= sq->acting_weight;
+	}
+
+	/* don't shrink if none or all cgroups will be shrinked */
+	if ((adjusted_bps == 0 && adjusted_iops == 0) || none_scaled_weight == 0)
+		return false;
+
+	if (bandwidth_mode) {
+		uint64_t base;
+
+		parent_bps = (psq->share * queue_bandwidth(tg->td)) >>
+			SHARE_SHIFT;
+
+		tmp_bps = adjusted_bps * psq->children_weight;
+		base = adjusted_weight * parent_bps;
+		do_div(base, psq->children_weight);
+		base = parent_bps + adjusted_bps - base;
+		tmp_bps = div64_u64(tmp_bps, base);
+
+		new_children_weight = psq->children_weight - tmp_bps;
+	} else {
+		parent_iops = (psq->share * queue_iops(tg->td)) >>
+			SHARE_SHIFT;
+		tmp_iops = adjusted_iops * psq->children_weight /
+			(parent_iops + adjusted_iops -
+			 adjusted_weight * parent_iops / psq->children_weight);
+
+		new_children_weight = psq->children_weight - tmp_iops;
+	}
+
+	blkg_for_each_child(blkg, pos_css, tg_to_blkg(ptg)) {
+		unsigned int new_weight;
+
+		child = blkg_to_tg(blkg);
+		sq = &child->service_queue;
+		if (!child->io_cost.target_bps && !child->io_cost.target_iops)
+			continue;
+
+		if (bandwidth_mode) {
+			tmp_bps = child->io_cost.target_bps * new_children_weight;
+			tmp_bps = div64_u64(tmp_bps, parent_bps);
+			new_weight = tmp_bps;
+			child->io_cost.target_bps = 0;
+		} else {
+			new_weight = child->io_cost.target_iops *
+				new_children_weight / parent_iops;
+			child->io_cost.target_iops = 0;
+		}
+		psq->children_weight -= sq->acting_weight - new_weight;
+		sq->acting_weight = new_weight;
+	}
+
+	return true;
+}
+
 static void detect_inactive_cg(struct throtl_grp *tg)
 {
 	struct throtl_data *td = tg->td;
@@ -1191,11 +1370,15 @@ static void detect_inactive_cg(struct throtl_grp *tg)
 	struct cgroup_subsys_state *pos_css;
 	struct blkcg_gq *blkg;
 	bool update_share = false;
+	bool ret;
+	unsigned long elapsed_time;
+	bool has_overrun = false;
 
-	tg_update_active_time(tg);
+	tg_init_acting_weight(tg);
 
 	if (time_before(now, td->last_check_timestamp + CGCHECK_TIME))
 		return;
+	elapsed_time = now - td->last_check_timestamp;
 	td->last_check_timestamp = now;
 
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
@@ -1205,18 +1388,36 @@ static void detect_inactive_cg(struct throtl_grp *tg)
 		if (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
 		    !tg_to_blkg(tg)->parent)
 			continue;
+		tg->io_cost.check_weight = true;
+		precheck_tg_activity(tg, elapsed_time, &has_overrun);
+	}
 
-		if (sq->parent_sq &&
-		    time_before(sq->active_timestamp + CGCHECK_TIME, now) &&
-		    !(sq->nr_queued[READ] || sq->nr_queued[WRITE])) {
-			if (sq->acting_weight && sq->parent_sq) {
-				sq->parent_sq->children_weight -= sq->acting_weight;
-				sq->acting_weight = 0;
-				update_share = true;
-				throtl_log(sq, "inactive weight=%u parent_weight=%u",
-					sq->weight, sq->parent_sq->children_weight);
-			}
+	blkg_for_each_descendant_pre(blkg, pos_css, td->queue->root_blkg) {
+		tg = blkg_to_tg(blkg);
+		sq = &tg->service_queue;
+		/* '/' cgroup of cgroup2 */
+		if (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+		    !tg_to_blkg(tg)->parent)
+			continue;
+
+		/*
+		 * If there are other cgroups dispatch more IO than their
+		 * limits, this cgroup might not be able to get a chance to
+		 * dispatch IO, don't shrink this cgroup's share
+		 */
+		if (has_overrun || !tg->io_cost.check_weight) {
+			tg->io_cost.bytes_disp_active = 0;
+			tg->io_cost.io_disp_active = 0;
+			continue;
 		}
+
+		ret = detect_one_inactive_cg(tg, elapsed_time);
+		update_share |= ret;
+		if (ret)
+			throtl_log(sq, "idle weight=%u parent_weight=%u",
+				sq->acting_weight, sq->parent_sq->children_weight);
+		tg->io_cost.bytes_disp_active = 0;
+		tg->io_cost.io_disp_active = 0;
 	}
 
 	if (update_share)
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index c02e669..58a30c3 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -411,6 +411,16 @@ static inline void blkg_put(struct blkcg_gq *blkg)
 	css_for_each_descendant_post((pos_css), &(p_blkg)->blkcg->css)	\
 		if (((d_blkg) = __blkg_lookup(css_to_blkcg(pos_css),	\
 					      (p_blkg)->q, false)))
+/**
+ * blkg_for_each_child - walk all children of a blkg
+ * @d_blkg: loop cursor pointing to the current child
+ * @pos_css: used for iteration
+ * @p_blkg: target blkg to walk children of
+ */
+#define blkg_for_each_child(d_blkg, pos_css, p_blkg)			\
+	css_for_each_child((pos_css), &(p_blkg)->blkcg->css)		\
+		if (((d_blkg) = __blkg_lookup(css_to_blkcg(pos_css),	\
+					      (p_blkg)->q, false)))
 
 /**
  * blk_get_rl - get request_list to use
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 12/13] blk-throttle: restore shrinked cgroup share
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (10 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 11/13] blk-throttle: shrink cgroup share if its target is overestimated Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-22 22:01 ` [PATCH V2 13/13] blk-throttle: detect wrong shrink Shaohua Li
  2016-02-28 15:02 ` [PATCH V2 00/13] block-throttle: proportional throttle Pavel Machek
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

If a cgroup share gets shrinked but the cgroup hits its limit, the
shrink isn't optimal, we restore its original share.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 2271c7e..7db7d8e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2229,6 +2229,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	sq = &tg->service_queue;
 
+again:
 	tg_check_new_weight(tg);
 	detect_inactive_cg(tg);
 	while (true) {
@@ -2238,8 +2239,18 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		tg_update_perf(tg);
 
 		/* if above limits, break to queue */
-		if (!tg_may_dispatch(tg, bio, NULL))
+		if (!tg_may_dispatch(tg, bio, NULL)) {
+			/*
+			 * If the cg hits limit but its share is shrinked, the
+			 * shrink isn't optimal
+			 */
+			if (sq->acting_weight < sq->weight) {
+				/* pretend we are changing weight */
+				sq->new_weight = sq->weight;
+				goto again;
+			}
 			break;
+		}
 
 		/* within limits, let's charge and dispatch directly */
 		throtl_charge_bio(tg, bio);
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH V2 13/13] blk-throttle: detect wrong shrink
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (11 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 12/13] blk-throttle: restore shrinked cgroup share Shaohua Li
@ 2016-02-22 22:01 ` Shaohua Li
  2016-02-28 15:02 ` [PATCH V2 00/13] block-throttle: proportional throttle Pavel Machek
  13 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-02-22 22:01 UTC (permalink / raw)
  To: linux-kernel, linux-block
  Cc: axboe, tj, Vivek Goyal, jmoyer @ redhat . com, Kernel-team

Last patch detects wrong shrink for some cases, but not all. Let's have
an example. cg1 gets 20% share, cg2 gets 80% share.

disk max bps 100, cgroup can only dispatch 80 bps
cg1 bps = 20, cg2 bps = 80

This should be stable state, but we don't know about it. As we assign
extra bps to queue, we will find cg2 doesn't use its share. So we try to
adjust it, for example, shrink cg2 bps 5. cg1 will then get 25 bps. The
problem is total disk bps is 100, cg2 can't dispatch enough IO and its
bps will drop to 75. From time to time, cg2's share will be adjusted.
Finally cg1 and cg2 will get same bps. The shrink is wrong.

To detect this situation, we record history when a cgroup's share is
shrinked. If we found the cgroup's bps drops, we think the shrink is
wrong and restore the cgroup's share. To avoid fluctuation, we also
disable shrink for a while if a wrong shrink is detected. It's possible,
the shrink is right actually, we will redo the shrink after a while.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 7db7d8e..3af41bc 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -122,7 +122,15 @@ struct throtl_io_cost {
 
 	uint64_t target_bps;
 	unsigned int target_iops;
+
+	uint64_t last_bps;
+	unsigned int last_iops;
+	unsigned int last_weight;
+	unsigned long last_shrink_time;
+	unsigned long last_wrong_shrink_time;
 };
+#define SHRINK_STABLE_TIME (2 * CGCHECK_TIME)
+#define WRONG_SHRINK_STABLE_TIME (16 * CGCHECK_TIME)
 
 /* per cgroup per device data */
 struct throtl_grp {
@@ -1239,6 +1247,76 @@ static void precheck_tg_activity(struct throtl_grp *tg, unsigned long elapsed_ti
 	}
 }
 
+static bool detect_wrong_shrink(struct throtl_grp *tg, unsigned long elapsed_time)
+{
+	struct throtl_service_queue *sq = &tg->service_queue;
+	struct throtl_service_queue *psq = tg->service_queue.parent_sq;
+	struct throtl_grp *ptg = sq_to_tg(psq);
+	struct blkcg_gq *blkg;
+	struct cgroup_subsys_state *pos_css;
+	uint64_t actual_bps;
+	unsigned int actual_iops;
+	bool ret = false;
+	unsigned long now = jiffies;
+	unsigned int new_weight;
+	bool bandwidth_mode = tg->td->mode == MODE_WEIGHT_BANDWIDTH;
+
+	blkg_for_each_child(blkg, pos_css, tg_to_blkg(ptg)) {
+		struct throtl_grp *child;
+
+		child = blkg_to_tg(blkg);
+		sq = &child->service_queue;
+
+		actual_bps = child->io_cost.act_bps;
+		actual_iops = child->io_cost.act_iops;
+		if (child->io_cost.last_shrink_time) {
+			if (time_before_eq(now, child->io_cost.last_shrink_time +
+			     SHRINK_STABLE_TIME) &&
+			   ((bandwidth_mode && actual_bps < child->io_cost.last_bps &&
+			     child->io_cost.last_bps - actual_bps >
+			      (child->io_cost.last_bps >> 5)) ||
+			    (!bandwidth_mode && actual_iops < child->io_cost.last_iops &&
+			      child->io_cost.last_iops - actual_iops >
+			       (child->io_cost.last_iops >> 5)))) {
+
+				ret = true;
+				/* the cgroup will get 1/4 more share */
+				if (4 * psq->children_weight > 5 * sq->acting_weight) {
+					new_weight = sq->acting_weight *
+					  psq->children_weight / (4 * psq->children_weight -
+					  5 * sq->acting_weight) + sq->acting_weight;
+					if (new_weight > sq->weight)
+						new_weight = sq->weight;
+				} else
+					new_weight = sq->weight;
+
+				psq->children_weight += new_weight -
+					sq->acting_weight;
+				sq->acting_weight = new_weight;
+
+				child->io_cost.last_shrink_time = 0;
+				child->io_cost.last_wrong_shrink_time = now;
+			} else if (time_after(now,
+				   child->io_cost.last_shrink_time + SHRINK_STABLE_TIME))
+				child->io_cost.last_shrink_time = 0;
+		}
+		if (child->io_cost.last_wrong_shrink_time &&
+		    time_after(now, child->io_cost.last_wrong_shrink_time +
+		     WRONG_SHRINK_STABLE_TIME))
+			child->io_cost.last_wrong_shrink_time = 0;
+	}
+	if (!ret)
+		return ret;
+
+	blkg_for_each_child(blkg, pos_css, tg_to_blkg(ptg)) {
+		struct throtl_grp *child;
+
+		child = blkg_to_tg(blkg);
+		child->io_cost.check_weight = false;
+	}
+	return true;
+}
+
 static bool detect_one_inactive_cg(struct throtl_grp *tg, unsigned long elapsed_time)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1268,6 +1346,9 @@ static bool detect_one_inactive_cg(struct throtl_grp *tg, unsigned long elapsed_
 	    sq->acting_weight == psq->children_weight)
 		return false;
 
+	if (detect_wrong_shrink(tg, elapsed_time))
+		return true;
+
 	blkg_for_each_child(blkg, pos_css, tg_to_blkg(ptg)) {
 		child = blkg_to_tg(blkg);
 		sq = &child->service_queue;
@@ -1277,9 +1358,18 @@ static bool detect_one_inactive_cg(struct throtl_grp *tg, unsigned long elapsed_
 		if (sq->acting_weight == sq->weight)
 			none_scaled_weight += sq->acting_weight;
 
+		/* wait it stable */
+		if ((child->io_cost.last_shrink_time && time_before_eq(jiffies,
+		     child->io_cost.last_shrink_time + SHRINK_STABLE_TIME)) ||
+		    (child->io_cost.last_wrong_shrink_time && time_before_eq(jiffies,
+		      child->io_cost.last_wrong_shrink_time +
+		      WRONG_SHRINK_STABLE_TIME)))
+			continue;
+
 		if (bandwidth_mode) {
 			if (child->io_cost.bps[0] == -1)
 				continue;
+
 			if (child->io_cost.act_bps +
 			    (child->io_cost.act_bps >> 3) >= child->io_cost.bps[0])
 				continue;
@@ -1290,6 +1380,8 @@ static bool detect_one_inactive_cg(struct throtl_grp *tg, unsigned long elapsed_
 			adjusted_bps = tmp_bps >> 2;
 			child->io_cost.target_bps = child->io_cost.bps[0] -
 				adjusted_bps;
+
+			child->io_cost.last_bps = child->io_cost.act_bps;
 		} else {
 			if (child->io_cost.iops[0] == -1)
 				continue;
@@ -1305,6 +1397,7 @@ static bool detect_one_inactive_cg(struct throtl_grp *tg, unsigned long elapsed_
 				adjusted_iops;
 		}
 
+		child->io_cost.last_weight = sq->acting_weight;
 		adjusted_weight += sq->acting_weight;
 		if (sq->acting_weight == sq->weight)
 			none_scaled_weight -= sq->acting_weight;
@@ -1357,6 +1450,8 @@ static bool detect_one_inactive_cg(struct throtl_grp *tg, unsigned long elapsed_
 		}
 		psq->children_weight -= sq->acting_weight - new_weight;
 		sq->acting_weight = new_weight;
+
+		child->io_cost.last_shrink_time = jiffies;
 	}
 
 	return true;
-- 
2.6.5

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 00/13] block-throttle: proportional throttle
  2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
                   ` (12 preceding siblings ...)
  2016-02-22 22:01 ` [PATCH V2 13/13] blk-throttle: detect wrong shrink Shaohua Li
@ 2016-02-28 15:02 ` Pavel Machek
  2016-03-01  5:19   ` Shaohua Li
  13 siblings, 1 reply; 16+ messages in thread
From: Pavel Machek @ 2016-02-28 15:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-block, axboe, tj, Vivek Goyal,
	jmoyer @ redhat . com, Kernel-team

Hi!

> The problem is we don't know the max bandwidth a disk can provide for a
> specific workload, which depends on the device and IO pattern. The estimated
> bandwidth by patch 1 will be always not accurate unless the disk is already in
> max bandwidth. To solve this issue, we always over estimate the bandwidth. Over
> esitmate bandwidth, workload dispatchs more IO, estimated bandwidth becomes
> higher, dispatches even more IO. The loop will run till we enter a stable
> state, in which the disk gets max bandwidth. The 'slightly adjust and run into
> stable state' is the core algorithm the patch series use. We also use it to
> detect inactive cgroup.

Ok, so you want to reach a steady state, but what if workloads varies
a lot?

Lets say random writes for ten minutes, then linear write.

Will the linear write be severely throttled because of the previous
seeks?

Can a task get bigger bandwidth by doing some additional (useless)
work?

Like "I do bigger reads in the random read phase, so that I'm not
throttled that badly when I do the linear read"?
								Pavel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH V2 00/13] block-throttle: proportional throttle
  2016-02-28 15:02 ` [PATCH V2 00/13] block-throttle: proportional throttle Pavel Machek
@ 2016-03-01  5:19   ` Shaohua Li
  0 siblings, 0 replies; 16+ messages in thread
From: Shaohua Li @ 2016-03-01  5:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, linux-block, axboe, tj, Vivek Goyal,
	jmoyer @ redhat . com, Kernel-team

On Sun, Feb 28, 2016 at 04:02:51PM +0100, Pavel Machek wrote:
> Hi!
> 
> > The problem is we don't know the max bandwidth a disk can provide for a
> > specific workload, which depends on the device and IO pattern. The estimated
> > bandwidth by patch 1 will be always not accurate unless the disk is already in
> > max bandwidth. To solve this issue, we always over estimate the bandwidth. Over
> > esitmate bandwidth, workload dispatchs more IO, estimated bandwidth becomes
> > higher, dispatches even more IO. The loop will run till we enter a stable
> > state, in which the disk gets max bandwidth. The 'slightly adjust and run into
> > stable state' is the core algorithm the patch series use. We also use it to
> > detect inactive cgroup.
> 
> Ok, so you want to reach a steady state, but what if workloads varies
> a lot?
> 
> Lets say random writes for ten minutes, then linear write.
> 
> Will the linear write be severely throttled because of the previous
> seeks?

If the workload vary a lot, it's possible there is fairness issue or
performance issue when the workload is changing. The fairness or
performance issue will depend on the changing interval. If the changing
interval is short, say, several milliseconds, the issue would be big.
For the case above, the linear write will get throttled initially as
previously estimated bandwidth is low. The workload will get less
throttled soon as estimated bandwidth will get bigger. Within some time,
the workload will enter stable state and get highest bandwidth.

> Can a task get bigger bandwidth by doing some additional (useless)
> work?
> 
> Like "I do bigger reads in the random read phase, so that I'm not
> throttled that badly when I do the linear read"?

Yes, it's possible, but only at the stage approaching to stable state.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-03-01  5:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-22 22:01 [PATCH V2 00/13] block-throttle: proportional throttle Shaohua Li
2016-02-22 22:01 ` [PATCH V2 01/13] block: estimate disk performance Shaohua Li
2016-02-22 22:01 ` [PATCH V2 02/13] blk-throttle: cleanup io cost related stuff Shaohua Li
2016-02-22 22:01 ` [PATCH V2 03/13] blk-throttle: add abstract to index data Shaohua Li
2016-02-22 22:01 ` [PATCH V2 04/13] blk-throttle: weight based throttling Shaohua Li
2016-02-22 22:01 ` [PATCH V2 05/13] blk-throttling: detect inactive cgroup Shaohua Li
2016-02-22 22:01 ` [PATCH V2 06/13] blk-throttle: add per-cgroup data Shaohua Li
2016-02-22 22:01 ` [PATCH V2 07/13] blk-throttle: add interface for proporation based throttle Shaohua Li
2016-02-22 22:01 ` [PATCH V2 08/13] blk-throttle: add cgroup2 interface Shaohua Li
2016-02-22 22:01 ` [PATCH V2 09/13] blk-throttle: add trace for new proporation throttle Shaohua Li
2016-02-22 22:01 ` [PATCH V2 10/13] blk-throttle: over estimate bandwidth Shaohua Li
2016-02-22 22:01 ` [PATCH V2 11/13] blk-throttle: shrink cgroup share if its target is overestimated Shaohua Li
2016-02-22 22:01 ` [PATCH V2 12/13] blk-throttle: restore shrinked cgroup share Shaohua Li
2016-02-22 22:01 ` [PATCH V2 13/13] blk-throttle: detect wrong shrink Shaohua Li
2016-02-28 15:02 ` [PATCH V2 00/13] block-throttle: proportional throttle Pavel Machek
2016-03-01  5:19   ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.