linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V6 00/18] blk-throttle: add .low limit
@ 2017-01-15  3:42 Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 01/18] blk-throttle: use U64_MAX/UINT_MAX to replace -1 Shaohua Li
                   ` (18 more replies)
  0 siblings, 19 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

Hi,

cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
much for SSD. This patch set try to add a conservative limit for blk-throttle.
It isn't a proportional scheduling, but can help prioritize cgroups. There are
several advantages we choose blk-throttle:
- blk-throttle resides early in the block stack. It works for both bio and
  request based queues.
- blk-throttle is light weight in general. It still takes queue lock, but it's
  not hard to implement a per-cpu cache and remove the lock contention.
- blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
  mechanism is proved to harm performance for fast SSD.

The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
The existing io.max is a hard limit throttling. cgroup with a max limit never
dispatch more IO than its max limit. While io.low is a best effort throttling.
cgroups with 'low' limit can run above their 'low' limit at appropriate time.
Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
will run according to their 'low' limit. So the 'low' limit could act as two
roles, it allows cgroups using free bandwidth and it protects cgroups from
their 'low' limit.

An example usage is we have a high prio cgroup with high 'low' limit and a low
prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
prio can run above its 'low' limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
under its 'low' limit. This will protect high prio cgroup to get more
resources.

The implementation is simple. The disk queue has a state machine. We have 2
states LIMIT_LOW and LIMIT_MAX. In each disk state, we throttle cgroups
according to the limit of the state. That is io.low limit for LIMIT_LOW state,
io.max limit for LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_LOW and LIMIT_MAX according to the rule aboe. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.low, the disk state will remain in
LIMIT_MAX state. Systems with only io.max set will find nothing changed with the
patches.

The first 10 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 10 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 11-18
adds more heuristics.

The basic framework has 2 major issues.

1. fluctuation. When the state is upgraded from LIMIT_LOW to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way we are not
expected. For example, one cgroup's bandwidth will drop below its io.low limit
very soon after a upgrade. patch 10 has more details about the issue.

2. idle cgroup. cgroup with a io.low limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_LOW state and all other
cgroups can't dispatch more IO above their 'low' limit. Hence there is waste.
patch 11 has more details about the issue.

For issue 1, we make cgroup bandwidth increase/decrease smoothly after a
upgrade/downgrade. This will reduce the chance a cgroup's bandwidth drop under
its 'low' limit rapidly. The smoothness means we could waste some bandwidth in
the transition though. But we must pay something for sharing.

The issue 2 is very hard. We introduce two mechanisms for this. One is 'idle
time' or 'think time' borrowed from CFQ. If a cgroup's average idle time is
high, we treat it's idle and its 'low' limit isn't respected. Please see patch
12 - 14 for details. The other is 'latency target'. If a cgroup's io latency is
low, we treat it's idle and its 'low' limit isn't resptected. Please see patch
15 - 18 for fetails. Both mechanisms only happen when a cgroup runs below its
'low' limit.

The disadvantages of blk-throttle is it exports a kind of low level knobs.
Configuration would not be easy for normal users. It would be powerful for
experienced users though.

More tuning is required of course, but otherwise this works well. Please
review, test and consider merge.

Thanks,
Shaohua

V5->V6:
- Change default setting for io.low limit. It's 0 now, which makes more sense
- The default setting for latency is still 0, the default setting for idle time
  becomes bigger. So with the default settings, cgroups have small latency but
  disk sharing could be harmed
- Addressed other issues pointed out by Tejun

V4->V5, basically address Tejun's comments:
- Change interface from 'io.high' to 'io.low' so consistent with memcg
- Change interface for 'idle time' and 'latency target'
- Make 'idle time' per-cgroup-disk instead of per-cgroup
- Chnage interface name for 'throttle slice'. It's not a real slice
- Make downgrade smooth too
- Make latency sampling work for both bio and request based queue
- Change latency estimation method from 'line fitting' to 'bucket based
  calculation'
- Rebase and fix other problems

Issue pointed out by Tejun isn't fixed yet:
- .pd_offline_fn vs .pd_free_fn. .pd_free_fn seems too late to change states
http://marc.info/?l=linux-kernel&m=148183437022975&w=2

V3->V4:
- Add latency target for cgroup
- Fix bugs
http://marc.info/?l=linux-block&m=147916216512915&w=2

V2->V3:
- Rebase
- Fix several bugs
- Make harddisk think time threshold bigger
http://marc.info/?l=linux-kernel&m=147552964708965&w=2

V1->V2:
- Drop io.low interface for simplicity and the interface isn't a must-have to
  prioritize cgroups.
- Remove the 'trial' logic, which creates too much fluctuation
- Add a new idle cgroup detection
- Other bug fixes and improvements
http://marc.info/?l=linux-block&m=147395674732335&w=2

V1:
http://marc.info/?l=linux-block&m=146292596425689&w=2


Shaohua Li (18):
  blk-throttle: use U64_MAX/UINT_MAX to replace -1
  blk-throttle: prepare support multiple limits
  blk-throttle: add .low interface
  blk-throttle: configure bps/iops limit for cgroup in low limit
  blk-throttle: add upgrade logic for LIMIT_LOW state
  blk-throttle: add downgrade logic
  blk-throttle: make sure expire time isn't too big
  blk-throttle: make throtl_slice tunable
  blk-throttle: choose a small throtl_slice for SSD
  blk-throttle: detect completed idle cgroup
  blk-throttle: make bandwidth change smooth
  blk-throttle: add a simple idle detection
  blk-throttle: add interface to configure idle time threshold
  blk-throttle: ignore idle cgroup limit
  blk-throttle: add interface for per-cgroup target latency
  block: track request size in blk_issue_stat
  blk-throttle: add a mechanism to estimate IO latency
  blk-throttle: add latency target support

 Documentation/block/queue-sysfs.txt |   6 +
 block/bio.c                         |   2 +
 block/blk-core.c                    |   2 +-
 block/blk-mq.c                      |   2 +-
 block/blk-stat.c                    |  11 +-
 block/blk-stat.h                    |  29 +-
 block/blk-sysfs.c                   |  12 +
 block/blk-throttle.c                | 961 +++++++++++++++++++++++++++++++++---
 block/blk-wbt.h                     |  10 +-
 block/blk.h                         |   9 +
 include/linux/blk_types.h           |  10 +-
 11 files changed, 959 insertions(+), 95 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V6 01/18] blk-throttle: use U64_MAX/UINT_MAX to replace -1
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 02/18] blk-throttle: prepare support multiple limits Shaohua Li
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

clean up the code to avoid using -1

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a6bb4fe..e45bf50 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -334,10 +334,10 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	}
 
 	RB_CLEAR_NODE(&tg->rb_node);
-	tg->bps[READ] = -1;
-	tg->bps[WRITE] = -1;
-	tg->iops[READ] = -1;
-	tg->iops[WRITE] = -1;
+	tg->bps[READ] = U64_MAX;
+	tg->bps[WRITE] = U64_MAX;
+	tg->iops[READ] = UINT_MAX;
+	tg->iops[WRITE] = UINT_MAX;
 
 	return &tg->pd;
 }
@@ -380,7 +380,7 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 	for (rw = READ; rw <= WRITE; rw++)
 		tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-				    (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+			(tg->bps[rw] != U64_MAX || tg->iops[rw] != UINT_MAX);
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -771,7 +771,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
-	if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
+	if (tg->bps[rw] == U64_MAX && tg->iops[rw] == UINT_MAX) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -1110,7 +1110,7 @@ static u64 tg_prfill_conf_u64(struct seq_file *sf, struct blkg_policy_data *pd,
 	struct throtl_grp *tg = pd_to_tg(pd);
 	u64 v = *(u64 *)((void *)tg + off);
 
-	if (v == -1)
+	if (v == U64_MAX)
 		return 0;
 	return __blkg_prfill_u64(sf, pd, v);
 }
@@ -1121,7 +1121,7 @@ static u64 tg_prfill_conf_uint(struct seq_file *sf, struct blkg_policy_data *pd,
 	struct throtl_grp *tg = pd_to_tg(pd);
 	unsigned int v = *(unsigned int *)((void *)tg + off);
 
-	if (v == -1)
+	if (v == UINT_MAX)
 		return 0;
 	return __blkg_prfill_u64(sf, pd, v);
 }
@@ -1195,7 +1195,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 	if (sscanf(ctx.body, "%llu", &v) != 1)
 		goto out_finish;
 	if (!v)
-		v = -1;
+		v = U64_MAX;
 
 	tg = blkg_to_tg(ctx.blkg);
 
@@ -1270,17 +1270,17 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
 
 	if (!dname)
 		return 0;
-	if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 &&
-	    tg->iops[READ] == -1 && tg->iops[WRITE] == -1)
+	if (tg->bps[READ] == U64_MAX && tg->bps[WRITE] == U64_MAX &&
+	    tg->iops[READ] == UINT_MAX && tg->iops[WRITE] == UINT_MAX)
 		return 0;
 
-	if (tg->bps[READ] != -1)
+	if (tg->bps[READ] != U64_MAX)
 		snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
-	if (tg->bps[WRITE] != -1)
+	if (tg->bps[WRITE] != U64_MAX)
 		snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
-	if (tg->iops[READ] != -1)
+	if (tg->iops[READ] != UINT_MAX)
 		snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
-	if (tg->iops[WRITE] != -1)
+	if (tg->iops[WRITE] != UINT_MAX)
 		snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
@@ -1318,7 +1318,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
 		char *p;
-		u64 val = -1;
+		u64 val = U64_MAX;
 		int len;
 
 		if (sscanf(ctx.body, "%26s%n", tok, &len) != 1)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 02/18] blk-throttle: prepare support multiple limits
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 01/18] blk-throttle: use U64_MAX/UINT_MAX to replace -1 Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 03/18] blk-throttle: add .low interface Shaohua Li
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

We are going to support low/max limit, each cgroup will have 2 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 110 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 70 insertions(+), 40 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e45bf50..b08369b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,11 @@ enum tg_state_flags {
 
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
+enum {
+	LIMIT_MAX,
+	LIMIT_CNT,
+};
+
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -120,10 +125,10 @@ struct throtl_grp {
 	bool has_rules[2];
 
 	/* bytes per second rate limits */
-	uint64_t bps[2];
+	uint64_t bps[2][LIMIT_CNT];
 
 	/* IOPS limits */
-	unsigned int iops[2];
+	unsigned int iops[2][LIMIT_CNT];
 
 	/* Number of bytes disptached in current slice */
 	uint64_t bytes_disp[2];
@@ -147,6 +152,8 @@ struct throtl_data
 
 	/* Work for dispatching throttled bios */
 	struct work_struct dispatch_work;
+	unsigned int limit_index;
+	bool limit_valid[LIMIT_CNT];
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -198,6 +205,16 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 		return container_of(sq, struct throtl_data, service_queue);
 }
 
+static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
+{
+	return tg->bps[rw][tg->td->limit_index];
+}
+
+static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
+{
+	return tg->iops[rw][tg->td->limit_index];
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -334,10 +351,10 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	}
 
 	RB_CLEAR_NODE(&tg->rb_node);
-	tg->bps[READ] = U64_MAX;
-	tg->bps[WRITE] = U64_MAX;
-	tg->iops[READ] = UINT_MAX;
-	tg->iops[WRITE] = UINT_MAX;
+	tg->bps[READ][LIMIT_MAX] = U64_MAX;
+	tg->bps[WRITE][LIMIT_MAX] = U64_MAX;
+	tg->iops[READ][LIMIT_MAX] = UINT_MAX;
+	tg->iops[WRITE][LIMIT_MAX] = UINT_MAX;
 
 	return &tg->pd;
 }
@@ -376,11 +393,14 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 static void tg_update_has_rules(struct throtl_grp *tg)
 {
 	struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+	struct throtl_data *td = tg->td;
 	int rw;
 
 	for (rw = READ; rw <= WRITE; rw++)
 		tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-			(tg->bps[rw] != U64_MAX || tg->iops[rw] != UINT_MAX);
+			(td->limit_valid[td->limit_index] &&
+			 (tg_bps_limit(tg, rw) != U64_MAX ||
+			  tg_iops_limit(tg, rw) != UINT_MAX));
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -632,11 +652,11 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 
 	if (!nr_slices)
 		return;
-	tmp = tg->bps[rw] * throtl_slice * nr_slices;
+	tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
 	do_div(tmp, HZ);
 	bytes_trim = tmp;
 
-	io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
+	io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;
 
 	if (!bytes_trim && !io_trim)
 		return;
@@ -682,7 +702,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	 * have been trimmed.
 	 */
 
-	tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+	tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 
 	if (tmp > UINT_MAX)
@@ -697,7 +717,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	}
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;
 
 	if (jiffy_wait > jiffy_elapsed)
 		jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -724,7 +744,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
 
-	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+	tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 	bytes_allowed = tmp;
 
@@ -736,7 +756,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Calc approx time to dispatch */
 	extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+	jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw));
 
 	if (!jiffy_wait)
 		jiffy_wait = 1;
@@ -771,7 +791,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
-	if (tg->bps[rw] == U64_MAX && tg->iops[rw] == UINT_MAX) {
+	if (tg_bps_limit(tg, rw) == U64_MAX &&
+	    tg_iops_limit(tg, rw) == UINT_MAX) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -1148,8 +1169,8 @@ static void tg_conf_updated(struct throtl_grp *tg)
 
 	throtl_log(&tg->service_queue,
 		   "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
-		   tg->bps[READ], tg->bps[WRITE],
-		   tg->iops[READ], tg->iops[WRITE]);
+		   tg_bps_limit(tg, READ), tg_bps_limit(tg, WRITE),
+		   tg_iops_limit(tg, READ), tg_iops_limit(tg, WRITE));
 
 	/*
 	 * Update has_rules[] flags for the updated tg's subtree.  A tg is
@@ -1226,25 +1247,25 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
 static struct cftype throtl_legacy_files[] = {
 	{
 		.name = "throttle.read_bps_device",
-		.private = offsetof(struct throtl_grp, bps[READ]),
+		.private = offsetof(struct throtl_grp, bps[READ][LIMIT_MAX]),
 		.seq_show = tg_print_conf_u64,
 		.write = tg_set_conf_u64,
 	},
 	{
 		.name = "throttle.write_bps_device",
-		.private = offsetof(struct throtl_grp, bps[WRITE]),
+		.private = offsetof(struct throtl_grp, bps[WRITE][LIMIT_MAX]),
 		.seq_show = tg_print_conf_u64,
 		.write = tg_set_conf_u64,
 	},
 	{
 		.name = "throttle.read_iops_device",
-		.private = offsetof(struct throtl_grp, iops[READ]),
+		.private = offsetof(struct throtl_grp, iops[READ][LIMIT_MAX]),
 		.seq_show = tg_print_conf_uint,
 		.write = tg_set_conf_uint,
 	},
 	{
 		.name = "throttle.write_iops_device",
-		.private = offsetof(struct throtl_grp, iops[WRITE]),
+		.private = offsetof(struct throtl_grp, iops[WRITE][LIMIT_MAX]),
 		.seq_show = tg_print_conf_uint,
 		.write = tg_set_conf_uint,
 	},
@@ -1270,18 +1291,25 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
 
 	if (!dname)
 		return 0;
-	if (tg->bps[READ] == U64_MAX && tg->bps[WRITE] == U64_MAX &&
-	    tg->iops[READ] == UINT_MAX && tg->iops[WRITE] == UINT_MAX)
+
+	if (tg->bps[READ][LIMIT_MAX] == U64_MAX &&
+	    tg->bps[WRITE][LIMIT_MAX] == U64_MAX &&
+	    tg->iops[READ][LIMIT_MAX] == UINT_MAX &&
+	    tg->iops[WRITE][LIMIT_MAX] == UINT_MAX)
 		return 0;
 
-	if (tg->bps[READ] != U64_MAX)
-		snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
-	if (tg->bps[WRITE] != U64_MAX)
-		snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
-	if (tg->iops[READ] != UINT_MAX)
-		snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
-	if (tg->iops[WRITE] != UINT_MAX)
-		snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
+	if (tg->bps[READ][LIMIT_MAX] != U64_MAX)
+		snprintf(bufs[0], sizeof(bufs[0]), "%llu",
+			tg->bps[READ][LIMIT_MAX]);
+	if (tg->bps[WRITE][LIMIT_MAX] != U64_MAX)
+		snprintf(bufs[1], sizeof(bufs[1]), "%llu",
+			tg->bps[WRITE][LIMIT_MAX]);
+	if (tg->iops[READ][LIMIT_MAX] != UINT_MAX)
+		snprintf(bufs[2], sizeof(bufs[2]), "%u",
+			tg->iops[READ][LIMIT_MAX]);
+	if (tg->iops[WRITE][LIMIT_MAX] != UINT_MAX)
+		snprintf(bufs[3], sizeof(bufs[3]), "%u",
+			tg->iops[WRITE][LIMIT_MAX]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
 		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
@@ -1310,10 +1338,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 	tg = blkg_to_tg(ctx.blkg);
 
-	v[0] = tg->bps[READ];
-	v[1] = tg->bps[WRITE];
-	v[2] = tg->iops[READ];
-	v[3] = tg->iops[WRITE];
+	v[0] = tg->bps[READ][LIMIT_MAX];
+	v[1] = tg->bps[WRITE][LIMIT_MAX];
+	v[2] = tg->iops[READ][LIMIT_MAX];
+	v[3] = tg->iops[WRITE][LIMIT_MAX];
 
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
@@ -1350,10 +1378,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 			goto out_finish;
 	}
 
-	tg->bps[READ] = v[0];
-	tg->bps[WRITE] = v[1];
-	tg->iops[READ] = v[2];
-	tg->iops[WRITE] = v[3];
+	tg->bps[READ][LIMIT_MAX] = v[0];
+	tg->bps[WRITE][LIMIT_MAX] = v[1];
+	tg->iops[READ][LIMIT_MAX] = v[2];
+	tg->iops[WRITE][LIMIT_MAX] = v[3];
 
 	tg_conf_updated(tg);
 	ret = 0;
@@ -1451,8 +1479,9 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	/* out-of-limit, queue to @tg */
 	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
-		   tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw],
-		   tg->io_disp[rw], tg->iops[rw],
+		   tg->bytes_disp[rw], bio->bi_iter.bi_size,
+		   tg_bps_limit(tg, rw),
+		   tg->io_disp[rw], tg_iops_limit(tg, rw),
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
@@ -1563,6 +1592,7 @@ int blk_throtl_init(struct request_queue *q)
 	q->td = td;
 	td->queue = q;
 
+	td->limit_valid[LIMIT_MAX] = true;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 03/18] blk-throttle: add .low interface
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 01/18] blk-throttle: use U64_MAX/UINT_MAX to replace -1 Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 02/18] blk-throttle: prepare support multiple limits Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 04/18] blk-throttle: configure bps/iops limit for cgroup in low limit Shaohua Li
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

Add low limit for cgroup and corresponding cgroup interface. To be
consistent with memcg, we allow users configure .low limit higher than
.max limit. But the internal logic always assumes .low limit is lower
than .max limit. So we add extra bps/iops_conf fields in throtl_grp for
userspace configuration. Old bps/iops fields in throtl_grp will be the
actual limit we use for throttling.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 142 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 114 insertions(+), 28 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b08369b..d3ad43c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,6 +84,7 @@ enum tg_state_flags {
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
 enum {
+	LIMIT_LOW,
 	LIMIT_MAX,
 	LIMIT_CNT,
 };
@@ -124,11 +125,15 @@ struct throtl_grp {
 	/* are there any throtl rules between this group and td? */
 	bool has_rules[2];
 
-	/* bytes per second rate limits */
+	/* internally used bytes per second rate limits */
 	uint64_t bps[2][LIMIT_CNT];
+	/* user configured bps limits */
+	uint64_t bps_conf[2][LIMIT_CNT];
 
-	/* IOPS limits */
+	/* internally used IOPS limits */
 	unsigned int iops[2][LIMIT_CNT];
+	/* user configured IOPS limits */
+	unsigned int iops_conf[2][LIMIT_CNT];
 
 	/* Number of bytes disptached in current slice */
 	uint64_t bytes_disp[2];
@@ -355,6 +360,11 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	tg->bps[WRITE][LIMIT_MAX] = U64_MAX;
 	tg->iops[READ][LIMIT_MAX] = UINT_MAX;
 	tg->iops[WRITE][LIMIT_MAX] = UINT_MAX;
+	tg->bps_conf[READ][LIMIT_MAX] = U64_MAX;
+	tg->bps_conf[WRITE][LIMIT_MAX] = U64_MAX;
+	tg->iops_conf[READ][LIMIT_MAX] = UINT_MAX;
+	tg->iops_conf[WRITE][LIMIT_MAX] = UINT_MAX;
+	/* LIMIT_LOW will have default value 0 */
 
 	return &tg->pd;
 }
@@ -412,6 +422,41 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 	tg_update_has_rules(pd_to_tg(pd));
 }
 
+static void blk_throtl_update_limit_valid(struct throtl_data *td)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+	bool low_valid = false;
+
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (tg->bps[READ][LIMIT_LOW] || tg->bps[WRITE][LIMIT_LOW] ||
+		    tg->iops[READ][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW])
+			low_valid = true;
+	}
+	rcu_read_unlock();
+
+	td->limit_valid[LIMIT_LOW] = low_valid;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+	struct throtl_grp *tg = pd_to_tg(pd);
+
+	tg->bps[READ][LIMIT_LOW] = 0;
+	tg->bps[WRITE][LIMIT_LOW] = 0;
+	tg->iops[READ][LIMIT_LOW] = 0;
+	tg->iops[WRITE][LIMIT_LOW] = 0;
+
+	blk_throtl_update_limit_valid(tg->td);
+
+	if (tg->td->limit_index == LIMIT_LOW &&
+	    !tg->td->limit_valid[LIMIT_LOW])
+		tg->td->limit_index = LIMIT_MAX;
+}
+
 static void throtl_pd_free(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
@@ -1282,48 +1327,58 @@ static struct cftype throtl_legacy_files[] = {
 	{ }	/* terminate */
 };
 
-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 			 int off)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
 	const char *dname = blkg_dev_name(pd->blkg);
 	char bufs[4][21] = { "max", "max", "max", "max" };
+	u64 bps_dft;
+	unsigned int iops_dft;
 
 	if (!dname)
 		return 0;
 
-	if (tg->bps[READ][LIMIT_MAX] == U64_MAX &&
-	    tg->bps[WRITE][LIMIT_MAX] == U64_MAX &&
-	    tg->iops[READ][LIMIT_MAX] == UINT_MAX &&
-	    tg->iops[WRITE][LIMIT_MAX] == UINT_MAX)
+	if (off == LIMIT_LOW) {
+		bps_dft = 0;
+		iops_dft = 0;
+	} else {
+		bps_dft = U64_MAX;
+		iops_dft = UINT_MAX;
+	}
+
+	if (tg->bps_conf[READ][off] == bps_dft &&
+	    tg->bps_conf[WRITE][off] == bps_dft &&
+	    tg->iops_conf[READ][off] == iops_dft &&
+	    tg->iops_conf[WRITE][off] == iops_dft)
 		return 0;
 
-	if (tg->bps[READ][LIMIT_MAX] != U64_MAX)
+	if (tg->bps_conf[READ][off] != bps_dft)
 		snprintf(bufs[0], sizeof(bufs[0]), "%llu",
-			tg->bps[READ][LIMIT_MAX]);
-	if (tg->bps[WRITE][LIMIT_MAX] != U64_MAX)
+			tg->bps_conf[READ][off]);
+	if (tg->bps_conf[WRITE][off] != bps_dft)
 		snprintf(bufs[1], sizeof(bufs[1]), "%llu",
-			tg->bps[WRITE][LIMIT_MAX]);
-	if (tg->iops[READ][LIMIT_MAX] != UINT_MAX)
+			tg->bps_conf[WRITE][off]);
+	if (tg->iops_conf[READ][off] != iops_dft)
 		snprintf(bufs[2], sizeof(bufs[2]), "%u",
-			tg->iops[READ][LIMIT_MAX]);
-	if (tg->iops[WRITE][LIMIT_MAX] != UINT_MAX)
+			tg->iops_conf[READ][off]);
+	if (tg->iops_conf[WRITE][off] != iops_dft)
 		snprintf(bufs[3], sizeof(bufs[3]), "%u",
-			tg->iops[WRITE][LIMIT_MAX]);
+			tg->iops_conf[WRITE][off]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
 		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
 	return 0;
 }
 
-static int tg_print_max(struct seq_file *sf, void *v)
+static int tg_print_limit(struct seq_file *sf, void *v)
 {
-	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_max,
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_limit,
 			  &blkcg_policy_throtl, seq_cft(sf)->private, false);
 	return 0;
 }
 
-static ssize_t tg_set_max(struct kernfs_open_file *of,
+static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			  char *buf, size_t nbytes, loff_t off)
 {
 	struct blkcg *blkcg = css_to_blkcg(of_css(of));
@@ -1331,6 +1386,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	u64 v[4];
 	int ret;
+	int index = of_cft(of)->private;
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
 	if (ret)
@@ -1338,10 +1394,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 	tg = blkg_to_tg(ctx.blkg);
 
-	v[0] = tg->bps[READ][LIMIT_MAX];
-	v[1] = tg->bps[WRITE][LIMIT_MAX];
-	v[2] = tg->iops[READ][LIMIT_MAX];
-	v[3] = tg->iops[WRITE][LIMIT_MAX];
+	v[0] = tg->bps_conf[READ][index];
+	v[1] = tg->bps_conf[WRITE][index];
+	v[2] = tg->iops_conf[READ][index];
+	v[3] = tg->iops_conf[WRITE][index];
 
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
@@ -1378,11 +1434,31 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 			goto out_finish;
 	}
 
-	tg->bps[READ][LIMIT_MAX] = v[0];
-	tg->bps[WRITE][LIMIT_MAX] = v[1];
-	tg->iops[READ][LIMIT_MAX] = v[2];
-	tg->iops[WRITE][LIMIT_MAX] = v[3];
+	tg->bps_conf[READ][index] = v[0];
+	tg->bps_conf[WRITE][index] = v[1];
+	tg->iops_conf[READ][index] = v[2];
+	tg->iops_conf[WRITE][index] = v[3];
 
+	if (index == LIMIT_MAX) {
+		tg->bps[READ][index] = v[0];
+		tg->bps[WRITE][index] = v[1];
+		tg->iops[READ][index] = v[2];
+		tg->iops[WRITE][index] = v[3];
+	}
+	tg->bps[READ][LIMIT_LOW] = min(tg->bps_conf[READ][LIMIT_LOW],
+		tg->bps_conf[READ][LIMIT_MAX]);
+	tg->bps[WRITE][LIMIT_LOW] = min(tg->bps_conf[WRITE][LIMIT_LOW],
+		tg->bps_conf[WRITE][LIMIT_MAX]);
+	tg->iops[READ][LIMIT_LOW] = min(tg->iops_conf[READ][LIMIT_LOW],
+		tg->iops_conf[READ][LIMIT_MAX]);
+	tg->iops[WRITE][LIMIT_LOW] = min(tg->iops_conf[WRITE][LIMIT_LOW],
+		tg->iops_conf[WRITE][LIMIT_MAX]);
+
+	if (index == LIMIT_LOW) {
+		blk_throtl_update_limit_valid(tg->td);
+		if (tg->td->limit_valid[LIMIT_LOW])
+			tg->td->limit_index = LIMIT_LOW;
+	}
 	tg_conf_updated(tg);
 	ret = 0;
 out_finish:
@@ -1392,10 +1468,18 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 static struct cftype throtl_files[] = {
 	{
+		.name = "low",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = tg_print_limit,
+		.write = tg_set_limit,
+		.private = LIMIT_LOW,
+	},
+	{
 		.name = "max",
 		.flags = CFTYPE_NOT_ON_ROOT,
-		.seq_show = tg_print_max,
-		.write = tg_set_max,
+		.seq_show = tg_print_limit,
+		.write = tg_set_limit,
+		.private = LIMIT_MAX,
 	},
 	{ }	/* terminate */
 };
@@ -1414,6 +1498,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_alloc_fn		= throtl_pd_alloc,
 	.pd_init_fn		= throtl_pd_init,
 	.pd_online_fn		= throtl_pd_online,
+	.pd_offline_fn		= throtl_pd_offline,
 	.pd_free_fn		= throtl_pd_free,
 };
 
@@ -1593,6 +1678,7 @@ int blk_throtl_init(struct request_queue *q)
 	td->queue = q;
 
 	td->limit_valid[LIMIT_MAX] = true;
+	td->limit_index = LIMIT_MAX;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 04/18] blk-throttle: configure bps/iops limit for cgroup in low limit
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (2 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 03/18] blk-throttle: add .low interface Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 05/18] blk-throttle: add upgrade logic for LIMIT_LOW state Shaohua Li
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

each queue will have a state machine. Initially queue is in LIMIT_LOW
state, which means all cgroups will be throttled according to their low
limit. After all cgroups with low limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
For max limit, cgroup will use the limit configured by user.
For low limit, cgroup will use the minimal value between low limit and
max limit configured by user. If the minimal value is 0, which means the
cgroup doesn't configure low limit, we will use max limit to throttle
the cgroup and the cgroup is ready to upgrade to LIMIT_MAX

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d3ad43c..3bc6deb 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -212,12 +212,28 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
-	return tg->bps[rw][tg->td->limit_index];
+	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	uint64_t ret;
+
+	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+		return U64_MAX;
+	ret = tg->bps[rw][tg->td->limit_index];
+	if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
+		return tg->bps[rw][LIMIT_MAX];
+	return ret;
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
-	return tg->iops[rw][tg->td->limit_index];
+	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	unsigned int ret;
+
+	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+		return UINT_MAX;
+	ret = tg->iops[rw][tg->td->limit_index];
+	if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
+		return tg->iops[rw][LIMIT_MAX];
+	return ret;
 }
 
 /**
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 05/18] blk-throttle: add upgrade logic for LIMIT_LOW state
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (3 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 04/18] blk-throttle: configure bps/iops limit for cgroup in low limit Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 06/18] blk-throttle: add downgrade logic Shaohua Li
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

When queue is in LIMIT_LOW state and all cgroups with low limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if
the cgroup has pending request. Since cgroup is throttled according to
the limit, pending request means the cgroup reaches the limit.

If a cgroup has limit set for both read and write, we consider the
combination of them for upgrade. The reason is read IO and write IO can
interfere with each other. If we do the upgrade based in one direction
IO, the other direction IO could be severly harmed.

For a cgroup hierarchy, there are two cases. Children has lower low
limit than parent. Parent's low limit is meaningless. If children's
bps/iops cross low limit, we can upgrade queue state. The other case is
children has higher low limit than parent. Children's low limit is
meaningless. As long as parent's bps/iops (which is a sum of childrens
bps/iops) cross low limit, we can upgrade queue state.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 96 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3bc6deb..4cc066f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -457,6 +457,7 @@ static void blk_throtl_update_limit_valid(struct throtl_data *td)
 	td->limit_valid[LIMIT_LOW] = low_valid;
 }
 
+static void throtl_upgrade_state(struct throtl_data *td);
 static void throtl_pd_offline(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
@@ -468,9 +469,8 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)
 
 	blk_throtl_update_limit_valid(tg->td);
 
-	if (tg->td->limit_index == LIMIT_LOW &&
-	    !tg->td->limit_valid[LIMIT_LOW])
-		tg->td->limit_index = LIMIT_MAX;
+	if (!tg->td->limit_valid[tg->td->limit_index])
+		throtl_upgrade_state(tg->td);
 }
 
 static void throtl_pd_free(struct blkg_policy_data *pd)
@@ -1079,6 +1079,8 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
 	return nr_disp;
 }
 
+static bool throtl_can_upgrade(struct throtl_data *td,
+	struct throtl_grp *this_tg);
 /**
  * throtl_pending_timer_fn - timer function for service_queue->pending_timer
  * @arg: the throtl_service_queue being serviced
@@ -1105,6 +1107,9 @@ static void throtl_pending_timer_fn(unsigned long arg)
 	int ret;
 
 	spin_lock_irq(q->queue_lock);
+	if (throtl_can_upgrade(td, NULL))
+		throtl_upgrade_state(td);
+
 again:
 	parent_sq = sq->parent_sq;
 	dispatched = false;
@@ -1518,6 +1523,87 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_free_fn		= throtl_pd_free,
 };
 
+static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *sq = &tg->service_queue;
+	bool read_limit, write_limit;
+
+	/*
+	 * if cgroup reaches low limit (if low limit is 0, the cgroup always
+	 * reaches), it's ok to upgrade to next limit
+	 */
+	read_limit = tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW];
+	write_limit = tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW];
+	if (!read_limit && !write_limit)
+		return true;
+	if (read_limit && sq->nr_queued[READ] &&
+	    (!write_limit || sq->nr_queued[WRITE]))
+		return true;
+	if (write_limit && sq->nr_queued[WRITE] &&
+	    (!read_limit || sq->nr_queued[READ]))
+		return true;
+	return false;
+}
+
+static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg)
+{
+	while (true) {
+		if (throtl_tg_can_upgrade(tg))
+			return true;
+		tg = sq_to_tg(tg->service_queue.parent_sq);
+		if (!tg || !tg_to_blkg(tg)->parent)
+			return false;
+	}
+	return false;
+}
+
+static bool throtl_can_upgrade(struct throtl_data *td,
+	struct throtl_grp *this_tg)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+
+	if (td->limit_index != LIMIT_LOW)
+		return false;
+
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (tg == this_tg)
+			continue;
+		if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+			continue;
+		if (!throtl_hierarchy_can_upgrade(tg)) {
+			rcu_read_unlock();
+			return false;
+		}
+	}
+	rcu_read_unlock();
+	return true;
+}
+
+static void throtl_upgrade_state(struct throtl_data *td)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+
+	td->limit_index = LIMIT_MAX;
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+		struct throtl_service_queue *sq = &tg->service_queue;
+
+		tg->disptime = jiffies - 1;
+		throtl_select_dispatch(sq);
+		throtl_schedule_next_dispatch(sq, false);
+	}
+	rcu_read_unlock();
+	throtl_select_dispatch(&td->service_queue);
+	throtl_schedule_next_dispatch(&td->service_queue, false);
+	queue_work(kthrotld_workqueue, &td->dispatch_work);
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1540,14 +1626,20 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	sq = &tg->service_queue;
 
+again:
 	while (true) {
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
 
 		/* if above limits, break to queue */
-		if (!tg_may_dispatch(tg, bio, NULL))
+		if (!tg_may_dispatch(tg, bio, NULL)) {
+			if (throtl_can_upgrade(tg->td, tg)) {
+				throtl_upgrade_state(tg->td);
+				goto again;
+			}
 			break;
+		}
 
 		/* within limits, let's charge and dispatch directly */
 		throtl_charge_bio(tg, bio);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 06/18] blk-throttle: add downgrade logic
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (4 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 05/18] blk-throttle: add upgrade logic for LIMIT_LOW state Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 07/18] blk-throttle: make sure expire time isn't too big Shaohua Li
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

When queue state machine is in LIMIT_MAX state, but a cgroup is below
its low limit for some time, the queue should be downgraded to lower
state as one cgroup's low limit isn't met.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 156 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 156 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 4cc066f..9dbbc94d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -140,6 +140,13 @@ struct throtl_grp {
 	/* Number of bio's dispatched in current slice */
 	unsigned int io_disp[2];
 
+	unsigned long last_low_overflow_time[2];
+
+	uint64_t last_bytes_disp[2];
+	unsigned int last_io_disp[2];
+
+	unsigned long last_check_time;
+
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -159,6 +166,9 @@ struct throtl_data
 	struct work_struct dispatch_work;
 	unsigned int limit_index;
 	bool limit_valid[LIMIT_CNT];
+
+	unsigned long low_upgrade_time;
+	unsigned long low_downgrade_time;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -898,6 +908,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	/* Charge the bio to the group */
 	tg->bytes_disp[rw] += bio->bi_iter.bi_size;
 	tg->io_disp[rw]++;
+	tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+	tg->last_io_disp[rw]++;
 
 	/*
 	 * BIO_THROTTLED is used to prevent the same bio to be throttled
@@ -1523,6 +1535,45 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_free_fn		= throtl_pd_free,
 };
 
+static unsigned long __tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+	unsigned long rtime = jiffies, wtime = jiffies;
+
+	if (tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW])
+		rtime = tg->last_low_overflow_time[READ];
+	if (tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW])
+		wtime = tg->last_low_overflow_time[WRITE];
+	return min(rtime, wtime);
+}
+
+/* tg should not be an intermediate node */
+static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *parent_sq;
+	struct throtl_grp *parent = tg;
+	unsigned long ret = __tg_last_low_overflow_time(tg);
+
+	while (true) {
+		parent_sq = parent->service_queue.parent_sq;
+		parent = sq_to_tg(parent_sq);
+		if (!parent)
+			break;
+
+		/*
+		 * The parent doesn't have low limit, it always reaches low
+		 * limit. Its overflow time is useless for children
+		 */
+		if (!parent->bps[READ][LIMIT_LOW] &&
+		    !parent->iops[READ][LIMIT_LOW] &&
+		    !parent->bps[WRITE][LIMIT_LOW] &&
+		    !parent->iops[WRITE][LIMIT_LOW])
+			continue;
+		if (time_after(__tg_last_low_overflow_time(parent), ret))
+			ret = __tg_last_low_overflow_time(parent);
+	}
+	return ret;
+}
+
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1566,6 +1617,9 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	if (td->limit_index != LIMIT_LOW)
 		return false;
 
+	if (time_before(jiffies, td->low_downgrade_time + throtl_slice))
+		return false;
+
 	rcu_read_lock();
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1589,6 +1643,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
 	struct blkcg_gq *blkg;
 
 	td->limit_index = LIMIT_MAX;
+	td->low_upgrade_time = jiffies;
 	rcu_read_lock();
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1604,6 +1659,99 @@ static void throtl_upgrade_state(struct throtl_data *td)
 	queue_work(kthrotld_workqueue, &td->dispatch_work);
 }
 
+static void throtl_downgrade_state(struct throtl_data *td, int new)
+{
+	td->limit_index = new;
+	td->low_downgrade_time = jiffies;
+}
+
+static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
+{
+	struct throtl_data *td = tg->td;
+	unsigned long now = jiffies;
+
+	/*
+	 * If cgroup is below low limit, consider downgrade and throttle other
+	 * cgroups
+	 */
+	if (time_after_eq(now, td->low_upgrade_time + throtl_slice) &&
+	    time_after_eq(now, tg_last_low_overflow_time(tg) + throtl_slice))
+		return true;
+	return false;
+}
+
+static bool throtl_hierarchy_can_downgrade(struct throtl_grp *tg)
+{
+	while (true) {
+		if (!throtl_tg_can_downgrade(tg))
+			return false;
+		tg = sq_to_tg(tg->service_queue.parent_sq);
+		if (!tg || !tg_to_blkg(tg)->parent)
+			break;
+	}
+	return true;
+}
+
+static void throtl_downgrade_check(struct throtl_grp *tg)
+{
+	uint64_t bps;
+	unsigned int iops;
+	unsigned long elapsed_time;
+	unsigned long now = jiffies;
+
+	if (tg->td->limit_index != LIMIT_MAX ||
+	    !tg->td->limit_valid[LIMIT_LOW])
+		return;
+	if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+		return;
+	if (time_after(tg->last_check_time + throtl_slice, now))
+		return;
+
+	elapsed_time = now - tg->last_check_time;
+	tg->last_check_time = now;
+
+	if (time_before(now, tg_last_low_overflow_time(tg) + throtl_slice))
+		return;
+
+	if (tg->bps[READ][LIMIT_LOW]) {
+		bps = tg->last_bytes_disp[READ] * HZ;
+		do_div(bps, elapsed_time);
+		if (bps >= tg->bps[READ][LIMIT_LOW])
+			tg->last_low_overflow_time[READ] = now;
+	}
+
+	if (tg->bps[WRITE][LIMIT_LOW]) {
+		bps = tg->last_bytes_disp[WRITE] * HZ;
+		do_div(bps, elapsed_time);
+		if (bps >= tg->bps[WRITE][LIMIT_LOW])
+			tg->last_low_overflow_time[WRITE] = now;
+	}
+
+	if (tg->iops[READ][LIMIT_LOW]) {
+		iops = tg->last_io_disp[READ] * HZ / elapsed_time;
+		if (iops >= tg->iops[READ][LIMIT_LOW])
+			tg->last_low_overflow_time[READ] = now;
+	}
+
+	if (tg->iops[WRITE][LIMIT_LOW]) {
+		iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
+		if (iops >= tg->iops[WRITE][LIMIT_LOW])
+			tg->last_low_overflow_time[WRITE] = now;
+	}
+
+	/*
+	 * If cgroup is below low limit, consider downgrade and throttle other
+	 * cgroups
+	 */
+	if (throtl_hierarchy_can_downgrade(tg))
+		throtl_downgrade_state(tg->td, LIMIT_LOW);
+
+	tg->last_bytes_disp[READ] = 0;
+	tg->last_bytes_disp[WRITE] = 0;
+	tg->last_io_disp[READ] = 0;
+	tg->last_io_disp[WRITE] = 0;
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1628,12 +1776,16 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
+		if (tg->last_low_overflow_time[rw] == 0)
+			tg->last_low_overflow_time[rw] = jiffies;
+		throtl_downgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
 
 		/* if above limits, break to queue */
 		if (!tg_may_dispatch(tg, bio, NULL)) {
+			tg->last_low_overflow_time[rw] = jiffies;
 			if (throtl_can_upgrade(tg->td, tg)) {
 				throtl_upgrade_state(tg->td);
 				goto again;
@@ -1677,6 +1829,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		   tg->io_disp[rw], tg_iops_limit(tg, rw),
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
+	tg->last_low_overflow_time[rw] = jiffies;
+
 	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
@@ -1787,6 +1941,8 @@ int blk_throtl_init(struct request_queue *q)
 
 	td->limit_valid[LIMIT_MAX] = true;
 	td->limit_index = LIMIT_MAX;
+	td->low_upgrade_time = jiffies;
+	td->low_downgrade_time = jiffies;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 07/18] blk-throttle: make sure expire time isn't too big
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (5 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 06/18] blk-throttle: add downgrade logic Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 08/18] blk-throttle: make throtl_slice tunable Shaohua Li
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9dbbc94d..ce4094e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -590,6 +590,17 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
 					  unsigned long expires)
 {
+	unsigned long max_expire = jiffies + 8 * throtl_slice;
+
+	/*
+	 * Since we are adjusting the throttle limit dynamically, the sleep
+	 * time calculated according to previous limit might be invalid. It's
+	 * possible the cgroup sleep time is very long and no other cgroups
+	 * have IO running so notify the limit changes. Make sure the cgroup
+	 * doesn't sleep too long to avoid the missed notification.
+	 */
+	if (time_after(expires, max_expire))
+		expires = max_expire;
 	mod_timer(&sq->pending_timer, expires);
 	throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
 		   expires - jiffies, jiffies);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 08/18] blk-throttle: make throtl_slice tunable
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (6 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 07/18] blk-throttle: make sure expire time isn't too big Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 09/18] blk-throttle: choose a small throtl_slice for SSD Shaohua Li
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of throtl_slice.

A small throtl_slice meanse cgroups have smoother throughput but burn
more CPUs. It has 100ms default value, which is not appropriate for all
disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
it tunable.

Since throtl_slice isn't a time slice, the sysfs name
'throttle_sample_time' reflects its character better.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 Documentation/block/queue-sysfs.txt |  6 +++
 block/blk-sysfs.c                   | 10 +++++
 block/blk-throttle.c                | 77 ++++++++++++++++++++++++++-----------
 block/blk.h                         |  3 ++
 4 files changed, 74 insertions(+), 22 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index c0a3bb5..5afebfe 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -192,5 +192,11 @@ scaling back writes. Writing a value of '0' to this file disables the
 feature. Writing a value of '-1' to this file resets the value to the
 default setting.
 
+throttle_sample_time (RW)
+-------------------------
+This is the time window that blk-throttle samples data, in millisecond.
+blk-throttle makes decision based on the samplings. Lower time means cgroups
+have more smooth throughput, but higher CPU overhead. This exists only when
+CONFIG_BLK_DEV_THROTTLING is enabled.
 
 Jens Axboe <jens.axboe@oracle.com>, February 2009
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1dbce05..0e3fb2a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -690,6 +690,13 @@ static struct queue_sysfs_entry queue_wb_lat_entry = {
 	.show = queue_wb_lat_show,
 	.store = queue_wb_lat_store,
 };
+#ifdef CONFIG_BLK_DEV_THROTTLING
+static struct queue_sysfs_entry throtl_sample_time_entry = {
+	.attr = {.name = "throttle_sample_time", .mode = S_IRUGO | S_IWUSR },
+	.show = blk_throtl_sample_time_show,
+	.store = blk_throtl_sample_time_store,
+};
+#endif
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
@@ -724,6 +731,9 @@ static struct attribute *default_attrs[] = {
 	&queue_stats_entry.attr,
 	&queue_wb_lat_entry.attr,
 	&queue_poll_delay_entry.attr,
+#ifdef CONFIG_BLK_DEV_THROTTLING
+	&throtl_sample_time_entry.attr,
+#endif
 	NULL,
 };
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index ce4094e..49cad9a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -19,7 +19,8 @@ static int throtl_grp_quantum = 8;
 static int throtl_quantum = 32;
 
 /* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10;	/* 100 ms */
+#define DFL_THROTL_SLICE (HZ / 10)
+#define MAX_THROTL_SLICE (HZ)
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -162,6 +163,8 @@ struct throtl_data
 	/* Total Number of queued bios on READ and WRITE lists */
 	unsigned int nr_queued[2];
 
+	unsigned int throtl_slice;
+
 	/* Work for dispatching throttled bios */
 	struct work_struct dispatch_work;
 	unsigned int limit_index;
@@ -590,7 +593,7 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
 					  unsigned long expires)
 {
-	unsigned long max_expire = jiffies + 8 * throtl_slice;
+	unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice;
 
 	/*
 	 * Since we are adjusting the throttle limit dynamically, the sleep
@@ -658,7 +661,7 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 	if (time_after_eq(start, tg->slice_start[rw]))
 		tg->slice_start[rw] = start;
 
-	tg->slice_end[rw] = jiffies + throtl_slice;
+	tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
 	throtl_log(&tg->service_queue,
 		   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
 		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -670,7 +673,7 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 	tg->bytes_disp[rw] = 0;
 	tg->io_disp[rw] = 0;
 	tg->slice_start[rw] = jiffies;
-	tg->slice_end[rw] = jiffies + throtl_slice;
+	tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
 	throtl_log(&tg->service_queue,
 		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
 		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -680,13 +683,13 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
 					unsigned long jiffy_end)
 {
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+	tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
 }
 
 static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
 				       unsigned long jiffy_end)
 {
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+	tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
 	throtl_log(&tg->service_queue,
 		   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
 		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -726,19 +729,20 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 	 * is bad because it does not allow new slice to start.
 	 */
 
-	throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
+	throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);
 
 	time_elapsed = jiffies - tg->slice_start[rw];
 
-	nr_slices = time_elapsed / throtl_slice;
+	nr_slices = time_elapsed / tg->td->throtl_slice;
 
 	if (!nr_slices)
 		return;
-	tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
+	tmp = tg_bps_limit(tg, rw) * tg->td->throtl_slice * nr_slices;
 	do_div(tmp, HZ);
 	bytes_trim = tmp;
 
-	io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;
+	io_trim = (tg_iops_limit(tg, rw) * tg->td->throtl_slice * nr_slices) /
+		HZ;
 
 	if (!bytes_trim && !io_trim)
 		return;
@@ -753,7 +757,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 	else
 		tg->io_disp[rw] = 0;
 
-	tg->slice_start[rw] += nr_slices * throtl_slice;
+	tg->slice_start[rw] += nr_slices * tg->td->throtl_slice;
 
 	throtl_log(&tg->service_queue,
 		   "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
@@ -773,9 +777,9 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Slice has just started. Consider one slice interval */
 	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
+		jiffy_elapsed_rnd = tg->td->throtl_slice;
 
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
 
 	/*
 	 * jiffy_elapsed_rnd should not be a big value as minimum iops can be
@@ -822,9 +826,9 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Slice has just started. Consider one slice interval */
 	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
+		jiffy_elapsed_rnd = tg->td->throtl_slice;
 
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
 
 	tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
@@ -890,8 +894,10 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw]))
 		throtl_start_new_slice(tg, rw);
 	else {
-		if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
-			throtl_extend_slice(tg, rw, jiffies + throtl_slice);
+		if (time_before(tg->slice_end[rw],
+		    jiffies + tg->td->throtl_slice))
+			throtl_extend_slice(tg, rw,
+				jiffies + tg->td->throtl_slice);
 	}
 
 	if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
@@ -1628,7 +1634,7 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	if (td->limit_index != LIMIT_LOW)
 		return false;
 
-	if (time_before(jiffies, td->low_downgrade_time + throtl_slice))
+	if (time_before(jiffies, td->low_downgrade_time + td->throtl_slice))
 		return false;
 
 	rcu_read_lock();
@@ -1685,8 +1691,9 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
 	 * If cgroup is below low limit, consider downgrade and throttle other
 	 * cgroups
 	 */
-	if (time_after_eq(now, td->low_upgrade_time + throtl_slice) &&
-	    time_after_eq(now, tg_last_low_overflow_time(tg) + throtl_slice))
+	if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
+	    time_after_eq(now, tg_last_low_overflow_time(tg) +
+					td->throtl_slice))
 		return true;
 	return false;
 }
@@ -1715,13 +1722,14 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 		return;
 	if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
 		return;
-	if (time_after(tg->last_check_time + throtl_slice, now))
+	if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
 		return;
 
 	elapsed_time = now - tg->last_check_time;
 	tg->last_check_time = now;
 
-	if (time_before(now, tg_last_low_overflow_time(tg) + throtl_slice))
+	if (time_before(now, tg_last_low_overflow_time(tg) +
+			tg->td->throtl_slice))
 		return;
 
 	if (tg->bps[READ][LIMIT_LOW]) {
@@ -1949,6 +1957,7 @@ int blk_throtl_init(struct request_queue *q)
 
 	q->td = td;
 	td->queue = q;
+	td->throtl_slice = DFL_THROTL_SLICE;
 
 	td->limit_valid[LIMIT_MAX] = true;
 	td->limit_index = LIMIT_MAX;
@@ -1969,6 +1978,30 @@ void blk_throtl_exit(struct request_queue *q)
 	kfree(q->td);
 }
 
+ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page)
+{
+	if (!q->td)
+		return -EINVAL;
+	return sprintf(page, "%u\n", jiffies_to_msecs(q->td->throtl_slice));
+}
+
+ssize_t blk_throtl_sample_time_store(struct request_queue *q,
+	const char *page, size_t count)
+{
+	unsigned long v;
+	unsigned long t;
+
+	if (!q->td)
+		return -EINVAL;
+	if (kstrtoul(page, 10, &v))
+		return -EINVAL;
+	t = msecs_to_jiffies(v);
+	if (t == 0 || t > MAX_THROTL_SLICE)
+		return -EINVAL;
+	q->td->throtl_slice = t;
+	return count;
+}
+
 static int __init throtl_init(void)
 {
 	kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0);
diff --git a/block/blk.h b/block/blk.h
index 041185e..e83e757 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -290,6 +290,9 @@ static inline struct io_context *create_io_context(gfp_t gfp_mask, int node)
 extern void blk_throtl_drain(struct request_queue *q);
 extern int blk_throtl_init(struct request_queue *q);
 extern void blk_throtl_exit(struct request_queue *q);
+extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
+extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
+	const char *page, size_t count);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 09/18] blk-throttle: choose a small throtl_slice for SSD
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (7 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 08/18] blk-throttle: make throtl_slice tunable Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 10/18] blk-throttle: detect completed idle cgroup Shaohua Li
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

The throtl_slice is 100ms by default. This is a long time for SSD, a lot
of IO can run. To make cgroups have smoother throughput, we choose a
small value (20ms) for SSD.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-sysfs.c    |  2 ++
 block/blk-throttle.c | 18 +++++++++++++++---
 block/blk.h          |  2 ++
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0e3fb2a..7285f74 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -907,6 +907,8 @@ int blk_register_queue(struct gendisk *disk)
 
 	blk_wb_init(q);
 
+	blk_throtl_register_queue(q);
+
 	if (!q->request_fn)
 		return 0;
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 49cad9a..2d05c91 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -18,8 +18,9 @@ static int throtl_grp_quantum = 8;
 /* Total max dispatch from all groups in one round */
 static int throtl_quantum = 32;
 
-/* Throttling is performed over 100ms slice and after that slice is renewed */
-#define DFL_THROTL_SLICE (HZ / 10)
+/* Throttling is performed over a slice and after that slice is renewed */
+#define DFL_THROTL_SLICE_HD (HZ / 10)
+#define DFL_THROTL_SLICE_SSD (HZ / 50)
 #define MAX_THROTL_SLICE (HZ)
 
 static struct blkcg_policy blkcg_policy_throtl;
@@ -1957,7 +1958,6 @@ int blk_throtl_init(struct request_queue *q)
 
 	q->td = td;
 	td->queue = q;
-	td->throtl_slice = DFL_THROTL_SLICE;
 
 	td->limit_valid[LIMIT_MAX] = true;
 	td->limit_index = LIMIT_MAX;
@@ -1978,6 +1978,18 @@ void blk_throtl_exit(struct request_queue *q)
 	kfree(q->td);
 }
 
+void blk_throtl_register_queue(struct request_queue *q)
+{
+	struct throtl_data *td;
+
+	td = q->td;
+	BUG_ON(!td);
+	if (blk_queue_nonrot(q))
+		td->throtl_slice = DFL_THROTL_SLICE_SSD;
+	else
+		td->throtl_slice = DFL_THROTL_SLICE_HD;
+}
+
 ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page)
 {
 	if (!q->td)
diff --git a/block/blk.h b/block/blk.h
index e83e757..186c67d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -293,10 +293,12 @@ extern void blk_throtl_exit(struct request_queue *q);
 extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
 extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
 	const char *page, size_t count);
+extern void blk_throtl_register_queue(struct request_queue *q);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
+static inline void blk_throtl_register_queue(struct request_queue *q) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
 #endif /* BLK_INTERNAL_H */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 10/18] blk-throttle: detect completed idle cgroup
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (8 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 09/18] blk-throttle: choose a small throtl_slice for SSD Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 11/18] blk-throttle: make bandwidth change smooth Shaohua Li
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to their lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Please note this will be replaced with a more sophisticated algorithm
later, but this demonstrates the idea how we handle idle cgroups, so I
leave it here.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 2d05c91..b3ce176 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -149,6 +149,8 @@ struct throtl_grp {
 
 	unsigned long last_check_time;
 
+	unsigned long last_dispatch_time[2];
+
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -445,11 +447,14 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
 {
+	struct throtl_grp *tg = pd_to_tg(pd);
 	/*
 	 * We don't want new groups to escape the limits of its ancestors.
 	 * Update has_rules[] after a new group is brought online.
 	 */
-	tg_update_has_rules(pd_to_tg(pd));
+	tg_update_has_rules(tg);
+	tg->last_dispatch_time[READ] = jiffies;
+	tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_limit_valid(struct throtl_data *td)
@@ -1611,6 +1616,12 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 	if (write_limit && sq->nr_queued[WRITE] &&
 	    (!read_limit || sq->nr_queued[READ]))
 		return true;
+
+	if (time_after_eq(jiffies,
+	     tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
+	    time_after_eq(jiffies,
+	     tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+		return true;
 	return false;
 }
 
@@ -1688,6 +1699,11 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
 	struct throtl_data *td = tg->td;
 	unsigned long now = jiffies;
 
+	if (time_after_eq(now, tg->last_dispatch_time[READ] +
+					td->throtl_slice) &&
+	    time_after_eq(now, tg->last_dispatch_time[WRITE] +
+					td->throtl_slice))
+		return false;
 	/*
 	 * If cgroup is below low limit, consider downgrade and throttle other
 	 * cgroups
@@ -1796,6 +1812,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
+		tg->last_dispatch_time[rw] = jiffies;
 		if (tg->last_low_overflow_time[rw] == 0)
 			tg->last_low_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 11/18] blk-throttle: make bandwidth change smooth
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (9 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 10/18] blk-throttle: detect completed idle cgroup Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 12/18] blk-throttle: add a simple idle detection Shaohua Li
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

When cgroups all reach low limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their low limit. For example, cg1
low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach low limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
will have bandwidth below its low limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 22/98

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Scale up is linear. The limit scales up 1/2 .low limit every
throtl_slice after upgrade. The scale up will stop if the adjusted limit
hits .max limit. Scale down is exponential. We cut the scale value half
if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
fully downgrade the queue to LIMIT_LOW state.

Note this doesn't completely avoid cgroup running under its low limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its low limit.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 54 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b3ce176..a9eb03b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -175,6 +175,8 @@ struct throtl_data
 
 	unsigned long low_upgrade_time;
 	unsigned long low_downgrade_time;
+
+	unsigned int scale;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -226,29 +228,70 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 		return container_of(sq, struct throtl_data, service_queue);
 }
 
+/*
+ * cgroup's limit in LIMIT_MAX is scaled if low limit is set. This scale is to
+ * make the IO dispatch more smooth.
+ * Scale up: linearly scale up according to lapsed time since upgrade. For
+ *           every throtl_slice, the limit scales up 1/2 .low limit till the
+ *           limit hits .max limit
+ * Scale down: exponentially scale down if a cgroup doesn't hit its .low limit
+ */
+static uint64_t throtl_adjusted_limit(uint64_t low, struct throtl_data *td)
+{
+	/* arbitrary value to avoid too big scale */
+	if (td->scale < 4096 && time_after_eq(jiffies,
+	    td->low_upgrade_time + td->scale * td->throtl_slice))
+		td->scale = (jiffies - td->low_upgrade_time) / td->throtl_slice;
+
+	return low + (low >> 1) * td->scale;
+}
+
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
 	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	struct throtl_data *td;
 	uint64_t ret;
 
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
 		return U64_MAX;
-	ret = tg->bps[rw][tg->td->limit_index];
-	if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
+
+	td = tg->td;
+	ret = tg->bps[rw][td->limit_index];
+	if (ret == 0 && td->limit_index == LIMIT_LOW)
 		return tg->bps[rw][LIMIT_MAX];
+
+	if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] &&
+	    tg->bps[rw][LIMIT_LOW] != tg->bps[rw][LIMIT_MAX]) {
+		uint64_t adjusted;
+
+		adjusted = throtl_adjusted_limit(tg->bps[rw][LIMIT_LOW], td);
+		ret = min(tg->bps[rw][LIMIT_MAX], adjusted);
+	}
 	return ret;
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
 	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	struct throtl_data *td;
 	unsigned int ret;
 
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
 		return UINT_MAX;
-	ret = tg->iops[rw][tg->td->limit_index];
+	td = tg->td;
+	ret = tg->iops[rw][td->limit_index];
 	if (ret == 0 && tg->td->limit_index == LIMIT_LOW)
 		return tg->iops[rw][LIMIT_MAX];
+
+	if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_LOW] &&
+	    tg->iops[rw][LIMIT_LOW] != tg->iops[rw][LIMIT_MAX]) {
+		uint64_t adjusted;
+
+		adjusted = throtl_adjusted_limit(tg->iops[rw][LIMIT_LOW], td);
+		if (adjusted > UINT_MAX)
+			adjusted = UINT_MAX;
+		ret = min(tg->iops[rw][LIMIT_MAX], (unsigned int)adjusted);
+	}
 	return ret;
 }
 
@@ -1673,6 +1716,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
 	td->limit_index = LIMIT_MAX;
 	td->low_upgrade_time = jiffies;
+	td->scale = 0;
 	rcu_read_lock();
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1690,6 +1734,13 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
 static void throtl_downgrade_state(struct throtl_data *td, int new)
 {
+	td->scale /= 2;
+
+	if (td->scale) {
+		td->low_upgrade_time = jiffies - td->scale * td->throtl_slice;
+		return;
+	}
+
 	td->limit_index = new;
 	td->low_downgrade_time = jiffies;
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 12/18] blk-throttle: add a simple idle detection
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (10 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 11/18] blk-throttle: make bandwidth change smooth Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 13/18] blk-throttle: add interface to configure idle time threshold Shaohua Li
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/bio.c               |  2 ++
 block/blk-throttle.c      | 79 ++++++++++++++++++++++++++++++++++++++++++++++-
 block/blk.h               |  2 ++
 include/linux/blk_types.h |  1 +
 4 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 2b37502..948ebc1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -30,6 +30,7 @@
 #include <linux/cgroup.h>
 
 #include <trace/events/block.h>
+#include "blk.h"
 
 /*
  * Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1813,6 +1814,7 @@ void bio_endio(struct bio *bio)
 		goto again;
 	}
 
+	blk_throtl_bio_endio(bio);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a9eb03b..974ec9d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -22,6 +22,9 @@ static int throtl_quantum = 32;
 #define DFL_THROTL_SLICE_HD (HZ / 10)
 #define DFL_THROTL_SLICE_SSD (HZ / 50)
 #define MAX_THROTL_SLICE (HZ)
+#define DFL_IDLE_THRESHOLD_SSD (1000L) /* 1 ms */
+#define DFL_IDLE_THRESHOLD_HD (100L * 1000) /* 100 ms */
+#define MAX_IDLE_TIME (5L * 1000 * 1000) /* 5 s */
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -154,6 +157,11 @@ struct throtl_grp {
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
+
+	unsigned long last_finish_time; /* ns / 1024 */
+	unsigned long checked_last_finish_time; /* ns / 1024 */
+	unsigned long avg_idletime; /* ns / 1024 */
+	unsigned long idletime_threshold; /* us */
 };
 
 struct throtl_data
@@ -468,6 +476,11 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
 		sq->parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
 	tg->td = td;
+
+	if (blk_queue_nonrot(td->queue))
+		tg->idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
+	else
+		tg->idletime_threshold = DFL_IDLE_THRESHOLD_HD;
 }
 
 /*
@@ -1640,6 +1653,21 @@ static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
 	return ret;
 }
 
+static bool throtl_tg_is_idle(struct throtl_grp *tg)
+{
+	/*
+	 * cgroup is idle if:
+	 * - single idle is too long, longer than a fixed value (in case user
+	 *   configure a too big threshold) or 4 times of slice
+	 * - average think time is more than threshold
+	 */
+	unsigned long time = jiffies_to_usecs(4 * tg->td->throtl_slice);
+
+	time = min_t(unsigned long, MAX_IDLE_TIME, time);
+	return (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
+	       tg->avg_idletime > tg->idletime_threshold;
+}
+
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1839,6 +1867,19 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 	tg->last_io_disp[WRITE] = 0;
 }
 
+static void blk_throtl_update_idletime(struct throtl_grp *tg)
+{
+	unsigned long now = ktime_get_ns() >> 10;
+	unsigned long last_finish_time = tg->last_finish_time;
+
+	if (now <= last_finish_time || last_finish_time == 0 ||
+	    last_finish_time == tg->checked_last_finish_time)
+		return;
+
+	tg->avg_idletime = (tg->avg_idletime * 7 + now - last_finish_time) >> 3;
+	tg->checked_last_finish_time = last_finish_time;
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1847,6 +1888,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	struct throtl_service_queue *sq;
 	bool rw = bio_data_dir(bio);
 	bool throttled = false;
+	int ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
@@ -1859,6 +1901,12 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	if (unlikely(blk_queue_bypass(q)))
 		goto out_unlock;
 
+	ret = bio_associate_current(bio);
+	if (ret == 0 || ret == -EBUSY)
+		bio->bi_cg_private = tg;
+
+	blk_throtl_update_idletime(tg);
+
 	sq = &tg->service_queue;
 
 again:
@@ -1919,7 +1967,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	tg->last_low_overflow_time[rw] = jiffies;
 
-	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
@@ -1948,6 +1995,18 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	return throttled;
 }
 
+void blk_throtl_bio_endio(struct bio *bio)
+{
+	struct throtl_grp *tg;
+
+	tg = bio->bi_cg_private;
+	if (!tg)
+		return;
+	bio->bi_cg_private = NULL;
+
+	tg->last_finish_time = ktime_get_ns() >> 10;
+}
+
 /*
  * Dispatch all bios from all children tg's queued on @parent_sq.  On
  * return, @parent_sq is guaranteed to not have any active children tg's
@@ -2031,6 +2090,7 @@ int blk_throtl_init(struct request_queue *q)
 	td->limit_index = LIMIT_MAX;
 	td->low_upgrade_time = jiffies;
 	td->low_downgrade_time = jiffies;
+
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
@@ -2049,6 +2109,8 @@ void blk_throtl_exit(struct request_queue *q)
 void blk_throtl_register_queue(struct request_queue *q)
 {
 	struct throtl_data *td;
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
 
 	td = q->td;
 	BUG_ON(!td);
@@ -2056,6 +2118,21 @@ void blk_throtl_register_queue(struct request_queue *q)
 		td->throtl_slice = DFL_THROTL_SLICE_SSD;
 	else
 		td->throtl_slice = DFL_THROTL_SLICE_HD;
+
+	/*
+	 * some tg are created before queue is fully initialized, eg, nonrot
+	 * isn't initialized yet
+	 */
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, q->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (blk_queue_nonrot(q))
+			tg->idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
+		else
+			tg->idletime_threshold = DFL_IDLE_THRESHOLD_HD;
+	}
+	rcu_read_unlock();
 }
 
 ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page)
diff --git a/block/blk.h b/block/blk.h
index 186c67d..1539825 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -294,11 +294,13 @@ extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
 extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
 	const char *page, size_t count);
 extern void blk_throtl_register_queue(struct request_queue *q);
+extern void blk_throtl_bio_endio(struct bio *bio);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
 static inline void blk_throtl_register_queue(struct request_queue *q) { }
+static inline void blk_throtl_bio_endio(struct bio *bio) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 519ea2c..01019a7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -58,6 +58,7 @@ struct bio {
 	 */
 	struct io_context	*bi_ioc;
 	struct cgroup_subsys_state *bi_css;
+	void			*bi_cg_private;
 #endif
 	union {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 13/18] blk-throttle: add interface to configure idle time threshold
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (11 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 12/18] blk-throttle: add a simple idle detection Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 14/18] blk-throttle: ignore idle cgroup limit Shaohua Li
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

Add interface to configure the threshold. The io.low interface will
like:
echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low

idle is in microsecond unit.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 41 ++++++++++++++++++++++++++++-------------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 974ec9d..88c0d1e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -181,6 +181,8 @@ struct throtl_data
 	unsigned int limit_index;
 	bool limit_valid[LIMIT_CNT];
 
+	unsigned long dft_idletime_threshold; /* us */
+
 	unsigned long low_upgrade_time;
 	unsigned long low_downgrade_time;
 
@@ -477,10 +479,7 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 		sq->parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
 	tg->td = td;
 
-	if (blk_queue_nonrot(td->queue))
-		tg->idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
-	else
-		tg->idletime_threshold = DFL_IDLE_THRESHOLD_HD;
+	tg->idletime_threshold = td->dft_idletime_threshold;
 }
 
 /*
@@ -1447,6 +1446,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	char bufs[4][21] = { "max", "max", "max", "max" };
 	u64 bps_dft;
 	unsigned int iops_dft;
+	char idle_time[26] = "";
 
 	if (!dname)
 		return 0;
@@ -1462,7 +1462,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	if (tg->bps_conf[READ][off] == bps_dft &&
 	    tg->bps_conf[WRITE][off] == bps_dft &&
 	    tg->iops_conf[READ][off] == iops_dft &&
-	    tg->iops_conf[WRITE][off] == iops_dft)
+	    tg->iops_conf[WRITE][off] == iops_dft &&
+	    (off != LIMIT_LOW || tg->idletime_threshold ==
+				  tg->td->dft_idletime_threshold))
 		return 0;
 
 	if (tg->bps_conf[READ][off] != bps_dft)
@@ -1477,9 +1479,16 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	if (tg->iops_conf[WRITE][off] != iops_dft)
 		snprintf(bufs[3], sizeof(bufs[3]), "%u",
 			tg->iops_conf[WRITE][off]);
+	if (off == LIMIT_LOW) {
+		if (tg->idletime_threshold == ULONG_MAX)
+			strcpy(idle_time, " idle=max");
+		else
+			snprintf(idle_time, sizeof(idle_time), " idle=%lu",
+				tg->idletime_threshold);
+	}
 
-	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
-		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
+	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
+		   dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
 	return 0;
 }
 
@@ -1497,6 +1506,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	struct blkg_conf_ctx ctx;
 	struct throtl_grp *tg;
 	u64 v[4];
+	unsigned long idle_time;
 	int ret;
 	int index = of_cft(of)->private;
 
@@ -1511,6 +1521,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	v[2] = tg->iops_conf[READ][index];
 	v[3] = tg->iops_conf[WRITE][index];
 
+	idle_time = tg->idletime_threshold;
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
 		char *p;
@@ -1542,6 +1553,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			v[2] = min_t(u64, val, UINT_MAX);
 		else if (!strcmp(tok, "wiops"))
 			v[3] = min_t(u64, val, UINT_MAX);
+		else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
+			idle_time = val;
 		else
 			goto out_finish;
 	}
@@ -1570,6 +1583,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		blk_throtl_update_limit_valid(tg->td);
 		if (tg->td->limit_valid[LIMIT_LOW])
 			tg->td->limit_index = LIMIT_LOW;
+		tg->idletime_threshold = (idle_time == ULONG_MAX) ?
+			ULONG_MAX : idle_time;
 	}
 	tg_conf_updated(tg);
 	ret = 0;
@@ -2114,10 +2129,13 @@ void blk_throtl_register_queue(struct request_queue *q)
 
 	td = q->td;
 	BUG_ON(!td);
-	if (blk_queue_nonrot(q))
+	if (blk_queue_nonrot(q)) {
 		td->throtl_slice = DFL_THROTL_SLICE_SSD;
-	else
+		td->dft_idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
+	} else {
 		td->throtl_slice = DFL_THROTL_SLICE_HD;
+		td->dft_idletime_threshold = DFL_IDLE_THRESHOLD_HD;
+	}
 
 	/*
 	 * some tg are created before queue is fully initialized, eg, nonrot
@@ -2127,10 +2145,7 @@ void blk_throtl_register_queue(struct request_queue *q)
 	blkg_for_each_descendant_post(blkg, pos_css, q->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
 
-		if (blk_queue_nonrot(q))
-			tg->idletime_threshold = DFL_IDLE_THRESHOLD_SSD;
-		else
-			tg->idletime_threshold = DFL_IDLE_THRESHOLD_HD;
+		tg->idletime_threshold = td->dft_idletime_threshold;
 	}
 	rcu_read_unlock();
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 14/18] blk-throttle: ignore idle cgroup limit
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (12 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 13/18] blk-throttle: add interface to configure idle time threshold Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 15/18] blk-throttle: add interface for per-cgroup target latency Shaohua Li
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision. And the new algorithm can detect completely
idle cgroup too, so we can delete the corresponding code.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 40 ++++++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 88c0d1e..cf8d386 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -152,8 +152,6 @@ struct throtl_grp {
 
 	unsigned long last_check_time;
 
-	unsigned long last_dispatch_time[2];
-
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -508,8 +506,6 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 	 * Update has_rules[] after a new group is brought online.
 	 */
 	tg_update_has_rules(tg);
-	tg->last_dispatch_time[READ] = jiffies;
-	tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_limit_valid(struct throtl_data *td)
@@ -1704,9 +1700,8 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 		return true;
 
 	if (time_after_eq(jiffies,
-	     tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
-	    time_after_eq(jiffies,
-	     tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+		tg_last_low_overflow_time(tg) + tg->td->throtl_slice) &&
+	    throtl_tg_is_idle(tg))
 		return true;
 	return false;
 }
@@ -1752,6 +1747,26 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	return true;
 }
 
+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+	unsigned long now = jiffies;
+
+	if (tg->td->limit_index != LIMIT_LOW)
+		return;
+
+	if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
+		return;
+
+	tg->last_check_time = now;
+
+	if (!time_after_eq(now,
+	     __tg_last_low_overflow_time(tg) + tg->td->throtl_slice))
+		return;
+
+	if (throtl_can_upgrade(tg->td, NULL))
+		throtl_upgrade_state(tg->td);
+}
+
 static void throtl_upgrade_state(struct throtl_data *td)
 {
 	struct cgroup_subsys_state *pos_css;
@@ -1793,18 +1808,15 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
 	struct throtl_data *td = tg->td;
 	unsigned long now = jiffies;
 
-	if (time_after_eq(now, tg->last_dispatch_time[READ] +
-					td->throtl_slice) &&
-	    time_after_eq(now, tg->last_dispatch_time[WRITE] +
-					td->throtl_slice))
-		return false;
 	/*
 	 * If cgroup is below low limit, consider downgrade and throttle other
 	 * cgroups
 	 */
 	if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
 	    time_after_eq(now, tg_last_low_overflow_time(tg) +
-					td->throtl_slice))
+					td->throtl_slice) &&
+	    (!throtl_tg_is_idle(tg) ||
+	     !list_empty(&tg_to_blkg(tg)->blkcg->css.children)))
 		return true;
 	return false;
 }
@@ -1926,10 +1938,10 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
-		tg->last_dispatch_time[rw] = jiffies;
 		if (tg->last_low_overflow_time[rw] == 0)
 			tg->last_low_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
+		throtl_upgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 15/18] blk-throttle: add interface for per-cgroup target latency
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (13 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 14/18] blk-throttle: ignore idle cgroup limit Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 16/18] block: track request size in blk_issue_stat Shaohua Li
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

Here we introduce per-cgroup latency target. The target determines how a
cgroup can afford latency increasement. We will use the target latency
to calculate a threshold and use it to schedule IO for cgroups. If a
cgroup's bandwidth is below its low limit but its average latency is
below the threshold, other cgroups can safely dispatch more IO even
their bandwidth is higher than their low limits. On the other hand, if
the first cgroup's latency is higher than the threshold, other cgroups
are throttled to their low limits. So the target latency determines how
we efficiently utilize free disk resource without sacifice of worload's
IO latency.

For example, assume 4k IO average latency is 50us when disk isn't
congested. A cgroup sets the target latency to 30us. Then the cgroup can
accept 50+30=80us IO latency. If the cgroupt's average IO latency is
90us and its bandwidth is below low limit, other cgroups are throttled
to their low limit. If the cgroup's average IO latency is 60us, other
cgroups are allowed to dispatch more IO. When other cgroups dispatch
more IO, the first cgroup's IO latency will increase. If it increases to
81us, we then throttle other cgroups.

User will configure the interface in this way:
echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low

latency is in microsecond unit

By default, latency target is 0, which means to guarantee IO latency.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index cf8d386..b2fab58 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -25,6 +25,8 @@ static int throtl_quantum = 32;
 #define DFL_IDLE_THRESHOLD_SSD (1000L) /* 1 ms */
 #define DFL_IDLE_THRESHOLD_HD (100L * 1000) /* 100 ms */
 #define MAX_IDLE_TIME (5L * 1000 * 1000) /* 5 s */
+/* default latency target is 0, eg, guarantee IO latency by default */
+#define DFL_LATENCY_TARGET (0)
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -152,6 +154,7 @@ struct throtl_grp {
 
 	unsigned long last_check_time;
 
+	unsigned long latency_target; /* us */
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -449,6 +452,8 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	tg->iops_conf[WRITE][LIMIT_MAX] = UINT_MAX;
 	/* LIMIT_LOW will have default value 0 */
 
+	tg->latency_target = DFL_LATENCY_TARGET;
+
 	return &tg->pd;
 }
 
@@ -1443,6 +1448,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	u64 bps_dft;
 	unsigned int iops_dft;
 	char idle_time[26] = "";
+	char latency_time[26] = "";
 
 	if (!dname)
 		return 0;
@@ -1459,8 +1465,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	    tg->bps_conf[WRITE][off] == bps_dft &&
 	    tg->iops_conf[READ][off] == iops_dft &&
 	    tg->iops_conf[WRITE][off] == iops_dft &&
-	    (off != LIMIT_LOW || tg->idletime_threshold ==
-				  tg->td->dft_idletime_threshold))
+	    (off != LIMIT_LOW ||
+	     (tg->idletime_threshold == tg->td->dft_idletime_threshold &&
+	      tg->latency_target == DFL_LATENCY_TARGET)))
 		return 0;
 
 	if (tg->bps_conf[READ][off] != bps_dft)
@@ -1481,10 +1488,17 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 		else
 			snprintf(idle_time, sizeof(idle_time), " idle=%lu",
 				tg->idletime_threshold);
+
+		if (tg->latency_target == ULONG_MAX)
+			strcpy(latency_time, " latency=max");
+		else
+			snprintf(latency_time, sizeof(latency_time),
+				" latency=%lu", tg->latency_target);
 	}
 
-	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
-		   dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
+	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s%s\n",
+		   dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time,
+		   latency_time);
 	return 0;
 }
 
@@ -1503,6 +1517,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	u64 v[4];
 	unsigned long idle_time;
+	unsigned long latency_time;
 	int ret;
 	int index = of_cft(of)->private;
 
@@ -1518,6 +1533,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	v[3] = tg->iops_conf[WRITE][index];
 
 	idle_time = tg->idletime_threshold;
+	latency_time = tg->latency_target;
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
 		char *p;
@@ -1551,6 +1567,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			v[3] = min_t(u64, val, UINT_MAX);
 		else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
 			idle_time = val;
+		else if (off == LIMIT_LOW && !strcmp(tok, "latency"))
+			latency_time = val;
 		else
 			goto out_finish;
 	}
@@ -1581,6 +1599,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			tg->td->limit_index = LIMIT_LOW;
 		tg->idletime_threshold = (idle_time == ULONG_MAX) ?
 			ULONG_MAX : idle_time;
+		tg->latency_target = (latency_time == ULONG_MAX) ?
+			ULONG_MAX : latency_time;
 	}
 	tg_conf_updated(tg);
 	ret = 0;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 16/18] block: track request size in blk_issue_stat
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (14 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 15/18] blk-throttle: add interface for per-cgroup target latency Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 17/18] blk-throttle: add a mechanism to estimate IO latency Shaohua Li
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

Currently there is no way to know the request size when the request is
finished. Next patch will need this info. We could add extra field to
record the size, but blk_issue_stat has enough space to record it, so
this patch just overloads blk_issue_stat. With this, we will have 49bits
to track time, which still is very long time.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-core.c          |  2 +-
 block/blk-mq.c            |  2 +-
 block/blk-stat.c          |  7 ++++---
 block/blk-stat.h          | 29 +++++++++++++++++++++++------
 block/blk-wbt.h           | 10 +++++-----
 include/linux/blk_types.h |  2 +-
 6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..485c32d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2459,7 +2459,7 @@ void blk_start_request(struct request *req)
 	blk_dequeue_request(req);
 
 	if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
-		blk_stat_set_issue_time(&req->issue_stat);
+		blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req));
 		req->rq_flags |= RQF_STATS;
 		wbt_issue(req->q->rq_wb, &req->issue_stat);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a8e67a1..1ad172c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -474,7 +474,7 @@ void blk_mq_start_request(struct request *rq)
 		rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
 
 	if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
-		blk_stat_set_issue_time(&rq->issue_stat);
+		blk_stat_set_issue(&rq->issue_stat, blk_rq_sectors(rq));
 		rq->rq_flags |= RQF_STATS;
 		wbt_issue(q->rq_wb, &rq->issue_stat);
 	}
diff --git a/block/blk-stat.c b/block/blk-stat.c
index 9b43efb..7e9df17 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -236,10 +236,11 @@ void blk_stat_clear(struct request_queue *q)
 	}
 }
 
-void blk_stat_set_issue_time(struct blk_issue_stat *stat)
+void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size)
 {
-	stat->time = (stat->time & BLK_STAT_MASK) |
-			(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK);
+	stat->stat = (stat->stat & BLK_STAT_RES_MASK) |
+		(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK) |
+		(((u64)blk_capped_size(size)) << BLK_STAT_SIZE_SHIFT);
 }
 
 /*
diff --git a/block/blk-stat.h b/block/blk-stat.h
index a2050a0..462197f 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -8,12 +8,19 @@
 #define BLK_STAT_NSEC_MASK	~(BLK_STAT_NSEC - 1)
 
 /*
- * Upper 3 bits can be used elsewhere
+ * from upper:
+ * 3 bits: reserved for other usage
+ * 12 bits: size
+ * 49 bits: time
  */
 #define BLK_STAT_RES_BITS	3
-#define BLK_STAT_SHIFT		(64 - BLK_STAT_RES_BITS)
-#define BLK_STAT_TIME_MASK	((1ULL << BLK_STAT_SHIFT) - 1)
-#define BLK_STAT_MASK		~BLK_STAT_TIME_MASK
+#define BLK_STAT_SIZE_BITS	12
+#define BLK_STAT_RES_SHIFT	(64 - BLK_STAT_RES_BITS)
+#define BLK_STAT_SIZE_SHIFT	(BLK_STAT_RES_SHIFT - BLK_STAT_SIZE_BITS)
+#define BLK_STAT_TIME_MASK	((1ULL << BLK_STAT_SIZE_SHIFT) - 1)
+#define BLK_STAT_SIZE_MASK	\
+	(((1ULL << BLK_STAT_SIZE_BITS) - 1) << BLK_STAT_SIZE_SHIFT)
+#define BLK_STAT_RES_MASK	(~((1ULL << BLK_STAT_RES_SHIFT) - 1))
 
 enum {
 	BLK_STAT_READ	= 0,
@@ -26,7 +33,7 @@ void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
 void blk_stat_clear(struct request_queue *);
 void blk_stat_init(struct blk_rq_stat *);
 bool blk_stat_is_current(struct blk_rq_stat *);
-void blk_stat_set_issue_time(struct blk_issue_stat *);
+void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size);
 bool blk_stat_enable(struct request_queue *);
 
 static inline u64 __blk_stat_time(u64 time)
@@ -36,7 +43,17 @@ static inline u64 __blk_stat_time(u64 time)
 
 static inline u64 blk_stat_time(struct blk_issue_stat *stat)
 {
-	return __blk_stat_time(stat->time);
+	return __blk_stat_time(stat->stat);
+}
+
+static inline sector_t blk_capped_size(sector_t size)
+{
+	return size & ((1ULL << BLK_STAT_SIZE_BITS) - 1);
+}
+
+static inline sector_t blk_stat_size(struct blk_issue_stat *stat)
+{
+	return (stat->stat & BLK_STAT_SIZE_MASK) >> BLK_STAT_SIZE_SHIFT;
 }
 
 #endif
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 65f1de5..7265f1f 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -32,27 +32,27 @@ enum {
 
 static inline void wbt_clear_state(struct blk_issue_stat *stat)
 {
-	stat->time &= BLK_STAT_TIME_MASK;
+	stat->stat &= ~BLK_STAT_RES_MASK;
 }
 
 static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
 {
-	return (stat->time & BLK_STAT_MASK) >> BLK_STAT_SHIFT;
+	return (stat->stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT;
 }
 
 static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags wb_acct)
 {
-	stat->time |= ((u64) wb_acct) << BLK_STAT_SHIFT;
+	stat->stat |= ((u64) wb_acct) << BLK_STAT_RES_SHIFT;
 }
 
 static inline bool wbt_is_tracked(struct blk_issue_stat *stat)
 {
-	return (stat->time >> BLK_STAT_SHIFT) & WBT_TRACKED;
+	return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_TRACKED;
 }
 
 static inline bool wbt_is_read(struct blk_issue_stat *stat)
 {
-	return (stat->time >> BLK_STAT_SHIFT) & WBT_READ;
+	return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_READ;
 }
 
 struct rq_wait {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 01019a7..8159a6c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -257,7 +257,7 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
 }
 
 struct blk_issue_stat {
-	u64 time;
+	u64 stat;
 };
 
 #define BLK_RQ_STAT_BATCH	64
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 17/18] blk-throttle: add a mechanism to estimate IO latency
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (15 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 16/18] block: track request size in blk_issue_stat Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-01-15  3:42 ` [PATCH V6 18/18] blk-throttle: add latency target support Shaohua Li
  2017-08-31  7:24 ` [PATCH V6 00/18] blk-throttle: add .low limit Paolo VALENTE
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-stat.c          |   4 ++
 block/blk-throttle.c      | 162 ++++++++++++++++++++++++++++++++++++++++++++--
 block/blk.h               |   2 +
 include/linux/blk_types.h |   9 +--
 4 files changed, 168 insertions(+), 9 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 7e9df17..073eff4 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -7,6 +7,7 @@
 #include <linux/blk-mq.h>
 
 #include "blk-stat.h"
+#include "blk.h"
 #include "blk-mq.h"
 
 static void blk_stat_flush_batch(struct blk_rq_stat *stat)
@@ -204,6 +205,9 @@ void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
 		__blk_stat_init(stat, now);
 
 	value = now - blk_stat_time(&rq->issue_stat);
+
+	blk_throtl_stat_add(rq, value);
+
 	if (value > stat->max)
 		stat->max = value;
 	if (value < stat->min)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b2fab58..c0eb100 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -28,6 +28,8 @@ static int throtl_quantum = 32;
 /* default latency target is 0, eg, guarantee IO latency by default */
 #define DFL_LATENCY_TARGET (0)
 
+#define SKIP_LATENCY (((u64)1) << BLK_STAT_RES_SHIFT)
+
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -165,6 +167,19 @@ struct throtl_grp {
 	unsigned long idletime_threshold; /* us */
 };
 
+/* We measure latency for request size from <= 4k to >= 1M */
+#define LATENCY_BUCKET_SIZE 9
+
+struct latency_bucket {
+	unsigned long total_latency; /* ns / 1024 */
+	int samples;
+};
+
+struct avg_latency_bucket {
+	unsigned long latency; /* ns / 1024 */
+	bool valid;
+};
+
 struct throtl_data
 {
 	/* service tree for active throtl groups */
@@ -188,6 +203,13 @@ struct throtl_data
 	unsigned long low_downgrade_time;
 
 	unsigned int scale;
+
+	struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+	struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+	struct latency_bucket __percpu *latency_buckets;
+	unsigned long last_calculate_time;
+
+	bool track_bio_latency;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -306,6 +328,13 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 	return ret;
 }
 
+static int request_bucket_index(sector_t sectors)
+{
+	int order = order_base_2(sectors);
+
+	return clamp_t(int, order - 3, 0, LATENCY_BUCKET_SIZE - 1);
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -1927,6 +1956,67 @@ static void blk_throtl_update_idletime(struct throtl_grp *tg)
 	tg->checked_last_finish_time = last_finish_time;
 }
 
+static void throtl_update_latency_buckets(struct throtl_data *td)
+{
+	struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+	int i, cpu;
+	unsigned long last_latency = 0;
+	unsigned long latency;
+
+	if (!blk_queue_nonrot(td->queue))
+		return;
+	if (time_before(jiffies, td->last_calculate_time + HZ))
+		return;
+	td->last_calculate_time = jiffies;
+
+	memset(avg_latency, 0, sizeof(avg_latency));
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		struct latency_bucket *tmp = &td->tmp_buckets[i];
+
+		for_each_possible_cpu(cpu) {
+			struct latency_bucket *bucket;
+
+			/* this isn't race free, but ok in practice */
+			bucket = per_cpu_ptr(td->latency_buckets, cpu);
+			tmp->total_latency += bucket[i].total_latency;
+			tmp->samples += bucket[i].samples;
+			bucket[i].total_latency = 0;
+			bucket[i].samples = 0;
+		}
+
+		if (tmp->samples >= 32) {
+			int samples = tmp->samples;
+
+			latency = tmp->total_latency;
+
+			tmp->total_latency = 0;
+			tmp->samples = 0;
+			latency /= samples;
+			if (latency == 0)
+				continue;
+			avg_latency[i].latency = latency;
+		}
+	}
+
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		if (!avg_latency[i].latency) {
+			if (td->avg_buckets[i].latency < last_latency)
+				td->avg_buckets[i].latency = last_latency;
+			continue;
+		}
+
+		if (!td->avg_buckets[i].valid)
+			latency = avg_latency[i].latency;
+		else
+			latency = (td->avg_buckets[i].latency * 7 +
+				avg_latency[i].latency) >> 3;
+
+		td->avg_buckets[i].latency = max(latency, last_latency);
+		td->avg_buckets[i].valid = true;
+		last_latency = td->avg_buckets[i].latency;
+	}
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1935,6 +2025,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	struct throtl_service_queue *sq;
 	bool rw = bio_data_dir(bio);
 	bool throttled = false;
+	struct throtl_data *td = tg->td;
 	int ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
@@ -1945,6 +2036,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	spin_lock_irq(q->queue_lock);
 
+	throtl_update_latency_buckets(td);
+
 	if (unlikely(blk_queue_bypass(q)))
 		goto out_unlock;
 
@@ -1952,6 +2045,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	if (ret == 0 || ret == -EBUSY)
 		bio->bi_cg_private = tg;
 
+	blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio));
+
 	blk_throtl_update_idletime(tg);
 
 	sq = &tg->service_queue;
@@ -1969,8 +2064,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		/* if above limits, break to queue */
 		if (!tg_may_dispatch(tg, bio, NULL)) {
 			tg->last_low_overflow_time[rw] = jiffies;
-			if (throtl_can_upgrade(tg->td, tg)) {
-				throtl_upgrade_state(tg->td);
+			if (throtl_can_upgrade(td, tg)) {
+				throtl_upgrade_state(td);
 				goto again;
 			}
 			break;
@@ -2014,7 +2109,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	tg->last_low_overflow_time[rw] = jiffies;
 
-	tg->td->nr_queued[rw]++;
+	td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
 
@@ -2039,19 +2134,63 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	 */
 	if (!throttled)
 		bio_clear_flag(bio, BIO_THROTTLED);
+
+	if (throttled || !td->track_bio_latency)
+		bio->bi_issue_stat.stat |= SKIP_LATENCY;
 	return throttled;
 }
 
+static void throtl_track_latency(struct throtl_data *td, sector_t size,
+	int op, unsigned long time)
+{
+	struct latency_bucket *latency;
+	int index;
+
+	if (!td || td->limit_index != LIMIT_LOW || op != REQ_OP_READ ||
+	    !blk_queue_nonrot(td->queue))
+		return;
+
+	index = request_bucket_index(size);
+
+	latency = get_cpu_ptr(td->latency_buckets);
+	latency[index].total_latency += time;
+	latency[index].samples++;
+	put_cpu_ptr(td->latency_buckets);
+}
+
+void blk_throtl_stat_add(struct request *rq, u64 time_ns)
+{
+	struct request_queue *q = rq->q;
+	struct throtl_data *td = q->td;
+
+	throtl_track_latency(td, blk_stat_size(&rq->issue_stat),
+		req_op(rq), time_ns >> 10);
+}
+
 void blk_throtl_bio_endio(struct bio *bio)
 {
 	struct throtl_grp *tg;
+	u64 finish_time_ns;
+	unsigned long finish_time;
+	unsigned long start_time;
+	unsigned long lat;
 
 	tg = bio->bi_cg_private;
 	if (!tg)
 		return;
 	bio->bi_cg_private = NULL;
 
-	tg->last_finish_time = ktime_get_ns() >> 10;
+	finish_time_ns = ktime_get_ns();
+	tg->last_finish_time = finish_time_ns >> 10;
+
+	start_time = blk_stat_time(&bio->bi_issue_stat) >> 10;
+	finish_time = __blk_stat_time(finish_time_ns) >> 10;
+	if (start_time && finish_time > start_time &&
+	    !(bio->bi_issue_stat.stat & SKIP_LATENCY)) {
+		lat = finish_time - start_time;
+		throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
+			bio_op(bio), lat);
+	}
 }
 
 /*
@@ -2126,6 +2265,12 @@ int blk_throtl_init(struct request_queue *q)
 	td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
 	if (!td)
 		return -ENOMEM;
+	td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
+		LATENCY_BUCKET_SIZE, __alignof__(u64));
+	if (!td->latency_buckets) {
+		kfree(td);
+		return -ENOMEM;
+	}
 
 	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 	throtl_service_queue_init(&td->service_queue);
@@ -2140,8 +2285,10 @@ int blk_throtl_init(struct request_queue *q)
 
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
-	if (ret)
+	if (ret) {
+		free_percpu(td->latency_buckets);
 		kfree(td);
+	}
 	return ret;
 }
 
@@ -2150,6 +2297,7 @@ void blk_throtl_exit(struct request_queue *q)
 	BUG_ON(!q->td);
 	throtl_shutdown_wq(q);
 	blkcg_deactivate_policy(q, &blkcg_policy_throtl);
+	free_percpu(q->td->latency_buckets);
 	kfree(q->td);
 }
 
@@ -2169,6 +2317,10 @@ void blk_throtl_register_queue(struct request_queue *q)
 		td->dft_idletime_threshold = DFL_IDLE_THRESHOLD_HD;
 	}
 
+	td->track_bio_latency = !q->mq_ops && !q->request_fn;
+	if (!td->track_bio_latency)
+		blk_stat_enable(q);
+
 	/*
 	 * some tg are created before queue is fully initialized, eg, nonrot
 	 * isn't initialized yet
diff --git a/block/blk.h b/block/blk.h
index 1539825..c71dca9 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -295,12 +295,14 @@ extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
 	const char *page, size_t count);
 extern void blk_throtl_register_queue(struct request_queue *q);
 extern void blk_throtl_bio_endio(struct bio *bio);
+extern void blk_throtl_stat_add(struct request *rq, u64 time);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
 static inline void blk_throtl_register_queue(struct request_queue *q) { }
 static inline void blk_throtl_bio_endio(struct bio *bio) { }
+static inline void blk_throtl_stat_add(struct request *rq, u64 time) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8159a6c..d88a3c0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -17,6 +17,10 @@ struct io_context;
 struct cgroup_subsys_state;
 typedef void (bio_end_io_t) (struct bio *);
 
+struct blk_issue_stat {
+	u64 stat;
+};
+
 /*
  * main unit of I/O for the block layer and lower layers (ie drivers and
  * stacking drivers)
@@ -59,6 +63,7 @@ struct bio {
 	struct io_context	*bi_ioc;
 	struct cgroup_subsys_state *bi_css;
 	void			*bi_cg_private;
+	struct blk_issue_stat	bi_issue_stat;
 #endif
 	union {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
@@ -256,10 +261,6 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
 	return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
 }
 
-struct blk_issue_stat {
-	u64 stat;
-};
-
 #define BLK_RQ_STAT_BATCH	64
 
 struct blk_rq_stat {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V6 18/18] blk-throttle: add latency target support
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (16 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 17/18] blk-throttle: add a mechanism to estimate IO latency Shaohua Li
@ 2017-01-15  3:42 ` Shaohua Li
  2017-08-31  7:24 ` [PATCH V6 00/18] blk-throttle: add .low limit Paolo VALENTE
  18 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-01-15  3:42 UTC (permalink / raw)
  To: linux-kernel, linux-block; +Cc: Kernel-team, tj, axboe, vgoyal

One hard problem adding .low limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its low limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we add a latency based approach.

We already have mechanism to calculate latency threshold for each IO
size. For every IO dispatched from a cgorup, we compare its latency
against its threshold and record the info. If most IO latency is below
threshold (in the code I use 75%), the cgroup could be treated idle and
other cgroups can dispatch more IO.

Currently this latency target check is only for SSD as we can't
calcualte the latency target for hard disk. And this is only for cgroup
leaf node so far.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index c0eb100..901af99 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -165,6 +165,10 @@ struct throtl_grp {
 	unsigned long checked_last_finish_time; /* ns / 1024 */
 	unsigned long avg_idletime; /* ns / 1024 */
 	unsigned long idletime_threshold; /* us */
+
+	unsigned int bio_cnt; /* total bios */
+	unsigned int bad_bio_cnt; /* bios exceeding latency threshold */
+	unsigned long bio_cnt_reset_time;
 };
 
 /* We measure latency for request size from <= 4k to >= 1M */
@@ -1720,12 +1724,15 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
 	 * - single idle is too long, longer than a fixed value (in case user
 	 *   configure a too big threshold) or 4 times of slice
 	 * - average think time is more than threshold
+	 * - IO latency is largely below threshold
 	 */
 	unsigned long time = jiffies_to_usecs(4 * tg->td->throtl_slice);
 
 	time = min_t(unsigned long, MAX_IDLE_TIME, time);
 	return (ktime_get_ns() >> 10) - tg->last_finish_time > time ||
-	       tg->avg_idletime > tg->idletime_threshold;
+	       tg->avg_idletime > tg->idletime_threshold ||
+	       (tg->latency_target && tg->bio_cnt &&
+		tg->bad_bio_cnt * 5 < tg->bio_cnt);
 }
 
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
@@ -2185,11 +2192,35 @@ void blk_throtl_bio_endio(struct bio *bio)
 
 	start_time = blk_stat_time(&bio->bi_issue_stat) >> 10;
 	finish_time = __blk_stat_time(finish_time_ns) >> 10;
-	if (start_time && finish_time > start_time &&
-	    !(bio->bi_issue_stat.stat & SKIP_LATENCY)) {
-		lat = finish_time - start_time;
+	if (!start_time || finish_time <= start_time)
+		return;
+
+	lat = finish_time - start_time;
+	if (!(bio->bi_issue_stat.stat & SKIP_LATENCY))
 		throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
 			bio_op(bio), lat);
+
+	if (tg->latency_target) {
+		int bucket;
+		unsigned int threshold;
+
+		bucket = request_bucket_index(
+			blk_stat_size(&bio->bi_issue_stat));
+		threshold = tg->td->avg_buckets[bucket].latency +
+			tg->latency_target;
+		if (lat > threshold)
+			tg->bad_bio_cnt++;
+		/*
+		 * Not race free, could get wrong count, which means cgroups
+		 * will be throttled
+		 */
+		tg->bio_cnt++;
+	}
+
+	if (time_after(jiffies, tg->bio_cnt_reset_time) || tg->bio_cnt > 1024) {
+		tg->bio_cnt_reset_time = tg->td->throtl_slice + jiffies;
+		tg->bio_cnt /= 2;
+		tg->bad_bio_cnt /= 2;
 	}
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 00/18] blk-throttle: add .low limit
  2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
                   ` (17 preceding siblings ...)
  2017-01-15  3:42 ` [PATCH V6 18/18] blk-throttle: add latency target support Shaohua Li
@ 2017-08-31  7:24 ` Paolo VALENTE
  2017-09-05 21:02   ` Shaohua Li
  18 siblings, 1 reply; 24+ messages in thread
From: Paolo VALENTE @ 2017-08-31  7:24 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Linux Kernel Mailing List, linux-block, Kernel-team, tj, axboe, vgoyal


> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li <shli@fb.com> ha scritto:
> 
> Hi,
> 
> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
> much for SSD. This patch set try to add a conservative limit for blk-throttle.
> It isn't a proportional scheduling, but can help prioritize cgroups. There are
> several advantages we choose blk-throttle:
> - blk-throttle resides early in the block stack. It works for both bio and
>  request based queues.
> - blk-throttle is light weight in general. It still takes queue lock, but it's
>  not hard to implement a per-cpu cache and remove the lock contention.
> - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
>  mechanism is proved to harm performance for fast SSD.
> 
> The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
> The existing io.max is a hard limit throttling. cgroup with a max limit never
> dispatch more IO than its max limit. While io.low is a best effort throttling.
> cgroups with 'low' limit can run above their 'low' limit at appropriate time.
> Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
> their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
> will run according to their 'low' limit. So the 'low' limit could act as two
> roles, it allows cgroups using free bandwidth and it protects cgroups from
> their 'low' limit.
> 
> An example usage is we have a high prio cgroup with high 'low' limit and a low
> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
> prio can run above its 'low' limit, so we don't waste the bandwidth. When the
> high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
> under its 'low' limit. This will protect high prio cgroup to get more
> resources.
> 

Hi Shaohua,
I would like to ask you some questions, to make sure I fully
understand how the 'low' limit and the idle-group detection work in
your above scenario.  Suppose that: the drive has a random-I/O peak
rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
the low prio group has a 'low' limit of 10 MB/s.  If
- the high prio process happens to do, say, only 5 MB/s for a given
  long time
- the low prio process constantly does greedy I/O
- the idle-group detection is not being used
then the low prio process is limited to 10 MB/s during all this time
interval.  And only 10% of the device bandwidth is utilized.

To recover lost bandwidth through idle-group detection, we need to set
a target IO latency for the high-prio group.  The high prio group
should happen to be below the threshold, and thus to be detected as
idle, leaving the low prio group free too use all the bandwidth.

Here are my questions:
1) Is all I wrote above correct?
2) In particular, maybe there are other better mechanism to saturate
the bandwidth in the above scenario?

If what I wrote above is correct:
3) Doesn't fluctuation occur?  I mean: when the low prio group gets
full bandwidth, the latency threshold of the high prio group may be
overcome, causing the high prio group to not be considered idle any
longer, and thus the low prio group to be limited again; this in turn
will cause the threshold to not be overcome any longer, and so on.
4) Is there a way to compute an appropriate target latency of the high
prio group, if it is a generic group, for which the latency
requirements of the processes it contains are only partially known or
completely unknown?  By appropriate target latency, I mean a target
latency that enables the framework to fully utilize the device
bandwidth while the high prio group is doing less I/O than its limit.

Thanks,
Paolo

> The implementation is simple. The disk queue has a state machine. We have 2
> states LIMIT_LOW and LIMIT_MAX. In each disk state, we throttle cgroups
> according to the limit of the state. That is io.low limit for LIMIT_LOW state,
> io.max limit for LIMIT_MAX. The disk state can be upgraded/downgraded between
> LIMIT_LOW and LIMIT_MAX according to the rule aboe. Initially disk state is
> LIMIT_MAX. And if no cgroup sets io.low, the disk state will remain in
> LIMIT_MAX state. Systems with only io.max set will find nothing changed with the
> patches.
> 
> The first 10 patches implement the basic framework. Add interface, handle
> upgrade and downgrade logic. The patch 10 detects a special case a cgroup is
> completely idle. In this case, we ignore the cgroup's limit. The patch 11-18
> adds more heuristics.
> 
> The basic framework has 2 major issues.
> 
> 1. fluctuation. When the state is upgraded from LIMIT_LOW to LIMIT_MAX, the
> cgroup's bandwidth can change dramatically, sometimes in a way we are not
> expected. For example, one cgroup's bandwidth will drop below its io.low limit
> very soon after a upgrade. patch 10 has more details about the issue.
> 
> 2. idle cgroup. cgroup with a io.low limit doesn't always dispatch enough IO.
> In above upgrade rule, the disk will remain in LIMIT_LOW state and all other
> cgroups can't dispatch more IO above their 'low' limit. Hence there is waste.
> patch 11 has more details about the issue.
> 
> For issue 1, we make cgroup bandwidth increase/decrease smoothly after a
> upgrade/downgrade. This will reduce the chance a cgroup's bandwidth drop under
> its 'low' limit rapidly. The smoothness means we could waste some bandwidth in
> the transition though. But we must pay something for sharing.
> 
> The issue 2 is very hard. We introduce two mechanisms for this. One is 'idle
> time' or 'think time' borrowed from CFQ. If a cgroup's average idle time is
> high, we treat it's idle and its 'low' limit isn't respected. Please see patch
> 12 - 14 for details. The other is 'latency target'. If a cgroup's io latency is
> low, we treat it's idle and its 'low' limit isn't resptected. Please see patch
> 15 - 18 for fetails. Both mechanisms only happen when a cgroup runs below its
> 'low' limit.
> 
> The disadvantages of blk-throttle is it exports a kind of low level knobs.
> Configuration would not be easy for normal users. It would be powerful for
> experienced users though.
> 
> More tuning is required of course, but otherwise this works well. Please
> review, test and consider merge.
> 
> Thanks,
> Shaohua
> 
> V5->V6:
> - Change default setting for io.low limit. It's 0 now, which makes more sense
> - The default setting for latency is still 0, the default setting for idle time
>  becomes bigger. So with the default settings, cgroups have small latency but
>  disk sharing could be harmed
> - Addressed other issues pointed out by Tejun
> 
> V4->V5, basically address Tejun's comments:
> - Change interface from 'io.high' to 'io.low' so consistent with memcg
> - Change interface for 'idle time' and 'latency target'
> - Make 'idle time' per-cgroup-disk instead of per-cgroup
> - Chnage interface name for 'throttle slice'. It's not a real slice
> - Make downgrade smooth too
> - Make latency sampling work for both bio and request based queue
> - Change latency estimation method from 'line fitting' to 'bucket based
>  calculation'
> - Rebase and fix other problems
> 
> Issue pointed out by Tejun isn't fixed yet:
> - .pd_offline_fn vs .pd_free_fn. .pd_free_fn seems too late to change states
> http://marc.info/?l=linux-kernel&m=148183437022975&w=2
> 
> V3->V4:
> - Add latency target for cgroup
> - Fix bugs
> http://marc.info/?l=linux-block&m=147916216512915&w=2
> 
> V2->V3:
> - Rebase
> - Fix several bugs
> - Make harddisk think time threshold bigger
> http://marc.info/?l=linux-kernel&m=147552964708965&w=2
> 
> V1->V2:
> - Drop io.low interface for simplicity and the interface isn't a must-have to
>  prioritize cgroups.
> - Remove the 'trial' logic, which creates too much fluctuation
> - Add a new idle cgroup detection
> - Other bug fixes and improvements
> http://marc.info/?l=linux-block&m=147395674732335&w=2
> 
> V1:
> http://marc.info/?l=linux-block&m=146292596425689&w=2
> 
> 
> Shaohua Li (18):
>  blk-throttle: use U64_MAX/UINT_MAX to replace -1
>  blk-throttle: prepare support multiple limits
>  blk-throttle: add .low interface
>  blk-throttle: configure bps/iops limit for cgroup in low limit
>  blk-throttle: add upgrade logic for LIMIT_LOW state
>  blk-throttle: add downgrade logic
>  blk-throttle: make sure expire time isn't too big
>  blk-throttle: make throtl_slice tunable
>  blk-throttle: choose a small throtl_slice for SSD
>  blk-throttle: detect completed idle cgroup
>  blk-throttle: make bandwidth change smooth
>  blk-throttle: add a simple idle detection
>  blk-throttle: add interface to configure idle time threshold
>  blk-throttle: ignore idle cgroup limit
>  blk-throttle: add interface for per-cgroup target latency
>  block: track request size in blk_issue_stat
>  blk-throttle: add a mechanism to estimate IO latency
>  blk-throttle: add latency target support
> 
> Documentation/block/queue-sysfs.txt |   6 +
> block/bio.c                         |   2 +
> block/blk-core.c                    |   2 +-
> block/blk-mq.c                      |   2 +-
> block/blk-stat.c                    |  11 +-
> block/blk-stat.h                    |  29 +-
> block/blk-sysfs.c                   |  12 +
> block/blk-throttle.c                | 961 +++++++++++++++++++++++++++++++++---
> block/blk-wbt.h                     |  10 +-
> block/blk.h                         |   9 +
> include/linux/blk_types.h           |  10 +-
> 11 files changed, 959 insertions(+), 95 deletions(-)
> 
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 00/18] blk-throttle: add .low limit
  2017-08-31  7:24 ` [PATCH V6 00/18] blk-throttle: add .low limit Paolo VALENTE
@ 2017-09-05 21:02   ` Shaohua Li
  2017-09-06  1:12     ` Joseph Qi
  2017-09-22 14:29     ` Paolo Valente
  0 siblings, 2 replies; 24+ messages in thread
From: Shaohua Li @ 2017-09-05 21:02 UTC (permalink / raw)
  To: Paolo VALENTE
  Cc: Shaohua Li, Linux Kernel Mailing List, linux-block, Kernel-team,
	tj, axboe, vgoyal

On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
> 
> > Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li <shli@fb.com> ha scritto:
> > 
> > Hi,
> > 
> > cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
> > much for SSD. This patch set try to add a conservative limit for blk-throttle.
> > It isn't a proportional scheduling, but can help prioritize cgroups. There are
> > several advantages we choose blk-throttle:
> > - blk-throttle resides early in the block stack. It works for both bio and
> >  request based queues.
> > - blk-throttle is light weight in general. It still takes queue lock, but it's
> >  not hard to implement a per-cpu cache and remove the lock contention.
> > - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
> >  mechanism is proved to harm performance for fast SSD.
> > 
> > The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
> > The existing io.max is a hard limit throttling. cgroup with a max limit never
> > dispatch more IO than its max limit. While io.low is a best effort throttling.
> > cgroups with 'low' limit can run above their 'low' limit at appropriate time.
> > Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
> > their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
> > will run according to their 'low' limit. So the 'low' limit could act as two
> > roles, it allows cgroups using free bandwidth and it protects cgroups from
> > their 'low' limit.
> > 
> > An example usage is we have a high prio cgroup with high 'low' limit and a low
> > prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
> > prio can run above its 'low' limit, so we don't waste the bandwidth. When the
> > high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
> > under its 'low' limit. This will protect high prio cgroup to get more
> > resources.
> > 
> 
> Hi Shaohua,

Hi,

Sorry for the late response.
> I would like to ask you some questions, to make sure I fully
> understand how the 'low' limit and the idle-group detection work in
> your above scenario.  Suppose that: the drive has a random-I/O peak
> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
> the low prio group has a 'low' limit of 10 MB/s.  If
> - the high prio process happens to do, say, only 5 MB/s for a given
>   long time
> - the low prio process constantly does greedy I/O
> - the idle-group detection is not being used
> then the low prio process is limited to 10 MB/s during all this time
> interval.  And only 10% of the device bandwidth is utilized.
> 
> To recover lost bandwidth through idle-group detection, we need to set
> a target IO latency for the high-prio group.  The high prio group
> should happen to be below the threshold, and thus to be detected as
> idle, leaving the low prio group free too use all the bandwidth.
> 
> Here are my questions:
> 1) Is all I wrote above correct?

Yes
> 2) In particular, maybe there are other better mechanism to saturate
> the bandwidth in the above scenario?

Assume it's the 4) below.
> If what I wrote above is correct:
> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
> full bandwidth, the latency threshold of the high prio group may be
> overcome, causing the high prio group to not be considered idle any
> longer, and thus the low prio group to be limited again; this in turn
> will cause the threshold to not be overcome any longer, and so on.

That's true. We try to mitigate the fluctuation by increasing the low prio
cgroup bandwidth graduately though.

> 4) Is there a way to compute an appropriate target latency of the high
> prio group, if it is a generic group, for which the latency
> requirements of the processes it contains are only partially known or
> completely unknown?  By appropriate target latency, I mean a target
> latency that enables the framework to fully utilize the device
> bandwidth while the high prio group is doing less I/O than its limit.

Not sure how we can do this. The device max bandwidth varies based on request
size and read/write ratio. We don't know when the max bandwidth is reached.
Also I think we must consider a case that the workloads never use the full
bandwidth of a disk, which is pretty common for SSD (at least in our
environment).

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 00/18] blk-throttle: add .low limit
  2017-09-05 21:02   ` Shaohua Li
@ 2017-09-06  1:12     ` Joseph Qi
  2017-09-06 16:05       ` Shaohua Li
  2017-09-22 14:29     ` Paolo Valente
  1 sibling, 1 reply; 24+ messages in thread
From: Joseph Qi @ 2017-09-06  1:12 UTC (permalink / raw)
  To: Shaohua Li, Paolo VALENTE
  Cc: Shaohua Li, Linux Kernel Mailing List, linux-block, Kernel-team,
	tj, axboe, vgoyal

Hi Shaohua,

On 17/9/6 05:02, Shaohua Li wrote:
> On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
>>
>>> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li <shli@fb.com> ha scritto:
>>>
>>> Hi,
>>>
>>> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
>>> much for SSD. This patch set try to add a conservative limit for blk-throttle.
>>> It isn't a proportional scheduling, but can help prioritize cgroups. There are
>>> several advantages we choose blk-throttle:
>>> - blk-throttle resides early in the block stack. It works for both bio and
>>>  request based queues.
>>> - blk-throttle is light weight in general. It still takes queue lock, but it's
>>>  not hard to implement a per-cpu cache and remove the lock contention.
>>> - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
>>>  mechanism is proved to harm performance for fast SSD.
>>>
>>> The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
>>> The existing io.max is a hard limit throttling. cgroup with a max limit never
>>> dispatch more IO than its max limit. While io.low is a best effort throttling.
>>> cgroups with 'low' limit can run above their 'low' limit at appropriate time.
>>> Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
>>> their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
>>> will run according to their 'low' limit. So the 'low' limit could act as two
>>> roles, it allows cgroups using free bandwidth and it protects cgroups from
>>> their 'low' limit.
>>>
>>> An example usage is we have a high prio cgroup with high 'low' limit and a low
>>> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
>>> prio can run above its 'low' limit, so we don't waste the bandwidth. When the
>>> high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
>>> under its 'low' limit. This will protect high prio cgroup to get more
>>> resources.
>>>
>>
>> Hi Shaohua,
> 
> Hi,
> 
> Sorry for the late response.
>> I would like to ask you some questions, to make sure I fully
>> understand how the 'low' limit and the idle-group detection work in
>> your above scenario.  Suppose that: the drive has a random-I/O peak
>> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
>> the low prio group has a 'low' limit of 10 MB/s.  If
>> - the high prio process happens to do, say, only 5 MB/s for a given
>>   long time
>> - the low prio process constantly does greedy I/O
>> - the idle-group detection is not being used
>> then the low prio process is limited to 10 MB/s during all this time
>> interval.  And only 10% of the device bandwidth is utilized.
>>
>> To recover lost bandwidth through idle-group detection, we need to set
>> a target IO latency for the high-prio group.  The high prio group
>> should happen to be below the threshold, and thus to be detected as
>> idle, leaving the low prio group free too use all the bandwidth.
>>
>> Here are my questions:
>> 1) Is all I wrote above correct?
> 
> Yes
>> 2) In particular, maybe there are other better mechanism to saturate
>> the bandwidth in the above scenario?
> 
> Assume it's the 4) below.
>> If what I wrote above is correct:
>> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
>> full bandwidth, the latency threshold of the high prio group may be
>> overcome, causing the high prio group to not be considered idle any
>> longer, and thus the low prio group to be limited again; this in turn
>> will cause the threshold to not be overcome any longer, and so on.
> 
> That's true. We try to mitigate the fluctuation by increasing the low prio
> cgroup bandwidth graduately though.
> 
>> 4) Is there a way to compute an appropriate target latency of the high
>> prio group, if it is a generic group, for which the latency
>> requirements of the processes it contains are only partially known or
>> completely unknown?  By appropriate target latency, I mean a target
>> latency that enables the framework to fully utilize the device
>> bandwidth while the high prio group is doing less I/O than its limit.
> 
> Not sure how we can do this. The device max bandwidth varies based on request
> size and read/write ratio. We don't know when the max bandwidth is reached.
> Also I think we must consider a case that the workloads never use the full
> bandwidth of a disk, which is pretty common for SSD (at least in our
> environment).
> 
I have a question on the base latency tracking.
>From my test on SSD, write latency is much lower than read when doing
mixed read/write, but currently we only track read request and then use
it's average as base latency. In other words, we don't distinguish read
and write now. As a result, all write request's latency will always be
considered as good. So I think we have to track read and write latency
separately. Or am I missing something here?

Thanks,
Joseph

> Thanks,
> Shaohua
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 00/18] blk-throttle: add .low limit
  2017-09-06  1:12     ` Joseph Qi
@ 2017-09-06 16:05       ` Shaohua Li
  0 siblings, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2017-09-06 16:05 UTC (permalink / raw)
  To: Joseph Qi
  Cc: Paolo VALENTE, Shaohua Li, Linux Kernel Mailing List,
	linux-block, Kernel-team, tj, axboe, vgoyal

On Wed, Sep 06, 2017 at 09:12:20AM +0800, Joseph Qi wrote:
> Hi Shaohua,
> 
> On 17/9/6 05:02, Shaohua Li wrote:
> > On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
> >>
> >>> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li <shli@fb.com> ha scritto:
> >>>
> >>> Hi,
> >>>
> >>> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
> >>> much for SSD. This patch set try to add a conservative limit for blk-throttle.
> >>> It isn't a proportional scheduling, but can help prioritize cgroups. There are
> >>> several advantages we choose blk-throttle:
> >>> - blk-throttle resides early in the block stack. It works for both bio and
> >>>  request based queues.
> >>> - blk-throttle is light weight in general. It still takes queue lock, but it's
> >>>  not hard to implement a per-cpu cache and remove the lock contention.
> >>> - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
> >>>  mechanism is proved to harm performance for fast SSD.
> >>>
> >>> The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
> >>> The existing io.max is a hard limit throttling. cgroup with a max limit never
> >>> dispatch more IO than its max limit. While io.low is a best effort throttling.
> >>> cgroups with 'low' limit can run above their 'low' limit at appropriate time.
> >>> Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
> >>> their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
> >>> will run according to their 'low' limit. So the 'low' limit could act as two
> >>> roles, it allows cgroups using free bandwidth and it protects cgroups from
> >>> their 'low' limit.
> >>>
> >>> An example usage is we have a high prio cgroup with high 'low' limit and a low
> >>> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
> >>> prio can run above its 'low' limit, so we don't waste the bandwidth. When the
> >>> high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
> >>> under its 'low' limit. This will protect high prio cgroup to get more
> >>> resources.
> >>>
> >>
> >> Hi Shaohua,
> > 
> > Hi,
> > 
> > Sorry for the late response.
> >> I would like to ask you some questions, to make sure I fully
> >> understand how the 'low' limit and the idle-group detection work in
> >> your above scenario.  Suppose that: the drive has a random-I/O peak
> >> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
> >> the low prio group has a 'low' limit of 10 MB/s.  If
> >> - the high prio process happens to do, say, only 5 MB/s for a given
> >>   long time
> >> - the low prio process constantly does greedy I/O
> >> - the idle-group detection is not being used
> >> then the low prio process is limited to 10 MB/s during all this time
> >> interval.  And only 10% of the device bandwidth is utilized.
> >>
> >> To recover lost bandwidth through idle-group detection, we need to set
> >> a target IO latency for the high-prio group.  The high prio group
> >> should happen to be below the threshold, and thus to be detected as
> >> idle, leaving the low prio group free too use all the bandwidth.
> >>
> >> Here are my questions:
> >> 1) Is all I wrote above correct?
> > 
> > Yes
> >> 2) In particular, maybe there are other better mechanism to saturate
> >> the bandwidth in the above scenario?
> > 
> > Assume it's the 4) below.
> >> If what I wrote above is correct:
> >> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
> >> full bandwidth, the latency threshold of the high prio group may be
> >> overcome, causing the high prio group to not be considered idle any
> >> longer, and thus the low prio group to be limited again; this in turn
> >> will cause the threshold to not be overcome any longer, and so on.
> > 
> > That's true. We try to mitigate the fluctuation by increasing the low prio
> > cgroup bandwidth graduately though.
> > 
> >> 4) Is there a way to compute an appropriate target latency of the high
> >> prio group, if it is a generic group, for which the latency
> >> requirements of the processes it contains are only partially known or
> >> completely unknown?  By appropriate target latency, I mean a target
> >> latency that enables the framework to fully utilize the device
> >> bandwidth while the high prio group is doing less I/O than its limit.
> > 
> > Not sure how we can do this. The device max bandwidth varies based on request
> > size and read/write ratio. We don't know when the max bandwidth is reached.
> > Also I think we must consider a case that the workloads never use the full
> > bandwidth of a disk, which is pretty common for SSD (at least in our
> > environment).
> > 
> I have a question on the base latency tracking.
> From my test on SSD, write latency is much lower than read when doing
> mixed read/write, but currently we only track read request and then use
> it's average as base latency. In other words, we don't distinguish read
> and write now. As a result, all write request's latency will always be
> considered as good. So I think we have to track read and write latency
> separately. Or am I missing something here?

For base latency we only consider read, but for cgroup latency we do consider
write. For the base latency, only using read isn't a big problem, because we
use the latency as a rough base to check if cgroup's latency is good or not.
The comparison is never going to be precise.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V6 00/18] blk-throttle: add .low limit
  2017-09-05 21:02   ` Shaohua Li
  2017-09-06  1:12     ` Joseph Qi
@ 2017-09-22 14:29     ` Paolo Valente
  1 sibling, 0 replies; 24+ messages in thread
From: Paolo Valente @ 2017-09-22 14:29 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Shaohua Li, Linux Kernel Mailing List, linux-block, Kernel-team,
	tj, axboe, vgoyal


> Il giorno 05 set 2017, alle ore 23:02, Shaohua Li <shli@kernel.org> ha scritto:
> 
> On Thu, Aug 31, 2017 at 09:24:23AM +0200, Paolo VALENTE wrote:
>> 
>>> Il giorno 15 gen 2017, alle ore 04:42, Shaohua Li <shli@fb.com> ha scritto:
>>> 
>>> Hi,
>>> 
>>> cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
>>> much for SSD. This patch set try to add a conservative limit for blk-throttle.
>>> It isn't a proportional scheduling, but can help prioritize cgroups. There are
>>> several advantages we choose blk-throttle:
>>> - blk-throttle resides early in the block stack. It works for both bio and
>>> request based queues.
>>> - blk-throttle is light weight in general. It still takes queue lock, but it's
>>> not hard to implement a per-cpu cache and remove the lock contention.
>>> - blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
>>> mechanism is proved to harm performance for fast SSD.
>>> 
>>> The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
>>> The existing io.max is a hard limit throttling. cgroup with a max limit never
>>> dispatch more IO than its max limit. While io.low is a best effort throttling.
>>> cgroups with 'low' limit can run above their 'low' limit at appropriate time.
>>> Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
>>> their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
>>> will run according to their 'low' limit. So the 'low' limit could act as two
>>> roles, it allows cgroups using free bandwidth and it protects cgroups from
>>> their 'low' limit.
>>> 
>>> An example usage is we have a high prio cgroup with high 'low' limit and a low
>>> prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
>>> prio can run above its 'low' limit, so we don't waste the bandwidth. When the
>>> high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
>>> under its 'low' limit. This will protect high prio cgroup to get more
>>> resources.
>>> 
>> 
>> Hi Shaohua,
> 
> Hi,
> 
> Sorry for the late response.
>> I would like to ask you some questions, to make sure I fully
>> understand how the 'low' limit and the idle-group detection work in
>> your above scenario.  Suppose that: the drive has a random-I/O peak
>> rate of 100MB/s, the high prio group has a 'low' limit of 90 MB/s, and
>> the low prio group has a 'low' limit of 10 MB/s.  If
>> - the high prio process happens to do, say, only 5 MB/s for a given
>>  long time
>> - the low prio process constantly does greedy I/O
>> - the idle-group detection is not being used
>> then the low prio process is limited to 10 MB/s during all this time
>> interval.  And only 10% of the device bandwidth is utilized.
>> 
>> To recover lost bandwidth through idle-group detection, we need to set
>> a target IO latency for the high-prio group.  The high prio group
>> should happen to be below the threshold, and thus to be detected as
>> idle, leaving the low prio group free too use all the bandwidth.
>> 
>> Here are my questions:
>> 1) Is all I wrote above correct?
> 
> Yes
>> 2) In particular, maybe there are other better mechanism to saturate
>> the bandwidth in the above scenario?
> 
> Assume it's the 4) below.
>> If what I wrote above is correct:
>> 3) Doesn't fluctuation occur?  I mean: when the low prio group gets
>> full bandwidth, the latency threshold of the high prio group may be
>> overcome, causing the high prio group to not be considered idle any
>> longer, and thus the low prio group to be limited again; this in turn
>> will cause the threshold to not be overcome any longer, and so on.
> 
> That's true. We try to mitigate the fluctuation by increasing the low prio
> cgroup bandwidth graduately though.
> 
>> 4) Is there a way to compute an appropriate target latency of the high
>> prio group, if it is a generic group, for which the latency
>> requirements of the processes it contains are only partially known or
>> completely unknown?  By appropriate target latency, I mean a target
>> latency that enables the framework to fully utilize the device
>> bandwidth while the high prio group is doing less I/O than its limit.
> 
> Not sure how we can do this. The device max bandwidth varies based on request
> size and read/write ratio. We don't know when the max bandwidth is reached.
> Also I think we must consider a case that the workloads never use the full
> bandwidth of a disk, which is pretty common for SSD (at least in our
> environment).
> 

Hi Shaohua,
sorry for adding this bit so late (and of course thanks for your
previous explanations).  By fully utilizing the device bandwidth, I
(imprecisely) didn't mean reaching peak rate, but being close to, and
thus utilizing, the maximum possible throughput achievable with the
workload to serve.  But the only way to know what such maximum
throughput would be, one should be able to let each group enjoy the
maximum possible bandwidth that wouldn't jeopardize the bandwidth and
latency that has to be guaranteed to the other groups.  Yet the
mechanism to do that is exactly the one that one wants to properly
configure, i.e., throttling with your extensions.  So, one should
iteratively change the involved parameters (.low limit, target
latency, ...) until reaching optimal overall throughput, without
violating service guarantees.  Such a task may be very long to
accomplish, depending on the complexity of the system and of the I/O
performed by the groups; or even unfeasible in a dynamic system.

Did what I wrote above make any sense for you?

Thanks,
Paolo

> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-09-22 14:29 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-15  3:42 [PATCH V6 00/18] blk-throttle: add .low limit Shaohua Li
2017-01-15  3:42 ` [PATCH V6 01/18] blk-throttle: use U64_MAX/UINT_MAX to replace -1 Shaohua Li
2017-01-15  3:42 ` [PATCH V6 02/18] blk-throttle: prepare support multiple limits Shaohua Li
2017-01-15  3:42 ` [PATCH V6 03/18] blk-throttle: add .low interface Shaohua Li
2017-01-15  3:42 ` [PATCH V6 04/18] blk-throttle: configure bps/iops limit for cgroup in low limit Shaohua Li
2017-01-15  3:42 ` [PATCH V6 05/18] blk-throttle: add upgrade logic for LIMIT_LOW state Shaohua Li
2017-01-15  3:42 ` [PATCH V6 06/18] blk-throttle: add downgrade logic Shaohua Li
2017-01-15  3:42 ` [PATCH V6 07/18] blk-throttle: make sure expire time isn't too big Shaohua Li
2017-01-15  3:42 ` [PATCH V6 08/18] blk-throttle: make throtl_slice tunable Shaohua Li
2017-01-15  3:42 ` [PATCH V6 09/18] blk-throttle: choose a small throtl_slice for SSD Shaohua Li
2017-01-15  3:42 ` [PATCH V6 10/18] blk-throttle: detect completed idle cgroup Shaohua Li
2017-01-15  3:42 ` [PATCH V6 11/18] blk-throttle: make bandwidth change smooth Shaohua Li
2017-01-15  3:42 ` [PATCH V6 12/18] blk-throttle: add a simple idle detection Shaohua Li
2017-01-15  3:42 ` [PATCH V6 13/18] blk-throttle: add interface to configure idle time threshold Shaohua Li
2017-01-15  3:42 ` [PATCH V6 14/18] blk-throttle: ignore idle cgroup limit Shaohua Li
2017-01-15  3:42 ` [PATCH V6 15/18] blk-throttle: add interface for per-cgroup target latency Shaohua Li
2017-01-15  3:42 ` [PATCH V6 16/18] block: track request size in blk_issue_stat Shaohua Li
2017-01-15  3:42 ` [PATCH V6 17/18] blk-throttle: add a mechanism to estimate IO latency Shaohua Li
2017-01-15  3:42 ` [PATCH V6 18/18] blk-throttle: add latency target support Shaohua Li
2017-08-31  7:24 ` [PATCH V6 00/18] blk-throttle: add .low limit Paolo VALENTE
2017-09-05 21:02   ` Shaohua Li
2017-09-06  1:12     ` Joseph Qi
2017-09-06 16:05       ` Shaohua Li
2017-09-22 14:29     ` Paolo Valente

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).