linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V5 00/17] blk-throttle: add .low limit
@ 2016-12-15 20:32 Shaohua Li
  2016-12-15 20:32 ` [PATCH V5 01/17] blk-throttle: use U64_MAX/UINT_MAX to replace -1 Shaohua Li
                   ` (17 more replies)
  0 siblings, 18 replies; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:32 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

Hi,

cgroup still lacks a good iocontroller. CFQ works well for hard disk, but not
much for SSD. This patch set try to add a conservative limit for blk-throttle.
It isn't a proportional scheduling, but can help prioritize cgroups. There are
several advantages we choose blk-throttle:
- blk-throttle resides early in the block stack. It works for both bio and
  request based queues.
- blk-throttle is light weight in general. It still takes queue lock, but it's
  not hard to implement a per-cpu cache and remove the lock contention.
- blk-throttle doesn't use 'idle disk' mechanism, which is used by CFQ/BFQ. The
  mechanism is proved to harm performance for fast SSD.

The patch set add a new io.low limit for blk-throttle. It's only for cgroup2.
The existing io.max is a hard limit throttling. cgroup with a max limit never
dispatch more IO than its max limit. While io.low is a best effort throttling.
cgroups with 'low' limit can run above their 'low' limit at appropriate time.
Specifically, if all cgroups reach their 'low' limit, all cgroups can run above
their 'low' limit. If any cgroup runs under its 'low' limit, all other cgroups
will run according to their 'low' limit. So the 'low' limit could act as two
roles, it allows cgroups using free bandwidth and it protects cgroups from
their 'low' limit.

An example usage is we have a high prio cgroup with high 'low' limit and a low
prio cgroup with low 'low' limit. If the high prio cgroup isn't running, the low
prio can run above its 'low' limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its 'low' limit, low prio cgroup will run
under its 'low' limit. This will protect high prio cgroup to get more
resources.

The implementation is simple. The disk queue has a state machine. We have 2
states LIMIT_LOW and LIMIT_MAX. In each disk state, we throttle cgroups
according to the limit of the state. That is io.low limit for LIMIT_LOW state,
io.max limit for LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_LOW and LIMIT_MAX according to the rule aboe. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.low, the disk state will remain in
LIMIT_MAX state. Systems with only io.max set will find nothing changed with the
patches.

The first 9 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 9 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 10-17
adds more heuristics.

The basic framework has 2 major issues.

1. fluctuation. When the state is upgraded from LIMIT_LOW to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way we are not
expected. For example, one cgroup's bandwidth will drop below its io.low limit
very soon after a upgrade. patch 10 has more details about the issue.

2. idle cgroup. cgroup with a io.low limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_LOW state and all other
cgroups can't dispatch more IO above their 'low' limit. Hence there is waste.
patch 11 has more details about the issue.

For issue 1, we make cgroup bandwidth increase/decrease smoothly after a
upgrade/downgrade. This will reduce the chance a cgroup's bandwidth drop under
its 'low' limit rapidly. The smoothness means we could waste some bandwidth in
the transition though. But we must pay something for sharing.

The issue 2 is very hard. We introduce two mechanisms for this. One is 'idle
time' or 'think time' borrowed from CFQ. If a cgroup's average idle time is
high, we treat it's idle and its 'low' limit isn't respected. Please see patch
11 - 13 for details. The other is 'latency target'. If a cgroup's io latency is
low, we treat it's idle and its 'low' limit isn't resptected. Please see patch
14 - 17 for fetails. Both mechanisms only happen when a cgroup runs below its
'low' limit.

The disadvantages of blk-throttle is it exports a kind of low level knobs.
Configuration would not be easy for normal users. It would be powerful for
experienced users though.

More tuning is required of course, but otherwise this works well. Please
review, test and consider merge.

Thanks,
Shaohua

V4->V5, basically address Tejun's comments:
- Change interface from 'io.high' to 'io.low' so consistent with memcg
- Change interface for 'idle time' and 'latency target'
- Make 'idle time' per-cgroup-disk instead of per-cgroup
- Chnage interface name for 'throttle slice'. It's not a real slice
- Make downgrade smooth too
- Make latency sampling work for both bio and request based queue
- Change latency estimation method from 'line fitting' to 'bucket based
  calculation'
- Rebase and fix other problems

Issue pointed out by Tejun isn't fixed yet:
- .pd_offline_fn vs .pd_free_fn. .pd_free_fn seems too late to change states

V3->V4:
- Add latency target for cgroup
- Fix bugs
http://marc.info/?l=linux-block&m=147916216512915&w=2

V2->V3:
- Rebase
- Fix several bugs
- Make harddisk think time threshold bigger
http://marc.info/?l=linux-kernel&m=147552964708965&w=2

V1->V2:
- Drop io.low interface for simplicity and the interface isn't a must-have to
  prioritize cgroups.
- Remove the 'trial' logic, which creates too much fluctuation
- Add a new idle cgroup detection
- Other bug fixes and improvements
http://marc.info/?l=linux-block&m=147395674732335&w=2

V1:
http://marc.info/?l=linux-block&m=146292596425689&w=2

Shaohua Li (17):
  blk-throttle: use U64_MAX/UINT_MAX to replace -1
  blk-throttle: prepare support multiple limits
  blk-throttle: add .low interface
  blk-throttle: configure bps/iops limit for cgroup in low limit
  blk-throttle: add upgrade logic for LIMIT_LOW state
  blk-throttle: add downgrade logic
  blk-throttle: make sure expire time isn't too big
  blk-throttle: make throtl_slice tunable
  blk-throttle: detect completed idle cgroup
  blk-throttle: make bandwidth change smooth
  blk-throttle: add a simple idle detection
  blk-throttle: add interface to configure idle time threshold
  blk-throttle: ignore idle cgroup limit
  blk-throttle: add interface for per-cgroup target latency
  block: track request size in blk_issue_stat
  blk-throttle: add a mechanism to estimate IO latency
  blk-throttle: add latency target support

 Documentation/block/queue-sysfs.txt |   6 +
 block/bio.c                         |   2 +
 block/blk-core.c                    |   2 +-
 block/blk-mq.c                      |   2 +-
 block/blk-stat.c                    |  11 +-
 block/blk-stat.h                    |  29 +-
 block/blk-sysfs.c                   |  10 +
 block/blk-throttle.c                | 914 +++++++++++++++++++++++++++++++++---
 block/blk-wbt.h                     |  10 +-
 block/blk.h                         |   7 +
 include/linux/blk_types.h           |  10 +-
 11 files changed, 909 insertions(+), 94 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH V5 01/17] blk-throttle: use U64_MAX/UINT_MAX to replace -1
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
@ 2016-12-15 20:32 ` Shaohua Li
  2016-12-15 20:32 ` [PATCH V5 02/17] blk-throttle: prepare support multiple limits Shaohua Li
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:32 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

clean up the code to avoid using -1

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a6bb4fe..e45bf50 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -334,10 +334,10 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	}
 
 	RB_CLEAR_NODE(&tg->rb_node);
-	tg->bps[READ] = -1;
-	tg->bps[WRITE] = -1;
-	tg->iops[READ] = -1;
-	tg->iops[WRITE] = -1;
+	tg->bps[READ] = U64_MAX;
+	tg->bps[WRITE] = U64_MAX;
+	tg->iops[READ] = UINT_MAX;
+	tg->iops[WRITE] = UINT_MAX;
 
 	return &tg->pd;
 }
@@ -380,7 +380,7 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 	for (rw = READ; rw <= WRITE; rw++)
 		tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-				    (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+			(tg->bps[rw] != U64_MAX || tg->iops[rw] != UINT_MAX);
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -771,7 +771,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
-	if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
+	if (tg->bps[rw] == U64_MAX && tg->iops[rw] == UINT_MAX) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -1110,7 +1110,7 @@ static u64 tg_prfill_conf_u64(struct seq_file *sf, struct blkg_policy_data *pd,
 	struct throtl_grp *tg = pd_to_tg(pd);
 	u64 v = *(u64 *)((void *)tg + off);
 
-	if (v == -1)
+	if (v == U64_MAX)
 		return 0;
 	return __blkg_prfill_u64(sf, pd, v);
 }
@@ -1121,7 +1121,7 @@ static u64 tg_prfill_conf_uint(struct seq_file *sf, struct blkg_policy_data *pd,
 	struct throtl_grp *tg = pd_to_tg(pd);
 	unsigned int v = *(unsigned int *)((void *)tg + off);
 
-	if (v == -1)
+	if (v == UINT_MAX)
 		return 0;
 	return __blkg_prfill_u64(sf, pd, v);
 }
@@ -1195,7 +1195,7 @@ static ssize_t tg_set_conf(struct kernfs_open_file *of,
 	if (sscanf(ctx.body, "%llu", &v) != 1)
 		goto out_finish;
 	if (!v)
-		v = -1;
+		v = U64_MAX;
 
 	tg = blkg_to_tg(ctx.blkg);
 
@@ -1270,17 +1270,17 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
 
 	if (!dname)
 		return 0;
-	if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 &&
-	    tg->iops[READ] == -1 && tg->iops[WRITE] == -1)
+	if (tg->bps[READ] == U64_MAX && tg->bps[WRITE] == U64_MAX &&
+	    tg->iops[READ] == UINT_MAX && tg->iops[WRITE] == UINT_MAX)
 		return 0;
 
-	if (tg->bps[READ] != -1)
+	if (tg->bps[READ] != U64_MAX)
 		snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
-	if (tg->bps[WRITE] != -1)
+	if (tg->bps[WRITE] != U64_MAX)
 		snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
-	if (tg->iops[READ] != -1)
+	if (tg->iops[READ] != UINT_MAX)
 		snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
-	if (tg->iops[WRITE] != -1)
+	if (tg->iops[WRITE] != UINT_MAX)
 		snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
@@ -1318,7 +1318,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
 		char *p;
-		u64 val = -1;
+		u64 val = U64_MAX;
 		int len;
 
 		if (sscanf(ctx.body, "%26s%n", tok, &len) != 1)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 02/17] blk-throttle: prepare support multiple limits
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
  2016-12-15 20:32 ` [PATCH V5 01/17] blk-throttle: use U64_MAX/UINT_MAX to replace -1 Shaohua Li
@ 2016-12-15 20:32 ` Shaohua Li
  2016-12-15 20:32 ` [PATCH V5 03/17] blk-throttle: add .low interface Shaohua Li
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:32 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

We are going to support low/max limit, each cgroup will have 2 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 114 +++++++++++++++++++++++++++++++++------------------
 1 file changed, 73 insertions(+), 41 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e45bf50..a75bfa2 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,11 @@ enum tg_state_flags {
 
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
+enum {
+	LIMIT_MAX,
+	LIMIT_CNT,
+};
+
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -120,10 +125,10 @@ struct throtl_grp {
 	bool has_rules[2];
 
 	/* bytes per second rate limits */
-	uint64_t bps[2];
+	uint64_t bps[2][LIMIT_CNT];
 
 	/* IOPS limits */
-	unsigned int iops[2];
+	unsigned int iops[2][LIMIT_CNT];
 
 	/* Number of bytes disptached in current slice */
 	uint64_t bytes_disp[2];
@@ -147,6 +152,8 @@ struct throtl_data
 
 	/* Work for dispatching throttled bios */
 	struct work_struct dispatch_work;
+	unsigned int limit_index;
+	bool limit_valid[LIMIT_CNT];
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -198,6 +205,16 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 		return container_of(sq, struct throtl_data, service_queue);
 }
 
+static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
+{
+	return tg->bps[rw][tg->td->limit_index];
+}
+
+static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
+{
+	return tg->iops[rw][tg->td->limit_index];
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -320,7 +337,7 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq)
 static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 {
 	struct throtl_grp *tg;
-	int rw;
+	int rw, index;
 
 	tg = kzalloc_node(sizeof(*tg), gfp, node);
 	if (!tg)
@@ -334,10 +351,12 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	}
 
 	RB_CLEAR_NODE(&tg->rb_node);
-	tg->bps[READ] = U64_MAX;
-	tg->bps[WRITE] = U64_MAX;
-	tg->iops[READ] = UINT_MAX;
-	tg->iops[WRITE] = UINT_MAX;
+	for (rw = READ; rw <= WRITE; rw++) {
+		for (index = 0; index < LIMIT_CNT; index++) {
+			tg->bps[rw][index] = U64_MAX;
+			tg->iops[rw][index] = UINT_MAX;
+		}
+	}
 
 	return &tg->pd;
 }
@@ -376,11 +395,14 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 static void tg_update_has_rules(struct throtl_grp *tg)
 {
 	struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+	struct throtl_data *td = tg->td;
 	int rw;
 
 	for (rw = READ; rw <= WRITE; rw++)
 		tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-			(tg->bps[rw] != U64_MAX || tg->iops[rw] != UINT_MAX);
+			(td->limit_valid[td->limit_index] &&
+			 (tg_bps_limit(tg, rw) != U64_MAX ||
+			  tg_iops_limit(tg, rw) != UINT_MAX));
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -632,11 +654,11 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 
 	if (!nr_slices)
 		return;
-	tmp = tg->bps[rw] * throtl_slice * nr_slices;
+	tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
 	do_div(tmp, HZ);
 	bytes_trim = tmp;
 
-	io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
+	io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;
 
 	if (!bytes_trim && !io_trim)
 		return;
@@ -682,7 +704,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	 * have been trimmed.
 	 */
 
-	tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+	tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 
 	if (tmp > UINT_MAX)
@@ -697,7 +719,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	}
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ) / tg_iops_limit(tg, rw) + 1;
 
 	if (jiffy_wait > jiffy_elapsed)
 		jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -724,7 +746,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
 
-	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+	tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 	bytes_allowed = tmp;
 
@@ -736,7 +758,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Calc approx time to dispatch */
 	extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+	jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw));
 
 	if (!jiffy_wait)
 		jiffy_wait = 1;
@@ -771,7 +793,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
-	if (tg->bps[rw] == U64_MAX && tg->iops[rw] == UINT_MAX) {
+	if (tg_bps_limit(tg, rw) == U64_MAX &&
+	    tg_iops_limit(tg, rw) == UINT_MAX) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -1148,8 +1171,8 @@ static void tg_conf_updated(struct throtl_grp *tg)
 
 	throtl_log(&tg->service_queue,
 		   "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
-		   tg->bps[READ], tg->bps[WRITE],
-		   tg->iops[READ], tg->iops[WRITE]);
+		   tg_bps_limit(tg, READ), tg_bps_limit(tg, WRITE),
+		   tg_iops_limit(tg, READ), tg_iops_limit(tg, WRITE));
 
 	/*
 	 * Update has_rules[] flags for the updated tg's subtree.  A tg is
@@ -1226,25 +1249,25 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
 static struct cftype throtl_legacy_files[] = {
 	{
 		.name = "throttle.read_bps_device",
-		.private = offsetof(struct throtl_grp, bps[READ]),
+		.private = offsetof(struct throtl_grp, bps[READ][LIMIT_MAX]),
 		.seq_show = tg_print_conf_u64,
 		.write = tg_set_conf_u64,
 	},
 	{
 		.name = "throttle.write_bps_device",
-		.private = offsetof(struct throtl_grp, bps[WRITE]),
+		.private = offsetof(struct throtl_grp, bps[WRITE][LIMIT_MAX]),
 		.seq_show = tg_print_conf_u64,
 		.write = tg_set_conf_u64,
 	},
 	{
 		.name = "throttle.read_iops_device",
-		.private = offsetof(struct throtl_grp, iops[READ]),
+		.private = offsetof(struct throtl_grp, iops[READ][LIMIT_MAX]),
 		.seq_show = tg_print_conf_uint,
 		.write = tg_set_conf_uint,
 	},
 	{
 		.name = "throttle.write_iops_device",
-		.private = offsetof(struct throtl_grp, iops[WRITE]),
+		.private = offsetof(struct throtl_grp, iops[WRITE][LIMIT_MAX]),
 		.seq_show = tg_print_conf_uint,
 		.write = tg_set_conf_uint,
 	},
@@ -1270,18 +1293,25 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
 
 	if (!dname)
 		return 0;
-	if (tg->bps[READ] == U64_MAX && tg->bps[WRITE] == U64_MAX &&
-	    tg->iops[READ] == UINT_MAX && tg->iops[WRITE] == UINT_MAX)
+
+	if (tg->bps[READ][LIMIT_MAX] == U64_MAX &&
+	    tg->bps[WRITE][LIMIT_MAX] == U64_MAX &&
+	    tg->iops[READ][LIMIT_MAX] == UINT_MAX &&
+	    tg->iops[WRITE][LIMIT_MAX] == UINT_MAX)
 		return 0;
 
-	if (tg->bps[READ] != U64_MAX)
-		snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
-	if (tg->bps[WRITE] != U64_MAX)
-		snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
-	if (tg->iops[READ] != UINT_MAX)
-		snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
-	if (tg->iops[WRITE] != UINT_MAX)
-		snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
+	if (tg->bps[READ][LIMIT_MAX] != U64_MAX)
+		snprintf(bufs[0], sizeof(bufs[0]), "%llu",
+			tg->bps[READ][LIMIT_MAX]);
+	if (tg->bps[WRITE][LIMIT_MAX] != U64_MAX)
+		snprintf(bufs[1], sizeof(bufs[1]), "%llu",
+			tg->bps[WRITE][LIMIT_MAX]);
+	if (tg->iops[READ][LIMIT_MAX] != UINT_MAX)
+		snprintf(bufs[2], sizeof(bufs[2]), "%u",
+			tg->iops[READ][LIMIT_MAX]);
+	if (tg->iops[WRITE][LIMIT_MAX] != UINT_MAX)
+		snprintf(bufs[3], sizeof(bufs[3]), "%u",
+			tg->iops[WRITE][LIMIT_MAX]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
 		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
@@ -1310,10 +1340,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 	tg = blkg_to_tg(ctx.blkg);
 
-	v[0] = tg->bps[READ];
-	v[1] = tg->bps[WRITE];
-	v[2] = tg->iops[READ];
-	v[3] = tg->iops[WRITE];
+	v[0] = tg->bps[READ][LIMIT_MAX];
+	v[1] = tg->bps[WRITE][LIMIT_MAX];
+	v[2] = tg->iops[READ][LIMIT_MAX];
+	v[3] = tg->iops[WRITE][LIMIT_MAX];
 
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
@@ -1350,10 +1380,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 			goto out_finish;
 	}
 
-	tg->bps[READ] = v[0];
-	tg->bps[WRITE] = v[1];
-	tg->iops[READ] = v[2];
-	tg->iops[WRITE] = v[3];
+	tg->bps[READ][LIMIT_MAX] = v[0];
+	tg->bps[WRITE][LIMIT_MAX] = v[1];
+	tg->iops[READ][LIMIT_MAX] = v[2];
+	tg->iops[WRITE][LIMIT_MAX] = v[3];
 
 	tg_conf_updated(tg);
 	ret = 0;
@@ -1451,8 +1481,9 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	/* out-of-limit, queue to @tg */
 	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
-		   tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw],
-		   tg->io_disp[rw], tg->iops[rw],
+		   tg->bytes_disp[rw], bio->bi_iter.bi_size,
+		   tg_bps_limit(tg, rw),
+		   tg->io_disp[rw], tg_iops_limit(tg, rw),
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
@@ -1563,6 +1594,7 @@ int blk_throtl_init(struct request_queue *q)
 	q->td = td;
 	td->queue = q;
 
+	td->limit_valid[LIMIT_MAX] = true;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 03/17] blk-throttle: add .low interface
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
  2016-12-15 20:32 ` [PATCH V5 01/17] blk-throttle: use U64_MAX/UINT_MAX to replace -1 Shaohua Li
  2016-12-15 20:32 ` [PATCH V5 02/17] blk-throttle: prepare support multiple limits Shaohua Li
@ 2016-12-15 20:32 ` Shaohua Li
  2017-01-09 16:35   ` Tejun Heo
  2016-12-15 20:32 ` [PATCH V5 04/17] blk-throttle: configure bps/iops limit for cgroup in low limit Shaohua Li
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:32 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

Add low limit for cgroup and corresponding cgroup interface.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 134 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 106 insertions(+), 28 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a75bfa2..fcc4199 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,6 +84,7 @@ enum tg_state_flags {
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
 enum {
+	LIMIT_LOW,
 	LIMIT_MAX,
 	LIMIT_CNT,
 };
@@ -124,11 +125,15 @@ struct throtl_grp {
 	/* are there any throtl rules between this group and td? */
 	bool has_rules[2];
 
-	/* bytes per second rate limits */
+	/* internally used bytes per second rate limits */
 	uint64_t bps[2][LIMIT_CNT];
+	/* user configured bps limits */
+	uint64_t bps_conf[2][LIMIT_CNT];
 
-	/* IOPS limits */
+	/* internally used IOPS limits */
 	unsigned int iops[2][LIMIT_CNT];
+	/* user configured IOPS limits */
+	unsigned int iops_conf[2][LIMIT_CNT];
 
 	/* Number of bytes disptached in current slice */
 	uint64_t bytes_disp[2];
@@ -355,6 +360,8 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 		for (index = 0; index < LIMIT_CNT; index++) {
 			tg->bps[rw][index] = U64_MAX;
 			tg->iops[rw][index] = UINT_MAX;
+			tg->bps_conf[rw][index] = U64_MAX;
+			tg->iops_conf[rw][index] = UINT_MAX;
 		}
 	}
 
@@ -414,6 +421,46 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 	tg_update_has_rules(pd_to_tg(pd));
 }
 
+static void blk_throtl_update_valid_limit(struct throtl_data *td)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+	bool low_valid = false;
+
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (tg->bps[READ][LIMIT_LOW] != U64_MAX ||
+		    tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
+		    tg->iops[READ][LIMIT_LOW] != UINT_MAX ||
+		    tg->iops[WRITE][LIMIT_LOW] != UINT_MAX)
+			low_valid = true;
+	}
+	rcu_read_unlock();
+
+	if (low_valid)
+		td->limit_valid[LIMIT_LOW] = true;
+	else
+		td->limit_valid[LIMIT_LOW] = false;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+	struct throtl_grp *tg = pd_to_tg(pd);
+
+	tg->bps[READ][LIMIT_LOW] = U64_MAX;
+	tg->bps[WRITE][LIMIT_LOW] = U64_MAX;
+	tg->iops[READ][LIMIT_LOW] = UINT_MAX;
+	tg->iops[WRITE][LIMIT_LOW] = UINT_MAX;
+
+	blk_throtl_update_valid_limit(tg->td);
+
+	if (tg->td->limit_index == LIMIT_LOW &&
+	    !tg->td->limit_valid[LIMIT_LOW])
+		tg->td->limit_index = LIMIT_MAX;
+}
+
 static void throtl_pd_free(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
@@ -1284,7 +1331,7 @@ static struct cftype throtl_legacy_files[] = {
 	{ }	/* terminate */
 };
 
-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 			 int off)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
@@ -1294,38 +1341,38 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
 	if (!dname)
 		return 0;
 
-	if (tg->bps[READ][LIMIT_MAX] == U64_MAX &&
-	    tg->bps[WRITE][LIMIT_MAX] == U64_MAX &&
-	    tg->iops[READ][LIMIT_MAX] == UINT_MAX &&
-	    tg->iops[WRITE][LIMIT_MAX] == UINT_MAX)
+	if (tg->bps_conf[READ][off] == U64_MAX &&
+	    tg->bps_conf[WRITE][off] == U64_MAX &&
+	    tg->iops_conf[READ][off] == UINT_MAX &&
+	    tg->iops_conf[WRITE][off] == UINT_MAX)
 		return 0;
 
-	if (tg->bps[READ][LIMIT_MAX] != U64_MAX)
+	if (tg->bps_conf[READ][off] != U64_MAX)
 		snprintf(bufs[0], sizeof(bufs[0]), "%llu",
-			tg->bps[READ][LIMIT_MAX]);
-	if (tg->bps[WRITE][LIMIT_MAX] != U64_MAX)
+			tg->bps_conf[READ][off]);
+	if (tg->bps_conf[WRITE][off] != U64_MAX)
 		snprintf(bufs[1], sizeof(bufs[1]), "%llu",
-			tg->bps[WRITE][LIMIT_MAX]);
-	if (tg->iops[READ][LIMIT_MAX] != UINT_MAX)
+			tg->bps_conf[WRITE][off]);
+	if (tg->iops_conf[READ][off] != UINT_MAX)
 		snprintf(bufs[2], sizeof(bufs[2]), "%u",
-			tg->iops[READ][LIMIT_MAX]);
-	if (tg->iops[WRITE][LIMIT_MAX] != UINT_MAX)
+			tg->iops_conf[READ][off]);
+	if (tg->iops_conf[WRITE][off] != UINT_MAX)
 		snprintf(bufs[3], sizeof(bufs[3]), "%u",
-			tg->iops[WRITE][LIMIT_MAX]);
+			tg->iops_conf[WRITE][off]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
 		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
 	return 0;
 }
 
-static int tg_print_max(struct seq_file *sf, void *v)
+static int tg_print_limit(struct seq_file *sf, void *v)
 {
-	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_max,
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_limit,
 			  &blkcg_policy_throtl, seq_cft(sf)->private, false);
 	return 0;
 }
 
-static ssize_t tg_set_max(struct kernfs_open_file *of,
+static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			  char *buf, size_t nbytes, loff_t off)
 {
 	struct blkcg *blkcg = css_to_blkcg(of_css(of));
@@ -1333,6 +1380,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	u64 v[4];
 	int ret;
+	int index = of_cft(of)->private;
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
 	if (ret)
@@ -1340,10 +1388,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 	tg = blkg_to_tg(ctx.blkg);
 
-	v[0] = tg->bps[READ][LIMIT_MAX];
-	v[1] = tg->bps[WRITE][LIMIT_MAX];
-	v[2] = tg->iops[READ][LIMIT_MAX];
-	v[3] = tg->iops[WRITE][LIMIT_MAX];
+	v[0] = tg->bps_conf[READ][index];
+	v[1] = tg->bps_conf[WRITE][index];
+	v[2] = tg->iops_conf[READ][index];
+	v[3] = tg->iops_conf[WRITE][index];
 
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
@@ -1380,11 +1428,31 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 			goto out_finish;
 	}
 
-	tg->bps[READ][LIMIT_MAX] = v[0];
-	tg->bps[WRITE][LIMIT_MAX] = v[1];
-	tg->iops[READ][LIMIT_MAX] = v[2];
-	tg->iops[WRITE][LIMIT_MAX] = v[3];
+	tg->bps_conf[READ][index] = v[0];
+	tg->bps_conf[WRITE][index] = v[1];
+	tg->iops_conf[READ][index] = v[2];
+	tg->iops_conf[WRITE][index] = v[3];
 
+	if (index == LIMIT_MAX) {
+		tg->bps[READ][index] = v[0];
+		tg->bps[WRITE][index] = v[1];
+		tg->iops[READ][index] = v[2];
+		tg->iops[WRITE][index] = v[3];
+	}
+	tg->bps[READ][LIMIT_LOW] = min(tg->bps_conf[READ][LIMIT_LOW],
+		tg->bps_conf[READ][LIMIT_MAX]);
+	tg->bps[WRITE][LIMIT_LOW] = min(tg->bps_conf[WRITE][LIMIT_LOW],
+		tg->bps_conf[WRITE][LIMIT_MAX]);
+	tg->iops[READ][LIMIT_LOW] = min(tg->iops_conf[READ][LIMIT_LOW],
+		tg->iops_conf[READ][LIMIT_MAX]);
+	tg->iops[WRITE][LIMIT_LOW] = min(tg->iops_conf[WRITE][LIMIT_LOW],
+		tg->iops_conf[WRITE][LIMIT_MAX]);
+
+	if (index == LIMIT_LOW) {
+		blk_throtl_update_valid_limit(tg->td);
+		if (tg->td->limit_valid[LIMIT_LOW])
+			tg->td->limit_index = LIMIT_LOW;
+	}
 	tg_conf_updated(tg);
 	ret = 0;
 out_finish:
@@ -1394,10 +1462,18 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 static struct cftype throtl_files[] = {
 	{
+		.name = "low",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = tg_print_limit,
+		.write = tg_set_limit,
+		.private = LIMIT_LOW,
+	},
+	{
 		.name = "max",
 		.flags = CFTYPE_NOT_ON_ROOT,
-		.seq_show = tg_print_max,
-		.write = tg_set_max,
+		.seq_show = tg_print_limit,
+		.write = tg_set_limit,
+		.private = LIMIT_MAX,
 	},
 	{ }	/* terminate */
 };
@@ -1416,6 +1492,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_alloc_fn		= throtl_pd_alloc,
 	.pd_init_fn		= throtl_pd_init,
 	.pd_online_fn		= throtl_pd_online,
+	.pd_offline_fn		= throtl_pd_offline,
 	.pd_free_fn		= throtl_pd_free,
 };
 
@@ -1595,6 +1672,7 @@ int blk_throtl_init(struct request_queue *q)
 	td->queue = q;
 
 	td->limit_valid[LIMIT_MAX] = true;
+	td->limit_index = LIMIT_MAX;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 04/17] blk-throttle: configure bps/iops limit for cgroup in low limit
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (2 preceding siblings ...)
  2016-12-15 20:32 ` [PATCH V5 03/17] blk-throttle: add .low interface Shaohua Li
@ 2016-12-15 20:32 ` Shaohua Li
  2017-01-09 17:35   ` Tejun Heo
  2016-12-15 20:32 ` [PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state Shaohua Li
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:32 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

each queue will have a state machine. Initially queue is in LIMIT_LOW
state, which means all cgroups will be throttled according to their low
limit. After all cgroups with low limit cross the limit, the queue state
gets upgraded to LIMIT_MAX state.
For max limit, cgroup will use the limit configured by user.
For low limit, cgroup will use the minimal between low limit and max
limit configured by user. Last patch already did the convertion.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index fcc4199..e55bd36 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -212,11 +212,21 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
+	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	uint64_t ret;
+
+	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+		return U64_MAX;
 	return tg->bps[rw][tg->td->limit_index];
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
+	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	unsigned int ret;
+
+	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+		return UINT_MAX;
 	return tg->iops[rw][tg->td->limit_index];
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (3 preceding siblings ...)
  2016-12-15 20:32 ` [PATCH V5 04/17] blk-throttle: configure bps/iops limit for cgroup in low limit Shaohua Li
@ 2016-12-15 20:32 ` Shaohua Li
  2017-01-09 18:40   ` Tejun Heo
  2016-12-15 20:32 ` [PATCH V5 06/17] blk-throttle: add downgrade logic Shaohua Li
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:32 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

When queue is in LIMIT_LOW state and all cgroups with low limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_MAX

For a cgroup hierarchy, there are two cases. Children has lower low
limit than parent. Parent's low limit is meaningless. If children's
bps/iops cross low limit, we can upgrade queue state. The other case is
children has higher low limit than parent. Children's low limit is
meaningless. As long as parent's bps/iops cross low limit, we can
upgrade queue state.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 97 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e55bd36..cfd74cfc 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -455,6 +455,7 @@ static void blk_throtl_update_valid_limit(struct throtl_data *td)
 		td->limit_valid[LIMIT_LOW] = false;
 }
 
+static void throtl_upgrade_state(struct throtl_data *td);
 static void throtl_pd_offline(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
@@ -466,9 +467,8 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)
 
 	blk_throtl_update_valid_limit(tg->td);
 
-	if (tg->td->limit_index == LIMIT_LOW &&
-	    !tg->td->limit_valid[LIMIT_LOW])
-		tg->td->limit_index = LIMIT_MAX;
+	if (!tg->td->limit_valid[tg->td->limit_index])
+		throtl_upgrade_state(tg->td);
 }
 
 static void throtl_pd_free(struct blkg_policy_data *pd)
@@ -1077,6 +1077,8 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
 	return nr_disp;
 }
 
+static bool throtl_can_upgrade(struct throtl_data *td,
+	struct throtl_grp *this_tg);
 /**
  * throtl_pending_timer_fn - timer function for service_queue->pending_timer
  * @arg: the throtl_service_queue being serviced
@@ -1103,6 +1105,9 @@ static void throtl_pending_timer_fn(unsigned long arg)
 	int ret;
 
 	spin_lock_irq(q->queue_lock);
+	if (throtl_can_upgrade(td, NULL))
+		throtl_upgrade_state(td);
+
 again:
 	parent_sq = sq->parent_sq;
 	dispatched = false;
@@ -1506,6 +1511,88 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_free_fn		= throtl_pd_free,
 };
 
+static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *sq = &tg->service_queue;
+	bool read_limit, write_limit;
+
+	/*
+	 * if cgroup reaches low/max limit (max >= low), it's ok to next
+	 * limit
+	 */
+	read_limit = tg->bps[READ][LIMIT_LOW] != U64_MAX ||
+		     tg->iops[READ][LIMIT_LOW] != UINT_MAX;
+	write_limit = tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
+		      tg->iops[WRITE][LIMIT_LOW] != UINT_MAX;
+	if (read_limit && sq->nr_queued[READ] &&
+	    (!write_limit || sq->nr_queued[WRITE]))
+		return true;
+	if (write_limit && sq->nr_queued[WRITE] &&
+	    (!read_limit || sq->nr_queued[READ]))
+		return true;
+	return false;
+}
+
+static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg)
+{
+	while (true) {
+		if (throtl_tg_can_upgrade(tg))
+			return true;
+		tg = sq_to_tg(tg->service_queue.parent_sq);
+		if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+				!tg_to_blkg(tg)->parent))
+			return false;
+	}
+	return false;
+}
+
+static bool throtl_can_upgrade(struct throtl_data *td,
+	struct throtl_grp *this_tg)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+
+	if (td->limit_index != LIMIT_LOW)
+		return false;
+
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (tg == this_tg)
+			continue;
+		if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+			continue;
+		if (!throtl_hierarchy_can_upgrade(tg)) {
+			rcu_read_unlock();
+			return false;
+		}
+	}
+	rcu_read_unlock();
+	return true;
+}
+
+static void throtl_upgrade_state(struct throtl_data *td)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+
+	td->limit_index = LIMIT_MAX;
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+		struct throtl_service_queue *sq = &tg->service_queue;
+
+		tg->disptime = jiffies - 1;
+		throtl_select_dispatch(sq);
+		throtl_schedule_next_dispatch(sq, false);
+	}
+	rcu_read_unlock();
+	throtl_select_dispatch(&td->service_queue);
+	throtl_schedule_next_dispatch(&td->service_queue, false);
+	queue_work(kthrotld_workqueue, &td->dispatch_work);
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1528,14 +1615,20 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	sq = &tg->service_queue;
 
+again:
 	while (true) {
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
 
 		/* if above limits, break to queue */
-		if (!tg_may_dispatch(tg, bio, NULL))
+		if (!tg_may_dispatch(tg, bio, NULL)) {
+			if (throtl_can_upgrade(tg->td, tg)) {
+				throtl_upgrade_state(tg->td);
+				goto again;
+			}
 			break;
+		}
 
 		/* within limits, let's charge and dispatch directly */
 		throtl_charge_bio(tg, bio);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 06/17] blk-throttle: add downgrade logic
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (4 preceding siblings ...)
  2016-12-15 20:32 ` [PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state Shaohua Li
@ 2016-12-15 20:32 ` Shaohua Li
  2016-12-15 20:32 ` [PATCH V5 07/17] blk-throttle: make sure expire time isn't too big Shaohua Li
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:32 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

When queue state machine is in LIMIT_MAX state, but a cgroup is below
its low limit for some time, the queue should be downgraded to lower
state as one cgroup's low limit isn't met.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 148 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 148 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index cfd74cfc..0f65fce 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -140,6 +140,13 @@ struct throtl_grp {
 	/* Number of bio's dispatched in current slice */
 	unsigned int io_disp[2];
 
+	unsigned long last_low_overflow_time[2];
+
+	uint64_t last_bytes_disp[2];
+	unsigned int last_io_disp[2];
+
+	unsigned long last_check_time;
+
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -159,6 +166,9 @@ struct throtl_data
 	struct work_struct dispatch_work;
 	unsigned int limit_index;
 	bool limit_valid[LIMIT_CNT];
+
+	unsigned long low_upgrade_time;
+	unsigned long low_downgrade_time;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -896,6 +906,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	/* Charge the bio to the group */
 	tg->bytes_disp[rw] += bio->bi_iter.bi_size;
 	tg->io_disp[rw]++;
+	tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+	tg->last_io_disp[rw]++;
 
 	/*
 	 * BIO_THROTTLED is used to prevent the same bio to be throttled
@@ -1511,6 +1523,36 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_free_fn		= throtl_pd_free,
 };
 
+static unsigned long __tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+	unsigned long rtime = -1, wtime = -1;
+
+	if (tg->bps[READ][LIMIT_LOW] != U64_MAX ||
+	    tg->iops[READ][LIMIT_LOW] != UINT_MAX)
+		rtime = tg->last_low_overflow_time[READ];
+	if (tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
+	    tg->iops[WRITE][LIMIT_LOW] != UINT_MAX)
+		wtime = tg->last_low_overflow_time[WRITE];
+	return min(rtime, wtime) == -1 ? 0 : min(rtime, wtime);
+}
+
+static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *parent_sq;
+	struct throtl_grp *parent = tg;
+	unsigned long ret = __tg_last_low_overflow_time(tg);
+
+	while (true) {
+		parent_sq = parent->service_queue.parent_sq;
+		parent = sq_to_tg(parent_sq);
+		if (!parent)
+			break;
+		if (time_after(__tg_last_low_overflow_time(parent), ret))
+			ret = __tg_last_low_overflow_time(parent);
+	}
+	return ret;
+}
+
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1555,6 +1597,9 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	if (td->limit_index != LIMIT_LOW)
 		return false;
 
+	if (time_before(jiffies, td->low_downgrade_time + throtl_slice))
+		return false;
+
 	rcu_read_lock();
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1578,6 +1623,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
 	struct blkcg_gq *blkg;
 
 	td->limit_index = LIMIT_MAX;
+	td->low_upgrade_time = jiffies;
 	rcu_read_lock();
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1593,6 +1639,100 @@ static void throtl_upgrade_state(struct throtl_data *td)
 	queue_work(kthrotld_workqueue, &td->dispatch_work);
 }
 
+static void throtl_downgrade_state(struct throtl_data *td, int new)
+{
+	td->limit_index = new;
+	td->low_downgrade_time = jiffies;
+}
+
+static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
+{
+	struct throtl_data *td = tg->td;
+	unsigned long now = jiffies;
+
+	/*
+	 * If cgroup is below low limit, consider downgrade and throttle other
+	 * cgroups
+	 */
+	if (time_after_eq(now, td->low_upgrade_time + throtl_slice) &&
+	    time_after_eq(now, tg_last_low_overflow_time(tg) + throtl_slice))
+		return true;
+	return false;
+}
+
+static bool throtl_hierarchy_can_downgrade(struct throtl_grp *tg)
+{
+	while (true) {
+		if (!throtl_tg_can_downgrade(tg))
+			return false;
+		tg = sq_to_tg(tg->service_queue.parent_sq);
+		if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+			    !tg_to_blkg(tg)->parent))
+			break;
+	}
+	return true;
+}
+
+static void throtl_downgrade_check(struct throtl_grp *tg)
+{
+	uint64_t bps;
+	unsigned int iops;
+	unsigned long elapsed_time;
+	unsigned long now = jiffies;
+
+	if (tg->td->limit_index != LIMIT_MAX ||
+	    !tg->td->limit_valid[LIMIT_LOW])
+		return;
+	if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+		return;
+	if (time_after(tg->last_check_time + throtl_slice, now))
+		return;
+
+	elapsed_time = now - tg->last_check_time;
+	tg->last_check_time = now;
+
+	if (time_before(now, tg_last_low_overflow_time(tg) + throtl_slice))
+		return;
+
+	if (tg->bps[READ][LIMIT_LOW] != U64_MAX) {
+		bps = tg->last_bytes_disp[READ] * HZ;
+		do_div(bps, elapsed_time);
+		if (bps >= tg->bps[READ][LIMIT_LOW])
+			tg->last_low_overflow_time[READ] = now;
+	}
+
+	if (tg->bps[WRITE][LIMIT_LOW] != U64_MAX) {
+		bps = tg->last_bytes_disp[WRITE] * HZ;
+		do_div(bps, elapsed_time);
+		if (bps >= tg->bps[WRITE][LIMIT_LOW])
+			tg->last_low_overflow_time[WRITE] = now;
+	}
+
+	if (tg->iops[READ][LIMIT_LOW] != UINT_MAX) {
+		iops = tg->last_io_disp[READ] * HZ / elapsed_time;
+		if (iops >= tg->iops[READ][LIMIT_LOW])
+			tg->last_low_overflow_time[READ] = now;
+	}
+
+	if (tg->iops[WRITE][LIMIT_LOW] != UINT_MAX) {
+		iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
+		if (iops >= tg->iops[WRITE][LIMIT_LOW])
+			tg->last_low_overflow_time[WRITE] = now;
+	}
+
+	/*
+	 * If cgroup is below low limit, consider downgrade and throttle other
+	 * cgroups
+	 */
+	if (throtl_hierarchy_can_downgrade(tg))
+		throtl_downgrade_state(tg->td, LIMIT_LOW);
+
+	tg->last_bytes_disp[READ] = 0;
+	tg->last_bytes_disp[WRITE] = 0;
+	tg->last_io_disp[READ] = 0;
+	tg->last_io_disp[WRITE] = 0;
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1617,12 +1757,16 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
+		if (tg->last_low_overflow_time[rw] == 0)
+			tg->last_low_overflow_time[rw] = jiffies;
+		throtl_downgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
 
 		/* if above limits, break to queue */
 		if (!tg_may_dispatch(tg, bio, NULL)) {
+			tg->last_low_overflow_time[rw] = jiffies;
 			if (throtl_can_upgrade(tg->td, tg)) {
 				throtl_upgrade_state(tg->td);
 				goto again;
@@ -1666,6 +1810,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		   tg->io_disp[rw], tg_iops_limit(tg, rw),
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
+	tg->last_low_overflow_time[rw] = jiffies;
+
 	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
@@ -1776,6 +1922,8 @@ int blk_throtl_init(struct request_queue *q)
 
 	td->limit_valid[LIMIT_MAX] = true;
 	td->limit_index = LIMIT_MAX;
+	td->low_upgrade_time = jiffies;
+	td->low_downgrade_time = jiffies;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 07/17] blk-throttle: make sure expire time isn't too big
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (5 preceding siblings ...)
  2016-12-15 20:32 ` [PATCH V5 06/17] blk-throttle: add downgrade logic Shaohua Li
@ 2016-12-15 20:32 ` Shaohua Li
  2017-01-09 19:54   ` Tejun Heo
  2016-12-15 20:32 ` [PATCH V5 08/17] blk-throttle: make throtl_slice tunable Shaohua Li
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:32 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0f65fce..41ec72c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -588,6 +588,10 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
 					  unsigned long expires)
 {
+	unsigned long max_expire = jiffies + 8 * throtl_slice;
+
+	if (time_after(expires, max_expire))
+		expires = max_expire;
 	mod_timer(&sq->pending_timer, expires);
 	throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
 		   expires - jiffies, jiffies);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 08/17] blk-throttle: make throtl_slice tunable
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (6 preceding siblings ...)
  2016-12-15 20:32 ` [PATCH V5 07/17] blk-throttle: make sure expire time isn't too big Shaohua Li
@ 2016-12-15 20:32 ` Shaohua Li
  2017-01-09 20:08   ` Tejun Heo
  2016-12-15 20:33 ` [PATCH V5 09/17] blk-throttle: detect completed idle cgroup Shaohua Li
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:32 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

throtl_slice is important for blk-throttling. It's called slice
internally but it really is a time window blk-throttling samples data.
blk-throttling will make decision based on the samplings. An example is
bandwidth measurement. A cgroup's bandwidth is measured in the time
interval of throtl_slice.

A small throtl_slice meanse cgroups have smoother throughput but burn
more CPUs. It has 100ms default value, which is not appropriate for all
disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
it tunable.

Since throtl_slice isn't a time slice, the sysfs name
'throttle_sample_time' reflects its character better.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 Documentation/block/queue-sysfs.txt |  6 +++
 block/blk-sysfs.c                   | 10 +++++
 block/blk-throttle.c                | 77 ++++++++++++++++++++++++++-----------
 block/blk.h                         |  3 ++
 4 files changed, 74 insertions(+), 22 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index 5164215..91f371d 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -192,5 +192,11 @@ scaling back writes. Writing a value of '0' to this file disables the
 feature. Writing a value of '-1' to this file resets the value to the
 default setting.
 
+throttle_sample_time (RW)
+-------------------------
+This is the time window that blk-throttle samples data, in millisecond.
+blk-throttle makes decision based on the samplings. Lower time means cgroups
+have more smooth throughput, but higher CPU overhead. This exists only when
+CONFIG_BLK_DEV_THROTTLING is enabled.
 
 Jens Axboe <jens.axboe@oracle.com>, February 2009
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1dbce05..0e3fb2a 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -690,6 +690,13 @@ static struct queue_sysfs_entry queue_wb_lat_entry = {
 	.show = queue_wb_lat_show,
 	.store = queue_wb_lat_store,
 };
+#ifdef CONFIG_BLK_DEV_THROTTLING
+static struct queue_sysfs_entry throtl_sample_time_entry = {
+	.attr = {.name = "throttle_sample_time", .mode = S_IRUGO | S_IWUSR },
+	.show = blk_throtl_sample_time_show,
+	.store = blk_throtl_sample_time_store,
+};
+#endif
 
 static struct attribute *default_attrs[] = {
 	&queue_requests_entry.attr,
@@ -724,6 +731,9 @@ static struct attribute *default_attrs[] = {
 	&queue_stats_entry.attr,
 	&queue_wb_lat_entry.attr,
 	&queue_poll_delay_entry.attr,
+#ifdef CONFIG_BLK_DEV_THROTTLING
+	&throtl_sample_time_entry.attr,
+#endif
 	NULL,
 };
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 41ec72c..cd10c65 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -19,7 +19,8 @@ static int throtl_grp_quantum = 8;
 static int throtl_quantum = 32;
 
 /* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10;	/* 100 ms */
+#define DFL_THROTL_SLICE (HZ / 10)
+#define MAX_THROTL_SLICE (HZ / 5)
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -162,6 +163,8 @@ struct throtl_data
 	/* Total Number of queued bios on READ and WRITE lists */
 	unsigned int nr_queued[2];
 
+	unsigned int throtl_slice;
+
 	/* Work for dispatching throttled bios */
 	struct work_struct dispatch_work;
 	unsigned int limit_index;
@@ -588,7 +591,7 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
 					  unsigned long expires)
 {
-	unsigned long max_expire = jiffies + 8 * throtl_slice;
+	unsigned long max_expire = jiffies + 8 * sq_to_tg(sq)->td->throtl_slice;
 
 	if (time_after(expires, max_expire))
 		expires = max_expire;
@@ -649,7 +652,7 @@ static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
 	if (time_after_eq(start, tg->slice_start[rw]))
 		tg->slice_start[rw] = start;
 
-	tg->slice_end[rw] = jiffies + throtl_slice;
+	tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
 	throtl_log(&tg->service_queue,
 		   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
 		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -661,7 +664,7 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 	tg->bytes_disp[rw] = 0;
 	tg->io_disp[rw] = 0;
 	tg->slice_start[rw] = jiffies;
-	tg->slice_end[rw] = jiffies + throtl_slice;
+	tg->slice_end[rw] = jiffies + tg->td->throtl_slice;
 	throtl_log(&tg->service_queue,
 		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
 		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -671,13 +674,13 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
 					unsigned long jiffy_end)
 {
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+	tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
 }
 
 static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
 				       unsigned long jiffy_end)
 {
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
+	tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
 	throtl_log(&tg->service_queue,
 		   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
 		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
@@ -717,19 +720,20 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 	 * is bad because it does not allow new slice to start.
 	 */
 
-	throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
+	throtl_set_slice_end(tg, rw, jiffies + tg->td->throtl_slice);
 
 	time_elapsed = jiffies - tg->slice_start[rw];
 
-	nr_slices = time_elapsed / throtl_slice;
+	nr_slices = time_elapsed / tg->td->throtl_slice;
 
 	if (!nr_slices)
 		return;
-	tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
+	tmp = tg_bps_limit(tg, rw) * tg->td->throtl_slice * nr_slices;
 	do_div(tmp, HZ);
 	bytes_trim = tmp;
 
-	io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices) / HZ;
+	io_trim = (tg_iops_limit(tg, rw) * tg->td->throtl_slice * nr_slices) /
+		HZ;
 
 	if (!bytes_trim && !io_trim)
 		return;
@@ -744,7 +748,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 	else
 		tg->io_disp[rw] = 0;
 
-	tg->slice_start[rw] += nr_slices * throtl_slice;
+	tg->slice_start[rw] += nr_slices * tg->td->throtl_slice;
 
 	throtl_log(&tg->service_queue,
 		   "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
@@ -764,9 +768,9 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Slice has just started. Consider one slice interval */
 	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
+		jiffy_elapsed_rnd = tg->td->throtl_slice;
 
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
 
 	/*
 	 * jiffy_elapsed_rnd should not be a big value as minimum iops can be
@@ -813,9 +817,9 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Slice has just started. Consider one slice interval */
 	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
+		jiffy_elapsed_rnd = tg->td->throtl_slice;
 
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
 
 	tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
@@ -881,8 +885,10 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	if (throtl_slice_used(tg, rw) && !(tg->service_queue.nr_queued[rw]))
 		throtl_start_new_slice(tg, rw);
 	else {
-		if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
-			throtl_extend_slice(tg, rw, jiffies + throtl_slice);
+		if (time_before(tg->slice_end[rw],
+		    jiffies + tg->td->throtl_slice))
+			throtl_extend_slice(tg, rw,
+				jiffies + tg->td->throtl_slice);
 	}
 
 	if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
@@ -1601,7 +1607,7 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	if (td->limit_index != LIMIT_LOW)
 		return false;
 
-	if (time_before(jiffies, td->low_downgrade_time + throtl_slice))
+	if (time_before(jiffies, td->low_downgrade_time + td->throtl_slice))
 		return false;
 
 	rcu_read_lock();
@@ -1658,8 +1664,9 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
 	 * If cgroup is below low limit, consider downgrade and throttle other
 	 * cgroups
 	 */
-	if (time_after_eq(now, td->low_upgrade_time + throtl_slice) &&
-	    time_after_eq(now, tg_last_low_overflow_time(tg) + throtl_slice))
+	if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
+	    time_after_eq(now, tg_last_low_overflow_time(tg) +
+					td->throtl_slice))
 		return true;
 	return false;
 }
@@ -1689,13 +1696,14 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 		return;
 	if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
 		return;
-	if (time_after(tg->last_check_time + throtl_slice, now))
+	if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
 		return;
 
 	elapsed_time = now - tg->last_check_time;
 	tg->last_check_time = now;
 
-	if (time_before(now, tg_last_low_overflow_time(tg) + throtl_slice))
+	if (time_before(now, tg_last_low_overflow_time(tg) +
+			tg->td->throtl_slice))
 		return;
 
 	if (tg->bps[READ][LIMIT_LOW] != U64_MAX) {
@@ -1923,6 +1931,7 @@ int blk_throtl_init(struct request_queue *q)
 
 	q->td = td;
 	td->queue = q;
+	td->throtl_slice = DFL_THROTL_SLICE;
 
 	td->limit_valid[LIMIT_MAX] = true;
 	td->limit_index = LIMIT_MAX;
@@ -1943,6 +1952,30 @@ void blk_throtl_exit(struct request_queue *q)
 	kfree(q->td);
 }
 
+ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page)
+{
+	if (!q->td)
+		return -EINVAL;
+	return sprintf(page, "%u\n", jiffies_to_msecs(q->td->throtl_slice));
+}
+
+ssize_t blk_throtl_sample_time_store(struct request_queue *q,
+	const char *page, size_t count)
+{
+	unsigned long v;
+	unsigned long t;
+
+	if (!q->td)
+		return -EINVAL;
+	if (kstrtoul(page, 10, &v))
+		return -EINVAL;
+	t = msecs_to_jiffies(v);
+	if (t == 0 || t > MAX_THROTL_SLICE)
+		return -EINVAL;
+	q->td->throtl_slice = t;
+	return count;
+}
+
 static int __init throtl_init(void)
 {
 	kthrotld_workqueue = alloc_workqueue("kthrotld", WQ_MEM_RECLAIM, 0);
diff --git a/block/blk.h b/block/blk.h
index 041185e..e83e757 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -290,6 +290,9 @@ static inline struct io_context *create_io_context(gfp_t gfp_mask, int node)
 extern void blk_throtl_drain(struct request_queue *q);
 extern int blk_throtl_init(struct request_queue *q);
 extern void blk_throtl_exit(struct request_queue *q);
+extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
+extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
+	const char *page, size_t count);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 09/17] blk-throttle: detect completed idle cgroup
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (7 preceding siblings ...)
  2016-12-15 20:32 ` [PATCH V5 08/17] blk-throttle: make throtl_slice tunable Shaohua Li
@ 2016-12-15 20:33 ` Shaohua Li
  2017-01-09 20:13   ` Tejun Heo
  2016-12-15 20:33 ` [PATCH V5 10/17] blk-throttle: make bandwidth change smooth Shaohua Li
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:33 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
we can't move the state machine to higher level and all cgroups will be
throttled to their lower limit, so we waste bandwidth. Detecting idle
cgroup is hard. This patch handles a simple case, a cgroup doesn't
dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
the bandwidth.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index cd10c65..a0ba961 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -148,6 +148,8 @@ struct throtl_grp {
 
 	unsigned long last_check_time;
 
+	unsigned long last_dispatch_time[2];
+
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -437,11 +439,14 @@ static void tg_update_has_rules(struct throtl_grp *tg)
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
 {
+	struct throtl_grp *tg = pd_to_tg(pd);
 	/*
 	 * We don't want new groups to escape the limits of its ancestors.
 	 * Update has_rules[] after a new group is brought online.
 	 */
-	tg_update_has_rules(pd_to_tg(pd));
+	tg_update_has_rules(tg);
+	tg->last_dispatch_time[READ] = jiffies;
+	tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1582,6 +1587,12 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 	if (write_limit && sq->nr_queued[WRITE] &&
 	    (!read_limit || sq->nr_queued[READ]))
 		return true;
+
+	if (time_after_eq(jiffies,
+	     tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
+	    time_after_eq(jiffies,
+	     tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+		return true;
 	return false;
 }
 
@@ -1660,6 +1671,11 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
 	struct throtl_data *td = tg->td;
 	unsigned long now = jiffies;
 
+	if (time_after_eq(now, tg->last_dispatch_time[READ] +
+					td->throtl_slice) &&
+	    time_after_eq(now, tg->last_dispatch_time[WRITE] +
+					td->throtl_slice))
+		return false;
 	/*
 	 * If cgroup is below low limit, consider downgrade and throttle other
 	 * cgroups
@@ -1769,6 +1785,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
+		tg->last_dispatch_time[rw] = jiffies;
 		if (tg->last_low_overflow_time[rw] == 0)
 			tg->last_low_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 10/17] blk-throttle: make bandwidth change smooth
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (8 preceding siblings ...)
  2016-12-15 20:33 ` [PATCH V5 09/17] blk-throttle: detect completed idle cgroup Shaohua Li
@ 2016-12-15 20:33 ` Shaohua Li
  2017-01-09 20:28   ` Tejun Heo
  2016-12-15 20:33 ` [PATCH V5 11/17] blk-throttle: add a simple idle detection Shaohua Li
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:33 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

When cgroups all reach low limit, cgroups can dispatch more IO. This
could make some cgroups dispatch more IO but others not, and even some
cgroups could dispatch less IO than their low limit. For example, cg1
low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
120M/s for the workload. Their bps could something like this:

cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80

At T1, all cgroups reach low limit, so they can dispatch more IO later.
Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
will have bandwidth below its low limit at most time.

The big problem here is we don't know the maximum bandwidth of the
workload, so we can't make smart decision to avoid the situation. This
patch makes cgroup bandwidth change smooth. After disk upgrades from
LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
their max limit immediately. Their bandwidth limit will be increased
gradually to avoid above situation. So above example will became
something like:

cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
-> 45/75 -> 22/98

In this way cgroups bandwidth will be above their limit in majority
time, this still doesn't fully utilize disk bandwidth, but that's
something we pay for sharing.

Note this doesn't completely avoid cgroup running under its low limit.
The best way to guarantee cgroup doesn't run under its limit is to set
max limit. For example, if we set cg1 max limit to 40, cg2 will never
run under its low limit.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a0ba961..6b2f365 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -174,6 +174,8 @@ struct throtl_data
 
 	unsigned long low_upgrade_time;
 	unsigned long low_downgrade_time;
+
+	unsigned int scale;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -228,21 +230,58 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
 	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	struct throtl_data *td;
 	uint64_t ret;
 
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
 		return U64_MAX;
-	return tg->bps[rw][tg->td->limit_index];
+
+	td = tg->td;
+	ret = tg->bps[rw][td->limit_index];
+	if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] !=
+	    tg->bps[rw][LIMIT_MAX]) {
+		uint64_t increase;
+
+		if (td->scale < 4096 && time_after_eq(jiffies,
+		    td->low_upgrade_time + td->scale * td->throtl_slice)) {
+			unsigned int time = jiffies - td->low_upgrade_time;
+
+			td->scale = time / td->throtl_slice;
+		}
+		increase = (tg->bps[rw][LIMIT_LOW] >> 1) * td->scale;
+		ret = min(tg->bps[rw][LIMIT_MAX],
+			tg->bps[rw][LIMIT_LOW] + increase);
+	}
+	return ret;
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
 	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	struct throtl_data *td;
 	unsigned int ret;
 
 	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
 		return UINT_MAX;
-	return tg->iops[rw][tg->td->limit_index];
+
+	td = tg->td;
+	ret = tg->iops[rw][td->limit_index];
+	if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_LOW] !=
+	    tg->iops[rw][LIMIT_MAX]) {
+		uint64_t increase;
+
+		if (td->scale < 4096 && time_after_eq(jiffies,
+		    td->low_upgrade_time + td->scale * td->throtl_slice)) {
+			unsigned int time = jiffies - td->low_upgrade_time;
+
+			td->scale = time / td->throtl_slice;
+		}
+
+		increase = (tg->iops[rw][LIMIT_LOW] >> 1) * td->scale;
+		ret = min(tg->iops[rw][LIMIT_MAX],
+			tg->iops[rw][LIMIT_LOW] + (unsigned int)increase);
+	}
+	return ret;
 }
 
 /**
@@ -1645,6 +1684,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
 	td->limit_index = LIMIT_MAX;
 	td->low_upgrade_time = jiffies;
+	td->scale = 0;
 	rcu_read_lock();
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -1662,6 +1702,13 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
 static void throtl_downgrade_state(struct throtl_data *td, int new)
 {
+	td->scale /= 2;
+
+	if (td->scale) {
+		td->low_upgrade_time = jiffies - td->scale * td->throtl_slice;
+		return;
+	}
+
 	td->limit_index = new;
 	td->low_downgrade_time = jiffies;
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 11/17] blk-throttle: add a simple idle detection
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (9 preceding siblings ...)
  2016-12-15 20:33 ` [PATCH V5 10/17] blk-throttle: make bandwidth change smooth Shaohua Li
@ 2016-12-15 20:33 ` Shaohua Li
  2017-01-09 20:56   ` Tejun Heo
  2016-12-15 20:33 ` [PATCH V5 12/17] blk-throttle: add interface to configure idle time threshold Shaohua Li
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:33 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 50us for SSD and 1ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/bio.c               |  2 ++
 block/blk-throttle.c      | 65 ++++++++++++++++++++++++++++++++++++++++++++++-
 block/blk.h               |  2 ++
 include/linux/blk_types.h |  1 +
 4 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 2b37502..948ebc1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -30,6 +30,7 @@
 #include <linux/cgroup.h>
 
 #include <trace/events/block.h>
+#include "blk.h"
 
 /*
  * Test patch to inline a certain number of bi_io_vec's inside the bio
@@ -1813,6 +1814,7 @@ void bio_endio(struct bio *bio)
 		goto again;
 	}
 
+	blk_throtl_bio_endio(bio);
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6b2f365..3cd8400 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -21,6 +21,9 @@ static int throtl_quantum = 32;
 /* Throttling is performed over 100ms slice and after that slice is renewed */
 #define DFL_THROTL_SLICE (HZ / 10)
 #define MAX_THROTL_SLICE (HZ / 5)
+#define DFL_IDLE_THRESHOLD_SSD (50 * 1000) /* 50 us */
+#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
+#define MAX_IDLE_TIME (500L * 1000 * 1000) /* 500 ms */
 
 static struct blkcg_policy blkcg_policy_throtl;
 
@@ -153,6 +156,11 @@ struct throtl_grp {
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
+
+	u64 last_finish_time;
+	u64 checked_last_finish_time;
+	u64 avg_ttime;
+	u64 idle_ttime_threshold;
 };
 
 struct throtl_data
@@ -428,6 +436,7 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 			tg->iops_conf[rw][index] = UINT_MAX;
 		}
 	}
+	tg->idle_ttime_threshold = U64_MAX;
 
 	return &tg->pd;
 }
@@ -1607,6 +1616,20 @@ static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
 	return ret;
 }
 
+static bool throtl_tg_is_idle(struct throtl_grp *tg)
+{
+	/*
+	 * cgroup is idle if:
+	 * - single idle is too long, longer than a fixed value (in case user
+	 *   configure a too big threshold) or 4 times of slice
+	 * - average think time is more than threshold
+	 */
+	u64 time = (u64)jiffies_to_usecs(4 * tg->td->throtl_slice) * 1000;
+	time = min_t(u64, MAX_IDLE_TIME, time);
+	return ktime_get_ns() - tg->last_finish_time > time ||
+	       tg->avg_ttime > tg->idle_ttime_threshold;
+}
+
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1808,6 +1831,19 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 	tg->last_io_disp[WRITE] = 0;
 }
 
+static void blk_throtl_update_ttime(struct throtl_grp *tg)
+{
+	u64 now = ktime_get_ns();
+	u64 last_finish_time = tg->last_finish_time;
+
+	if (now <= last_finish_time || last_finish_time == 0 ||
+	    last_finish_time == tg->checked_last_finish_time)
+		return;
+
+	tg->avg_ttime = (tg->avg_ttime * 7 + now - last_finish_time) >> 3;
+	tg->checked_last_finish_time = last_finish_time;
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1816,9 +1852,18 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	struct throtl_service_queue *sq;
 	bool rw = bio_data_dir(bio);
 	bool throttled = false;
+	int ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 
+	/* nonrot bit isn't set in queue creation */
+	if (tg->idle_ttime_threshold == U64_MAX) {
+		if (blk_queue_nonrot(q))
+			tg->idle_ttime_threshold = DFL_IDLE_THRESHOLD_SSD;
+		else
+			tg->idle_ttime_threshold = DFL_IDLE_THRESHOLD_HD;
+	}
+
 	/* see throtl_charge_bio() */
 	if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw])
 		goto out;
@@ -1828,6 +1873,12 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	if (unlikely(blk_queue_bypass(q)))
 		goto out_unlock;
 
+	ret = bio_associate_current(bio);
+	if (ret == 0 || ret == -EBUSY)
+		bio->bi_cg_private = tg;
+
+	blk_throtl_update_ttime(tg);
+
 	sq = &tg->service_queue;
 
 again:
@@ -1888,7 +1939,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	tg->last_low_overflow_time[rw] = jiffies;
 
-	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
@@ -1917,6 +1967,18 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	return throttled;
 }
 
+void blk_throtl_bio_endio(struct bio *bio)
+{
+	struct throtl_grp *tg;
+
+	tg = bio->bi_cg_private;
+	if (!tg)
+		return;
+	bio->bi_cg_private = NULL;
+
+	tg->last_finish_time = ktime_get_ns();
+}
+
 /*
  * Dispatch all bios from all children tg's queued on @parent_sq.  On
  * return, @parent_sq is guaranteed to not have any active children tg's
@@ -2001,6 +2063,7 @@ int blk_throtl_init(struct request_queue *q)
 	td->limit_index = LIMIT_MAX;
 	td->low_upgrade_time = jiffies;
 	td->low_downgrade_time = jiffies;
+
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
diff --git a/block/blk.h b/block/blk.h
index e83e757..921d04f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -293,10 +293,12 @@ extern void blk_throtl_exit(struct request_queue *q);
 extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
 extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
 	const char *page, size_t count);
+extern void blk_throtl_bio_endio(struct bio *bio);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
+static inline void blk_throtl_bio_endio(struct bio *bio) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 519ea2c..01019a7 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -58,6 +58,7 @@ struct bio {
 	 */
 	struct io_context	*bi_ioc;
 	struct cgroup_subsys_state *bi_css;
+	void			*bi_cg_private;
 #endif
 	union {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 12/17] blk-throttle: add interface to configure idle time threshold
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (10 preceding siblings ...)
  2016-12-15 20:33 ` [PATCH V5 11/17] blk-throttle: add a simple idle detection Shaohua Li
@ 2016-12-15 20:33 ` Shaohua Li
  2017-01-09 20:58   ` Tejun Heo
  2016-12-15 20:33 ` [PATCH V5 13/17] blk-throttle: ignore idle cgroup limit Shaohua Li
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:33 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

Add interface to configure the threshold. The io.low interface will
like:
echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low

idle is in microsecond unit.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3cd8400..0218ff9 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -180,6 +180,8 @@ struct throtl_data
 	unsigned int limit_index;
 	bool limit_valid[LIMIT_CNT];
 
+	u64 dft_idle_ttime_threshold;
+
 	unsigned long low_upgrade_time;
 	unsigned long low_downgrade_time;
 
@@ -1427,6 +1429,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	struct throtl_grp *tg = pd_to_tg(pd);
 	const char *dname = blkg_dev_name(pd->blkg);
 	char bufs[4][21] = { "max", "max", "max", "max" };
+	char idle_time[26] = "";
 
 	if (!dname)
 		return 0;
@@ -1434,7 +1437,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	if (tg->bps_conf[READ][off] == U64_MAX &&
 	    tg->bps_conf[WRITE][off] == U64_MAX &&
 	    tg->iops_conf[READ][off] == UINT_MAX &&
-	    tg->iops_conf[WRITE][off] == UINT_MAX)
+	    tg->iops_conf[WRITE][off] == UINT_MAX &&
+	    (off != LIMIT_LOW || tg->idle_ttime_threshold ==
+				  tg->td->dft_idle_ttime_threshold))
 		return 0;
 
 	if (tg->bps_conf[READ][off] != U64_MAX)
@@ -1449,9 +1454,16 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	if (tg->iops_conf[WRITE][off] != UINT_MAX)
 		snprintf(bufs[3], sizeof(bufs[3]), "%u",
 			tg->iops_conf[WRITE][off]);
+	if (off == LIMIT_LOW) {
+		if (tg->idle_ttime_threshold == U64_MAX)
+			strcpy(idle_time, " idle=max");
+		else
+			snprintf(idle_time, sizeof(idle_time), " idle=%llu",
+				div_u64(tg->idle_ttime_threshold, 1000));
+	}
 
-	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
-		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
+	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
+		   dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
 	return 0;
 }
 
@@ -1469,6 +1481,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	struct blkg_conf_ctx ctx;
 	struct throtl_grp *tg;
 	u64 v[4];
+	u64 idle_time;
 	int ret;
 	int index = of_cft(of)->private;
 
@@ -1483,6 +1496,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	v[2] = tg->iops_conf[READ][index];
 	v[3] = tg->iops_conf[WRITE][index];
 
+	idle_time = tg->idle_ttime_threshold;
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
 		char *p;
@@ -1514,6 +1528,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			v[2] = min_t(u64, val, UINT_MAX);
 		else if (!strcmp(tok, "wiops"))
 			v[3] = min_t(u64, val, UINT_MAX);
+		else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
+			idle_time = val;
 		else
 			goto out_finish;
 	}
@@ -1542,6 +1558,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		blk_throtl_update_valid_limit(tg->td);
 		if (tg->td->limit_valid[LIMIT_LOW])
 			tg->td->limit_index = LIMIT_LOW;
+		tg->idle_ttime_threshold = (idle_time == U64_MAX) ?
+			U64_MAX : idle_time * 1000;
 	}
 	tg_conf_updated(tg);
 	ret = 0;
@@ -1862,6 +1880,9 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 			tg->idle_ttime_threshold = DFL_IDLE_THRESHOLD_SSD;
 		else
 			tg->idle_ttime_threshold = DFL_IDLE_THRESHOLD_HD;
+		if (tg->td->dft_idle_ttime_threshold == 0)
+			tg->td->dft_idle_ttime_threshold =
+				tg->idle_ttime_threshold;
 	}
 
 	/* see throtl_charge_bio() */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 13/17] blk-throttle: ignore idle cgroup limit
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (11 preceding siblings ...)
  2016-12-15 20:33 ` [PATCH V5 12/17] blk-throttle: add interface to configure idle time threshold Shaohua Li
@ 2016-12-15 20:33 ` Shaohua Li
  2017-01-09 21:01   ` Tejun Heo
  2016-12-15 20:33 ` [PATCH V5 14/17] blk-throttle: add interface for per-cgroup target latency Shaohua Li
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:33 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

Last patch introduces a way to detect idle cgroup. We use it to make
upgrade/downgrade decision. And the new algorithm can detect completely
idle cgroup too, so we can delete the corresponding code.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 40 ++++++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0218ff9..62fe72ea 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -151,8 +151,6 @@ struct throtl_grp {
 
 	unsigned long last_check_time;
 
-	unsigned long last_dispatch_time[2];
-
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -495,8 +493,6 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 	 * Update has_rules[] after a new group is brought online.
 	 */
 	tg_update_has_rules(tg);
-	tg->last_dispatch_time[READ] = jiffies;
-	tg->last_dispatch_time[WRITE] = jiffies;
 }
 
 static void blk_throtl_update_valid_limit(struct throtl_data *td)
@@ -1669,9 +1665,8 @@ static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
 		return true;
 
 	if (time_after_eq(jiffies,
-	     tg->last_dispatch_time[READ] + tg->td->throtl_slice) &&
-	    time_after_eq(jiffies,
-	     tg->last_dispatch_time[WRITE] + tg->td->throtl_slice))
+		tg_last_low_overflow_time(tg) + tg->td->throtl_slice) &&
+	    throtl_tg_is_idle(tg))
 		return true;
 	return false;
 }
@@ -1718,6 +1713,26 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	return true;
 }
 
+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+	unsigned long now = jiffies;
+
+	if (tg->td->limit_index != LIMIT_LOW)
+		return;
+
+	if (time_after(tg->last_check_time + tg->td->throtl_slice, now))
+		return;
+
+	tg->last_check_time = now;
+
+	if (!time_after_eq(now,
+	     __tg_last_low_overflow_time(tg) + tg->td->throtl_slice))
+		return;
+
+	if (throtl_can_upgrade(tg->td, NULL))
+		throtl_upgrade_state(tg->td);
+}
+
 static void throtl_upgrade_state(struct throtl_data *td)
 {
 	struct cgroup_subsys_state *pos_css;
@@ -1759,18 +1774,15 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
 	struct throtl_data *td = tg->td;
 	unsigned long now = jiffies;
 
-	if (time_after_eq(now, tg->last_dispatch_time[READ] +
-					td->throtl_slice) &&
-	    time_after_eq(now, tg->last_dispatch_time[WRITE] +
-					td->throtl_slice))
-		return false;
 	/*
 	 * If cgroup is below low limit, consider downgrade and throttle other
 	 * cgroups
 	 */
 	if (time_after_eq(now, td->low_upgrade_time + td->throtl_slice) &&
 	    time_after_eq(now, tg_last_low_overflow_time(tg) +
-					td->throtl_slice))
+					td->throtl_slice) &&
+	    (!throtl_tg_is_idle(tg) ||
+	     !list_empty(&tg_to_blkg(tg)->blkcg->css.children)))
 		return true;
 	return false;
 }
@@ -1904,10 +1916,10 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
-		tg->last_dispatch_time[rw] = jiffies;
 		if (tg->last_low_overflow_time[rw] == 0)
 			tg->last_low_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
+		throtl_upgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 14/17] blk-throttle: add interface for per-cgroup target latency
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (12 preceding siblings ...)
  2016-12-15 20:33 ` [PATCH V5 13/17] blk-throttle: ignore idle cgroup limit Shaohua Li
@ 2016-12-15 20:33 ` Shaohua Li
  2017-01-09 21:14   ` Tejun Heo
  2016-12-15 20:33 ` [PATCH V5 15/17] block: track request size in blk_issue_stat Shaohua Li
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:33 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

Here we introduce per-cgroup latency target. The target determines how a
cgroup can afford latency increasement. We will use the target latency
to calculate a threshold and use it to schedule IO for cgroups. If a
cgroup's bandwidth is below its low limit but its average latency is
below the threshold, other cgroups can safely dispatch more IO even
their bandwidth is higher than their low limits. On the other hand, if
the first cgroup's latency is higher than the threshold, other cgroups
are throttled to their low limits. So the target latency determines how
we efficiently utilize free disk resource without sacifice of worload's
IO latency.

For example, assume 4k IO average latency is 50us when disk isn't
congested. A cgroup sets the target latency to 30us. Then the cgroup can
accept 50+30=80us IO latency. If the cgroupt's average IO latency is
90us and its bandwidth is below low limit, other cgroups are throttled
to their low limit. If the cgroup's average IO latency is 60us, other
cgroups are allowed to dispatch more IO. When other cgroups dispatch
more IO, the first cgroup's IO latency will increase. If it increases to
81us, we then throttle other cgroups.

User will configure the interface in this way:
echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low

latency is in microsecond unit

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 62fe72ea..3431b1d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -151,6 +151,7 @@ struct throtl_grp {
 
 	unsigned long last_check_time;
 
+	u64 latency_target;
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -438,6 +439,11 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	}
 	tg->idle_ttime_threshold = U64_MAX;
 
+	/*
+	 * target latency default 0, eg, latency threshold is 0, which means
+	 * cgroup's latency is always higher than threshold
+	 */
+
 	return &tg->pd;
 }
 
@@ -1426,6 +1432,7 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	const char *dname = blkg_dev_name(pd->blkg);
 	char bufs[4][21] = { "max", "max", "max", "max" };
 	char idle_time[26] = "";
+	char latency_time[26] = "";
 
 	if (!dname)
 		return 0;
@@ -1434,8 +1441,9 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 	    tg->bps_conf[WRITE][off] == U64_MAX &&
 	    tg->iops_conf[READ][off] == UINT_MAX &&
 	    tg->iops_conf[WRITE][off] == UINT_MAX &&
-	    (off != LIMIT_LOW || tg->idle_ttime_threshold ==
-				  tg->td->dft_idle_ttime_threshold))
+	    (off != LIMIT_LOW ||
+	     (tg->idle_ttime_threshold == tg->td->dft_idle_ttime_threshold &&
+	      tg->latency_target == 0)))
 		return 0;
 
 	if (tg->bps_conf[READ][off] != U64_MAX)
@@ -1456,10 +1464,18 @@ static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 		else
 			snprintf(idle_time, sizeof(idle_time), " idle=%llu",
 				div_u64(tg->idle_ttime_threshold, 1000));
+
+		if (tg->latency_target == U64_MAX)
+			strcpy(latency_time, " latency=max");
+		else
+			snprintf(latency_time, sizeof(latency_time),
+				" latency=%llu",
+				div_u64(tg->latency_target, 1000));
 	}
 
-	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s\n",
-		   dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time);
+	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s%s%s\n",
+		   dname, bufs[0], bufs[1], bufs[2], bufs[3], idle_time,
+		   latency_time);
 	return 0;
 }
 
@@ -1478,6 +1494,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	u64 v[4];
 	u64 idle_time;
+	u64 latency_time;
 	int ret;
 	int index = of_cft(of)->private;
 
@@ -1493,6 +1510,7 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	v[3] = tg->iops_conf[WRITE][index];
 
 	idle_time = tg->idle_ttime_threshold;
+	latency_time = tg->latency_target;
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
 		char *p;
@@ -1526,6 +1544,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			v[3] = min_t(u64, val, UINT_MAX);
 		else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
 			idle_time = val;
+		else if (off == LIMIT_LOW && !strcmp(tok, "latency"))
+			latency_time = val;
 		else
 			goto out_finish;
 	}
@@ -1556,6 +1576,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			tg->td->limit_index = LIMIT_LOW;
 		tg->idle_ttime_threshold = (idle_time == U64_MAX) ?
 			U64_MAX : idle_time * 1000;
+		tg->latency_target = (latency_time == U64_MAX) ?
+			U64_MAX : latency_time * 1000;
 	}
 	tg_conf_updated(tg);
 	ret = 0;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 15/17] block: track request size in blk_issue_stat
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (13 preceding siblings ...)
  2016-12-15 20:33 ` [PATCH V5 14/17] blk-throttle: add interface for per-cgroup target latency Shaohua Li
@ 2016-12-15 20:33 ` Shaohua Li
  2016-12-16  2:01   ` kbuild test robot
  2017-01-09 21:17   ` Tejun Heo
  2016-12-15 20:33 ` [PATCH V5 16/17] blk-throttle: add a mechanism to estimate IO latency Shaohua Li
                   ` (2 subsequent siblings)
  17 siblings, 2 replies; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:33 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

Currently there is no way to know the request size when the request is
finished. Next patch will need this info, so add to blk_issue_stat. With
this, we will have 49bits to track time, which still is very long time.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-core.c          |  2 +-
 block/blk-mq.c            |  2 +-
 block/blk-stat.c          |  7 ++++---
 block/blk-stat.h          | 29 +++++++++++++++++++++++------
 block/blk-wbt.h           | 10 +++++-----
 include/linux/blk_types.h |  2 +-
 6 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..485c32d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2459,7 +2459,7 @@ void blk_start_request(struct request *req)
 	blk_dequeue_request(req);
 
 	if (test_bit(QUEUE_FLAG_STATS, &req->q->queue_flags)) {
-		blk_stat_set_issue_time(&req->issue_stat);
+		blk_stat_set_issue(&req->issue_stat, blk_rq_sectors(req));
 		req->rq_flags |= RQF_STATS;
 		wbt_issue(req->q->rq_wb, &req->issue_stat);
 	}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4bf850e..891db62 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -474,7 +474,7 @@ void blk_mq_start_request(struct request *rq)
 		rq->next_rq->resid_len = blk_rq_bytes(rq->next_rq);
 
 	if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
-		blk_stat_set_issue_time(&rq->issue_stat);
+		blk_stat_set_issue(&rq->issue_stat, blk_rq_sectors(rq));
 		rq->rq_flags |= RQF_STATS;
 		wbt_issue(q->rq_wb, &rq->issue_stat);
 	}
diff --git a/block/blk-stat.c b/block/blk-stat.c
index 9b43efb..0469855 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -236,10 +236,11 @@ void blk_stat_clear(struct request_queue *q)
 	}
 }
 
-void blk_stat_set_issue_time(struct blk_issue_stat *stat)
+void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size)
 {
-	stat->time = (stat->time & BLK_STAT_MASK) |
-			(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK);
+	stat->stat = (stat->stat & BLK_STAT_RES_MASK) |
+		(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK) |
+		(blk_capped_size(size) << BLK_STAT_SIZE_SHIFT);
 }
 
 /*
diff --git a/block/blk-stat.h b/block/blk-stat.h
index a2050a0..462197f 100644
--- a/block/blk-stat.h
+++ b/block/blk-stat.h
@@ -8,12 +8,19 @@
 #define BLK_STAT_NSEC_MASK	~(BLK_STAT_NSEC - 1)
 
 /*
- * Upper 3 bits can be used elsewhere
+ * from upper:
+ * 3 bits: reserved for other usage
+ * 12 bits: size
+ * 49 bits: time
  */
 #define BLK_STAT_RES_BITS	3
-#define BLK_STAT_SHIFT		(64 - BLK_STAT_RES_BITS)
-#define BLK_STAT_TIME_MASK	((1ULL << BLK_STAT_SHIFT) - 1)
-#define BLK_STAT_MASK		~BLK_STAT_TIME_MASK
+#define BLK_STAT_SIZE_BITS	12
+#define BLK_STAT_RES_SHIFT	(64 - BLK_STAT_RES_BITS)
+#define BLK_STAT_SIZE_SHIFT	(BLK_STAT_RES_SHIFT - BLK_STAT_SIZE_BITS)
+#define BLK_STAT_TIME_MASK	((1ULL << BLK_STAT_SIZE_SHIFT) - 1)
+#define BLK_STAT_SIZE_MASK	\
+	(((1ULL << BLK_STAT_SIZE_BITS) - 1) << BLK_STAT_SIZE_SHIFT)
+#define BLK_STAT_RES_MASK	(~((1ULL << BLK_STAT_RES_SHIFT) - 1))
 
 enum {
 	BLK_STAT_READ	= 0,
@@ -26,7 +33,7 @@ void blk_queue_stat_get(struct request_queue *, struct blk_rq_stat *);
 void blk_stat_clear(struct request_queue *);
 void blk_stat_init(struct blk_rq_stat *);
 bool blk_stat_is_current(struct blk_rq_stat *);
-void blk_stat_set_issue_time(struct blk_issue_stat *);
+void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size);
 bool blk_stat_enable(struct request_queue *);
 
 static inline u64 __blk_stat_time(u64 time)
@@ -36,7 +43,17 @@ static inline u64 __blk_stat_time(u64 time)
 
 static inline u64 blk_stat_time(struct blk_issue_stat *stat)
 {
-	return __blk_stat_time(stat->time);
+	return __blk_stat_time(stat->stat);
+}
+
+static inline sector_t blk_capped_size(sector_t size)
+{
+	return size & ((1ULL << BLK_STAT_SIZE_BITS) - 1);
+}
+
+static inline sector_t blk_stat_size(struct blk_issue_stat *stat)
+{
+	return (stat->stat & BLK_STAT_SIZE_MASK) >> BLK_STAT_SIZE_SHIFT;
 }
 
 #endif
diff --git a/block/blk-wbt.h b/block/blk-wbt.h
index 65f1de5..7265f1f 100644
--- a/block/blk-wbt.h
+++ b/block/blk-wbt.h
@@ -32,27 +32,27 @@ enum {
 
 static inline void wbt_clear_state(struct blk_issue_stat *stat)
 {
-	stat->time &= BLK_STAT_TIME_MASK;
+	stat->stat &= ~BLK_STAT_RES_MASK;
 }
 
 static inline enum wbt_flags wbt_stat_to_mask(struct blk_issue_stat *stat)
 {
-	return (stat->time & BLK_STAT_MASK) >> BLK_STAT_SHIFT;
+	return (stat->stat & BLK_STAT_RES_MASK) >> BLK_STAT_RES_SHIFT;
 }
 
 static inline void wbt_track(struct blk_issue_stat *stat, enum wbt_flags wb_acct)
 {
-	stat->time |= ((u64) wb_acct) << BLK_STAT_SHIFT;
+	stat->stat |= ((u64) wb_acct) << BLK_STAT_RES_SHIFT;
 }
 
 static inline bool wbt_is_tracked(struct blk_issue_stat *stat)
 {
-	return (stat->time >> BLK_STAT_SHIFT) & WBT_TRACKED;
+	return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_TRACKED;
 }
 
 static inline bool wbt_is_read(struct blk_issue_stat *stat)
 {
-	return (stat->time >> BLK_STAT_SHIFT) & WBT_READ;
+	return (stat->stat >> BLK_STAT_RES_SHIFT) & WBT_READ;
 }
 
 struct rq_wait {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 01019a7..8159a6c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -257,7 +257,7 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
 }
 
 struct blk_issue_stat {
-	u64 time;
+	u64 stat;
 };
 
 #define BLK_RQ_STAT_BATCH	64
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 16/17] blk-throttle: add a mechanism to estimate IO latency
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (14 preceding siblings ...)
  2016-12-15 20:33 ` [PATCH V5 15/17] block: track request size in blk_issue_stat Shaohua Li
@ 2016-12-15 20:33 ` Shaohua Li
  2017-01-09 21:39   ` Tejun Heo
  2016-12-15 20:33 ` [PATCH V5 17/17] blk-throttle: add latency target support Shaohua Li
  2017-01-09 21:46 ` [PATCH V5 00/17] blk-throttle: add .low limit Tejun Heo
  17 siblings, 1 reply; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:33 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.

To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.

But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.

Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-stat.c          |   4 ++
 block/blk-throttle.c      | 159 +++++++++++++++++++++++++++++++++++++++++++++-
 block/blk.h               |   2 +
 include/linux/blk_types.h |   9 +--
 4 files changed, 168 insertions(+), 6 deletions(-)

diff --git a/block/blk-stat.c b/block/blk-stat.c
index 0469855..15c1621 100644
--- a/block/blk-stat.c
+++ b/block/blk-stat.c
@@ -7,6 +7,7 @@
 #include <linux/blk-mq.h>
 
 #include "blk-stat.h"
+#include "blk.h"
 #include "blk-mq.h"
 
 static void blk_stat_flush_batch(struct blk_rq_stat *stat)
@@ -204,6 +205,9 @@ void blk_stat_add(struct blk_rq_stat *stat, struct request *rq)
 		__blk_stat_init(stat, now);
 
 	value = now - blk_stat_time(&rq->issue_stat);
+
+	blk_throtl_stat_add(rq, value);
+
 	if (value > stat->max)
 		stat->max = value;
 	if (value < stat->min)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3431b1d..1dc707a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -25,6 +25,8 @@ static int throtl_quantum = 32;
 #define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
 #define MAX_IDLE_TIME (500L * 1000 * 1000) /* 500 ms */
 
+#define SKIP_TRACK (((u64)1) << BLK_STAT_RES_SHIFT)
+
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -162,6 +164,19 @@ struct throtl_grp {
 	u64 idle_ttime_threshold;
 };
 
+/* We measure latency for request size from <= 4k to >= 1M */
+#define LATENCY_BUCKET_SIZE 9
+
+struct latency_bucket {
+	u64 total_latency;
+	int samples;
+};
+
+struct avg_latency_bucket {
+	u64 latency;
+	bool valid;
+};
+
 struct throtl_data
 {
 	/* service tree for active throtl groups */
@@ -185,6 +200,13 @@ struct throtl_data
 	unsigned long low_downgrade_time;
 
 	unsigned int scale;
+
+	struct latency_bucket tmp_buckets[LATENCY_BUCKET_SIZE];
+	struct avg_latency_bucket avg_buckets[LATENCY_BUCKET_SIZE];
+	struct latency_bucket __percpu *latency_buckets;
+	unsigned long last_calculate_time;
+
+	unsigned int track_bio_latency;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -293,6 +315,13 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 	return ret;
 }
 
+static int request_bucket_index(sector_t sectors)
+{
+	int order = order_base_2(sectors);
+
+	return clamp_t(int, order - 3, 0, LATENCY_BUCKET_SIZE - 1);
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -1896,12 +1925,74 @@ static void blk_throtl_update_ttime(struct throtl_grp *tg)
 	tg->checked_last_finish_time = last_finish_time;
 }
 
+static void throtl_update_latency_buckets(struct throtl_data *td)
+{
+	struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
+	int i, cpu;
+	u64 last_latency = 0;
+	u64 latency;
+
+	if (!blk_queue_nonrot(td->queue))
+		return;
+	if (time_before(jiffies, td->last_calculate_time + HZ))
+		return;
+	td->last_calculate_time = jiffies;
+
+	memset(avg_latency, 0, sizeof(avg_latency));
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		struct latency_bucket *tmp = &td->tmp_buckets[i];
+
+		for_each_possible_cpu(cpu) {
+			struct latency_bucket *bucket;
+
+			/* this isn't race free, but ok in practice */
+			bucket = per_cpu_ptr(td->latency_buckets, cpu);
+			tmp->total_latency += bucket[i].total_latency;
+			tmp->samples += bucket[i].samples;
+			bucket[i].total_latency = 0;
+			bucket[i].samples = 0;
+		}
+
+		if (tmp->samples >= 32) {
+			int samples = tmp->samples;
+
+			latency = tmp->total_latency;
+
+			tmp->total_latency = 0;
+			tmp->samples = 0;
+			do_div(latency, samples);
+			if (latency == 0)
+				continue;
+			avg_latency[i].latency = latency;
+		}
+	}
+
+	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
+		if (!avg_latency[i].latency) {
+			if (td->avg_buckets[i].latency < last_latency)
+				td->avg_buckets[i].latency = last_latency;
+			continue;
+		}
+
+		if (!td->avg_buckets[i].valid)
+			latency = avg_latency[i].latency;
+		else
+			latency = (td->avg_buckets[i].latency * 7 +
+				avg_latency[i].latency) >> 3;
+
+		td->avg_buckets[i].latency = max(latency, last_latency);
+		td->avg_buckets[i].valid = true;
+		last_latency = td->avg_buckets[i].latency;
+	}
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
 	struct throtl_qnode *qn = NULL;
 	struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
 	struct throtl_service_queue *sq;
+	struct throtl_data *td;
 	bool rw = bio_data_dir(bio);
 	bool throttled = false;
 	int ret;
@@ -1919,12 +2010,21 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 				tg->idle_ttime_threshold;
 	}
 
+	td = tg->td;
+	if (td->track_bio_latency == UINT_MAX) {
+		td->track_bio_latency = !q->mq_ops && !q->request_fn;
+		if (!td->track_bio_latency)
+			blk_stat_enable(q);
+	}
+
 	/* see throtl_charge_bio() */
 	if (bio_flagged(bio, BIO_THROTTLED) || !tg->has_rules[rw])
 		goto out;
 
 	spin_lock_irq(q->queue_lock);
 
+	throtl_update_latency_buckets(tg->td);
+
 	if (unlikely(blk_queue_bypass(q)))
 		goto out_unlock;
 
@@ -1932,6 +2032,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	if (ret == 0 || ret == -EBUSY)
 		bio->bi_cg_private = tg;
 
+	blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio));
+
 	blk_throtl_update_ttime(tg);
 
 	sq = &tg->service_queue;
@@ -2019,19 +2121,62 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	 */
 	if (!throttled)
 		bio_clear_flag(bio, BIO_THROTTLED);
+	else
+		bio->bi_issue_stat.stat |= SKIP_TRACK;
 	return throttled;
 }
 
+static void throtl_track_latency(struct throtl_data *td, sector_t size,
+	int op, u64 time)
+{
+	struct latency_bucket *latency;
+	int index;
+
+	if (!td || td->limit_index != LIMIT_LOW || op != REQ_OP_READ ||
+	    !blk_queue_nonrot(td->queue))
+		return;
+
+	index = request_bucket_index(size);
+
+	latency = get_cpu_ptr(td->latency_buckets);
+	latency[index].total_latency += time;
+	latency[index].samples++;
+	put_cpu_ptr(td->latency_buckets);
+}
+
+void blk_throtl_stat_add(struct request *rq, u64 time)
+{
+	struct request_queue *q = rq->q;
+	struct throtl_data *td = q->td;
+
+	throtl_track_latency(td, blk_stat_size(&rq->issue_stat),
+		req_op(rq), time);
+}
+
 void blk_throtl_bio_endio(struct bio *bio)
 {
 	struct throtl_grp *tg;
+	u64 finish_time;
+	u64 start_time;
+	u64 lat;
 
 	tg = bio->bi_cg_private;
 	if (!tg)
 		return;
 	bio->bi_cg_private = NULL;
 
-	tg->last_finish_time = ktime_get_ns();
+	finish_time = ktime_get_ns();
+	tg->last_finish_time = finish_time;
+
+	start_time = blk_stat_time(&bio->bi_issue_stat);
+	finish_time = __blk_stat_time(finish_time);
+	if (start_time && finish_time > start_time &&
+	    tg->td->track_bio_latency == 1 &&
+	    !(bio->bi_issue_stat.stat & SKIP_TRACK)) {
+		lat = finish_time - start_time;
+		throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
+			bio_op(bio), lat);
+	}
 }
 
 /*
@@ -2106,6 +2251,12 @@ int blk_throtl_init(struct request_queue *q)
 	td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
 	if (!td)
 		return -ENOMEM;
+	td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
+		LATENCY_BUCKET_SIZE, __alignof__(u64));
+	if (!td->latency_buckets) {
+		kfree(td);
+		return -ENOMEM;
+	}
 
 	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 	throtl_service_queue_init(&td->service_queue);
@@ -2119,10 +2270,13 @@ int blk_throtl_init(struct request_queue *q)
 	td->low_upgrade_time = jiffies;
 	td->low_downgrade_time = jiffies;
 
+	td->track_bio_latency = UINT_MAX;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
-	if (ret)
+	if (ret) {
+		free_percpu(td->latency_buckets);
 		kfree(td);
+	}
 	return ret;
 }
 
@@ -2131,6 +2285,7 @@ void blk_throtl_exit(struct request_queue *q)
 	BUG_ON(!q->td);
 	throtl_shutdown_wq(q);
 	blkcg_deactivate_policy(q, &blkcg_policy_throtl);
+	free_percpu(q->td->latency_buckets);
 	kfree(q->td);
 }
 
diff --git a/block/blk.h b/block/blk.h
index 921d04f..dfb44e6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -294,11 +294,13 @@ extern ssize_t blk_throtl_sample_time_show(struct request_queue *q, char *page);
 extern ssize_t blk_throtl_sample_time_store(struct request_queue *q,
 	const char *page, size_t count);
 extern void blk_throtl_bio_endio(struct bio *bio);
+extern void blk_throtl_stat_add(struct request *rq, u64 time);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
 static inline void blk_throtl_bio_endio(struct bio *bio) { }
+static inline void blk_throtl_stat_add(struct request *rq, u64 time) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8159a6c..d88a3c0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -17,6 +17,10 @@ struct io_context;
 struct cgroup_subsys_state;
 typedef void (bio_end_io_t) (struct bio *);
 
+struct blk_issue_stat {
+	u64 stat;
+};
+
 /*
  * main unit of I/O for the block layer and lower layers (ie drivers and
  * stacking drivers)
@@ -59,6 +63,7 @@ struct bio {
 	struct io_context	*bi_ioc;
 	struct cgroup_subsys_state *bi_css;
 	void			*bi_cg_private;
+	struct blk_issue_stat	bi_issue_stat;
 #endif
 	union {
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
@@ -256,10 +261,6 @@ static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
 	return cookie & ((1u << BLK_QC_T_SHIFT) - 1);
 }
 
-struct blk_issue_stat {
-	u64 stat;
-};
-
 #define BLK_RQ_STAT_BATCH	64
 
 struct blk_rq_stat {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH V5 17/17] blk-throttle: add latency target support
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (15 preceding siblings ...)
  2016-12-15 20:33 ` [PATCH V5 16/17] blk-throttle: add a mechanism to estimate IO latency Shaohua Li
@ 2016-12-15 20:33 ` Shaohua Li
  2017-01-09 21:46 ` [PATCH V5 00/17] blk-throttle: add .low limit Tejun Heo
  17 siblings, 0 replies; 35+ messages in thread
From: Shaohua Li @ 2016-12-15 20:33 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: kernel-team, axboe, tj, vgoyal

One hard problem adding .low limit is to detect idle cgroup. If one
cgroup doesn't dispatch enough IO against its low limit, we must have a
mechanism to determine if other cgroups dispatch more IO. We added the
think time detection mechanism before, but it doesn't work for all
workloads. Here we add a latency based approach.

We already have mechanism to calculate latency threshold for each IO
size. For every IO dispatched from a cgorup, we compare its latency
against its threshold and record the info. If most IO latency is below
threshold (in the code I use 75%), the cgroup could be treated idle and
other cgroups can dispatch more IO.

Currently this latency target check is only for SSD as we can't
calcualte the latency target for hard disk. And this is only for cgroup
leaf node so far.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 41 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1dc707a..915ebf5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -162,6 +162,10 @@ struct throtl_grp {
 	u64 checked_last_finish_time;
 	u64 avg_ttime;
 	u64 idle_ttime_threshold;
+
+	unsigned int bio_cnt; /* total bios */
+	unsigned int bad_bio_cnt; /* bios exceeding latency threshold */
+	unsigned long bio_cnt_reset_time;
 };
 
 /* We measure latency for request size from <= 4k to >= 1M */
@@ -1688,11 +1692,14 @@ static bool throtl_tg_is_idle(struct throtl_grp *tg)
 	 * - single idle is too long, longer than a fixed value (in case user
 	 *   configure a too big threshold) or 4 times of slice
 	 * - average think time is more than threshold
+	 * - IO latency is largely below threshold
 	 */
 	u64 time = (u64)jiffies_to_usecs(4 * tg->td->throtl_slice) * 1000;
 	time = min_t(u64, MAX_IDLE_TIME, time);
 	return ktime_get_ns() - tg->last_finish_time > time ||
-	       tg->avg_ttime > tg->idle_ttime_threshold;
+	       tg->avg_ttime > tg->idle_ttime_threshold ||
+	       (tg->latency_target && tg->bio_cnt &&
+		tg->bad_bio_cnt * 5 < tg->bio_cnt);
 }
 
 static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
@@ -2170,12 +2177,36 @@ void blk_throtl_bio_endio(struct bio *bio)
 
 	start_time = blk_stat_time(&bio->bi_issue_stat);
 	finish_time = __blk_stat_time(finish_time);
-	if (start_time && finish_time > start_time &&
-	    tg->td->track_bio_latency == 1 &&
-	    !(bio->bi_issue_stat.stat & SKIP_TRACK)) {
-		lat = finish_time - start_time;
+	if (!start_time || finish_time <= start_time)
+		return;
+
+	lat = finish_time - start_time;
+	if (tg->td->track_bio_latency == 1 &&
+	    !(bio->bi_issue_stat.stat & SKIP_TRACK))
 		throtl_track_latency(tg->td, blk_stat_size(&bio->bi_issue_stat),
 			bio_op(bio), lat);
+
+	if (tg->latency_target) {
+		int bucket;
+		unsigned int threshold;
+
+		bucket = request_bucket_index(
+			blk_stat_size(&bio->bi_issue_stat));
+		threshold = tg->td->avg_buckets[bucket].latency +
+			tg->latency_target;
+		if (lat > threshold)
+			tg->bad_bio_cnt++;
+		/*
+		 * Not race free, could get wrong count, which means cgroups
+		 * will be throttled
+		 */
+		tg->bio_cnt++;
+	}
+
+	if (time_after(jiffies, tg->bio_cnt_reset_time) || tg->bio_cnt > 1024) {
+		tg->bio_cnt_reset_time = tg->td->throtl_slice + jiffies;
+		tg->bio_cnt /= 2;
+		tg->bad_bio_cnt /= 2;
 	}
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 15/17] block: track request size in blk_issue_stat
  2016-12-15 20:33 ` [PATCH V5 15/17] block: track request size in blk_issue_stat Shaohua Li
@ 2016-12-16  2:01   ` kbuild test robot
  2017-01-09 21:17   ` Tejun Heo
  1 sibling, 0 replies; 35+ messages in thread
From: kbuild test robot @ 2016-12-16  2:01 UTC (permalink / raw)
  To: Shaohua Li
  Cc: kbuild-all, linux-block, linux-kernel, kernel-team, axboe, tj, vgoyal

[-- Attachment #1: Type: text/plain, Size: 1924 bytes --]

Hi Shaohua,

[auto build test WARNING on block/for-next]
[also build test WARNING on next-20161215]
[cannot apply to v4.9]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Shaohua-Li/blk-throttle-add-low-limit/20161216-093257
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
config: i386-randconfig-s0-201650 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   block/blk-stat.c: In function 'blk_stat_set_issue':
>> block/blk-stat.c:243:26: warning: left shift count >= width of type [-Wshift-count-overflow]
      (blk_capped_size(size) << BLK_STAT_SIZE_SHIFT);
                             ^~

vim +243 block/blk-stat.c

   227			queue_for_each_hw_ctx(q, hctx, i) {
   228				hctx_for_each_ctx(hctx, ctx, j) {
   229					blk_stat_init(&ctx->stat[BLK_STAT_READ]);
   230					blk_stat_init(&ctx->stat[BLK_STAT_WRITE]);
   231				}
   232			}
   233		} else {
   234			blk_stat_init(&q->rq_stats[BLK_STAT_READ]);
   235			blk_stat_init(&q->rq_stats[BLK_STAT_WRITE]);
   236		}
   237	}
   238	
   239	void blk_stat_set_issue(struct blk_issue_stat *stat, sector_t size)
   240	{
   241		stat->stat = (stat->stat & BLK_STAT_RES_MASK) |
   242			(ktime_to_ns(ktime_get()) & BLK_STAT_TIME_MASK) |
 > 243			(blk_capped_size(size) << BLK_STAT_SIZE_SHIFT);
   244	}
   245	
   246	/*
   247	 * Enable stat tracking, return whether it was enabled
   248	 */
   249	bool blk_stat_enable(struct request_queue *q)
   250	{
   251		if (!test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28621 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 03/17] blk-throttle: add .low interface
  2016-12-15 20:32 ` [PATCH V5 03/17] blk-throttle: add .low interface Shaohua Li
@ 2017-01-09 16:35   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 16:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

Happy new year, Shaohua.

Sorry about the long delay.

On Thu, Dec 15, 2016 at 12:32:54PM -0800, Shaohua Li wrote:
> Add low limit for cgroup and corresponding cgroup interface.

It'd be nice to explain why we're adding separate _conf fields.

> +static void blk_throtl_update_valid_limit(struct throtl_data *td)

I think blk_throtl_update_limit_valid() would be more consistent.

> +{
> +	struct cgroup_subsys_state *pos_css;
> +	struct blkcg_gq *blkg;
> +	bool low_valid = false;
> +
> +	rcu_read_lock();
> +	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
> +		struct throtl_grp *tg = blkg_to_tg(blkg);
> +
> +		if (tg->bps[READ][LIMIT_LOW] != U64_MAX ||
> +		    tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
> +		    tg->iops[READ][LIMIT_LOW] != UINT_MAX ||
> +		    tg->iops[WRITE][LIMIT_LOW] != UINT_MAX)
> +			low_valid = true;

It's weird that it's defaulting to MAX.  Shouldn't it default to 0?
When we enable these limits on a cgroup, we want it to not affect the
operation without further configuration.  For max limit, MAX does that
as being over the limit is what changes the behavior.  For low, it's
the other way around.  We enforce latency target if cgroups are under
the low limit, and thus 0 should be the noop default value, which is
the same in memcg.

> +	}
> +	rcu_read_unlock();
> +
> +	if (low_valid)
> +		td->limit_valid[LIMIT_LOW] = true;
> +	else
> +		td->limit_valid[LIMIT_LOW] = false;

Maybe
	td->limit_valid[LIMIT_LOW] = low_valid;

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 04/17] blk-throttle: configure bps/iops limit for cgroup in low limit
  2016-12-15 20:32 ` [PATCH V5 04/17] blk-throttle: configure bps/iops limit for cgroup in low limit Shaohua Li
@ 2017-01-09 17:35   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 17:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

Hello,

On Thu, Dec 15, 2016 at 12:32:55PM -0800, Shaohua Li wrote:
> each queue will have a state machine. Initially queue is in LIMIT_LOW
> state, which means all cgroups will be throttled according to their low
> limit. After all cgroups with low limit cross the limit, the queue state
> gets upgraded to LIMIT_MAX state.

As with the previous patch, I don't think it's correct to default to
MAX for LIMIT_LOW and then enforce it.  Without specific
configuration, a cgroup should always be above below (so, no latency
requirement) and below max (no hard limiting).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state
  2016-12-15 20:32 ` [PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state Shaohua Li
@ 2017-01-09 18:40   ` Tejun Heo
  2017-01-09 19:46     ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 18:40 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

Hello, Shaohua.

On Thu, Dec 15, 2016 at 12:32:56PM -0800, Shaohua Li wrote:
> For a cgroup hierarchy, there are two cases. Children has lower low
> limit than parent. Parent's low limit is meaningless. If children's
> bps/iops cross low limit, we can upgrade queue state. The other case is
> children has higher low limit than parent. Children's low limit is
> meaningless. As long as parent's bps/iops cross low limit, we can
> upgrade queue state.

The above isn't completely accurate as the parent should consider the
sum of what's currently being used in the children.

> +static bool throtl_tg_can_upgrade(struct throtl_grp *tg)
> +{
> +	struct throtl_service_queue *sq = &tg->service_queue;
> +	bool read_limit, write_limit;
> +
> +	/*
> +	 * if cgroup reaches low/max limit (max >= low), it's ok to next
> +	 * limit
> +	 */
> +	read_limit = tg->bps[READ][LIMIT_LOW] != U64_MAX ||
> +		     tg->iops[READ][LIMIT_LOW] != UINT_MAX;
> +	write_limit = tg->bps[WRITE][LIMIT_LOW] != U64_MAX ||
> +		      tg->iops[WRITE][LIMIT_LOW] != UINT_MAX;
> +	if (read_limit && sq->nr_queued[READ] &&
> +	    (!write_limit || sq->nr_queued[WRITE]))
> +		return true;
> +	if (write_limit && sq->nr_queued[WRITE] &&
> +	    (!read_limit || sq->nr_queued[READ]))
> +		return true;

I think it'd be great to explain the above.  It was a bit difficult
for me to follow.  It's also interesting because we're tying state
transitions for both read and write together.  blk-throtl has been
handling reads and writes independently, now the mode switching from
low to max is shared across reads and writes.  I suppose it could be
fine but would it be complex to separate them out?  It's weird to make
this one state shared across reads and writes while not for others or
was this sharing intentional?

> +	return false;
> +}
> +
> +static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg)
> +{
> +	while (true) {
> +		if (throtl_tg_can_upgrade(tg))
> +			return true;
> +		tg = sq_to_tg(tg->service_queue.parent_sq);
> +		if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
> +				!tg_to_blkg(tg)->parent))
> +			return false;

Isn't the low limit v2 only?  Do we need the on_dfl test this deep?

> +	}
> +	return false;
> +}
> +
> +static bool throtl_can_upgrade(struct throtl_data *td,
> +	struct throtl_grp *this_tg)
> +{
> +	struct cgroup_subsys_state *pos_css;
> +	struct blkcg_gq *blkg;
> +
> +	if (td->limit_index != LIMIT_LOW)
> +		return false;
> +
> +	rcu_read_lock();
> +	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
> +		struct throtl_grp *tg = blkg_to_tg(blkg);
> +
> +		if (tg == this_tg)
> +			continue;
> +		if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
> +			continue;
> +		if (!throtl_hierarchy_can_upgrade(tg)) {
> +			rcu_read_unlock();
> +			return false;
> +		}
> +	}
> +	rcu_read_unlock();
> +	return true;
> +}

So, if all with low limit are over their limits (have commands queued
in the delay queue), the state can be upgraded, right?  Yeah, that
seems correct to me.  The patch description didn't seem to match it
tho.  Can you please update the description accordingly?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state
  2017-01-09 18:40   ` Tejun Heo
@ 2017-01-09 19:46     ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 19:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

Hello, again.

On Mon, Jan 09, 2017 at 01:40:53PM -0500, Tejun Heo wrote:
> I think it'd be great to explain the above.  It was a bit difficult
> for me to follow.  It's also interesting because we're tying state
> transitions for both read and write together.  blk-throtl has been
> handling reads and writes independently, now the mode switching from
> low to max is shared across reads and writes.  I suppose it could be
> fine but would it be complex to separate them out?  It's weird to make
> this one state shared across reads and writes while not for others or
> was this sharing intentional?

I thought more about it and as the low limit is regulated by latency,
it makes sense to make the state shared across reads and writes;
otherwise, IOs in one direction could easily mess up the other
direction.  Can you please document that this is an intentional design
and explain the rationale tho?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 07/17] blk-throttle: make sure expire time isn't too big
  2016-12-15 20:32 ` [PATCH V5 07/17] blk-throttle: make sure expire time isn't too big Shaohua Li
@ 2017-01-09 19:54   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 19:54 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

On Thu, Dec 15, 2016 at 12:32:58PM -0800, Shaohua Li wrote:
> cgroup could be throttled to a limit but when all cgroups cross high
> limit, queue enters a higher state and so the group should be throttled
> to a higher limit. It's possible the cgroup is sleeping because of
> throttle and other cgroups don't dispatch IO any more. In this case,
> nobody can trigger current downgrade/upgrade logic. To fix this issue,
> we could either set up a timer to wakeup the cgroup if other cgroups are
> idle or make sure this cgroup doesn't sleep too long. Setting up a timer
> means we must change the timer very frequently. This patch chooses the
> latter. Making cgroup sleep time not too big wouldn't change cgroup
> bps/iops, but could make it wakeup more frequently, which isn't a big
> issue because throtl_slice * 8 is already quite big.
> 
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  block/blk-throttle.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 0f65fce..41ec72c 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -588,6 +588,10 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
>  static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
>  					  unsigned long expires)
>  {
> +	unsigned long max_expire = jiffies + 8 * throtl_slice;
> +
> +	if (time_after(expires, max_expire))
> +		expires = max_expire;

A comment explaining why we need this would be nice.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 08/17] blk-throttle: make throtl_slice tunable
  2016-12-15 20:32 ` [PATCH V5 08/17] blk-throttle: make throtl_slice tunable Shaohua Li
@ 2017-01-09 20:08   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 20:08 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

Hello,

On Thu, Dec 15, 2016 at 12:32:59PM -0800, Shaohua Li wrote:
> throtl_slice is important for blk-throttling. It's called slice
> internally but it really is a time window blk-throttling samples data.
> blk-throttling will make decision based on the samplings. An example is
> bandwidth measurement. A cgroup's bandwidth is measured in the time
> interval of throtl_slice.
> 
> A small throtl_slice meanse cgroups have smoother throughput but burn
> more CPUs. It has 100ms default value, which is not appropriate for all
> disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
> it tunable.
> 
> Since throtl_slice isn't a time slice, the sysfs name
> 'throttle_sample_time' reflects its character better.

Generally looks good.  I think some documentation would be great.
Also, do we wannt set it to a lower number if blk_queue_nonrot()?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 09/17] blk-throttle: detect completed idle cgroup
  2016-12-15 20:33 ` [PATCH V5 09/17] blk-throttle: detect completed idle cgroup Shaohua Li
@ 2017-01-09 20:13   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 20:13 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

Hello,

On Thu, Dec 15, 2016 at 12:33:00PM -0800, Shaohua Li wrote:
> @@ -1660,6 +1671,11 @@ static bool throtl_tg_can_downgrade(struct throtl_grp *tg)
>  	struct throtl_data *td = tg->td;
>  	unsigned long now = jiffies;
>  
> +	if (time_after_eq(now, tg->last_dispatch_time[READ] +
> +					td->throtl_slice) &&
> +	    time_after_eq(now, tg->last_dispatch_time[WRITE] +
> +					td->throtl_slice))
> +		return false;

So, the duration used here is gonna be made explicitly configurable by
a future patch, right?  Might be worthwhile to note that in the
description.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 10/17] blk-throttle: make bandwidth change smooth
  2016-12-15 20:33 ` [PATCH V5 10/17] blk-throttle: make bandwidth change smooth Shaohua Li
@ 2017-01-09 20:28   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 20:28 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

Hello,

On Thu, Dec 15, 2016 at 12:33:01PM -0800, Shaohua Li wrote:
>  static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
>  {
>  	struct blkcg_gq *blkg = tg_to_blkg(tg);
> +	struct throtl_data *td;
>  	uint64_t ret;
>  
>  	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
>  		return U64_MAX;
> -	return tg->bps[rw][tg->td->limit_index];
> +
> +	td = tg->td;
> +	ret = tg->bps[rw][td->limit_index];
> +	if (td->limit_index == LIMIT_MAX && tg->bps[rw][LIMIT_LOW] !=
> +	    tg->bps[rw][LIMIT_MAX]) {
> +		uint64_t increase;
> +
> +		if (td->scale < 4096 && time_after_eq(jiffies,

Hmm... why do we need to limit scale to 4096?  As 4096 is a big
number, this is only theoretical but this means that if max is more
then 2048 times low, that will never be reached, right?

> +		    td->low_upgrade_time + td->scale * td->throtl_slice)) {
> +			unsigned int time = jiffies - td->low_upgrade_time;
> +
> +			td->scale = time / td->throtl_slice;
> +		}
> +		increase = (tg->bps[rw][LIMIT_LOW] >> 1) * td->scale;
> +		ret = min(tg->bps[rw][LIMIT_MAX],
> +			tg->bps[rw][LIMIT_LOW] + increase);
> +	}
> +	return ret;
>  }

I think the code can use some comments.

>  static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
>  {
>  	struct blkcg_gq *blkg = tg_to_blkg(tg);
> +	struct throtl_data *td;
>  	unsigned int ret;
>  
>  	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
>  		return UINT_MAX;
> -	return tg->iops[rw][tg->td->limit_index];
> +
> +	td = tg->td;
> +	ret = tg->iops[rw][td->limit_index];
> +	if (td->limit_index == LIMIT_MAX && tg->iops[rw][LIMIT_LOW] !=
> +	    tg->iops[rw][LIMIT_MAX]) {
> +		uint64_t increase;
> +
> +		if (td->scale < 4096 && time_after_eq(jiffies,
> +		    td->low_upgrade_time + td->scale * td->throtl_slice)) {
> +			unsigned int time = jiffies - td->low_upgrade_time;
> +
> +			td->scale = time / td->throtl_slice;
> +		}
> +
> +		increase = (tg->iops[rw][LIMIT_LOW] >> 1) * td->scale;
> +		ret = min(tg->iops[rw][LIMIT_MAX],
> +			tg->iops[rw][LIMIT_LOW] + (unsigned int)increase);

Would it be worthwhile to factor the common part into a helper?

> @@ -1662,6 +1702,13 @@ static void throtl_upgrade_state(struct throtl_data *td)
>  
>  static void throtl_downgrade_state(struct throtl_data *td, int new)
>  {
> +	td->scale /= 2;
> +
> +	if (td->scale) {
> +		td->low_upgrade_time = jiffies - td->scale * td->throtl_slice;
> +		return;
> +	}

Cool, so linear increase and exponential backdown.  Yeah, that makes
sense to me but let's please document it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 11/17] blk-throttle: add a simple idle detection
  2016-12-15 20:33 ` [PATCH V5 11/17] blk-throttle: add a simple idle detection Shaohua Li
@ 2017-01-09 20:56   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 20:56 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

On Thu, Dec 15, 2016 at 12:33:02PM -0800, Shaohua Li wrote:
>  /* Throttling is performed over 100ms slice and after that slice is renewed */
>  #define DFL_THROTL_SLICE (HZ / 10)
>  #define MAX_THROTL_SLICE (HZ / 5)
> +#define DFL_IDLE_THRESHOLD_SSD (50 * 1000) /* 50 us */
> +#define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
> +#define MAX_IDLE_TIME (500L * 1000 * 1000) /* 500 ms */

Hmm... why are we capping idle time so low?  This is a value to be
configured by userland.  Does it make sense to cap it this low?  Also,
wouldn't it make sense to start with higher default value given that
the user has to explicitly enable low limit for it to be effective and
thus explicitly requesting best effort latency target which will be
added later?

I'm really uncomfortable with pitting these two knobs against each
other in the similar time ranges.  It's really difficult tell what
latency target of 25us means and predict its behavior and when the
idle timeout is 50us.  It's fine if some people fiddle with them but
it'd be great if the defaults clearly indicate that they're operating
in mostly separate time scales.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 12/17] blk-throttle: add interface to configure idle time threshold
  2016-12-15 20:33 ` [PATCH V5 12/17] blk-throttle: add interface to configure idle time threshold Shaohua Li
@ 2017-01-09 20:58   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 20:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

On Thu, Dec 15, 2016 at 12:33:03PM -0800, Shaohua Li wrote:
> @@ -180,6 +180,8 @@ struct throtl_data
>  	unsigned int limit_index;
>  	bool limit_valid[LIMIT_CNT];
>  
> +	u64 dft_idle_ttime_threshold;

BTW, wouldn't idle_time be a better name for these?  Currently, it's
"idle thinktime" and I'm not sure what that means.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 13/17] blk-throttle: ignore idle cgroup limit
  2016-12-15 20:33 ` [PATCH V5 13/17] blk-throttle: ignore idle cgroup limit Shaohua Li
@ 2017-01-09 21:01   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 21:01 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

On Thu, Dec 15, 2016 at 12:33:04PM -0800, Shaohua Li wrote:
> Last patch introduces a way to detect idle cgroup. We use it to make
> upgrade/downgrade decision. And the new algorithm can detect completely
> idle cgroup too, so we can delete the corresponding code.

Ah, okay, the slice based idle detection gets completely removed here.
Might be better to not add it in the first place but no biggie either
way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 14/17] blk-throttle: add interface for per-cgroup target latency
  2016-12-15 20:33 ` [PATCH V5 14/17] blk-throttle: add interface for per-cgroup target latency Shaohua Li
@ 2017-01-09 21:14   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 21:14 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

Hello,

On Thu, Dec 15, 2016 at 12:33:05PM -0800, Shaohua Li wrote:
> @@ -438,6 +439,11 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
>  	}
>  	tg->idle_ttime_threshold = U64_MAX;
>  
> +	/*
> +	 * target latency default 0, eg, latency threshold is 0, which means
> +	 * cgroup's latency is always higher than threshold
> +	 */
> +
>  	return &tg->pd;
>  }

So, this is something which bothers me regarding the default settings.
I suspect the reason why the earlier patch went for tight idle time
was because we're setting default latency to zero, so to achieve good
utilization, the idle timeout must be shortened so that it neutralizes
the 0 latency target here.

I don't think this is a good default configuration.  Latency target
should be the mechanism which determines how shareable an active
cgroup which is under its low limit is.  That's the only thing it can
do anyway.  Idle time mechanism should serve a different purpose, not
an overlapping one.

If we want to default to latency guarantee, we can go for 0 latency
and a long idle timeout, even infinity.  If we want to default to good
utilization, we should pick a reasonable latency target (which is tied
to the device latency) with a reasonable idle timeout (which is tied
to how human perceives something to be idle).

Please note that it's kinda clear that we're misconfiguring it in the
previous patch in that we're altering idle timeout on device type.
Idle timeout is about the application behavior.  This isn't really
decided by request completion latency.  On the other hand, latency
target is the parameter which is device dependent.  The fact that it
was picking different idle time depending on device type means that
the roles of idle timeout and latency target are overlapping.  They
shouldn't.  It gets really confusing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 15/17] block: track request size in blk_issue_stat
  2016-12-15 20:33 ` [PATCH V5 15/17] block: track request size in blk_issue_stat Shaohua Li
  2016-12-16  2:01   ` kbuild test robot
@ 2017-01-09 21:17   ` Tejun Heo
  1 sibling, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 21:17 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

On Thu, Dec 15, 2016 at 12:33:06PM -0800, Shaohua Li wrote:
> Currently there is no way to know the request size when the request is
> finished. Next patch will need this info, so add to blk_issue_stat. With
> this, we will have 49bits to track time, which still is very long time.

Not necessarily an objection but do we really need to overload the
size field?  Would a normal extra field hurt too much?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 16/17] blk-throttle: add a mechanism to estimate IO latency
  2016-12-15 20:33 ` [PATCH V5 16/17] blk-throttle: add a mechanism to estimate IO latency Shaohua Li
@ 2017-01-09 21:39   ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 21:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

Hello,

On Thu, Dec 15, 2016 at 12:33:07PM -0800, Shaohua Li wrote:
> User configures latency target, but the latency threshold for each
> request size isn't fixed. For a SSD, the IO latency highly depends on
> request size. To calculate latency threshold, we sample some data, eg,
> average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
> threshold of each request size will be the sample latency (I'll call it
> base latency) plus latency target. For example, the base latency for
> request size 4k is 80us and user configures latency target 60us. The 4k
> latency threshold will be 80 + 60 = 140us.

Ah okay, the user configures the extra latency.  Yeah, this is way
better than treating what the user configures as the target latency
for 4k IOs.

> @@ -25,6 +25,8 @@ static int throtl_quantum = 32;
>  #define DFL_IDLE_THRESHOLD_HD (1000 * 1000) /* 1 ms */
>  #define MAX_IDLE_TIME (500L * 1000 * 1000) /* 500 ms */
>  
> +#define SKIP_TRACK (((u64)1) << BLK_STAT_RES_SHIFT)

SKIP_LATENCY?

> +static void throtl_update_latency_buckets(struct throtl_data *td)
> +{
> +	struct avg_latency_bucket avg_latency[LATENCY_BUCKET_SIZE];
> +	int i, cpu;
> +	u64 last_latency = 0;
> +	u64 latency;
> +
> +	if (!blk_queue_nonrot(td->queue))
> +		return;
> +	if (time_before(jiffies, td->last_calculate_time + HZ))
> +		return;
> +	td->last_calculate_time = jiffies;
> +
> +	memset(avg_latency, 0, sizeof(avg_latency));
> +	for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
> +		struct latency_bucket *tmp = &td->tmp_buckets[i];
> +
> +		for_each_possible_cpu(cpu) {
> +			struct latency_bucket *bucket;
> +
> +			/* this isn't race free, but ok in practice */
> +			bucket = per_cpu_ptr(td->latency_buckets, cpu);
> +			tmp->total_latency += bucket[i].total_latency;
> +			tmp->samples += bucket[i].samples;

Heh, this *can* lead to surprising results (like reading zero for a
value larger than 2^32) on 32bit machines due to split updates, and if
we're using nanosecs, those surprises have a chance, albeit low, of
happening every four secs, which is a bit unsettling.  If we have to
use nanosecs, let's please use u64_stats_sync.  If we're okay with
microsecs, ulongs should be fine.

>  void blk_throtl_bio_endio(struct bio *bio)
>  {
>  	struct throtl_grp *tg;
> +	u64 finish_time;
> +	u64 start_time;
> +	u64 lat;
>  
>  	tg = bio->bi_cg_private;
>  	if (!tg)
>  		return;
>  	bio->bi_cg_private = NULL;
>  
> -	tg->last_finish_time = ktime_get_ns();
> +	finish_time = ktime_get_ns();
> +	tg->last_finish_time = finish_time;
> +
> +	start_time = blk_stat_time(&bio->bi_issue_stat);
> +	finish_time = __blk_stat_time(finish_time);
> +	if (start_time && finish_time > start_time &&
> +	    tg->td->track_bio_latency == 1 &&
> +	    !(bio->bi_issue_stat.stat & SKIP_TRACK)) {

Heh, can't we collapse some of the conditions?  e.g. flip SKIP_TRACK
to TRACK_LATENCY and set it iff the td has track_bio_latency set and
also the bio has start time set?

> @@ -2106,6 +2251,12 @@ int blk_throtl_init(struct request_queue *q)
>  	td = kzalloc_node(sizeof(*td), GFP_KERNEL, q->node);
>  	if (!td)
>  		return -ENOMEM;
> +	td->latency_buckets = __alloc_percpu(sizeof(struct latency_bucket) *
> +		LATENCY_BUCKET_SIZE, __alignof__(u64));
> +	if (!td->latency_buckets) {
> +		kfree(td);
> +		return -ENOMEM;
> +	}
>  
>  	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
>  	throtl_service_queue_init(&td->service_queue);
> @@ -2119,10 +2270,13 @@ int blk_throtl_init(struct request_queue *q)
>  	td->low_upgrade_time = jiffies;
>  	td->low_downgrade_time = jiffies;
>  
> +	td->track_bio_latency = UINT_MAX;

I don't think using 0, 1, UINT_MAX as enums is good for readability.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 00/17] blk-throttle: add .low limit
  2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
                   ` (16 preceding siblings ...)
  2016-12-15 20:33 ` [PATCH V5 17/17] blk-throttle: add latency target support Shaohua Li
@ 2017-01-09 21:46 ` Tejun Heo
  2017-01-09 22:27   ` Shaohua Li
  17 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2017-01-09 21:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

Hello,

Sorry about the long delay.  Generally looks good to me.  Overall,
there are only a few things that I think should be addressed.

* Low limit should default to zero.

* The default values and roles of idle timeout vs. latency target.

I don't think either would be difficult to address.  Oh and a bit more
documentation overall would be nice.

Thanks a lot for the great work and happy new year!

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH V5 00/17] blk-throttle: add .low limit
  2017-01-09 21:46 ` [PATCH V5 00/17] blk-throttle: add .low limit Tejun Heo
@ 2017-01-09 22:27   ` Shaohua Li
  0 siblings, 0 replies; 35+ messages in thread
From: Shaohua Li @ 2017-01-09 22:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-block, linux-kernel, kernel-team, axboe, vgoyal

On Mon, Jan 09, 2017 at 04:46:35PM -0500, Tejun Heo wrote:
> Hello,
> 
> Sorry about the long delay.  Generally looks good to me.  Overall,
> there are only a few things that I think should be addressed.

Thanks for your time!
 
> * Low limit should default to zero.

I forgot to change it after changing the name from high to low

> * The default values and roles of idle timeout vs. latency target.
> 
> I don't think either would be difficult to address.  Oh and a bit more
> documentation overall would be nice.

Will fix these.

> Thanks a lot for the great work and happy new year!

Thanks and happy new year!

Best regards,
Shaohua

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2017-01-09 22:27 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-15 20:32 [PATCH V5 00/17] blk-throttle: add .low limit Shaohua Li
2016-12-15 20:32 ` [PATCH V5 01/17] blk-throttle: use U64_MAX/UINT_MAX to replace -1 Shaohua Li
2016-12-15 20:32 ` [PATCH V5 02/17] blk-throttle: prepare support multiple limits Shaohua Li
2016-12-15 20:32 ` [PATCH V5 03/17] blk-throttle: add .low interface Shaohua Li
2017-01-09 16:35   ` Tejun Heo
2016-12-15 20:32 ` [PATCH V5 04/17] blk-throttle: configure bps/iops limit for cgroup in low limit Shaohua Li
2017-01-09 17:35   ` Tejun Heo
2016-12-15 20:32 ` [PATCH V5 05/17] blk-throttle: add upgrade logic for LIMIT_LOW state Shaohua Li
2017-01-09 18:40   ` Tejun Heo
2017-01-09 19:46     ` Tejun Heo
2016-12-15 20:32 ` [PATCH V5 06/17] blk-throttle: add downgrade logic Shaohua Li
2016-12-15 20:32 ` [PATCH V5 07/17] blk-throttle: make sure expire time isn't too big Shaohua Li
2017-01-09 19:54   ` Tejun Heo
2016-12-15 20:32 ` [PATCH V5 08/17] blk-throttle: make throtl_slice tunable Shaohua Li
2017-01-09 20:08   ` Tejun Heo
2016-12-15 20:33 ` [PATCH V5 09/17] blk-throttle: detect completed idle cgroup Shaohua Li
2017-01-09 20:13   ` Tejun Heo
2016-12-15 20:33 ` [PATCH V5 10/17] blk-throttle: make bandwidth change smooth Shaohua Li
2017-01-09 20:28   ` Tejun Heo
2016-12-15 20:33 ` [PATCH V5 11/17] blk-throttle: add a simple idle detection Shaohua Li
2017-01-09 20:56   ` Tejun Heo
2016-12-15 20:33 ` [PATCH V5 12/17] blk-throttle: add interface to configure idle time threshold Shaohua Li
2017-01-09 20:58   ` Tejun Heo
2016-12-15 20:33 ` [PATCH V5 13/17] blk-throttle: ignore idle cgroup limit Shaohua Li
2017-01-09 21:01   ` Tejun Heo
2016-12-15 20:33 ` [PATCH V5 14/17] blk-throttle: add interface for per-cgroup target latency Shaohua Li
2017-01-09 21:14   ` Tejun Heo
2016-12-15 20:33 ` [PATCH V5 15/17] block: track request size in blk_issue_stat Shaohua Li
2016-12-16  2:01   ` kbuild test robot
2017-01-09 21:17   ` Tejun Heo
2016-12-15 20:33 ` [PATCH V5 16/17] blk-throttle: add a mechanism to estimate IO latency Shaohua Li
2017-01-09 21:39   ` Tejun Heo
2016-12-15 20:33 ` [PATCH V5 17/17] blk-throttle: add latency target support Shaohua Li
2017-01-09 21:46 ` [PATCH V5 00/17] blk-throttle: add .low limit Tejun Heo
2017-01-09 22:27   ` Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).