linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10]block-throttle: add low/high limit
@ 2016-05-11  0:16 Shaohua Li
  2016-05-11  0:16 ` [PATCH 01/10] block-throttle: prepare support multiple limits Shaohua Li
                   ` (10 more replies)
  0 siblings, 11 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

Hi,

This patch set adds low/high limit for blk-throttle cgroup. The interface is
io.low and io.high.

low limit implements best effort bandwidth/iops protection. If one cgroup
doesn't reach its low limit, no other cgroups can use more bandwidth/iops than
their low limit. cgroup without low limit is not protected. If there is cgroup
with low limit but the cgroup doesn't reach low limit yet, the cgroup without
low limit will be throttled to very low bandwidth/iops.

high limit implements best effort limitation. cgroup with high limit can use
more than high limit bandwidth/iops if all cgroups use at least high limit
bandwidth/iops. If one cgroup is below its high limit, all cgroups can't use
more bandwidth/iops than their high limit. If some cgroups have high limit and
the others haven't, the cgroups without high limit will use max limit as their
high limit.

The disk queue has a state machine. We have 3 states LIMIT_LOW, LIMIT_HIGH and
LIMIT_MAX. In each state, we throttle cgroups up to a limit according to their
state limit. LIMIT_LOW state limit is low limit, LIMIT_HIGH high limit and
LIMIT_MAX max limit. In a state, if condition meets, queue can upgrade to
higher level state or downgrade to lower level state. For example, queue is in
LIMIT_LOW state and all cgroups reach their low limit, the queue will be
upgraded to LIMIT_HIGH. In another example, queue is in LIMIT_MAX state, but
one cgroup is below its high limit, the queue will be downgraded to LIMIT_HIGH.
If all cgroups don't have limit for specific state, the state will be invalid.
We will skip invalid state for upgrading/downgrading. Initially queue state is
LIMIT_MAX till some cgroup gets low/high limit set, so this will maintain
backward compatibility for users with only max limist set.

If downgrade/upgrade only happens according to limit, we will have performance
issue. For example, if one cgroup has low limit set but the cgroup never
dispatch enough IO to reach low limit, the queue state will remain in
LIMIT_LOW. Other cgroups will be throttled and the whole disk utilization will
be low. To solve this issue, if cgroup is below limit for a long time, we treat
the cgroup idle and its corresponding limit will be ignored for
upgrade/downgrade logic. The idle based upgrade could introduce a dilemma
though, since we will do downgrade if cgroup is below its limit (eg idle). For
example, if a cgroup is below its low limit for a long time, queue is upgraded
to HIGH state. The cgroup continues to be below its low limit, the queue will
be downgraded to LOW state. In this example, the queue will keep switching
state between LOW and HIGH.

The key to avoid unnecessary state switching is to detect if cgroup is truly
idle, which is a hard problem unfortunately. There are two kinds of idle. One
is cgroup intends to not dispatch enough IO (real idle). In this case, we
should do upgrade quickly and don't do downgrade. The other is other cgroups
dispatch too many IO and use all bandwidth, the cgroup can't dispatch enough IO
and looks idle (fake idle). In this case, we should do downgrade quickly and
never do upgrade.

Destinguishing the two kinds of idle is impossible for a high queue depth disk
as far as I can tell. This patch set doesn't try to precisely detect idle.
Instead we record history of upgrade. If queue upgrades because cgroup hits
limit, future downgrade is likely because of fake idle, hence future upgrade
should run slowly and future downgrade should run quickly. Otherwise future
downgrade is likely because of real idle, hence future upgrade should run
quickly and future downgrade should run slowly. The adaptive upgrade/downgrade
time means disk downgrade in real idle happens rarely and disk upgrade in fake
idle happens rarely. This doesn't avoid repeatedly state switching though.
Please see patch 6 for details.

User must carefully set the limits. Inproper setting could be ignored. For
example, disk max bandwidth is 100M/s. One cgroup has low limit 60M/s, the
other 50M/s. When the first cgroup runs in 60M/s, there is only 40M/s bandwidth
remaining. The second cgroup will never reach 50M/s, so the cgroup will be
treated idle and its limit will be literally ignored.

Comments and benchmarks are welcome!

Thanks,
Shaohua

Shaohua Li (10):
  block-throttle: prepare support multiple limits
  block-throttle: add .low interface
  block-throttle: configure bps/iops limit for cgroup in low limit
  block-throttle: add upgrade logic for LIMIT_LOW state
  block-throttle: add downgrade logic
  block-throttle: idle detection
  block-throttle: add .high interface
  block-throttle: handle high limit
  blk-throttle: make sure expire time isn't too big
  blk-throttle: add trace log

 block/blk-throttle.c | 813 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 764 insertions(+), 49 deletions(-)

-- 
2.8.0.rc2

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 01/10] block-throttle: prepare support multiple limits
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
@ 2016-05-11  0:16 ` Shaohua Li
  2016-05-11  0:16 ` [PATCH 02/10] block-throttle: add .low interface Shaohua Li
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

We are going to support low/high limit, each cgroup will have 3 limits
after that. This patch prepares for the multiple limits change.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 109 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 68 insertions(+), 41 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 2149a1d..162d54c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -83,6 +83,11 @@ enum tg_state_flags {
 
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
+enum {
+	LIMIT_MAX = 0,
+	LIMIT_CNT = 1,
+};
+
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -120,10 +125,10 @@ struct throtl_grp {
 	bool has_rules[2];
 
 	/* bytes per second rate limits */
-	uint64_t bps[2];
+	uint64_t bps[2][LIMIT_CNT];
 
 	/* IOPS limits */
-	unsigned int iops[2];
+	unsigned int iops[2][LIMIT_CNT];
 
 	/* Number of bytes disptached in current slice */
 	uint64_t bytes_disp[2];
@@ -152,6 +157,8 @@ struct throtl_data
 
 	/* Work for dispatching throttled bios */
 	struct work_struct dispatch_work;
+	unsigned int limit_index;
+	bool limit_valid[LIMIT_CNT];
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -203,6 +210,16 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 		return container_of(sq, struct throtl_data, service_queue);
 }
 
+static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
+{
+	return tg->bps[rw][tg->td->limit_index];
+}
+
+static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
+{
+	return tg->iops[rw][tg->td->limit_index];
+}
+
 /**
  * throtl_log - log debug message via blktrace
  * @sq: the service_queue being reported
@@ -326,7 +343,7 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq)
 static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 {
 	struct throtl_grp *tg;
-	int rw;
+	int rw, index;
 
 	tg = kzalloc_node(sizeof(*tg), gfp, node);
 	if (!tg)
@@ -340,10 +357,12 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 	}
 
 	RB_CLEAR_NODE(&tg->rb_node);
-	tg->bps[READ] = -1;
-	tg->bps[WRITE] = -1;
-	tg->iops[READ] = -1;
-	tg->iops[WRITE] = -1;
+	for (rw = READ; rw <= WRITE; rw++) {
+		for (index = 0; index < LIMIT_CNT; index++) {
+			tg->bps[rw][index] = -1;
+			tg->iops[rw][index] = -1;
+		}
+	}
 
 	return &tg->pd;
 }
@@ -382,11 +401,14 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
 static void tg_update_has_rules(struct throtl_grp *tg)
 {
 	struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+	struct throtl_data *td = tg->td;
 	int rw;
 
 	for (rw = READ; rw <= WRITE; rw++)
 		tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
-				    (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+			(td->limit_valid[td->limit_index] &&
+			 (tg_bps_limit(tg, rw) != -1 ||
+			  tg_iops_limit(tg, rw) != -1));
 }
 
 static void throtl_pd_online(struct blkg_policy_data *pd)
@@ -638,11 +660,11 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 
 	if (!nr_slices)
 		return;
-	tmp = tg->bps[rw] * throtl_slice * nr_slices;
+	tmp = tg_bps_limit(tg, rw) * throtl_slice * nr_slices;
 	do_div(tmp, HZ);
 	bytes_trim = tmp;
 
-	io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
+	io_trim = (tg_iops_limit(tg, rw) * throtl_slice * nr_slices)/HZ;
 
 	if (!bytes_trim && !io_trim)
 		return;
@@ -688,7 +710,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	 * have been trimmed.
 	 */
 
-	tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
+	tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 
 	if (tmp > UINT_MAX)
@@ -703,7 +725,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 	}
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
+	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg_iops_limit(tg, rw) + 1;
 
 	if (jiffy_wait > jiffy_elapsed)
 		jiffy_wait = jiffy_wait - jiffy_elapsed;
@@ -730,7 +752,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
 
-	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
+	tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
 	do_div(tmp, HZ);
 	bytes_allowed = tmp;
 
@@ -742,7 +764,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 
 	/* Calc approx time to dispatch */
 	extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+	jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw));
 
 	if (!jiffy_wait)
 		jiffy_wait = 1;
@@ -777,7 +799,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
-	if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
+	if (tg_bps_limit(tg, rw) == -1 && tg_iops_limit(tg, rw) == -1) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -1152,8 +1174,8 @@ static void tg_conf_updated(struct throtl_grp *tg)
 
 	throtl_log(&tg->service_queue,
 		   "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
-		   tg->bps[READ], tg->bps[WRITE],
-		   tg->iops[READ], tg->iops[WRITE]);
+		   tg_bps_limit(tg, READ), tg_bps_limit(tg, WRITE),
+		   tg_iops_limit(tg, READ), tg_iops_limit(tg, WRITE));
 
 	/*
 	 * Update has_rules[] flags for the updated tg's subtree.  A tg is
@@ -1230,25 +1252,25 @@ static ssize_t tg_set_conf_uint(struct kernfs_open_file *of,
 static struct cftype throtl_legacy_files[] = {
 	{
 		.name = "throttle.read_bps_device",
-		.private = offsetof(struct throtl_grp, bps[READ]),
+		.private = offsetof(struct throtl_grp, bps[READ][LIMIT_MAX]),
 		.seq_show = tg_print_conf_u64,
 		.write = tg_set_conf_u64,
 	},
 	{
 		.name = "throttle.write_bps_device",
-		.private = offsetof(struct throtl_grp, bps[WRITE]),
+		.private = offsetof(struct throtl_grp, bps[WRITE][LIMIT_MAX]),
 		.seq_show = tg_print_conf_u64,
 		.write = tg_set_conf_u64,
 	},
 	{
 		.name = "throttle.read_iops_device",
-		.private = offsetof(struct throtl_grp, iops[READ]),
+		.private = offsetof(struct throtl_grp, iops[READ][LIMIT_MAX]),
 		.seq_show = tg_print_conf_uint,
 		.write = tg_set_conf_uint,
 	},
 	{
 		.name = "throttle.write_iops_device",
-		.private = offsetof(struct throtl_grp, iops[WRITE]),
+		.private = offsetof(struct throtl_grp, iops[WRITE][LIMIT_MAX]),
 		.seq_show = tg_print_conf_uint,
 		.write = tg_set_conf_uint,
 	},
@@ -1274,18 +1296,22 @@ static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
 
 	if (!dname)
 		return 0;
-	if (tg->bps[READ] == -1 && tg->bps[WRITE] == -1 &&
-	    tg->iops[READ] == -1 && tg->iops[WRITE] == -1)
+	if (tg->bps[READ][LIMIT_MAX] == -1 && tg->bps[WRITE][LIMIT_MAX] == -1 &&
+	    tg->iops[READ][LIMIT_MAX] == -1 && tg->iops[WRITE][LIMIT_MAX] == -1)
 		return 0;
 
-	if (tg->bps[READ] != -1)
-		snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ]);
-	if (tg->bps[WRITE] != -1)
-		snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE]);
-	if (tg->iops[READ] != -1)
-		snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ]);
-	if (tg->iops[WRITE] != -1)
-		snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE]);
+	if (tg->bps[READ][LIMIT_MAX] != -1)
+		snprintf(bufs[0], sizeof(bufs[0]), "%llu",
+			tg->bps[READ][LIMIT_MAX]);
+	if (tg->bps[WRITE][LIMIT_MAX] != -1)
+		snprintf(bufs[1], sizeof(bufs[1]), "%llu",
+			tg->bps[WRITE][LIMIT_MAX]);
+	if (tg->iops[READ][LIMIT_MAX] != -1)
+		snprintf(bufs[2], sizeof(bufs[2]), "%u",
+			tg->iops[READ][LIMIT_MAX]);
+	if (tg->iops[WRITE][LIMIT_MAX] != -1)
+		snprintf(bufs[3], sizeof(bufs[3]), "%u",
+			tg->iops[WRITE][LIMIT_MAX]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
 		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
@@ -1314,10 +1340,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 	tg = blkg_to_tg(ctx.blkg);
 
-	v[0] = tg->bps[READ];
-	v[1] = tg->bps[WRITE];
-	v[2] = tg->iops[READ];
-	v[3] = tg->iops[WRITE];
+	v[0] = tg->bps[READ][LIMIT_MAX];
+	v[1] = tg->bps[WRITE][LIMIT_MAX];
+	v[2] = tg->iops[READ][LIMIT_MAX];
+	v[3] = tg->iops[WRITE][LIMIT_MAX];
 
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
@@ -1354,10 +1380,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 			goto out_finish;
 	}
 
-	tg->bps[READ] = v[0];
-	tg->bps[WRITE] = v[1];
-	tg->iops[READ] = v[2];
-	tg->iops[WRITE] = v[3];
+	tg->bps[READ][LIMIT_MAX] = v[0];
+	tg->bps[WRITE][LIMIT_MAX] = v[1];
+	tg->iops[READ][LIMIT_MAX] = v[2];
+	tg->iops[WRITE][LIMIT_MAX] = v[3];
 
 	tg_conf_updated(tg);
 	ret = 0;
@@ -1455,8 +1481,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	/* out-of-limit, queue to @tg */
 	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
-		   tg->bytes_disp[rw], bio->bi_iter.bi_size, tg->bps[rw],
-		   tg->io_disp[rw], tg->iops[rw],
+		   tg->bytes_disp[rw], bio->bi_iter.bi_size, tg_bps_limit(tg, rw),
+		   tg->io_disp[rw], tg_iops_limit(tg, rw),
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
@@ -1567,6 +1593,7 @@ int blk_throtl_init(struct request_queue *q)
 	q->td = td;
 	td->queue = q;
 
+	td->limit_valid[LIMIT_MAX] = true;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 02/10] block-throttle: add .low interface
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
  2016-05-11  0:16 ` [PATCH 01/10] block-throttle: prepare support multiple limits Shaohua Li
@ 2016-05-11  0:16 ` Shaohua Li
  2016-05-11  0:16 ` [PATCH 03/10] block-throttle: configure bps/iops limit for cgroup in low limit Shaohua Li
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

Add low limit for cgroup and corresponding cgroup interface.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 142 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 110 insertions(+), 32 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 162d54c..e69a3db 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,8 +84,9 @@ enum tg_state_flags {
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
 enum {
-	LIMIT_MAX = 0,
-	LIMIT_CNT = 1,
+	LIMIT_LOW = 0,
+	LIMIT_MAX = 1,
+	LIMIT_CNT = 2,
 };
 
 struct throtl_grp {
@@ -358,7 +359,7 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 
 	RB_CLEAR_NODE(&tg->rb_node);
 	for (rw = READ; rw <= WRITE; rw++) {
-		for (index = 0; index < LIMIT_CNT; index++) {
+		for (index = LIMIT_MAX; index < LIMIT_CNT; index++) {
 			tg->bps[rw][index] = -1;
 			tg->iops[rw][index] = -1;
 		}
@@ -420,6 +421,44 @@ static void throtl_pd_online(struct blkg_policy_data *pd)
 	tg_update_has_rules(pd_to_tg(pd));
 }
 
+static void blk_throtl_update_valid_limit(struct throtl_data *td)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+	bool low_valid = false;
+
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (tg->bps[READ][LIMIT_LOW] ||
+		    tg->bps[WRITE][LIMIT_LOW] ||
+		    tg->iops[READ][LIMIT_LOW] ||
+		    tg->iops[WRITE][LIMIT_LOW])
+			low_valid = true;
+	}
+
+	if (low_valid)
+		td->limit_valid[LIMIT_LOW] = true;
+	else
+		td->limit_valid[LIMIT_LOW] = false;
+}
+
+static void throtl_pd_offline(struct blkg_policy_data *pd)
+{
+	struct throtl_grp *tg = pd_to_tg(pd);
+
+	tg->bps[READ][LIMIT_LOW] = 0;
+	tg->bps[WRITE][LIMIT_LOW] = 0;
+	tg->iops[READ][LIMIT_LOW] = 0;
+	tg->iops[WRITE][LIMIT_LOW] = 0;
+
+	blk_throtl_update_valid_limit(tg->td);
+
+	if (tg->td->limit_index == LIMIT_LOW &&
+	    !tg->td->limit_valid[LIMIT_LOW])
+		tg->td->limit_index = LIMIT_MAX;
+}
+
 static void throtl_pd_free(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
@@ -1287,45 +1326,50 @@ static struct cftype throtl_legacy_files[] = {
 	{ }	/* terminate */
 };
 
-static u64 tg_prfill_max(struct seq_file *sf, struct blkg_policy_data *pd,
+static u64 tg_prfill_limit(struct seq_file *sf, struct blkg_policy_data *pd,
 			 int off)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
 	const char *dname = blkg_dev_name(pd->blkg);
 	char bufs[4][21] = { "max", "max", "max", "max" };
+	uint64_t target = -1;
 
 	if (!dname)
 		return 0;
-	if (tg->bps[READ][LIMIT_MAX] == -1 && tg->bps[WRITE][LIMIT_MAX] == -1 &&
-	    tg->iops[READ][LIMIT_MAX] == -1 && tg->iops[WRITE][LIMIT_MAX] == -1)
+	if (off == LIMIT_LOW) {
+		int i;
+		for (i = 0; i < 4; i++)
+			strcpy(bufs[i], "0");
+		target = 0;
+	}
+
+	if (tg->bps[READ][off] == target && tg->bps[WRITE][off] == target &&
+	    tg->iops[READ][off] == (unsigned int)target &&
+	    tg->iops[WRITE][off] == (unsigned int)target)
 		return 0;
 
-	if (tg->bps[READ][LIMIT_MAX] != -1)
-		snprintf(bufs[0], sizeof(bufs[0]), "%llu",
-			tg->bps[READ][LIMIT_MAX]);
-	if (tg->bps[WRITE][LIMIT_MAX] != -1)
-		snprintf(bufs[1], sizeof(bufs[1]), "%llu",
-			tg->bps[WRITE][LIMIT_MAX]);
-	if (tg->iops[READ][LIMIT_MAX] != -1)
-		snprintf(bufs[2], sizeof(bufs[2]), "%u",
-			tg->iops[READ][LIMIT_MAX]);
-	if (tg->iops[WRITE][LIMIT_MAX] != -1)
-		snprintf(bufs[3], sizeof(bufs[3]), "%u",
-			tg->iops[WRITE][LIMIT_MAX]);
+	if (tg->bps[READ][off] != target)
+		snprintf(bufs[0], sizeof(bufs[0]), "%llu", tg->bps[READ][off]);
+	if (tg->bps[WRITE][off] != target)
+		snprintf(bufs[1], sizeof(bufs[1]), "%llu", tg->bps[WRITE][off]);
+	if (tg->iops[READ][off] != (unsigned int)target)
+		snprintf(bufs[2], sizeof(bufs[2]), "%u", tg->iops[READ][off]);
+	if (tg->iops[WRITE][off] != (unsigned int)target)
+		snprintf(bufs[3], sizeof(bufs[3]), "%u", tg->iops[WRITE][off]);
 
 	seq_printf(sf, "%s rbps=%s wbps=%s riops=%s wiops=%s\n",
 		   dname, bufs[0], bufs[1], bufs[2], bufs[3]);
 	return 0;
 }
 
-static int tg_print_max(struct seq_file *sf, void *v)
+static int tg_print_limit(struct seq_file *sf, void *v)
 {
-	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_max,
+	blkcg_print_blkgs(sf, css_to_blkcg(seq_css(sf)), tg_prfill_limit,
 			  &blkcg_policy_throtl, seq_cft(sf)->private, false);
 	return 0;
 }
 
-static ssize_t tg_set_max(struct kernfs_open_file *of,
+static ssize_t tg_set_limit(struct kernfs_open_file *of,
 			  char *buf, size_t nbytes, loff_t off)
 {
 	struct blkcg *blkcg = css_to_blkcg(of_css(of));
@@ -1333,6 +1377,7 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 	struct throtl_grp *tg;
 	u64 v[4];
 	int ret;
+	int index = of_cft(of)->private;
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
 	if (ret)
@@ -1340,10 +1385,10 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 	tg = blkg_to_tg(ctx.blkg);
 
-	v[0] = tg->bps[READ][LIMIT_MAX];
-	v[1] = tg->bps[WRITE][LIMIT_MAX];
-	v[2] = tg->iops[READ][LIMIT_MAX];
-	v[3] = tg->iops[WRITE][LIMIT_MAX];
+	v[0] = tg->bps[READ][index];
+	v[1] = tg->bps[WRITE][index];
+	v[2] = tg->iops[READ][index];
+	v[3] = tg->iops[WRITE][index];
 
 	while (true) {
 		char tok[27];	/* wiops=18446744073709551616 */
@@ -1380,11 +1425,33 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 			goto out_finish;
 	}
 
-	tg->bps[READ][LIMIT_MAX] = v[0];
-	tg->bps[WRITE][LIMIT_MAX] = v[1];
-	tg->iops[READ][LIMIT_MAX] = v[2];
-	tg->iops[WRITE][LIMIT_MAX] = v[3];
-
+	if (index == LIMIT_MAX) {
+		if (v[0] < tg->bps[READ][LIMIT_LOW] ||
+		    v[1] < tg->bps[WRITE][LIMIT_LOW] ||
+		    v[2] < tg->iops[READ][LIMIT_LOW] ||
+		    v[3] < tg->iops[WRITE][LIMIT_LOW]) {
+			ret = -EINVAL;
+			goto out_finish;
+		}
+	} else if (index == LIMIT_LOW) {
+		if (v[0] > tg->bps[READ][LIMIT_MAX] ||
+		    v[1] > tg->bps[WRITE][LIMIT_MAX] ||
+		    v[2] > tg->iops[READ][LIMIT_MAX] ||
+		    v[3] > tg->iops[WRITE][LIMIT_MAX]) {
+			ret = -EINVAL;
+			goto out_finish;
+		}
+	}
+	tg->bps[READ][index] = v[0];
+	tg->bps[WRITE][index] = v[1];
+	tg->iops[READ][index] = v[2];
+	tg->iops[WRITE][index] = v[3];
+
+	if (index == LIMIT_LOW) {
+		blk_throtl_update_valid_limit(tg->td);
+		if (tg->td->limit_valid[LIMIT_LOW])
+			tg->td->limit_index = LIMIT_LOW;
+	}
 	tg_conf_updated(tg);
 	ret = 0;
 out_finish:
@@ -1394,10 +1461,18 @@ static ssize_t tg_set_max(struct kernfs_open_file *of,
 
 static struct cftype throtl_files[] = {
 	{
+		.name = "low",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = tg_print_limit,
+		.write = tg_set_limit,
+		.private = LIMIT_LOW,
+	},
+	{
 		.name = "max",
 		.flags = CFTYPE_NOT_ON_ROOT,
-		.seq_show = tg_print_max,
-		.write = tg_set_max,
+		.seq_show = tg_print_limit,
+		.write = tg_set_limit,
+		.private = LIMIT_MAX,
 	},
 	{ }	/* terminate */
 };
@@ -1416,6 +1491,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_alloc_fn		= throtl_pd_alloc,
 	.pd_init_fn		= throtl_pd_init,
 	.pd_online_fn		= throtl_pd_online,
+	.pd_offline_fn		= throtl_pd_offline,
 	.pd_free_fn		= throtl_pd_free,
 };
 
@@ -1593,7 +1669,9 @@ int blk_throtl_init(struct request_queue *q)
 	q->td = td;
 	td->queue = q;
 
+	td->limit_valid[LIMIT_LOW] = false;
 	td->limit_valid[LIMIT_MAX] = true;
+	td->limit_index = LIMIT_MAX;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 03/10] block-throttle: configure bps/iops limit for cgroup in low limit
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
  2016-05-11  0:16 ` [PATCH 01/10] block-throttle: prepare support multiple limits Shaohua Li
  2016-05-11  0:16 ` [PATCH 02/10] block-throttle: add .low interface Shaohua Li
@ 2016-05-11  0:16 ` Shaohua Li
  2016-05-11  0:16 ` [PATCH 04/10] block-throttle: add upgrade logic for LIMIT_LOW state Shaohua Li
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

each queue will have a state machine. Initially queue is in LOW_LIMIT
state, which means all cgroups will be throttled according to their low
limit. After all cgroups with low limit cross the limit, the queue state
gets upgraded to high/max state. This will guarantee cgroups with low
limit have at least low limit bandwidth/iops before other cgroups can
use the disk.
For cgroups without low limit, they are assigned a small bps/iops to
avoid completed stall.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e69a3db..bdcf1b7 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -213,12 +213,37 @@ static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
 
 static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 {
-	return tg->bps[rw][tg->td->limit_index];
+	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	uint64_t ret;
+
+	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+		return -1;
+	ret = tg->bps[rw][tg->td->limit_index];
+	if (ret == 0 && tg->td->limit_index == LIMIT_LOW) {
+		if (tg->iops[rw][LIMIT_LOW])
+			return -1;
+		/* assign a small default */
+		return 64 * 1024;
+	}
+
+	return ret;
 }
 
 static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 {
-	return tg->iops[rw][tg->td->limit_index];
+	struct blkcg_gq *blkg = tg_to_blkg(tg);
+	unsigned int ret;
+
+	if (cgroup_subsys_on_dfl(io_cgrp_subsys) && !blkg->parent)
+		return -1;
+	ret = tg->iops[rw][tg->td->limit_index];
+	if (ret == 0 && tg->td->limit_index == LIMIT_LOW) {
+		if (tg->bps[rw][LIMIT_LOW])
+			return -1;
+		/* assign a small default */
+		return 16;
+	}
+	return ret;
 }
 
 /**
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 04/10] block-throttle: add upgrade logic for LIMIT_LOW state
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
                   ` (2 preceding siblings ...)
  2016-05-11  0:16 ` [PATCH 03/10] block-throttle: configure bps/iops limit for cgroup in low limit Shaohua Li
@ 2016-05-11  0:16 ` Shaohua Li
  2016-05-11  0:16 ` [PATCH 05/10] block-throttle: add downgrade logic Shaohua Li
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

When queue is in LIMIT_LOW state and all cgroups with low limit cross
the bps/iops limitation, we will upgrade queue's state to
LIMIT_HIGH/LIMIT_MAX

For a cgroup hierarchy, there are two cases. Children has lower low
limit than parent. Parent's low limit is meaningless. If children's
bps/iops cross low limit, we can upgrade queue state. The other case is
children has higher low limit than parent. Children's low limit is
meaningless. As long as parent's bps/iops cross low limit, we can
upgrade queue state.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 86 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index bdcf1b7..df9cd13e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -468,6 +468,7 @@ static void blk_throtl_update_valid_limit(struct throtl_data *td)
 		td->limit_valid[LIMIT_LOW] = false;
 }
 
+static void throtl_upgrade_state(struct throtl_data *td);
 static void throtl_pd_offline(struct blkg_policy_data *pd)
 {
 	struct throtl_grp *tg = pd_to_tg(pd);
@@ -479,9 +480,8 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)
 
 	blk_throtl_update_valid_limit(tg->td);
 
-	if (tg->td->limit_index == LIMIT_LOW &&
-	    !tg->td->limit_valid[LIMIT_LOW])
-		tg->td->limit_index = LIMIT_MAX;
+	if (!tg->td->limit_valid[tg->td->limit_index])
+		throtl_upgrade_state(tg->td);
 }
 
 static void throtl_pd_free(struct blkg_policy_data *pd)
@@ -1087,6 +1087,8 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
 	return nr_disp;
 }
 
+static bool throtl_can_upgrade(struct throtl_data *td,
+	struct throtl_grp *this_tg);
 /**
  * throtl_pending_timer_fn - timer function for service_queue->pending_timer
  * @arg: the throtl_service_queue being serviced
@@ -1113,6 +1115,9 @@ static void throtl_pending_timer_fn(unsigned long arg)
 	int ret;
 
 	spin_lock_irq(q->queue_lock);
+	if (throtl_can_upgrade(td, NULL))
+		throtl_upgrade_state(td);
+
 again:
 	parent_sq = sq->parent_sq;
 	dispatched = false;
@@ -1520,6 +1525,77 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_free_fn		= throtl_pd_free,
 };
 
+static bool throtl_upgrade_check_one(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *sq = &tg->service_queue;
+
+	if (tg->bps[READ][LIMIT_LOW] != 0 && !sq->nr_queued[READ])
+		return false;
+	if (tg->bps[WRITE][LIMIT_LOW] != 0 && !sq->nr_queued[WRITE])
+		return false;
+	if (tg->iops[READ][LIMIT_LOW] != 0 && !sq->nr_queued[READ])
+		return false;
+	if (tg->iops[WRITE][LIMIT_LOW] != 0 && !sq->nr_queued[WRITE])
+		return false;
+	return true;
+}
+
+static bool throtl_upgrade_check_hierarchy(struct throtl_grp *tg)
+{
+	if (throtl_upgrade_check_one(tg))
+		return true;
+	while (true) {
+		if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+				!tg_to_blkg(tg)->parent))
+			return false;
+		if (throtl_upgrade_check_one(tg))
+			return true;
+		tg = sq_to_tg(tg->service_queue.parent_sq);
+	}
+	return false;
+}
+
+static bool throtl_can_upgrade(struct throtl_data *td,
+	struct throtl_grp *this_tg)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+
+	if (td->limit_index != LIMIT_LOW)
+		return false;
+
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+
+		if (tg == this_tg)
+			continue;
+		if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
+			continue;
+		if (!throtl_upgrade_check_hierarchy(tg))
+			return false;
+	}
+	return true;
+}
+
+static void throtl_upgrade_state(struct throtl_data *td)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+
+	td->limit_index = LIMIT_MAX;
+	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+		struct throtl_service_queue *sq = &tg->service_queue;
+
+		tg->disptime = jiffies - 1;
+		throtl_select_dispatch(sq);
+		throtl_schedule_next_dispatch(sq, false);
+	}
+	throtl_select_dispatch(&td->service_queue);
+	throtl_schedule_next_dispatch(&td->service_queue, false);
+	queue_work(kthrotld_workqueue, &td->dispatch_work);
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1542,14 +1618,20 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 	sq = &tg->service_queue;
 
+again:
 	while (true) {
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
 
 		/* if above limits, break to queue */
-		if (!tg_may_dispatch(tg, bio, NULL))
+		if (!tg_may_dispatch(tg, bio, NULL)) {
+			if (throtl_can_upgrade(tg->td, tg)) {
+				throtl_upgrade_state(tg->td);
+				goto again;
+			}
 			break;
+		}
 
 		/* within limits, let's charge and dispatch directly */
 		throtl_charge_bio(tg, bio);
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 05/10] block-throttle: add downgrade logic
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
                   ` (3 preceding siblings ...)
  2016-05-11  0:16 ` [PATCH 04/10] block-throttle: add upgrade logic for LIMIT_LOW state Shaohua Li
@ 2016-05-11  0:16 ` Shaohua Li
  2016-05-11  0:16 ` [PATCH 06/10] block-throttle: idle detection Shaohua Li
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

When queue state machine is in higher state, say LIMIT_MAX state without
LIMIT_HIGH state, but a cgroup is below its low limit for some time, the
queue should be downgraded to lower state as cgroup's low limit doesn't
meet.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 160 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 160 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index df9cd13e..5806507 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -21,6 +21,7 @@ static int throtl_quantum = 32;
 /* Throttling is performed over 100ms slice and after that slice is renewed */
 static unsigned long throtl_slice = HZ/10;	/* 100 ms */
 
+static unsigned long cg_check_time = HZ/10;
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -136,6 +137,13 @@ struct throtl_grp {
 	/* Number of bio's dispatched in current slice */
 	unsigned int io_disp[2];
 
+	unsigned long last_low_overflow_time[2];
+
+	uint64_t last_bytes_disp[2];
+	unsigned int last_io_disp[2];
+
+	unsigned long last_check_time;
+
 	/* When did we start a new slice */
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
@@ -160,6 +168,11 @@ struct throtl_data
 	struct work_struct dispatch_work;
 	unsigned int limit_index;
 	bool limit_valid[LIMIT_CNT];
+
+	unsigned long low_upgrade_time;
+	unsigned long low_downgrade_time;
+	unsigned int low_upgrade_interval;
+	unsigned int low_downgrade_interval;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -906,6 +919,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	/* Charge the bio to the group */
 	tg->bytes_disp[rw] += bio->bi_iter.bi_size;
 	tg->io_disp[rw]++;
+	tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+	tg->last_io_disp[rw]++;
 
 	/*
 	 * REQ_THROTTLED is used to prevent the same bio to be throttled
@@ -1525,6 +1540,38 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.pd_free_fn		= throtl_pd_free,
 };
 
+static unsigned long __tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+	unsigned long rtime = -1, wtime = -1;
+	if (tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW])
+		rtime = tg->last_low_overflow_time[READ];
+	if (tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW])
+		wtime = tg->last_low_overflow_time[WRITE];
+	return min(rtime, wtime);
+}
+
+static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *parent_sq;
+	struct throtl_grp *parent = tg;
+	unsigned long ret = __tg_last_low_overflow_time(tg);
+
+	while (true) {
+		parent_sq = parent->service_queue.parent_sq;
+		parent = sq_to_tg(parent_sq);
+		if (!parent)
+			break;
+		if (parent->bps[READ][LIMIT_LOW] > tg->bps[READ][LIMIT_LOW] &&
+		    parent->bps[WRITE][LIMIT_LOW] > tg->bps[WRITE][LIMIT_LOW] &&
+		    parent->iops[READ][LIMIT_LOW] > tg->iops[READ][LIMIT_LOW] &&
+		    parent->iops[WRITE][LIMIT_LOW] > tg->iops[WRITE][LIMIT_LOW])
+			break;
+		if (time_after(__tg_last_low_overflow_time(parent), ret))
+			ret = __tg_last_low_overflow_time(parent);
+	}
+	return ret;
+}
+
 static bool throtl_upgrade_check_one(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1564,6 +1611,10 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	if (td->limit_index != LIMIT_LOW)
 		return false;
 
+	if (td->limit_index == LIMIT_LOW && time_before(jiffies,
+	    td->low_downgrade_time + td->low_upgrade_interval))
+		return false;
+
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
 
@@ -1583,6 +1634,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
 	struct blkcg_gq *blkg;
 
 	td->limit_index = LIMIT_MAX;
+	td->low_upgrade_time = jiffies;
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
 		struct throtl_service_queue *sq = &tg->service_queue;
@@ -1596,6 +1648,104 @@ static void throtl_upgrade_state(struct throtl_data *td)
 	queue_work(kthrotld_workqueue, &td->dispatch_work);
 }
 
+static void throtl_downgrade_state(struct throtl_data *td, int new)
+{
+	td->limit_index = new;
+	td->low_downgrade_time = jiffies;
+}
+
+static bool throtl_downgrade_check_one(struct throtl_grp *tg)
+{
+	struct throtl_data *td = tg->td;
+	unsigned long now = jiffies;
+
+	/*
+	 * If cgroup is below low limit, consider downgrade and throttle other
+	 * cgroups
+	 */
+	if (time_after(now,
+	     td->low_upgrade_time + td->low_downgrade_interval) &&
+	    time_after(now,
+	     tg_last_low_overflow_time(tg) + td->low_downgrade_interval))
+		return true;
+	return false;
+}
+
+static bool throtl_downgrade_check_hierarchy(struct throtl_grp *tg)
+{
+	if (!throtl_downgrade_check_one(tg))
+		return false;
+	while (true) {
+		if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
+			    !tg_to_blkg(tg)->parent))
+			break;
+
+		if (!throtl_downgrade_check_one(tg))
+			return false;
+		tg = sq_to_tg(tg->service_queue.parent_sq);
+	}
+	return true;
+}
+
+static void throtl_downgrade_check(struct throtl_grp *tg)
+{
+	uint64_t bps;
+	unsigned int iops;
+	unsigned long elapsed_time;
+	unsigned long now = jiffies;
+
+	if (tg->td->limit_index != LIMIT_MAX)
+		return;
+	if (!(tg->bps[READ][LIMIT_LOW] ||
+	      tg->bps[WRITE][LIMIT_LOW] ||
+	      tg->iops[WRITE][LIMIT_LOW] ||
+	      tg->iops[READ][LIMIT_LOW]))
+		return;
+
+	if (time_after(tg->last_check_time + throtl_slice, now))
+		return;
+	elapsed_time = now - tg->last_check_time;
+	tg->last_check_time = now;
+
+	if (tg->bps[READ][LIMIT_LOW]) {
+		bps = tg->last_bytes_disp[READ] * HZ;
+		do_div(bps, elapsed_time);
+		if (bps >= tg->bps[READ][LIMIT_LOW])
+			tg->last_low_overflow_time[READ] = now;
+	}
+
+	if (tg->bps[WRITE][LIMIT_LOW]) {
+		bps = tg->last_bytes_disp[WRITE] * HZ;
+		do_div(bps, elapsed_time);
+		if (bps >= tg->bps[WRITE][LIMIT_LOW])
+			tg->last_low_overflow_time[WRITE] = now;
+	}
+
+	if (tg->iops[READ][LIMIT_LOW]) {
+		iops = tg->last_io_disp[READ] * HZ / elapsed_time;
+		if (iops >= tg->iops[READ][LIMIT_LOW])
+			tg->last_low_overflow_time[READ] = now;
+	}
+
+	if (tg->iops[WRITE][LIMIT_LOW]) {
+		iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
+		if (iops >= tg->iops[WRITE][LIMIT_LOW])
+			tg->last_low_overflow_time[WRITE] = now;
+	}
+
+	/*
+	 * If cgroup is below low limit, consider downgrade and throttle other
+	 * cgroups
+	 */
+	if (throtl_downgrade_check_hierarchy(tg))
+		throtl_downgrade_state(tg->td, LIMIT_LOW);
+
+	tg->last_bytes_disp[READ] = 0;
+	tg->last_bytes_disp[WRITE] = 0;
+	tg->last_io_disp[READ] = 0;
+	tg->last_io_disp[WRITE] = 0;
+}
+
 bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
@@ -1620,12 +1770,16 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 
 again:
 	while (true) {
+		if (tg->last_low_overflow_time[rw] == 0)
+			tg->last_low_overflow_time[rw] = jiffies;
+		throtl_downgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
 
 		/* if above limits, break to queue */
 		if (!tg_may_dispatch(tg, bio, NULL)) {
+			tg->last_low_overflow_time[rw] = jiffies;
 			if (throtl_can_upgrade(tg->td, tg)) {
 				throtl_upgrade_state(tg->td);
 				goto again;
@@ -1668,6 +1822,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		   tg->io_disp[rw], tg_iops_limit(tg, rw),
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
+	tg->last_low_overflow_time[rw] = jiffies;
+
 	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, qn, tg);
@@ -1779,6 +1935,10 @@ int blk_throtl_init(struct request_queue *q)
 	td->limit_valid[LIMIT_LOW] = false;
 	td->limit_valid[LIMIT_MAX] = true;
 	td->limit_index = LIMIT_MAX;
+	td->low_upgrade_time = jiffies;
+	td->low_downgrade_time = jiffies;
+	td->low_upgrade_interval = cg_check_time;
+	td->low_downgrade_interval = cg_check_time;
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 06/10] block-throttle: idle detection
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
                   ` (4 preceding siblings ...)
  2016-05-11  0:16 ` [PATCH 05/10] block-throttle: add downgrade logic Shaohua Li
@ 2016-05-11  0:16 ` Shaohua Li
  2016-05-11  0:16 ` [PATCH 07/10] block-throttle: add .high interface Shaohua Li
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We will treat
the cgroup idle and upgrade the state machine to higher state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to higher state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately I can't find a good way to destinguish the two kinds of
idle. One possible way is the think time check of CFQ. CFQ checks
request submit time and last request completion time and if the
difference of the time (think time) is positive the cgroup is idle.
This technique doesn't work for high queue depth disk. For example, a
workload with io depth 8 has disk utilization 100%, hence think time is
0, eg, not idle. But the workload can run higher bandwidth with io depth
16. Compared to io depth 16, the io depth 8 workload is idle. Think time
can't be used to detect idle here. Another possible way is detecting if
disk reaches maximum bandwidth (then we can detect fake idle). But
detecting maximum bandwidth is hard since maximum bandwidth isn't fixed
for a specific workload. we could only use a feedback system to detect
the maximum bandwidth, which isn't suitable for a limit based
scheduling.

This patch doesn't try to precisely detect idle, because if we wrongly
detect idle, the queue will never get downgrade/upgrade, hence we can't
guarantee low limit or sacrifice performance. If cgroup is below its low
limit, the queue state machine will do upgrade/downgrade continuously.
But we will make the upgrade/downgrade time interval adaptive. We
maintain a history of disk upgrade. If queue upgrades because cgroup
hits low limit, future downgrade is likely because of fake idle, hence
future upgrade should run slowly and future downgrade should run
quickly. Otherwise future downgrade is likely because of real idle,
hence future upgrade should run quickly and future downgrade should run
slowly. The adaptive upgrade/downgrade time means disk downgrade in real
idle happens rarely and disk upgrade in fake idle happens rarely. But we
will still see cgroup throughput jump up and down if some cgroups run
below their low limit.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 62 insertions(+), 7 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5806507..a462e2f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -12,6 +12,7 @@
 #include <linux/blk-cgroup.h>
 #include "blk.h"
 
+#define DEFAULT_HISTORY (0xAA) /* 0/1 bits are equal */
 /* Max dispatch from a group in 1 round */
 static int throtl_grp_quantum = 8;
 
@@ -171,6 +172,7 @@ struct throtl_data
 
 	unsigned long low_upgrade_time;
 	unsigned long low_downgrade_time;
+	unsigned char low_history;
 	unsigned int low_upgrade_interval;
 	unsigned int low_downgrade_interval;
 };
@@ -1572,10 +1574,40 @@ static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
 	return ret;
 }
 
-static bool throtl_upgrade_check_one(struct throtl_grp *tg)
+static void throtl_calculate_low_interval(struct throtl_data *td)
+{
+	unsigned long history = td->low_history;
+	unsigned int ubits = bitmap_weight(&history,
+		sizeof(td->low_history) * 8);
+	unsigned int dbits = sizeof(td->low_history) * 8 - ubits;
+
+	ubits = max(1U, ubits);
+	dbits = max(1U, dbits);
+
+	if (ubits >= dbits) {
+		td->low_upgrade_interval = ubits / dbits * cg_check_time;
+		td->low_downgrade_interval = cg_check_time;
+	} else {
+		td->low_upgrade_interval = cg_check_time;
+		td->low_downgrade_interval = dbits / ubits * cg_check_time;
+	}
+}
+
+static bool throtl_upgrade_check_one(struct throtl_grp *tg, bool *idle)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 
+	if (!tg->bps[READ][LIMIT_LOW] && !tg->bps[WRITE][LIMIT_LOW] &&
+	    !tg->iops[READ][LIMIT_LOW] && !tg->iops[WRITE][LIMIT_LOW])
+		return true;
+
+	/* if cgroup is below low limit for a long time, consider it idle */
+	if (time_after(jiffies,
+	    tg_last_low_overflow_time(tg) + tg->td->low_upgrade_interval)) {
+		*idle = true;
+		return true;
+	}
+
 	if (tg->bps[READ][LIMIT_LOW] != 0 && !sq->nr_queued[READ])
 		return false;
 	if (tg->bps[WRITE][LIMIT_LOW] != 0 && !sq->nr_queued[WRITE])
@@ -1587,15 +1619,15 @@ static bool throtl_upgrade_check_one(struct throtl_grp *tg)
 	return true;
 }
 
-static bool throtl_upgrade_check_hierarchy(struct throtl_grp *tg)
+static bool throtl_upgrade_check_hierarchy(struct throtl_grp *tg, bool *idle)
 {
-	if (throtl_upgrade_check_one(tg))
+	if (throtl_upgrade_check_one(tg, idle))
 		return true;
 	while (true) {
 		if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
 				!tg_to_blkg(tg)->parent))
 			return false;
-		if (throtl_upgrade_check_one(tg))
+		if (throtl_upgrade_check_one(tg, idle))
 			return true;
 		tg = sq_to_tg(tg->service_queue.parent_sq);
 	}
@@ -1607,6 +1639,7 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 {
 	struct cgroup_subsys_state *pos_css;
 	struct blkcg_gq *blkg;
+	bool idle = false;
 
 	if (td->limit_index != LIMIT_LOW)
 		return false;
@@ -1622,9 +1655,15 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 			continue;
 		if (!list_empty(&tg_to_blkg(tg)->blkcg->css.children))
 			continue;
-		if (!throtl_upgrade_check_hierarchy(tg))
+		if (!throtl_upgrade_check_hierarchy(tg, &idle))
 			return false;
 	}
+	if (td->limit_index == LIMIT_LOW) {
+		td->low_history <<= 1;
+		if (!idle)
+			td->low_history |= 1;
+		throtl_calculate_low_interval(td);
+	}
 	return true;
 }
 
@@ -1648,6 +1687,21 @@ static void throtl_upgrade_state(struct throtl_data *td)
 	queue_work(kthrotld_workqueue, &td->dispatch_work);
 }
 
+static void throtl_upgrade_check(struct throtl_grp *tg)
+{
+	if (tg->td->limit_index != LIMIT_LOW)
+		return;
+
+	if (!(tg->bps[READ][LIMIT_LOW] || tg->bps[WRITE][LIMIT_LOW] ||
+	      tg->iops[READ][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW]) ||
+	    !time_after(jiffies,
+	     tg_last_low_overflow_time(tg) + tg->td->low_upgrade_interval))
+		return;
+
+	if (throtl_can_upgrade(tg->td, NULL))
+		throtl_upgrade_state(tg->td);
+}
+
 static void throtl_downgrade_state(struct throtl_data *td, int new)
 {
 	td->limit_index = new;
@@ -1773,6 +1827,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		if (tg->last_low_overflow_time[rw] == 0)
 			tg->last_low_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
+		throtl_upgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
 		if (sq->nr_queued[rw])
 			break;
@@ -1937,8 +1992,8 @@ int blk_throtl_init(struct request_queue *q)
 	td->limit_index = LIMIT_MAX;
 	td->low_upgrade_time = jiffies;
 	td->low_downgrade_time = jiffies;
-	td->low_upgrade_interval = cg_check_time;
-	td->low_downgrade_interval = cg_check_time;
+	td->low_history = DEFAULT_HISTORY;
+	throtl_calculate_low_interval(td);
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 07/10] block-throttle: add .high interface
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
                   ` (5 preceding siblings ...)
  2016-05-11  0:16 ` [PATCH 06/10] block-throttle: idle detection Shaohua Li
@ 2016-05-11  0:16 ` Shaohua Li
  2016-05-11  0:16 ` [PATCH 08/10] block-throttle: handle high limit Shaohua Li
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

Add high limit for cgroup and corresponding cgroup interface.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 69 insertions(+), 5 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a462e2f..5736d1b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -87,8 +87,9 @@ enum tg_state_flags {
 
 enum {
 	LIMIT_LOW = 0,
-	LIMIT_MAX = 1,
-	LIMIT_CNT = 2,
+	LIMIT_HIGH = 1,
+	LIMIT_MAX = 2,
+	LIMIT_CNT = 3,
 };
 
 struct throtl_grp {
@@ -240,6 +241,8 @@ static uint64_t tg_bps_limit(struct throtl_grp *tg, int rw)
 		/* assign a small default */
 		return 64 * 1024;
 	}
+	if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+		return tg->bps[rw][LIMIT_MAX];
 
 	return ret;
 }
@@ -258,6 +261,8 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 		/* assign a small default */
 		return 16;
 	}
+	if (ret == -1 && tg->td->limit_index == LIMIT_HIGH)
+		return tg->iops[rw][LIMIT_MAX];
 	return ret;
 }
 
@@ -399,7 +404,7 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, int node)
 
 	RB_CLEAR_NODE(&tg->rb_node);
 	for (rw = READ; rw <= WRITE; rw++) {
-		for (index = LIMIT_MAX; index < LIMIT_CNT; index++) {
+		for (index = LIMIT_HIGH; index < LIMIT_CNT; index++) {
 			tg->bps[rw][index] = -1;
 			tg->iops[rw][index] = -1;
 		}
@@ -466,6 +471,7 @@ static void blk_throtl_update_valid_limit(struct throtl_data *td)
 	struct cgroup_subsys_state *pos_css;
 	struct blkcg_gq *blkg;
 	bool low_valid = false;
+	bool high_valid = false;
 
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -475,12 +481,21 @@ static void blk_throtl_update_valid_limit(struct throtl_data *td)
 		    tg->iops[READ][LIMIT_LOW] ||
 		    tg->iops[WRITE][LIMIT_LOW])
 			low_valid = true;
+		if (tg->bps[READ][LIMIT_HIGH] != -1 ||
+		    tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+		    tg->iops[READ][LIMIT_HIGH] != -1 ||
+		    tg->iops[WRITE][LIMIT_HIGH] != -1)
+			high_valid = true;
 	}
 
 	if (low_valid)
 		td->limit_valid[LIMIT_LOW] = true;
 	else
 		td->limit_valid[LIMIT_LOW] = false;
+	if (high_valid)
+		td->limit_valid[LIMIT_HIGH] = true;
+	else
+		td->limit_valid[LIMIT_HIGH] = false;
 }
 
 static void throtl_upgrade_state(struct throtl_data *td);
@@ -492,6 +507,10 @@ static void throtl_pd_offline(struct blkg_policy_data *pd)
 	tg->bps[WRITE][LIMIT_LOW] = 0;
 	tg->iops[READ][LIMIT_LOW] = 0;
 	tg->iops[WRITE][LIMIT_LOW] = 0;
+	tg->bps[READ][LIMIT_HIGH] = -1;
+	tg->bps[WRITE][LIMIT_HIGH] = -1;
+	tg->iops[READ][LIMIT_HIGH] = -1;
+	tg->iops[WRITE][LIMIT_HIGH] = -1;
 
 	blk_throtl_update_valid_limit(tg->td);
 
@@ -1476,7 +1495,15 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		if (v[0] < tg->bps[READ][LIMIT_LOW] ||
 		    v[1] < tg->bps[WRITE][LIMIT_LOW] ||
 		    v[2] < tg->iops[READ][LIMIT_LOW] ||
-		    v[3] < tg->iops[WRITE][LIMIT_LOW]) {
+		    v[3] < tg->iops[WRITE][LIMIT_LOW] ||
+		    (tg->bps[READ][LIMIT_HIGH] != -1 &&
+		     v[0] < tg->bps[READ][LIMIT_HIGH]) ||
+		    (tg->bps[WRITE][LIMIT_HIGH] != -1 &&
+		     v[1] < tg->bps[WRITE][LIMIT_HIGH]) ||
+		    (tg->iops[READ][LIMIT_HIGH] != -1 &&
+		     v[2] < tg->iops[READ][LIMIT_HIGH]) ||
+		    (tg->iops[WRITE][LIMIT_HIGH] != -1 &&
+		     v[3] < tg->iops[WRITE][LIMIT_HIGH])) {
 			ret = -EINVAL;
 			goto out_finish;
 		}
@@ -1484,7 +1511,27 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		if (v[0] > tg->bps[READ][LIMIT_MAX] ||
 		    v[1] > tg->bps[WRITE][LIMIT_MAX] ||
 		    v[2] > tg->iops[READ][LIMIT_MAX] ||
-		    v[3] > tg->iops[WRITE][LIMIT_MAX]) {
+		    v[3] > tg->iops[WRITE][LIMIT_MAX] ||
+		    (tg->bps[READ][LIMIT_HIGH] != -1 &&
+		     v[0] > tg->bps[READ][LIMIT_HIGH]) ||
+		    (tg->bps[WRITE][LIMIT_HIGH] != -1 &&
+		     v[1] > tg->bps[WRITE][LIMIT_HIGH]) ||
+		    (tg->iops[READ][LIMIT_HIGH] != -1 &&
+		     v[2] > tg->iops[READ][LIMIT_HIGH]) ||
+		    (tg->iops[WRITE][LIMIT_HIGH] != -1 &&
+		     v[3] > tg->iops[WRITE][LIMIT_HIGH])) {
+			ret = -EINVAL;
+			goto out_finish;
+		}
+	} else if (index == LIMIT_HIGH) {
+		if ((v[0] != -1 && (v[0] < tg->bps[READ][LIMIT_LOW] ||
+				    v[0] > tg->bps[READ][LIMIT_MAX])) ||
+		    (v[1] != -1 && (v[1] < tg->bps[WRITE][LIMIT_LOW] ||
+				    v[1] > tg->bps[WRITE][LIMIT_MAX])) ||
+		    (v[2] != -1 && (v[2] < tg->iops[READ][LIMIT_LOW] ||
+				    v[2] > tg->iops[READ][LIMIT_MAX])) ||
+		    (v[3] != -1 && (v[3] < tg->iops[WRITE][LIMIT_LOW] ||
+				    v[3] > tg->iops[WRITE][LIMIT_MAX]))) {
 			ret = -EINVAL;
 			goto out_finish;
 		}
@@ -1499,6 +1546,15 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 		if (tg->td->limit_valid[LIMIT_LOW])
 			tg->td->limit_index = LIMIT_LOW;
 	}
+	if (index == LIMIT_HIGH) {
+		blk_throtl_update_valid_limit(tg->td);
+		if (tg->td->limit_valid[LIMIT_HIGH] &&
+		    tg->td->limit_index == LIMIT_MAX)
+			tg->td->limit_index = LIMIT_HIGH;
+	}
+	if (index == LIMIT_MAX && tg->td->limit_index == LIMIT_MAX &&
+	    tg->td->limit_valid[LIMIT_HIGH])
+		tg->td->limit_index = LIMIT_HIGH;
 	tg_conf_updated(tg);
 	ret = 0;
 out_finish:
@@ -1515,6 +1571,13 @@ static struct cftype throtl_files[] = {
 		.private = LIMIT_LOW,
 	},
 	{
+		.name = "high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = tg_print_limit,
+		.write = tg_set_limit,
+		.private = LIMIT_HIGH,
+	},
+	{
 		.name = "max",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = tg_print_limit,
@@ -1988,6 +2051,7 @@ int blk_throtl_init(struct request_queue *q)
 	td->queue = q;
 
 	td->limit_valid[LIMIT_LOW] = false;
+	td->limit_valid[LIMIT_HIGH] = false;
 	td->limit_valid[LIMIT_MAX] = true;
 	td->limit_index = LIMIT_MAX;
 	td->low_upgrade_time = jiffies;
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 08/10] block-throttle: handle high limit
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
                   ` (6 preceding siblings ...)
  2016-05-11  0:16 ` [PATCH 07/10] block-throttle: add .high interface Shaohua Li
@ 2016-05-11  0:16 ` Shaohua Li
  2016-05-11  0:16 ` [PATCH 09/10] blk-throttle: make sure expire time isn't too big Shaohua Li
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

Handle high limit like we handle low limit including
downgrade/upgrade/idle detection logic. If cgroup has high limit, its
throttling limit is high limit. Otherwise the throttling limit is max
limit.

queue downgrades from LIMIT_HIGH/LIMIT_MAX to LIMIT_LOW if cgroup is below
low limit. queue upgrades from LIMIT_HIGH to LIMIT_MAX if all cgroups
reach high limit (max limit if no high limit). queue downgrades from
LIMIT_MAX to LIMIT_HIGH if cgroup is below high limit.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 278 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 230 insertions(+), 48 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5736d1b..0aed049 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -140,6 +140,7 @@ struct throtl_grp {
 	unsigned int io_disp[2];
 
 	unsigned long last_low_overflow_time[2];
+	unsigned long last_high_overflow_time[2];
 
 	uint64_t last_bytes_disp[2];
 	unsigned int last_io_disp[2];
@@ -176,6 +177,12 @@ struct throtl_data
 	unsigned char low_history;
 	unsigned int low_upgrade_interval;
 	unsigned int low_downgrade_interval;
+
+	unsigned long high_upgrade_time;
+	unsigned long high_downgrade_time;
+	unsigned char high_history;
+	unsigned int high_upgrade_interval;
+	unsigned int high_downgrade_interval;
 };
 
 static void throtl_pending_timer_fn(unsigned long arg);
@@ -1637,6 +1644,52 @@ static unsigned long tg_last_low_overflow_time(struct throtl_grp *tg)
 	return ret;
 }
 
+static unsigned long __tg_last_high_overflow_time(struct throtl_grp *tg)
+{
+	unsigned long rtime = -1, wtime = -1;
+	if (tg->bps[READ][LIMIT_HIGH] != -1 || tg->iops[READ][LIMIT_HIGH] != -1 ||
+	    tg->bps[READ][LIMIT_MAX] != -1 || tg->iops[READ][LIMIT_MAX] != -1)
+		rtime = tg->last_high_overflow_time[READ];
+	if (tg->bps[WRITE][LIMIT_HIGH] != -1 || tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+	    tg->bps[WRITE][LIMIT_MAX] != -1 || tg->iops[WRITE][LIMIT_MAX] != -1)
+		wtime = tg->last_high_overflow_time[WRITE];
+	return min(rtime, wtime);
+}
+
+static unsigned long tg_last_high_overflow_time(struct throtl_grp *tg)
+{
+	struct throtl_service_queue *parent_sq;
+	struct throtl_grp *parent = tg;
+	unsigned long ret = __tg_last_high_overflow_time(tg);
+
+	while (true) {
+		parent_sq = parent->service_queue.parent_sq;
+		parent = sq_to_tg(parent_sq);
+		if (!parent)
+			break;
+		if (((parent->bps[READ][LIMIT_HIGH] != -1 &&
+		      parent->bps[READ][LIMIT_HIGH] > tg->bps[READ][LIMIT_HIGH]) ||
+		     (parent->bps[READ][LIMIT_HIGH] == -1 &&
+		      parent->bps[READ][LIMIT_MAX] > tg->bps[READ][LIMIT_HIGH])) &&
+		    ((parent->bps[WRITE][LIMIT_HIGH] != -1 &&
+		      parent->bps[WRITE][LIMIT_HIGH] > tg->bps[WRITE][LIMIT_HIGH]) ||
+		     (parent->bps[WRITE][LIMIT_HIGH] == -1 &&
+		      parent->bps[WRITE][LIMIT_MAX] > tg->bps[WRITE][LIMIT_HIGH])) &&
+		    ((parent->iops[READ][LIMIT_HIGH] != -1 &&
+		      parent->iops[READ][LIMIT_HIGH] > tg->iops[READ][LIMIT_HIGH]) ||
+		     (parent->iops[READ][LIMIT_HIGH] == -1 &&
+		      parent->iops[READ][LIMIT_MAX] > tg->iops[READ][LIMIT_HIGH])) &&
+		    ((parent->iops[WRITE][LIMIT_HIGH] != -1 &&
+		      parent->iops[WRITE][LIMIT_HIGH] > tg->iops[WRITE][LIMIT_HIGH]) ||
+		     (parent->iops[WRITE][LIMIT_HIGH] == -1 &&
+		      parent->iops[WRITE][LIMIT_MAX] > tg->iops[WRITE][LIMIT_HIGH])))
+			break;
+		if (time_after(__tg_last_high_overflow_time(parent), ret))
+			ret = __tg_last_high_overflow_time(parent);
+	}
+	return ret;
+}
+
 static void throtl_calculate_low_interval(struct throtl_data *td)
 {
 	unsigned long history = td->low_history;
@@ -1656,10 +1709,32 @@ static void throtl_calculate_low_interval(struct throtl_data *td)
 	}
 }
 
+static void throtl_calculate_high_interval(struct throtl_data *td)
+{
+	unsigned long history = td->high_history;
+	unsigned int ubits = bitmap_weight(&history,
+		sizeof(td->high_history) * 8);
+	unsigned int dbits = sizeof(td->high_history) * 8 - ubits;
+
+	ubits = max(1U, ubits);
+	dbits = max(1U, dbits);
+
+	if (ubits >= dbits) {
+		td->high_upgrade_interval = ubits / dbits * cg_check_time;
+		td->high_downgrade_interval = cg_check_time;
+	} else {
+		td->high_upgrade_interval = cg_check_time;
+		td->high_downgrade_interval = dbits / ubits * cg_check_time;
+	}
+}
+
 static bool throtl_upgrade_check_one(struct throtl_grp *tg, bool *idle)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 
+	if (tg->td->limit_index == LIMIT_HIGH)
+		goto check_high;
+
 	if (!tg->bps[READ][LIMIT_LOW] && !tg->bps[WRITE][LIMIT_LOW] &&
 	    !tg->iops[READ][LIMIT_LOW] && !tg->iops[WRITE][LIMIT_LOW])
 		return true;
@@ -1680,6 +1755,18 @@ static bool throtl_upgrade_check_one(struct throtl_grp *tg, bool *idle)
 	if (tg->iops[WRITE][LIMIT_LOW] != 0 && !sq->nr_queued[WRITE])
 		return false;
 	return true;
+check_high:
+	/* if cgroup is below high limit for a long time, consider it idle */
+	if (time_after(jiffies,
+	    tg_last_high_overflow_time(tg) + tg->td->high_upgrade_interval)) {
+		*idle = true;
+		return true;
+	}
+
+	/* if cgroup reaches high/max limit, it's ok to next limit */
+	if (sq->nr_queued[READ] || sq->nr_queued[WRITE])
+		return true;
+	return false;
 }
 
 static bool throtl_upgrade_check_hierarchy(struct throtl_grp *tg, bool *idle)
@@ -1704,11 +1791,15 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 	struct blkcg_gq *blkg;
 	bool idle = false;
 
-	if (td->limit_index != LIMIT_LOW)
+	if (td->limit_index != LIMIT_LOW && td->limit_index != LIMIT_HIGH)
 		return false;
 
-	if (td->limit_index == LIMIT_LOW && time_before(jiffies,
-	    td->low_downgrade_time + td->low_upgrade_interval))
+	if ((td->limit_index == LIMIT_LOW &&
+	     time_before(jiffies,
+	      td->low_downgrade_time + td->low_upgrade_interval)) ||
+	    (td->limit_index == LIMIT_HIGH &&
+	     time_before(jiffies,
+	      td->high_downgrade_time + td->high_upgrade_interval)))
 		return false;
 
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
@@ -1726,6 +1817,11 @@ static bool throtl_can_upgrade(struct throtl_data *td,
 		if (!idle)
 			td->low_history |= 1;
 		throtl_calculate_low_interval(td);
+	} else {
+		td->high_history <<= 1;
+		if (!idle)
+			td->high_history |= 1;
+		throtl_calculate_high_interval(td);
 	}
 	return true;
 }
@@ -1734,9 +1830,21 @@ static void throtl_upgrade_state(struct throtl_data *td)
 {
 	struct cgroup_subsys_state *pos_css;
 	struct blkcg_gq *blkg;
+	int old = td->limit_index;
 
-	td->limit_index = LIMIT_MAX;
+	td->limit_index++;
+	while (!td->limit_valid[td->limit_index])
+		td->limit_index++;
 	td->low_upgrade_time = jiffies;
+	if (td->limit_index == LIMIT_HIGH)
+		td->high_downgrade_time = jiffies;
+	if (td->limit_index >= LIMIT_HIGH)
+		td->high_upgrade_time = jiffies;
+	/* high to max */
+	if (td->limit_index == LIMIT_MAX && old == LIMIT_HIGH) {
+		td->low_history = DEFAULT_HISTORY;
+		throtl_calculate_low_interval(td);
+	}
 	blkg_for_each_descendant_post(blkg, pos_css, td->queue->root_blkg) {
 		struct throtl_grp *tg = blkg_to_tg(blkg);
 		struct throtl_service_queue *sq = &tg->service_queue;
@@ -1752,13 +1860,18 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
 static void throtl_upgrade_check(struct throtl_grp *tg)
 {
-	if (tg->td->limit_index != LIMIT_LOW)
+	if (tg->td->limit_index != LIMIT_LOW &&
+	    tg->td->limit_index != LIMIT_HIGH)
 		return;
 
-	if (!(tg->bps[READ][LIMIT_LOW] || tg->bps[WRITE][LIMIT_LOW] ||
-	      tg->iops[READ][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW]) ||
-	    !time_after(jiffies,
-	     tg_last_low_overflow_time(tg) + tg->td->low_upgrade_interval))
+	if ((tg->td->limit_index == LIMIT_LOW &&
+	     (!(tg->bps[READ][LIMIT_LOW] || tg->bps[WRITE][LIMIT_LOW] ||
+	        tg->iops[READ][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW]) ||
+	      !time_after(jiffies,
+	       tg_last_low_overflow_time(tg) + tg->td->low_upgrade_interval))) ||
+	    (tg->td->limit_index == LIMIT_HIGH &&
+	     !time_after(jiffies,
+	      tg_last_high_overflow_time(tg) + tg->td->high_upgrade_interval)))
 		return;
 
 	if (throtl_can_upgrade(tg->td, NULL))
@@ -1767,11 +1880,32 @@ static void throtl_upgrade_check(struct throtl_grp *tg)
 
 static void throtl_downgrade_state(struct throtl_data *td, int new)
 {
+	int old = td->limit_index;
+
 	td->limit_index = new;
+	/* max crosses high to low */
+	if (new == LIMIT_LOW && old == LIMIT_MAX && td->limit_valid[LIMIT_HIGH]) {
+		td->low_downgrade_time = jiffies;
+		td->low_upgrade_time = jiffies;
+		td->low_history = 0xFF; /* do less upgrade later */
+		throtl_calculate_low_interval(td);
+
+		td->high_downgrade_time = jiffies;
+		td->high_upgrade_time = jiffies;
+		td->high_history = 0xFF; /* do less upgrade later */
+		throtl_calculate_high_interval(td);
+		return;
+	}
+	/* max to high */
+	if (new == LIMIT_HIGH) {
+		td->high_downgrade_time = jiffies;
+		return;
+	}
+
 	td->low_downgrade_time = jiffies;
 }
 
-static bool throtl_downgrade_check_one(struct throtl_grp *tg)
+static bool throtl_downgrade_check_one(struct throtl_grp *tg, bool check_low)
 {
 	struct throtl_data *td = tg->td;
 	unsigned long now = jiffies;
@@ -1780,24 +1914,30 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg)
 	 * If cgroup is below low limit, consider downgrade and throttle other
 	 * cgroups
 	 */
-	if (time_after(now,
-	     td->low_upgrade_time + td->low_downgrade_interval) &&
-	    time_after(now,
-	     tg_last_low_overflow_time(tg) + td->low_downgrade_interval))
+	if ((check_low &&
+	     time_after(now,
+	      td->low_upgrade_time + td->low_downgrade_interval) &&
+	     time_after(now,
+	      tg_last_low_overflow_time(tg) + td->low_downgrade_interval)) ||
+	    (!check_low &&
+	     time_after(now,
+	      td->high_upgrade_time + td->high_downgrade_interval) &&
+	     time_after(now,
+	      tg_last_high_overflow_time(tg) + td->high_downgrade_interval)))
 		return true;
 	return false;
 }
 
-static bool throtl_downgrade_check_hierarchy(struct throtl_grp *tg)
+static bool throtl_downgrade_check_hierarchy(struct throtl_grp *tg, bool check_low)
 {
-	if (!throtl_downgrade_check_one(tg))
+	if (!throtl_downgrade_check_one(tg, check_low))
 		return false;
 	while (true) {
 		if (!tg || (cgroup_subsys_on_dfl(io_cgrp_subsys) &&
 			    !tg_to_blkg(tg)->parent))
 			break;
 
-		if (!throtl_downgrade_check_one(tg))
+		if (!throtl_downgrade_check_one(tg, check_low))
 			return false;
 		tg = sq_to_tg(tg->service_queue.parent_sq);
 	}
@@ -1810,52 +1950,84 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 	unsigned int iops;
 	unsigned long elapsed_time;
 	unsigned long now = jiffies;
+	bool check_low;
+	bool check_high;
 
-	if (tg->td->limit_index != LIMIT_MAX)
+	if (tg->td->limit_index == LIMIT_LOW)
 		return;
-	if (!(tg->bps[READ][LIMIT_LOW] ||
-	      tg->bps[WRITE][LIMIT_LOW] ||
-	      tg->iops[WRITE][LIMIT_LOW] ||
-	      tg->iops[READ][LIMIT_LOW]))
+	if (!tg->td->limit_valid[LIMIT_LOW] && !tg->td->limit_valid[LIMIT_HIGH])
 		return;
-
 	if (time_after(tg->last_check_time + throtl_slice, now))
 		return;
+	check_low = tg->bps[READ][LIMIT_LOW] ||
+		    tg->bps[WRITE][LIMIT_LOW] ||
+		    tg->iops[READ][LIMIT_LOW] ||
+		    tg->iops[WRITE][LIMIT_LOW];
+	check_high = tg->bps[READ][LIMIT_HIGH] != -1 ||
+		     tg->bps[WRITE][LIMIT_HIGH] != -1 ||
+		     tg->iops[READ][LIMIT_HIGH] != -1 ||
+		     tg->iops[WRITE][LIMIT_HIGH] != -1 ||
+		     (tg->td->limit_valid[LIMIT_HIGH] &&
+		      (tg->bps[READ][LIMIT_MAX] != -1 ||
+		       tg->bps[WRITE][LIMIT_MAX] != -1 ||
+		       tg->iops[READ][LIMIT_MAX] != -1 ||
+		       tg->iops[WRITE][LIMIT_MAX] != -1) &&
+		      time_before(now, tg_last_high_overflow_time(tg) +
+		       tg->td->high_downgrade_interval));
+
 	elapsed_time = now - tg->last_check_time;
 	tg->last_check_time = now;
 
-	if (tg->bps[READ][LIMIT_LOW]) {
-		bps = tg->last_bytes_disp[READ] * HZ;
-		do_div(bps, elapsed_time);
-		if (bps >= tg->bps[READ][LIMIT_LOW])
-			tg->last_low_overflow_time[READ] = now;
-	}
-
-	if (tg->bps[WRITE][LIMIT_LOW]) {
-		bps = tg->last_bytes_disp[WRITE] * HZ;
-		do_div(bps, elapsed_time);
-		if (bps >= tg->bps[WRITE][LIMIT_LOW])
-			tg->last_low_overflow_time[WRITE] = now;
-	}
-
-	if (tg->iops[READ][LIMIT_LOW]) {
-		iops = tg->last_io_disp[READ] * HZ / elapsed_time;
-		if (iops >= tg->iops[READ][LIMIT_LOW])
-			tg->last_low_overflow_time[READ] = now;
-	}
+	if (!check_low && !check_high)
+		return;
 
-	if (tg->iops[WRITE][LIMIT_LOW]) {
-		iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
-		if (iops >= tg->iops[WRITE][LIMIT_LOW])
-			tg->last_low_overflow_time[WRITE] = now;
-	}
+	bps = tg->last_bytes_disp[READ] * HZ;
+	do_div(bps, elapsed_time);
+	if (tg->bps[READ][LIMIT_LOW] != 0 &&
+	    bps >= tg->bps[READ][LIMIT_LOW])
+		tg->last_low_overflow_time[READ] = now;
+	if ((tg->bps[READ][LIMIT_HIGH] != -1 &&
+	     bps >= tg->bps[READ][LIMIT_HIGH]) ||
+	    bps >= tg->bps[READ][LIMIT_MAX])
+		tg->last_high_overflow_time[READ] = now;
+
+	bps = tg->last_bytes_disp[WRITE] * HZ;
+	do_div(bps, elapsed_time);
+	if (tg->bps[WRITE][LIMIT_LOW] != 0 &&
+	    bps >= tg->bps[WRITE][LIMIT_LOW])
+		tg->last_low_overflow_time[WRITE] = now;
+	if ((tg->bps[WRITE][LIMIT_HIGH] != -1 &&
+	     bps >= tg->bps[WRITE][LIMIT_HIGH]) ||
+	    bps >= tg->bps[WRITE][LIMIT_MAX])
+		tg->last_high_overflow_time[WRITE] = now;
+
+	iops = tg->last_io_disp[READ] * HZ / elapsed_time;
+	if (tg->iops[READ][LIMIT_LOW] != 0 &&
+	    iops >= tg->iops[READ][LIMIT_LOW])
+		tg->last_low_overflow_time[READ] = now;
+	if ((tg->iops[READ][LIMIT_HIGH] != -1 &&
+	     iops >= tg->iops[READ][LIMIT_HIGH]) ||
+	    iops >= tg->iops[READ][LIMIT_MAX])
+		tg->last_high_overflow_time[READ] = now;
+
+	iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
+	if (tg->iops[WRITE][LIMIT_LOW] != 0 &&
+	    iops >= tg->iops[WRITE][LIMIT_LOW])
+		tg->last_low_overflow_time[WRITE] = now;
+	if ((tg->iops[WRITE][LIMIT_HIGH] != -1 &&
+	     iops >= tg->iops[WRITE][LIMIT_HIGH]) ||
+	    iops >= tg->iops[WRITE][LIMIT_MAX])
+		tg->last_high_overflow_time[WRITE] = now;
 
 	/*
 	 * If cgroup is below low limit, consider downgrade and throttle other
 	 * cgroups
 	 */
-	if (throtl_downgrade_check_hierarchy(tg))
+	if (check_low && throtl_downgrade_check_hierarchy(tg, true))
 		throtl_downgrade_state(tg->td, LIMIT_LOW);
+	else if (tg->td->limit_index == LIMIT_MAX && check_high &&
+		   throtl_downgrade_check_hierarchy(tg, false))
+		throtl_downgrade_state(tg->td, LIMIT_HIGH);
 
 	tg->last_bytes_disp[READ] = 0;
 	tg->last_bytes_disp[WRITE] = 0;
@@ -1889,6 +2061,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 	while (true) {
 		if (tg->last_low_overflow_time[rw] == 0)
 			tg->last_low_overflow_time[rw] = jiffies;
+		if (tg->last_high_overflow_time[rw] == 0)
+			tg->last_high_overflow_time[rw] = jiffies;
 		throtl_downgrade_check(tg);
 		throtl_upgrade_check(tg);
 		/* throtl is FIFO - if bios are already queued, should queue */
@@ -1898,6 +2072,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		/* if above limits, break to queue */
 		if (!tg_may_dispatch(tg, bio, NULL)) {
 			tg->last_low_overflow_time[rw] = jiffies;
+			if (tg->td->limit_index >= LIMIT_HIGH)
+				tg->last_high_overflow_time[rw] = jiffies;
 			if (throtl_can_upgrade(tg->td, tg)) {
 				throtl_upgrade_state(tg->td);
 				goto again;
@@ -1941,6 +2117,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	tg->last_low_overflow_time[rw] = jiffies;
+	if (tg->td->limit_index >= LIMIT_HIGH)
+		tg->last_high_overflow_time[rw] = jiffies;
 
 	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
@@ -2058,6 +2236,10 @@ int blk_throtl_init(struct request_queue *q)
 	td->low_downgrade_time = jiffies;
 	td->low_history = DEFAULT_HISTORY;
 	throtl_calculate_low_interval(td);
+	td->high_upgrade_time = jiffies;
+	td->high_downgrade_time = jiffies;
+	td->high_history = DEFAULT_HISTORY;
+	throtl_calculate_high_interval(td);
 	/* activate policy */
 	ret = blkcg_activate_policy(q, &blkcg_policy_throtl);
 	if (ret)
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 09/10] blk-throttle: make sure expire time isn't too big
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
                   ` (7 preceding siblings ...)
  2016-05-11  0:16 ` [PATCH 08/10] block-throttle: handle high limit Shaohua Li
@ 2016-05-11  0:16 ` Shaohua Li
  2016-05-11  0:16 ` [PATCH 10/10] blk-throttle: add trace log Shaohua Li
  2016-05-13 19:12 ` [PATCH 00/10]block-throttle: add low/high limit Vivek Goyal
  10 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

cgroup could be throttled to a limit but when other cgroups are idle,
queue enters a higher state and so the group should be throttled to a
higher limit. It's possible the cgroup is sleeping because of throttle
and other cgroups don't dispatch IO any more. In this case, nobody can
trigger current downgrade/upgrade logic. To fix this issue, we could
either set up a timer to wakeup the cgroup if other cgroups are idle or
make sure this cgroup doesn't sleep too long. Setting up a timer means
we must change the timer very frequently. This patch chooses the latter.
Making cgroup sleep time not too big wouldn't change cgroup bps/iops,
but could make it wakeup more frequently, which isn't a big issue
because cg_check_time * 8 is already quite big.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0aed049..a5f3435 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -632,6 +632,9 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
 					  unsigned long expires)
 {
+	unsigned long max_expire = jiffies + 8 * cg_check_time;
+	if (time_after(expires, max_expire))
+		expires = max_expire;
 	mod_timer(&sq->pending_timer, expires);
 	throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
 		   expires - jiffies, jiffies);
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 10/10] blk-throttle: add trace log
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
                   ` (8 preceding siblings ...)
  2016-05-11  0:16 ` [PATCH 09/10] blk-throttle: make sure expire time isn't too big Shaohua Li
@ 2016-05-11  0:16 ` Shaohua Li
  2016-05-13 19:12 ` [PATCH 00/10]block-throttle: add low/high limit Vivek Goyal
  10 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-11  0:16 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: tj, vgoyal, axboe, jmoyer, Kernel-team

Add trace log in new low/high logic to help dignose issues.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 43 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a5f3435..d35bbf1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2,6 +2,7 @@
  * Interface for controlling IO bandwidth on a request queue
  *
  * Copyright (C) 2010 Vivek Goyal <vgoyal@redhat.com>
+ * Shaohua Li <shli@fb.com> adds low/high limit
  */
 
 #include <linux/module.h>
@@ -92,6 +93,12 @@ enum {
 	LIMIT_CNT = 3,
 };
 
+static char *limit_name[LIMIT_CNT] = {
+	[LIMIT_LOW] = "low",
+	[LIMIT_HIGH] = "high",
+	[LIMIT_MAX] = "max",
+};
+
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -1565,6 +1572,8 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
 	if (index == LIMIT_MAX && tg->td->limit_index == LIMIT_MAX &&
 	    tg->td->limit_valid[LIMIT_HIGH])
 		tg->td->limit_index = LIMIT_HIGH;
+	throtl_log(&tg->td->service_queue, "switch state to %s",
+		limit_name[tg->td->limit_index]);
 	tg_conf_updated(tg);
 	ret = 0;
 out_finish:
@@ -1710,6 +1719,9 @@ static void throtl_calculate_low_interval(struct throtl_data *td)
 		td->low_upgrade_interval = cg_check_time;
 		td->low_downgrade_interval = dbits / ubits * cg_check_time;
 	}
+	throtl_log(&td->service_queue,
+		"low upgrade interval=%u, downgrade interval=%u",
+		td->low_upgrade_interval, td->low_downgrade_interval);
 }
 
 static void throtl_calculate_high_interval(struct throtl_data *td)
@@ -1729,6 +1741,9 @@ static void throtl_calculate_high_interval(struct throtl_data *td)
 		td->high_upgrade_interval = cg_check_time;
 		td->high_downgrade_interval = dbits / ubits * cg_check_time;
 	}
+	throtl_log(&td->service_queue,
+		"high upgrade interval=%u, downgrade interval=%u",
+		td->high_upgrade_interval, td->high_downgrade_interval);
 }
 
 static bool throtl_upgrade_check_one(struct throtl_grp *tg, bool *idle)
@@ -1745,6 +1760,8 @@ static bool throtl_upgrade_check_one(struct throtl_grp *tg, bool *idle)
 	/* if cgroup is below low limit for a long time, consider it idle */
 	if (time_after(jiffies,
 	    tg_last_low_overflow_time(tg) + tg->td->low_upgrade_interval)) {
+		throtl_log(&tg->service_queue, "idle upgrade, hit low time=%lu jiffies=%lu",
+			tg_last_low_overflow_time(tg), jiffies);
 		*idle = true;
 		return true;
 	}
@@ -1757,18 +1774,23 @@ static bool throtl_upgrade_check_one(struct throtl_grp *tg, bool *idle)
 		return false;
 	if (tg->iops[WRITE][LIMIT_LOW] != 0 && !sq->nr_queued[WRITE])
 		return false;
+	throtl_log(&tg->service_queue, "reach low limit upgrade");
 	return true;
 check_high:
 	/* if cgroup is below high limit for a long time, consider it idle */
 	if (time_after(jiffies,
 	    tg_last_high_overflow_time(tg) + tg->td->high_upgrade_interval)) {
+		throtl_log(&tg->service_queue, "idle upgrade, hit high time=%lu jiffies=%lu",
+			tg_last_high_overflow_time(tg), jiffies);
 		*idle = true;
 		return true;
 	}
 
 	/* if cgroup reaches high/max limit, it's ok to next limit */
-	if (sq->nr_queued[READ] || sq->nr_queued[WRITE])
+	if (sq->nr_queued[READ] || sq->nr_queued[WRITE]) {
+		throtl_log(&tg->service_queue, "reach high/max limit upgrade");
 		return true;
+	}
 	return false;
 }
 
@@ -1838,6 +1860,8 @@ static void throtl_upgrade_state(struct throtl_data *td)
 	td->limit_index++;
 	while (!td->limit_valid[td->limit_index])
 		td->limit_index++;
+	throtl_log(&td->service_queue, "upgrade state to %s",
+		limit_name[td->limit_index]);
 	td->low_upgrade_time = jiffies;
 	if (td->limit_index == LIMIT_HIGH)
 		td->high_downgrade_time = jiffies;
@@ -1886,6 +1910,8 @@ static void throtl_downgrade_state(struct throtl_data *td, int new)
 	int old = td->limit_index;
 
 	td->limit_index = new;
+	throtl_log(&td->service_queue, "downgrade state to %s",
+		limit_name[td->limit_index]);
 	/* max crosses high to low */
 	if (new == LIMIT_LOW && old == LIMIT_MAX && td->limit_valid[LIMIT_HIGH]) {
 		td->low_downgrade_time = jiffies;
@@ -1926,8 +1952,17 @@ static bool throtl_downgrade_check_one(struct throtl_grp *tg, bool check_low)
 	     time_after(now,
 	      td->high_upgrade_time + td->high_downgrade_interval) &&
 	     time_after(now,
-	      tg_last_high_overflow_time(tg) + td->high_downgrade_interval)))
+	      tg_last_high_overflow_time(tg) + td->high_downgrade_interval))) {
+		throtl_log(&tg->service_queue,
+			"%s idle downgrade, last hit limit time=%lu upgrade time=%lu jiffies=%lu",
+			check_low ? "low" : "high",
+			check_low ? tg_last_low_overflow_time(tg) :
+				tg_last_high_overflow_time(tg),
+			check_low ? td->low_upgrade_time :
+				td->high_upgrade_time,
+			jiffies);
 		return true;
+	}
 	return false;
 }
 
@@ -1986,6 +2021,7 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 
 	bps = tg->last_bytes_disp[READ] * HZ;
 	do_div(bps, elapsed_time);
+	throtl_log(&tg->service_queue, "rbps=%llu", bps);
 	if (tg->bps[READ][LIMIT_LOW] != 0 &&
 	    bps >= tg->bps[READ][LIMIT_LOW])
 		tg->last_low_overflow_time[READ] = now;
@@ -1996,6 +2032,7 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 
 	bps = tg->last_bytes_disp[WRITE] * HZ;
 	do_div(bps, elapsed_time);
+	throtl_log(&tg->service_queue, "wbps=%llu", bps);
 	if (tg->bps[WRITE][LIMIT_LOW] != 0 &&
 	    bps >= tg->bps[WRITE][LIMIT_LOW])
 		tg->last_low_overflow_time[WRITE] = now;
@@ -2005,6 +2042,7 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 		tg->last_high_overflow_time[WRITE] = now;
 
 	iops = tg->last_io_disp[READ] * HZ / elapsed_time;
+	throtl_log(&tg->service_queue, "riops=%u", iops);
 	if (tg->iops[READ][LIMIT_LOW] != 0 &&
 	    iops >= tg->iops[READ][LIMIT_LOW])
 		tg->last_low_overflow_time[READ] = now;
@@ -2014,6 +2052,7 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 		tg->last_high_overflow_time[READ] = now;
 
 	iops = tg->last_io_disp[WRITE] * HZ / elapsed_time;
+	throtl_log(&tg->service_queue, "wiops=%u", iops);
 	if (tg->iops[WRITE][LIMIT_LOW] != 0 &&
 	    iops >= tg->iops[WRITE][LIMIT_LOW])
 		tg->last_low_overflow_time[WRITE] = now;
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/10]block-throttle: add low/high limit
  2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
                   ` (9 preceding siblings ...)
  2016-05-11  0:16 ` [PATCH 10/10] blk-throttle: add trace log Shaohua Li
@ 2016-05-13 19:12 ` Vivek Goyal
  2016-05-13 22:59   ` Shaohua Li
  10 siblings, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2016-05-13 19:12 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, tj, axboe, jmoyer, Kernel-team

On Tue, May 10, 2016 at 05:16:30PM -0700, Shaohua Li wrote:
> Hi,
> 
> This patch set adds low/high limit for blk-throttle cgroup. The interface is
> io.low and io.high.
> 
> low limit implements best effort bandwidth/iops protection. If one cgroup
> doesn't reach its low limit, no other cgroups can use more bandwidth/iops than
> their low limit. cgroup without low limit is not protected. If there is cgroup
> with low limit but the cgroup doesn't reach low limit yet, the cgroup without
> low limit will be throttled to very low bandwidth/iops.

Hi Shaohua,

Can you please describe a little what problem are you solving and how
it is not solved with what we have right now.

Are you trying to guarantee minimum bandwidth to a cgroup? And approach
seems to be that specify minimum bandwidth required by a cgroup in
io.low and if cgroup does not get that bandwidth, other cgroups will
be automatically throttled and will not get more than their io.low
limit BW.

I am wondering how would one configure io.low limit? How would
application know what's the device IO capability and what part of
that bandwidth application requires. IOW, proportional control using
absolute limits is very tricky as it requires one to know device's
IO rate capabilities. To make it more complex, device throughput
is not fixed and varies based on badndwith. That mean, io.low also
somehow needs to adjust accorginly. And to me that means using a
notion of prio/weight works best instead of absolute limits.

In general you seem to be wanting to implement proportional control
outside CFQ so that it can be used with other block devices. I think
your previous idea of assigning weights to cgroup and translating
it automatically to some sort of control (number of tokens) was
better than absolute limits.

Having said that, it required knowing cost of IO and I am not sure
if we reached some conclusion at LSF about this.

On the other hand, all these algorithms only control how much IO
can be dispatched from a cgroup. Given deep queue depths of devices,
we will not gain much if device is not implementing some sort of
priority mechanism where one IO in queue is preferred over other.

To me biggest problem with IO has been writes overwhelming the device
and killing read latencies. CFQ did it to an extent but soon became
obsolete for faster devices. So now Jens's patch of controlling
background write might help here.

Not sure how proportional control at block layer will help with devices
of deep queue depths and without having any notion of priority of request.
Writes can easily fill up the queue and when latency sensitive IO comes
in, it will still suffer. So we probably need something proportional
control along with some sort of prioritization implemented in device.

Thanks
Vivek

> 
> high limit implements best effort limitation. cgroup with high limit can use
> more than high limit bandwidth/iops if all cgroups use at least high limit
> bandwidth/iops. If one cgroup is below its high limit, all cgroups can't use
> more bandwidth/iops than their high limit. If some cgroups have high limit and
> the others haven't, the cgroups without high limit will use max limit as their
> high limit.
> 
> The disk queue has a state machine. We have 3 states LIMIT_LOW, LIMIT_HIGH and
> LIMIT_MAX. In each state, we throttle cgroups up to a limit according to their
> state limit. LIMIT_LOW state limit is low limit, LIMIT_HIGH high limit and
> LIMIT_MAX max limit. In a state, if condition meets, queue can upgrade to
> higher level state or downgrade to lower level state. For example, queue is in
> LIMIT_LOW state and all cgroups reach their low limit, the queue will be
> upgraded to LIMIT_HIGH. In another example, queue is in LIMIT_MAX state, but
> one cgroup is below its high limit, the queue will be downgraded to LIMIT_HIGH.
> If all cgroups don't have limit for specific state, the state will be invalid.
> We will skip invalid state for upgrading/downgrading. Initially queue state is
> LIMIT_MAX till some cgroup gets low/high limit set, so this will maintain
> backward compatibility for users with only max limist set.
> 
> If downgrade/upgrade only happens according to limit, we will have performance
> issue. For example, if one cgroup has low limit set but the cgroup never
> dispatch enough IO to reach low limit, the queue state will remain in
> LIMIT_LOW. Other cgroups will be throttled and the whole disk utilization will
> be low. To solve this issue, if cgroup is below limit for a long time, we treat
> the cgroup idle and its corresponding limit will be ignored for
> upgrade/downgrade logic. The idle based upgrade could introduce a dilemma
> though, since we will do downgrade if cgroup is below its limit (eg idle). For
> example, if a cgroup is below its low limit for a long time, queue is upgraded
> to HIGH state. The cgroup continues to be below its low limit, the queue will
> be downgraded to LOW state. In this example, the queue will keep switching
> state between LOW and HIGH.
> 
> The key to avoid unnecessary state switching is to detect if cgroup is truly
> idle, which is a hard problem unfortunately. There are two kinds of idle. One
> is cgroup intends to not dispatch enough IO (real idle). In this case, we
> should do upgrade quickly and don't do downgrade. The other is other cgroups
> dispatch too many IO and use all bandwidth, the cgroup can't dispatch enough IO
> and looks idle (fake idle). In this case, we should do downgrade quickly and
> never do upgrade.
> 
> Destinguishing the two kinds of idle is impossible for a high queue depth disk
> as far as I can tell. This patch set doesn't try to precisely detect idle.
> Instead we record history of upgrade. If queue upgrades because cgroup hits
> limit, future downgrade is likely because of fake idle, hence future upgrade
> should run slowly and future downgrade should run quickly. Otherwise future
> downgrade is likely because of real idle, hence future upgrade should run
> quickly and future downgrade should run slowly. The adaptive upgrade/downgrade
> time means disk downgrade in real idle happens rarely and disk upgrade in fake
> idle happens rarely. This doesn't avoid repeatedly state switching though.
> Please see patch 6 for details.
> 
> User must carefully set the limits. Inproper setting could be ignored. For
> example, disk max bandwidth is 100M/s. One cgroup has low limit 60M/s, the
> other 50M/s. When the first cgroup runs in 60M/s, there is only 40M/s bandwidth
> remaining. The second cgroup will never reach 50M/s, so the cgroup will be
> treated idle and its limit will be literally ignored.
> 
> Comments and benchmarks are welcome!
> 
> Thanks,
> Shaohua
> 
> Shaohua Li (10):
>   block-throttle: prepare support multiple limits
>   block-throttle: add .low interface
>   block-throttle: configure bps/iops limit for cgroup in low limit
>   block-throttle: add upgrade logic for LIMIT_LOW state
>   block-throttle: add downgrade logic
>   block-throttle: idle detection
>   block-throttle: add .high interface
>   block-throttle: handle high limit
>   blk-throttle: make sure expire time isn't too big
>   blk-throttle: add trace log
> 
>  block/blk-throttle.c | 813 +++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 764 insertions(+), 49 deletions(-)
> 
> -- 
> 2.8.0.rc2

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/10]block-throttle: add low/high limit
  2016-05-13 19:12 ` [PATCH 00/10]block-throttle: add low/high limit Vivek Goyal
@ 2016-05-13 22:59   ` Shaohua Li
  2016-05-18 19:29     ` Vivek Goyal
  0 siblings, 1 reply; 15+ messages in thread
From: Shaohua Li @ 2016-05-13 22:59 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-block, linux-kernel, tj, axboe, jmoyer, Kernel-team

On Fri, May 13, 2016 at 03:12:45PM -0400, Vivek Goyal wrote:
> On Tue, May 10, 2016 at 05:16:30PM -0700, Shaohua Li wrote:
> > Hi,
> > 
> > This patch set adds low/high limit for blk-throttle cgroup. The interface is
> > io.low and io.high.
> > 
> > low limit implements best effort bandwidth/iops protection. If one cgroup
> > doesn't reach its low limit, no other cgroups can use more bandwidth/iops than
> > their low limit. cgroup without low limit is not protected. If there is cgroup
> > with low limit but the cgroup doesn't reach low limit yet, the cgroup without
> > low limit will be throttled to very low bandwidth/iops.
> 
> Hi Shaohua,
> 
> Can you please describe a little what problem are you solving and how
> it is not solved with what we have right now.

The goal is to implement a best effort limit. io.max is a hard limit,
which means cgroup can't use more bandwidth than max even there is no IO
pressure. If we set a high io.max limit for a low priority cgroup, high
priority cgroup will get harmed and dispatch less IO. If we set a low
io.max limit, total disk bandwidth can't be fully used by low priority
cgroup if high priority cgroup doesn't run. Either isn't good. This is
exactly what io.high tries to solve. The io.high is a soft limit, cgroup
could exceed the limit if there is no IO pressure. So in above example,
low priority cgroup can use more than io.high IO if high priority cgroup
isn't running and use up to io.high IO otherwise.

> Are you trying to guarantee minimum bandwidth to a cgroup? And approach
> seems to be that specify minimum bandwidth required by a cgroup in
> io.low and if cgroup does not get that bandwidth, other cgroups will
> be automatically throttled and will not get more than their io.low
> limit BW.

This is exactly what io.low tries to do, protect high priority cgroup.

> I am wondering how would one configure io.low limit? How would
> application know what's the device IO capability and what part of
> that bandwidth application requires.

I agree configure io.low/high limit isn't easy. We have the same problem
for any limit based scheduling including io.max. I don't have good
answer yet for the configuration, but those limits can only be found
after a lot of testing/benchmarking.

> IOW, proportional control using
> absolute limits is very tricky as it requires one to know device's
> IO rate capabilities. To make it more complex, device throughput
> is not fixed and varies based on badndwith. That mean, io.low also
> somehow needs to adjust accorginly. And to me that means using a
> notion of prio/weight works best instead of absolute limits.
>
> In general you seem to be wanting to implement proportional control
> outside CFQ so that it can be used with other block devices. I think
> your previous idea of assigning weights to cgroup and translating
> it automatically to some sort of control (number of tokens) was
> better than absolute limits.
> 
> Having said that, it required knowing cost of IO and I am not sure
> if we reached some conclusion at LSF about this.

So this patch set only tries to extend current blk-throttle, it isn't
related to the proportional control which I was working on before.

As for proportional control, I think proportional control is much better
than a limit based control, as it's easy to configure and adaptive. The
problem is we don't have a good way to measure IO cost, so my original
proportional control patches use either bandwidth or IOPS, none is
precise. Tejun has concerns on this. According to him, if we can't
precisely measure IO cost, we shouldn't do proportional control. This is
debatable though, I'll not give up the proportional patches. This patch
set gives us a temporary solution to prioritize cgroups giving the
proportional control is controversial. The io.low/io.high limit also
matches memcg behavior, which has the same interfaces.

> On the other hand, all these algorithms only control how much IO
> can be dispatched from a cgroup. Given deep queue depths of devices,
> we will not gain much if device is not implementing some sort of
> priority mechanism where one IO in queue is preferred over other.

We can't solve this issue without hardware support, hardware can freely
reschedule any IO. The limit based control can only have a big picture
scheduling. Tejun used to think about adding logic to throttle cgroup
based on IO latency, but the big problem is if latency increases we
don't know which cgorup makes the IO latency increase. It could be the
cgroup itself dispatch some IO or could be any other cgroup. And so we
don't know which cgroup should be throttled further.

> To me biggest problem with IO has been writes overwhelming the device
> and killing read latencies. CFQ did it to an extent but soon became
> obsolete for faster devices. So now Jens's patch of controlling
> background write might help here.
> 
> Not sure how proportional control at block layer will help with devices
> of deep queue depths and without having any notion of priority of request.
> Writes can easily fill up the queue and when latency sensitive IO comes
> in, it will still suffer. So we probably need something proportional
> control along with some sort of prioritization implemented in device.

I agree. proportional control is still the ultimate goal. deep queue
depth makes the problem very hard. The CFQ way (idle disk) is not a
choice for fast devices though.

Thanks,
Shaohua

> > 
> > high limit implements best effort limitation. cgroup with high limit can use
> > more than high limit bandwidth/iops if all cgroups use at least high limit
> > bandwidth/iops. If one cgroup is below its high limit, all cgroups can't use
> > more bandwidth/iops than their high limit. If some cgroups have high limit and
> > the others haven't, the cgroups without high limit will use max limit as their
> > high limit.
> > 
> > The disk queue has a state machine. We have 3 states LIMIT_LOW, LIMIT_HIGH and
> > LIMIT_MAX. In each state, we throttle cgroups up to a limit according to their
> > state limit. LIMIT_LOW state limit is low limit, LIMIT_HIGH high limit and
> > LIMIT_MAX max limit. In a state, if condition meets, queue can upgrade to
> > higher level state or downgrade to lower level state. For example, queue is in
> > LIMIT_LOW state and all cgroups reach their low limit, the queue will be
> > upgraded to LIMIT_HIGH. In another example, queue is in LIMIT_MAX state, but
> > one cgroup is below its high limit, the queue will be downgraded to LIMIT_HIGH.
> > If all cgroups don't have limit for specific state, the state will be invalid.
> > We will skip invalid state for upgrading/downgrading. Initially queue state is
> > LIMIT_MAX till some cgroup gets low/high limit set, so this will maintain
> > backward compatibility for users with only max limist set.
> > 
> > If downgrade/upgrade only happens according to limit, we will have performance
> > issue. For example, if one cgroup has low limit set but the cgroup never
> > dispatch enough IO to reach low limit, the queue state will remain in
> > LIMIT_LOW. Other cgroups will be throttled and the whole disk utilization will
> > be low. To solve this issue, if cgroup is below limit for a long time, we treat
> > the cgroup idle and its corresponding limit will be ignored for
> > upgrade/downgrade logic. The idle based upgrade could introduce a dilemma
> > though, since we will do downgrade if cgroup is below its limit (eg idle). For
> > example, if a cgroup is below its low limit for a long time, queue is upgraded
> > to HIGH state. The cgroup continues to be below its low limit, the queue will
> > be downgraded to LOW state. In this example, the queue will keep switching
> > state between LOW and HIGH.
> > 
> > The key to avoid unnecessary state switching is to detect if cgroup is truly
> > idle, which is a hard problem unfortunately. There are two kinds of idle. One
> > is cgroup intends to not dispatch enough IO (real idle). In this case, we
> > should do upgrade quickly and don't do downgrade. The other is other cgroups
> > dispatch too many IO and use all bandwidth, the cgroup can't dispatch enough IO
> > and looks idle (fake idle). In this case, we should do downgrade quickly and
> > never do upgrade.
> > 
> > Destinguishing the two kinds of idle is impossible for a high queue depth disk
> > as far as I can tell. This patch set doesn't try to precisely detect idle.
> > Instead we record history of upgrade. If queue upgrades because cgroup hits
> > limit, future downgrade is likely because of fake idle, hence future upgrade
> > should run slowly and future downgrade should run quickly. Otherwise future
> > downgrade is likely because of real idle, hence future upgrade should run
> > quickly and future downgrade should run slowly. The adaptive upgrade/downgrade
> > time means disk downgrade in real idle happens rarely and disk upgrade in fake
> > idle happens rarely. This doesn't avoid repeatedly state switching though.
> > Please see patch 6 for details.
> > 
> > User must carefully set the limits. Inproper setting could be ignored. For
> > example, disk max bandwidth is 100M/s. One cgroup has low limit 60M/s, the
> > other 50M/s. When the first cgroup runs in 60M/s, there is only 40M/s bandwidth
> > remaining. The second cgroup will never reach 50M/s, so the cgroup will be
> > treated idle and its limit will be literally ignored.
> > 
> > Comments and benchmarks are welcome!
> > 
> > Thanks,
> > Shaohua
> > 
> > Shaohua Li (10):
> >   block-throttle: prepare support multiple limits
> >   block-throttle: add .low interface
> >   block-throttle: configure bps/iops limit for cgroup in low limit
> >   block-throttle: add upgrade logic for LIMIT_LOW state
> >   block-throttle: add downgrade logic
> >   block-throttle: idle detection
> >   block-throttle: add .high interface
> >   block-throttle: handle high limit
> >   blk-throttle: make sure expire time isn't too big
> >   blk-throttle: add trace log
> > 
> >  block/blk-throttle.c | 813 +++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 764 insertions(+), 49 deletions(-)
> > 
> > -- 
> > 2.8.0.rc2

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/10]block-throttle: add low/high limit
  2016-05-13 22:59   ` Shaohua Li
@ 2016-05-18 19:29     ` Vivek Goyal
  2016-05-25 21:38       ` Shaohua Li
  0 siblings, 1 reply; 15+ messages in thread
From: Vivek Goyal @ 2016-05-18 19:29 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, tj, axboe, jmoyer, Kernel-team

On Fri, May 13, 2016 at 03:59:50PM -0700, Shaohua Li wrote:
> On Fri, May 13, 2016 at 03:12:45PM -0400, Vivek Goyal wrote:
> > On Tue, May 10, 2016 at 05:16:30PM -0700, Shaohua Li wrote:
> > > Hi,
> > > 
> > > This patch set adds low/high limit for blk-throttle cgroup. The interface is
> > > io.low and io.high.
> > > 
> > > low limit implements best effort bandwidth/iops protection. If one cgroup
> > > doesn't reach its low limit, no other cgroups can use more bandwidth/iops than
> > > their low limit. cgroup without low limit is not protected. If there is cgroup
> > > with low limit but the cgroup doesn't reach low limit yet, the cgroup without
> > > low limit will be throttled to very low bandwidth/iops.
> > 
> > Hi Shaohua,
> > 
> > Can you please describe a little what problem are you solving and how
> > it is not solved with what we have right now.
> 
> The goal is to implement a best effort limit. io.max is a hard limit,
> which means cgroup can't use more bandwidth than max even there is no IO
> pressure. If we set a high io.max limit for a low priority cgroup, high
> priority cgroup will get harmed and dispatch less IO. If we set a low
> io.max limit, total disk bandwidth can't be fully used by low priority
> cgroup if high priority cgroup doesn't run. Either isn't good. This is
> exactly what io.high tries to solve. The io.high is a soft limit, cgroup
> could exceed the limit if there is no IO pressure. So in above example,
> low priority cgroup can use more than io.high IO if high priority cgroup
> isn't running and use up to io.high IO otherwise.

io.max stuff was not designed to optimize the disk usage. It was more for
cloud scenario where one does not get the faster IO rate if one has not
paid for that kind of service (despite the fact that there is plenty of
bandwidth available in backend).

> 
> > Are you trying to guarantee minimum bandwidth to a cgroup? And approach
> > seems to be that specify minimum bandwidth required by a cgroup in
> > io.low and if cgroup does not get that bandwidth, other cgroups will
> > be automatically throttled and will not get more than their io.low
> > limit BW.
> 
> This is exactly what io.low tries to do, protect high priority cgroup.
> 
> > I am wondering how would one configure io.low limit? How would
> > application know what's the device IO capability and what part of
> > that bandwidth application requires.
> 
> I agree configure io.low/high limit isn't easy. We have the same problem
> for any limit based scheduling including io.max. I don't have good
> answer yet for the configuration, but those limits can only be found
> after a lot of testing/benchmarking.
> 
> > IOW, proportional control using
> > absolute limits is very tricky as it requires one to know device's
> > IO rate capabilities. To make it more complex, device throughput
> > is not fixed and varies based on badndwith. That mean, io.low also
> > somehow needs to adjust accorginly. And to me that means using a
> > notion of prio/weight works best instead of absolute limits.
> >
> > In general you seem to be wanting to implement proportional control
> > outside CFQ so that it can be used with other block devices. I think
> > your previous idea of assigning weights to cgroup and translating
> > it automatically to some sort of control (number of tokens) was
> > better than absolute limits.
> > 
> > Having said that, it required knowing cost of IO and I am not sure
> > if we reached some conclusion at LSF about this.
> 
> So this patch set only tries to extend current blk-throttle, it isn't
> related to the proportional control which I was working on before.

I think practically you are trying to achieve proportional control.
Proportional control gives everybody fair share and a low prio application
can do higher IO if there is no IO pressure. (This is what io.high seems
to be implementing).

And if there is IO pressure (lot of cgroup are doing IO), then everybody
will be limited to their fair share of IO bandwidth and minimum bandwidth
is guaranteed based on their fair share. (And this is what io.low seems
to be implementing).

So to me you are trying to achive what proportional control gives.
Difference is proportional control does it with a single knob (say weight)
and you have splitted in two knobs. Also proportional control adjusts
itself dynamically and easy to configure. While same can't be said for
these absolute limits (io.low, io.high).

IOW, why io.low will give me better minimum bandwidth guarantee as comapred
to proportional logic? I think at the end of the day you will run into
same issue of deciding whether to allow a writer to fill up the device
queue or not.

For example, say two cgroups A and B are doing IO. A is high prio cgroup
which primarily does reads (may be dependent reads) and B is the cgroup
which does tons of big WRITES. Now say you configured io.low for A as
1MB/s and for also as 1MB/s. Say for a period of few seconds, A did
not do any IO. Then you will think that high prio cgroup is not doing
any IO, that means all the cgroups have met their minimum bandwidth 
requirements and hence allow cgroup B to dispatch IO till io.max. And
that will fill up device queue. And now A does some reads which will
still be stuck behind tons of WRITEs in the device queue.

IOW, I think you are still trying to implement a proportional control
mechanism and instead of one knobw, using two knobs and it will have
more or less same issues with device queue depth as you have with
weight based proportional scheme.

> 
> As for proportional control, I think proportional control is much better
> than a limit based control, as it's easy to configure and adaptive. The
> problem is we don't have a good way to measure IO cost, so my original
> proportional control patches use either bandwidth or IOPS, none is
> precise. Tejun has concerns on this. According to him, if we can't
> precisely measure IO cost, we shouldn't do proportional control. This is
> debatable though, I'll not give up the proportional patches. This patch
> set gives us a temporary solution to prioritize cgroups giving the
> proportional control is controversial. The io.low/io.high limit also
> matches memcg behavior, which has the same interfaces.

It might make sense for memory control as memory is absolute resource
and there is no notion of proportional control as such and most of the
time memory is viewed in terms of absolute resource.

For IO, IMHO, proportional control makes more sense. If proportional
control is the ultimate goal, I think we should somehow try to get that
right instead of creating intermediate interfaces like io.low/io.high.

> 
> > On the other hand, all these algorithms only control how much IO
> > can be dispatched from a cgroup. Given deep queue depths of devices,
> > we will not gain much if device is not implementing some sort of
> > priority mechanism where one IO in queue is preferred over other.
> 
> We can't solve this issue without hardware support, hardware can freely
> reschedule any IO. The limit based control can only have a big picture
> scheduling. Tejun used to think about adding logic to throttle cgroup
> based on IO latency, but the big problem is if latency increases we
> don't know which cgorup makes the IO latency increase. It could be the
> cgroup itself dispatch some IO or could be any other cgroup. And so we
> don't know which cgroup should be throttled further.

I understand that without the help of device, it is very hard problem
to solve and we somehow need to reduce the queue depth intelligently.

I don't have any good answers but I feel we should still look into 
trying to make proportional control work (if we really have to). Biggest
problem with proportional control has been WRITEs and Jens's patches
might help reduce pressure of background writes. And drive smaller
queue depth and imporoving latency of higher prio low traffic cgroup.

If latency is the goal, will it make sense to allow configuring
max latency of each cgroup and if any of the cgroup is missing
its latency targets, then start throttling other cgroups till all
cgroups start meeting their max latency targets. I think this is
similar to your io.low proposal and only difference is limits are
in terms of latency and not BW/iops. Again this will only work
if both high prio cgroup and low prio cgroups are continously
backlogged. Which is rarely the case. Reads are latency sensitive
and which are often dependent on previous reads and are not
continously backlogged.

Thakns
Vivek

> 
> > To me biggest problem with IO has been writes overwhelming the device
> > and killing read latencies. CFQ did it to an extent but soon became
> > obsolete for faster devices. So now Jens's patch of controlling
> > background write might help here.
> > 
> > Not sure how proportional control at block layer will help with devices
> > of deep queue depths and without having any notion of priority of request.
> > Writes can easily fill up the queue and when latency sensitive IO comes
> > in, it will still suffer. So we probably need something proportional
> > control along with some sort of prioritization implemented in device.
> 
> I agree. proportional control is still the ultimate goal. deep queue
> depth makes the problem very hard. The CFQ way (idle disk) is not a
> choice for fast devices though.
> 
> Thanks,
> Shaohua
> 
> > > 
> > > high limit implements best effort limitation. cgroup with high limit can use
> > > more than high limit bandwidth/iops if all cgroups use at least high limit
> > > bandwidth/iops. If one cgroup is below its high limit, all cgroups can't use
> > > more bandwidth/iops than their high limit. If some cgroups have high limit and
> > > the others haven't, the cgroups without high limit will use max limit as their
> > > high limit.
> > > 
> > > The disk queue has a state machine. We have 3 states LIMIT_LOW, LIMIT_HIGH and
> > > LIMIT_MAX. In each state, we throttle cgroups up to a limit according to their
> > > state limit. LIMIT_LOW state limit is low limit, LIMIT_HIGH high limit and
> > > LIMIT_MAX max limit. In a state, if condition meets, queue can upgrade to
> > > higher level state or downgrade to lower level state. For example, queue is in
> > > LIMIT_LOW state and all cgroups reach their low limit, the queue will be
> > > upgraded to LIMIT_HIGH. In another example, queue is in LIMIT_MAX state, but
> > > one cgroup is below its high limit, the queue will be downgraded to LIMIT_HIGH.
> > > If all cgroups don't have limit for specific state, the state will be invalid.
> > > We will skip invalid state for upgrading/downgrading. Initially queue state is
> > > LIMIT_MAX till some cgroup gets low/high limit set, so this will maintain
> > > backward compatibility for users with only max limist set.
> > > 
> > > If downgrade/upgrade only happens according to limit, we will have performance
> > > issue. For example, if one cgroup has low limit set but the cgroup never
> > > dispatch enough IO to reach low limit, the queue state will remain in
> > > LIMIT_LOW. Other cgroups will be throttled and the whole disk utilization will
> > > be low. To solve this issue, if cgroup is below limit for a long time, we treat
> > > the cgroup idle and its corresponding limit will be ignored for
> > > upgrade/downgrade logic. The idle based upgrade could introduce a dilemma
> > > though, since we will do downgrade if cgroup is below its limit (eg idle). For
> > > example, if a cgroup is below its low limit for a long time, queue is upgraded
> > > to HIGH state. The cgroup continues to be below its low limit, the queue will
> > > be downgraded to LOW state. In this example, the queue will keep switching
> > > state between LOW and HIGH.
> > > 
> > > The key to avoid unnecessary state switching is to detect if cgroup is truly
> > > idle, which is a hard problem unfortunately. There are two kinds of idle. One
> > > is cgroup intends to not dispatch enough IO (real idle). In this case, we
> > > should do upgrade quickly and don't do downgrade. The other is other cgroups
> > > dispatch too many IO and use all bandwidth, the cgroup can't dispatch enough IO
> > > and looks idle (fake idle). In this case, we should do downgrade quickly and
> > > never do upgrade.
> > > 
> > > Destinguishing the two kinds of idle is impossible for a high queue depth disk
> > > as far as I can tell. This patch set doesn't try to precisely detect idle.
> > > Instead we record history of upgrade. If queue upgrades because cgroup hits
> > > limit, future downgrade is likely because of fake idle, hence future upgrade
> > > should run slowly and future downgrade should run quickly. Otherwise future
> > > downgrade is likely because of real idle, hence future upgrade should run
> > > quickly and future downgrade should run slowly. The adaptive upgrade/downgrade
> > > time means disk downgrade in real idle happens rarely and disk upgrade in fake
> > > idle happens rarely. This doesn't avoid repeatedly state switching though.
> > > Please see patch 6 for details.
> > > 
> > > User must carefully set the limits. Inproper setting could be ignored. For
> > > example, disk max bandwidth is 100M/s. One cgroup has low limit 60M/s, the
> > > other 50M/s. When the first cgroup runs in 60M/s, there is only 40M/s bandwidth
> > > remaining. The second cgroup will never reach 50M/s, so the cgroup will be
> > > treated idle and its limit will be literally ignored.
> > > 
> > > Comments and benchmarks are welcome!
> > > 
> > > Thanks,
> > > Shaohua
> > > 
> > > Shaohua Li (10):
> > >   block-throttle: prepare support multiple limits
> > >   block-throttle: add .low interface
> > >   block-throttle: configure bps/iops limit for cgroup in low limit
> > >   block-throttle: add upgrade logic for LIMIT_LOW state
> > >   block-throttle: add downgrade logic
> > >   block-throttle: idle detection
> > >   block-throttle: add .high interface
> > >   block-throttle: handle high limit
> > >   blk-throttle: make sure expire time isn't too big
> > >   blk-throttle: add trace log
> > > 
> > >  block/blk-throttle.c | 813 +++++++++++++++++++++++++++++++++++++++++++++++----
> > >  1 file changed, 764 insertions(+), 49 deletions(-)
> > > 
> > > -- 
> > > 2.8.0.rc2

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/10]block-throttle: add low/high limit
  2016-05-18 19:29     ` Vivek Goyal
@ 2016-05-25 21:38       ` Shaohua Li
  0 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-05-25 21:38 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-block, linux-kernel, tj, axboe, jmoyer, Kernel-team

Sorry for the late reply.

On Wed, May 18, 2016 at 03:29:55PM -0400, Vivek Goyal wrote:
> On Fri, May 13, 2016 at 03:59:50PM -0700, Shaohua Li wrote:
> > On Fri, May 13, 2016 at 03:12:45PM -0400, Vivek Goyal wrote:
> > > On Tue, May 10, 2016 at 05:16:30PM -0700, Shaohua Li wrote:
> > > > Hi,
> > > > 
> > > > This patch set adds low/high limit for blk-throttle cgroup. The interface is
> > > > io.low and io.high.
> > > > 
> > > > low limit implements best effort bandwidth/iops protection. If one cgroup
> > > > doesn't reach its low limit, no other cgroups can use more bandwidth/iops than
> > > > their low limit. cgroup without low limit is not protected. If there is cgroup
> > > > with low limit but the cgroup doesn't reach low limit yet, the cgroup without
> > > > low limit will be throttled to very low bandwidth/iops.
> > > 
> > > Hi Shaohua,
> > > 
> > > Can you please describe a little what problem are you solving and how
> > > it is not solved with what we have right now.
> > 
> > The goal is to implement a best effort limit. io.max is a hard limit,
> > which means cgroup can't use more bandwidth than max even there is no IO
> > pressure. If we set a high io.max limit for a low priority cgroup, high
> > priority cgroup will get harmed and dispatch less IO. If we set a low
> > io.max limit, total disk bandwidth can't be fully used by low priority
> > cgroup if high priority cgroup doesn't run. Either isn't good. This is
> > exactly what io.high tries to solve. The io.high is a soft limit, cgroup
> > could exceed the limit if there is no IO pressure. So in above example,
> > low priority cgroup can use more than io.high IO if high priority cgroup
> > isn't running and use up to io.high IO otherwise.
> 
> io.max stuff was not designed to optimize the disk usage. It was more for
> cloud scenario where one does not get the faster IO rate if one has not
> paid for that kind of service (despite the fact that there is plenty of
> bandwidth available in backend).
> 
> > 
> > > Are you trying to guarantee minimum bandwidth to a cgroup? And approach
> > > seems to be that specify minimum bandwidth required by a cgroup in
> > > io.low and if cgroup does not get that bandwidth, other cgroups will
> > > be automatically throttled and will not get more than their io.low
> > > limit BW.
> > 
> > This is exactly what io.low tries to do, protect high priority cgroup.
> > 
> > > I am wondering how would one configure io.low limit? How would
> > > application know what's the device IO capability and what part of
> > > that bandwidth application requires.
> > 
> > I agree configure io.low/high limit isn't easy. We have the same problem
> > for any limit based scheduling including io.max. I don't have good
> > answer yet for the configuration, but those limits can only be found
> > after a lot of testing/benchmarking.
> > 
> > > IOW, proportional control using
> > > absolute limits is very tricky as it requires one to know device's
> > > IO rate capabilities. To make it more complex, device throughput
> > > is not fixed and varies based on badndwith. That mean, io.low also
> > > somehow needs to adjust accorginly. And to me that means using a
> > > notion of prio/weight works best instead of absolute limits.
> > >
> > > In general you seem to be wanting to implement proportional control
> > > outside CFQ so that it can be used with other block devices. I think
> > > your previous idea of assigning weights to cgroup and translating
> > > it automatically to some sort of control (number of tokens) was
> > > better than absolute limits.
> > > 
> > > Having said that, it required knowing cost of IO and I am not sure
> > > if we reached some conclusion at LSF about this.
> > 
> > So this patch set only tries to extend current blk-throttle, it isn't
> > related to the proportional control which I was working on before.
> 
> I think practically you are trying to achieve proportional control.
> Proportional control gives everybody fair share and a low prio application
> can do higher IO if there is no IO pressure. (This is what io.high seems
> to be implementing).
> 
> And if there is IO pressure (lot of cgroup are doing IO), then everybody
> will be limited to their fair share of IO bandwidth and minimum bandwidth
> is guaranteed based on their fair share. (And this is what io.low seems
> to be implementing).
> 
> So to me you are trying to achive what proportional control gives.
> Difference is proportional control does it with a single knob (say weight)
> and you have splitted in two knobs. Also proportional control adjusts
> itself dynamically and easy to configure. While same can't be said for
> these absolute limits (io.low, io.high).

Hmm, the io.low/io.high can prioritize cgroups like proportional control, you
can think it's a kind of proportional control. I agree proportional configuration
is easier.

> IOW, why io.low will give me better minimum bandwidth guarantee as comapred
> to proportional logic? I think at the end of the day you will run into
> same issue of deciding whether to allow a writer to fill up the device
> queue or not.

No, I didn't say io.low gives better minimum bandiwdth guarantee. The
io.low/io.high is not better than proportional control, but it's simpler.

> For example, say two cgroups A and B are doing IO. A is high prio cgroup
> which primarily does reads (may be dependent reads) and B is the cgroup
> which does tons of big WRITES. Now say you configured io.low for A as
> 1MB/s and for also as 1MB/s. Say for a period of few seconds, A did
> not do any IO. Then you will think that high prio cgroup is not doing
> any IO, that means all the cgroups have met their minimum bandwidth 
> requirements and hence allow cgroup B to dispatch IO till io.max. And
> that will fill up device queue. And now A does some reads which will
> still be stuck behind tons of WRITEs in the device queue.

limit based control can only guarantee bandwidth for an interval, not for any
specific time. Unless making disk idle, otherwise I don't know any method
which fixes the stuck issue.

> IOW, I think you are still trying to implement a proportional control
> mechanism and instead of one knobw, using two knobs and it will have
> more or less same issues with device queue depth as you have with
> weight based proportional scheme.
> 
> > 
> > As for proportional control, I think proportional control is much better
> > than a limit based control, as it's easy to configure and adaptive. The
> > problem is we don't have a good way to measure IO cost, so my original
> > proportional control patches use either bandwidth or IOPS, none is
> > precise. Tejun has concerns on this. According to him, if we can't
> > precisely measure IO cost, we shouldn't do proportional control. This is
> > debatable though, I'll not give up the proportional patches. This patch
> > set gives us a temporary solution to prioritize cgroups giving the
> > proportional control is controversial. The io.low/io.high limit also
> > matches memcg behavior, which has the same interfaces.
> 
> It might make sense for memory control as memory is absolute resource
> and there is no notion of proportional control as such and most of the
> time memory is viewed in terms of absolute resource.
> 
> For IO, IMHO, proportional control makes more sense. If proportional
> control is the ultimate goal, I think we should somehow try to get that
> right instead of creating intermediate interfaces like io.low/io.high.

I agree proportional control makes more sense. The problem is it's hard to
implement. For my previous proportional control patches, we don't have good
approatch to measure IO cost, so I use bandwidth/iops. The concern is either
bandwidth or iops isn't precise to measure io cost so not works well for some
workloads. Another concern is we must add a new interface to choose one of
bandwidth and iops for IO cost measurement. The interface is considered not
good.

The reason I pursue io.low/io.high is it's relately easy to implement (it has
its own hard issues though) and can prioritize cgroups. If you have good idea
to implement proportional control, I'm happy to try.

> > 
> > > On the other hand, all these algorithms only control how much IO
> > > can be dispatched from a cgroup. Given deep queue depths of devices,
> > > we will not gain much if device is not implementing some sort of
> > > priority mechanism where one IO in queue is preferred over other.
> > 
> > We can't solve this issue without hardware support, hardware can freely
> > reschedule any IO. The limit based control can only have a big picture
> > scheduling. Tejun used to think about adding logic to throttle cgroup
> > based on IO latency, but the big problem is if latency increases we
> > don't know which cgorup makes the IO latency increase. It could be the
> > cgroup itself dispatch some IO or could be any other cgroup. And so we
> > don't know which cgroup should be throttled further.
> 
> I understand that without the help of device, it is very hard problem
> to solve and we somehow need to reduce the queue depth intelligently.
> 
> I don't have any good answers but I feel we should still look into 
> trying to make proportional control work (if we really have to). Biggest
> problem with proportional control has been WRITEs and Jens's patches
> might help reduce pressure of background writes. And drive smaller
> queue depth and imporoving latency of higher prio low traffic cgroup.

cgroups are not just trying to reduce latency caused by WRITE. Any cgorup's
read/write can impact latency of other cgroups' read/write. So it's much harder
than the write back throttling. And the writeback throttling only controls
minimum latency, which is less sensitive. For cgroup, we probable must control
average latency or outlier latency.

> If latency is the goal, will it make sense to allow configuring
> max latency of each cgroup and if any of the cgroup is missing
> its latency targets, then start throttling other cgroups till all
> cgroups start meeting their max latency targets. I think this is
> similar to your io.low proposal and only difference is limits are
> in terms of latency and not BW/iops. Again this will only work
> if both high prio cgroup and low prio cgroups are continously
> backlogged. Which is rarely the case. Reads are latency sensitive
> and which are often dependent on previous reads and are not
> continously backlogged.

We do consider this option. Configuring latency for a cgroup would be very
hard. Big latency means we do less throttling and harm fairness, low latency
means we do more throttling and harm throughput. The latency will be very
sensitive and should be adaptive for different disks. When one cgroup misses
its latency target, choosing which cgroups should be throttled is another hard
problem, because the increased latency could be caused by any cgroup.

Thanks,
Shaohua

> > > To me biggest problem with IO has been writes overwhelming the device
> > > and killing read latencies. CFQ did it to an extent but soon became
> > > obsolete for faster devices. So now Jens's patch of controlling
> > > background write might help here.
> > > 
> > > Not sure how proportional control at block layer will help with devices
> > > of deep queue depths and without having any notion of priority of request.
> > > Writes can easily fill up the queue and when latency sensitive IO comes
> > > in, it will still suffer. So we probably need something proportional
> > > control along with some sort of prioritization implemented in device.
> > 
> > I agree. proportional control is still the ultimate goal. deep queue
> > depth makes the problem very hard. The CFQ way (idle disk) is not a
> > choice for fast devices though.
> > 
> > Thanks,
> > Shaohua
> > 
> > > > 
> > > > high limit implements best effort limitation. cgroup with high limit can use
> > > > more than high limit bandwidth/iops if all cgroups use at least high limit
> > > > bandwidth/iops. If one cgroup is below its high limit, all cgroups can't use
> > > > more bandwidth/iops than their high limit. If some cgroups have high limit and
> > > > the others haven't, the cgroups without high limit will use max limit as their
> > > > high limit.
> > > > 
> > > > The disk queue has a state machine. We have 3 states LIMIT_LOW, LIMIT_HIGH and
> > > > LIMIT_MAX. In each state, we throttle cgroups up to a limit according to their
> > > > state limit. LIMIT_LOW state limit is low limit, LIMIT_HIGH high limit and
> > > > LIMIT_MAX max limit. In a state, if condition meets, queue can upgrade to
> > > > higher level state or downgrade to lower level state. For example, queue is in
> > > > LIMIT_LOW state and all cgroups reach their low limit, the queue will be
> > > > upgraded to LIMIT_HIGH. In another example, queue is in LIMIT_MAX state, but
> > > > one cgroup is below its high limit, the queue will be downgraded to LIMIT_HIGH.
> > > > If all cgroups don't have limit for specific state, the state will be invalid.
> > > > We will skip invalid state for upgrading/downgrading. Initially queue state is
> > > > LIMIT_MAX till some cgroup gets low/high limit set, so this will maintain
> > > > backward compatibility for users with only max limist set.
> > > > 
> > > > If downgrade/upgrade only happens according to limit, we will have performance
> > > > issue. For example, if one cgroup has low limit set but the cgroup never
> > > > dispatch enough IO to reach low limit, the queue state will remain in
> > > > LIMIT_LOW. Other cgroups will be throttled and the whole disk utilization will
> > > > be low. To solve this issue, if cgroup is below limit for a long time, we treat
> > > > the cgroup idle and its corresponding limit will be ignored for
> > > > upgrade/downgrade logic. The idle based upgrade could introduce a dilemma
> > > > though, since we will do downgrade if cgroup is below its limit (eg idle). For
> > > > example, if a cgroup is below its low limit for a long time, queue is upgraded
> > > > to HIGH state. The cgroup continues to be below its low limit, the queue will
> > > > be downgraded to LOW state. In this example, the queue will keep switching
> > > > state between LOW and HIGH.
> > > > 
> > > > The key to avoid unnecessary state switching is to detect if cgroup is truly
> > > > idle, which is a hard problem unfortunately. There are two kinds of idle. One
> > > > is cgroup intends to not dispatch enough IO (real idle). In this case, we
> > > > should do upgrade quickly and don't do downgrade. The other is other cgroups
> > > > dispatch too many IO and use all bandwidth, the cgroup can't dispatch enough IO
> > > > and looks idle (fake idle). In this case, we should do downgrade quickly and
> > > > never do upgrade.
> > > > 
> > > > Destinguishing the two kinds of idle is impossible for a high queue depth disk
> > > > as far as I can tell. This patch set doesn't try to precisely detect idle.
> > > > Instead we record history of upgrade. If queue upgrades because cgroup hits
> > > > limit, future downgrade is likely because of fake idle, hence future upgrade
> > > > should run slowly and future downgrade should run quickly. Otherwise future
> > > > downgrade is likely because of real idle, hence future upgrade should run
> > > > quickly and future downgrade should run slowly. The adaptive upgrade/downgrade
> > > > time means disk downgrade in real idle happens rarely and disk upgrade in fake
> > > > idle happens rarely. This doesn't avoid repeatedly state switching though.
> > > > Please see patch 6 for details.
> > > > 
> > > > User must carefully set the limits. Inproper setting could be ignored. For
> > > > example, disk max bandwidth is 100M/s. One cgroup has low limit 60M/s, the
> > > > other 50M/s. When the first cgroup runs in 60M/s, there is only 40M/s bandwidth
> > > > remaining. The second cgroup will never reach 50M/s, so the cgroup will be
> > > > treated idle and its limit will be literally ignored.
> > > > 
> > > > Comments and benchmarks are welcome!
> > > > 
> > > > Thanks,
> > > > Shaohua
> > > > 
> > > > Shaohua Li (10):
> > > >   block-throttle: prepare support multiple limits
> > > >   block-throttle: add .low interface
> > > >   block-throttle: configure bps/iops limit for cgroup in low limit
> > > >   block-throttle: add upgrade logic for LIMIT_LOW state
> > > >   block-throttle: add downgrade logic
> > > >   block-throttle: idle detection
> > > >   block-throttle: add .high interface
> > > >   block-throttle: handle high limit
> > > >   blk-throttle: make sure expire time isn't too big
> > > >   blk-throttle: add trace log
> > > > 
> > > >  block/blk-throttle.c | 813 +++++++++++++++++++++++++++++++++++++++++++++++----
> > > >  1 file changed, 764 insertions(+), 49 deletions(-)
> > > > 
> > > > -- 
> > > > 2.8.0.rc2

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-05-25 21:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-11  0:16 [PATCH 00/10]block-throttle: add low/high limit Shaohua Li
2016-05-11  0:16 ` [PATCH 01/10] block-throttle: prepare support multiple limits Shaohua Li
2016-05-11  0:16 ` [PATCH 02/10] block-throttle: add .low interface Shaohua Li
2016-05-11  0:16 ` [PATCH 03/10] block-throttle: configure bps/iops limit for cgroup in low limit Shaohua Li
2016-05-11  0:16 ` [PATCH 04/10] block-throttle: add upgrade logic for LIMIT_LOW state Shaohua Li
2016-05-11  0:16 ` [PATCH 05/10] block-throttle: add downgrade logic Shaohua Li
2016-05-11  0:16 ` [PATCH 06/10] block-throttle: idle detection Shaohua Li
2016-05-11  0:16 ` [PATCH 07/10] block-throttle: add .high interface Shaohua Li
2016-05-11  0:16 ` [PATCH 08/10] block-throttle: handle high limit Shaohua Li
2016-05-11  0:16 ` [PATCH 09/10] blk-throttle: make sure expire time isn't too big Shaohua Li
2016-05-11  0:16 ` [PATCH 10/10] blk-throttle: add trace log Shaohua Li
2016-05-13 19:12 ` [PATCH 00/10]block-throttle: add low/high limit Vivek Goyal
2016-05-13 22:59   ` Shaohua Li
2016-05-18 19:29     ` Vivek Goyal
2016-05-25 21:38       ` Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).