[PATCH v2] blk-throttle: simplify logic by token bucket algorithm

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
       [not found] <1381574794-7639-1-git-send-email-zhiguohong@tencent.com>
@ 2013-10-14  9:09 ` Hong Zhiguo
  2013-10-14 13:36   ` Tejun Heo
  2013-10-15 17:32   ` Vivek Goyal
  2013-10-17 12:17 ` [PATCH v3] " Hong Zhiguo
  1 sibling, 2 replies; 25+ messages in thread
From: Hong Zhiguo @ 2013-10-14  9:09 UTC (permalink / raw)
  To: axboe; +Cc: vgoyal, cgroups, tj, linux-kernel, Hong Zhiguo

From: Hong Zhiguo <zhiguohong@tencent.com>

Token bucket algorithm(http://en.wikipedia.org/wiki/Token_bucket)
is very simple and widely used for rate limiting and shaping.
It's well suitable for blk-throttle. And it natually works for
hierarchical cgroups. So I took it to replace the original time
_slicing_|_trimming_|_extending_ logic.

The advantage is simplicity, reliability and less fluctuation.

About 300 lines of code for time-slicing is replaced with 60 lines of
code for token bucket in blk-throttle.c.

I've tested this patch by fio with rw=randread, rw=randrw. It
behaves almost the same with original time-slicing implementation,
and with more accuracy.

For example, create 2 cgroups and set blkio.throttle.read_bps_device
to 51200(50K) and 204800(200K), start fio with rw=randread. The
Average and Standard Deviation of bandwidth report:
----------------------------------------------
Time Slicing         | 50K group | 200K group
----------------------------------------------
Average:             | 50.1176   | 200.4470
Standard Deviation:  | 1.2977    | 3.6272
----------------------------------------------

----------------------------------------------
Token Bucket         | 50K group | 200K group
----------------------------------------------
Average:             | 50.1647   | 200.4705
Standard Deviation:  | 0.5498    | 1.0080
----------------------------------------------

Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Tested-by: Hong Zhiguo <zhiguohong@tencent.com>
---
 block/blk-throttle.c | 284 ++++++++-------------------------------------------
 1 file changed, 43 insertions(+), 241 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8331aba..ec26555 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -18,9 +18,6 @@ static int throtl_grp_quantum = 8;
 /* Total max dispatch from all groups in one round */
 static int throtl_quantum = 32;
 
-/* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10;	/* 100 ms */
-
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -91,6 +88,11 @@ struct tg_stats_cpu {
 	struct blkg_rwstat		serviced;
 };
 
+/* Depth of iops token bucket */
+#define THROTL_BURST_IO 8
+/* Depth of bps token bucket */
+#define THROTL_BURST_BYTES (4096 * 8)
+
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -133,14 +135,13 @@ struct throtl_grp {
 	/* IOPS limits */
 	unsigned int iops[2];
 
-	/* Number of bytes disptached in current slice */
-	uint64_t bytes_disp[2];
-	/* Number of bio's dispatched in current slice */
-	unsigned int io_disp[2];
+	/* Token for throttling by bps */
+	uint64_t bytes_token[2];
+	/* Token for throttling by iops */
+	unsigned int io_token[2];
 
-	/* When did we start a new slice */
-	unsigned long slice_start[2];
-	unsigned long slice_end[2];
+	/* Time check-point */
+	unsigned long t_c[2];
 
 	/* Per cpu stats pointer */
 	struct tg_stats_cpu __percpu *stats_cpu;
@@ -680,171 +681,26 @@ static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq,
 	return false;
 }
 
-static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
-		bool rw, unsigned long start)
-{
-	tg->bytes_disp[rw] = 0;
-	tg->io_disp[rw] = 0;
-
-	/*
-	 * Previous slice has expired. We must have trimmed it after last
-	 * bio dispatch. That means since start of last slice, we never used
-	 * that bandwidth. Do try to make use of that bandwidth while giving
-	 * credit.
-	 */
-	if (time_after_eq(start, tg->slice_start[rw]))
-		tg->slice_start[rw] = start;
-
-	tg->slice_end[rw] = jiffies + throtl_slice;
-	throtl_log(&tg->service_queue,
-		   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
-{
-	tg->bytes_disp[rw] = 0;
-	tg->io_disp[rw] = 0;
-	tg->slice_start[rw] = jiffies;
-	tg->slice_end[rw] = jiffies + throtl_slice;
-	throtl_log(&tg->service_queue,
-		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
-					unsigned long jiffy_end)
-{
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
-}
-
-static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
-				       unsigned long jiffy_end)
-{
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
-	throtl_log(&tg->service_queue,
-		   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-/* Determine if previously allocated or extended slice is complete or not */
-static bool throtl_slice_used(struct throtl_grp *tg, bool rw)
-{
-	if (time_in_range(jiffies, tg->slice_start[rw], tg->slice_end[rw]))
-		return 0;
-
-	return 1;
-}
-
-/* Trim the used slices and adjust slice start accordingly */
-static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
-{
-	unsigned long nr_slices, time_elapsed, io_trim;
-	u64 bytes_trim, tmp;
-
-	BUG_ON(time_before(tg->slice_end[rw], tg->slice_start[rw]));
-
-	/*
-	 * If bps are unlimited (-1), then time slice don't get
-	 * renewed. Don't try to trim the slice if slice is used. A new
-	 * slice will start when appropriate.
-	 */
-	if (throtl_slice_used(tg, rw))
-		return;
-
-	/*
-	 * A bio has been dispatched. Also adjust slice_end. It might happen
-	 * that initially cgroup limit was very low resulting in high
-	 * slice_end, but later limit was bumped up and bio was dispached
-	 * sooner, then we need to reduce slice_end. A high bogus slice_end
-	 * is bad because it does not allow new slice to start.
-	 */
-
-	throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
-
-	time_elapsed = jiffies - tg->slice_start[rw];
-
-	nr_slices = time_elapsed / throtl_slice;
-
-	if (!nr_slices)
-		return;
-	tmp = tg->bps[rw] * throtl_slice * nr_slices;
-	do_div(tmp, HZ);
-	bytes_trim = tmp;
-
-	io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
-
-	if (!bytes_trim && !io_trim)
-		return;
-
-	if (tg->bytes_disp[rw] >= bytes_trim)
-		tg->bytes_disp[rw] -= bytes_trim;
-	else
-		tg->bytes_disp[rw] = 0;
-
-	if (tg->io_disp[rw] >= io_trim)
-		tg->io_disp[rw] -= io_trim;
-	else
-		tg->io_disp[rw] = 0;
-
-	tg->slice_start[rw] += nr_slices * throtl_slice;
-
-	throtl_log(&tg->service_queue,
-		   "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim,
-		   tg->slice_start[rw], tg->slice_end[rw], jiffies);
-}
-
 static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 				  unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
-	unsigned int io_allowed;
-	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
-	u64 tmp;
-
-	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
-
-	/* Slice has just started. Consider one slice interval */
-	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
-
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+	u64 token;
 
-	/*
-	 * jiffy_elapsed_rnd should not be a big value as minimum iops can be
-	 * 1 then at max jiffy elapsed should be equivalent of 1 second as we
-	 * will allow dispatch after 1 second and after that slice should
-	 * have been trimmed.
-	 */
-
-	tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
-	do_div(tmp, HZ);
+	token = (u64)tg->iops[rw] * (jiffies - tg->t_c[rw]);
+	do_div(token, HZ);
+	token += tg->io_token[rw];
 
-	if (tmp > UINT_MAX)
-		io_allowed = UINT_MAX;
-	else
-		io_allowed = tmp;
-
-	if (tg->io_disp[rw] + 1 <= io_allowed) {
+	if (token) {
+		tg->io_token[rw] = token;
 		if (wait)
 			*wait = 0;
 		return 1;
 	}
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
-
-	if (jiffy_wait > jiffy_elapsed)
-		jiffy_wait = jiffy_wait - jiffy_elapsed;
-	else
-		jiffy_wait = 1;
-
 	if (wait)
-		*wait = jiffy_wait;
+		*wait = HZ / tg->iops[rw] ?: 1;
 	return 0;
 }
 
@@ -852,41 +708,26 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 				 unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
-	u64 bytes_allowed, extra_bytes, tmp;
-	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
-
-	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
-
-	/* Slice has just started. Consider one slice interval */
-	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
-
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+	u64 extra_bytes, token;
+	unsigned long jiffy_wait;
 
-	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
-	do_div(tmp, HZ);
-	bytes_allowed = tmp;
+	token = (u64)tg->bps[rw] * (jiffies - tg->t_c[rw]);
+	do_div(token, HZ);
+	token += tg->bytes_token[rw];
 
-	if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) {
+	if (bio->bi_size <= token) {
+		tg->bytes_token[rw] = token;
 		if (wait)
 			*wait = 0;
 		return 1;
 	}
 
 	/* Calc approx time to dispatch */
-	extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
-
-	if (!jiffy_wait)
-		jiffy_wait = 1;
-
-	/*
-	 * This wait time is without taking into consideration the rounding
-	 * up we did. Add that time also.
-	 */
-	jiffy_wait = jiffy_wait + (jiffy_elapsed_rnd - jiffy_elapsed);
-	if (wait)
-		*wait = jiffy_wait;
+	if (wait) {
+		extra_bytes = bio->bi_size - token;
+		jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+		*wait = jiffy_wait ?: 1;
+	}
 	return 0;
 }
 
@@ -916,18 +757,6 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 		return 1;
 	}
 
-	/*
-	 * If previous slice expired, start a new one otherwise renew/extend
-	 * existing slice to make sure it is at least throtl_slice interval
-	 * long since now.
-	 */
-	if (throtl_slice_used(tg, rw))
-		throtl_start_new_slice(tg, rw);
-	else {
-		if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
-			throtl_extend_slice(tg, rw, jiffies + throtl_slice);
-	}
-
 	if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
 	    tg_with_in_iops_limit(tg, bio, &iops_wait)) {
 		if (wait)
@@ -940,9 +769,6 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	if (wait)
 		*wait = max_wait;
 
-	if (time_before(tg->slice_end[rw], jiffies + max_wait))
-		throtl_extend_slice(tg, rw, jiffies + max_wait);
-
 	return 0;
 }
 
@@ -976,10 +802,16 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 {
 	bool rw = bio_data_dir(bio);
 
-	/* Charge the bio to the group */
-	tg->bytes_disp[rw] += bio->bi_size;
-	tg->io_disp[rw]++;
+	/* Charge the bio to the group and trim the bucket */
+	tg->bytes_token[rw] -= bio->bi_size;
+	if (tg->bytes_token[rw] > THROTL_BURST_BYTES)
+		tg->bytes_token[rw] = THROTL_BURST_BYTES;
 
+	tg->io_token[rw]--;
+	if (tg->io_token[rw] > THROTL_BURST_IO)
+		tg->io_token[rw] = THROTL_BURST_IO;
+
+	tg->t_c[rw] = jiffies;
 	/*
 	 * REQ_THROTTLED is used to prevent the same bio to be throttled
 	 * more than once as a throttled bio will go through blk-throtl the
@@ -1055,16 +887,6 @@ static void tg_update_disptime(struct throtl_grp *tg)
 	tg->flags &= ~THROTL_TG_WAS_EMPTY;
 }
 
-static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
-					struct throtl_grp *parent_tg, bool rw)
-{
-	if (throtl_slice_used(parent_tg, rw)) {
-		throtl_start_new_slice_with_credit(parent_tg, rw,
-				child_tg->slice_start[rw]);
-	}
-
-}
-
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1093,7 +915,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 	 */
 	if (parent_tg) {
 		throtl_add_bio_tg(bio, &tg->qnode_on_parent[rw], parent_tg);
-		start_parent_slice_with_credit(tg, parent_tg, rw);
 	} else {
 		throtl_qnode_add_bio(bio, &tg->qnode_on_parent[rw],
 				     &parent_sq->queued[rw]);
@@ -1101,8 +922,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 		tg->td->nr_queued[rw]--;
 	}
 
-	throtl_trim_slice(tg, rw);
-
 	if (tg_to_put)
 		blkg_put(tg_to_blkg(tg_to_put));
 }
@@ -1386,12 +1205,8 @@ static int tg_set_conf(struct cgroup_subsys_state *css, struct cftype *cft,
 	 * We're already holding queue_lock and know @tg is valid.  Let's
 	 * apply the new config directly.
 	 *
-	 * Restart the slices for both READ and WRITES. It might happen
-	 * that a group's limit are dropped suddenly and we don't want to
-	 * account recently dispatched IO with new low rate.
+	 * Update dispatch time.
 	 */
-	throtl_start_new_slice(tg, 0);
-	throtl_start_new_slice(tg, 1);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
@@ -1527,19 +1342,6 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		throtl_charge_bio(tg, bio);
 
 		/*
-		 * We need to trim slice even when bios are not being queued
-		 * otherwise it might happen that a bio is not queued for
-		 * a long time and slice keeps on extending and trim is not
-		 * called for a long time. Now if limits are reduced suddenly
-		 * we take into account all the IO dispatched so far at new
-		 * low rate and * newly queued IO gets a really long dispatch
-		 * time.
-		 *
-		 * So keep on trimming slice even if bio is not queued.
-		 */
-		throtl_trim_slice(tg, rw);
-
-		/*
 		 * @bio passed through this layer without being throttled.
 		 * Climb up the ladder.  If we''re already at the top, it
 		 * can be executed directly.
@@ -1552,10 +1354,10 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	}
 
 	/* out-of-limit, queue to @tg */
-	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
+	throtl_log(sq, "[%c] bio. btk=%llu sz=%u bps=%llu iotk=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
-		   tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
-		   tg->io_disp[rw], tg->iops[rw],
+		   tg->bytes_token[rw], bio->bi_size, tg->bps[rw],
+		   tg->io_token[rw], tg->iops[rw],
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-14  9:09 ` [PATCH v2] blk-throttle: simplify logic by token bucket algorithm Hong Zhiguo
@ 2013-10-14 13:36   ` Tejun Heo
  2013-10-14 13:47     ` Hong zhi guo
                       ` (2 more replies)
  2013-10-15 17:32   ` Vivek Goyal
  1 sibling, 3 replies; 25+ messages in thread
From: Tejun Heo @ 2013-10-14 13:36 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: axboe, vgoyal, cgroups, linux-kernel, Hong Zhiguo

Hello,

On Mon, Oct 14, 2013 at 05:09:17PM +0800, Hong Zhiguo wrote:
> From: Hong Zhiguo <zhiguohong@tencent.com>
> 
> Token bucket algorithm(http://en.wikipedia.org/wiki/Token_bucket)
> is very simple and widely used for rate limiting and shaping.
> It's well suitable for blk-throttle. And it natually works for
> hierarchical cgroups. So I took it to replace the original time
> _slicing_|_trimming_|_extending_ logic.
> 
> The advantage is simplicity, reliability and less fluctuation.
> 
> About 300 lines of code for time-slicing is replaced with 60 lines of
> code for token bucket in blk-throttle.c.
> 
> I've tested this patch by fio with rw=randread, rw=randrw. It
> behaves almost the same with original time-slicing implementation,
> and with more accuracy.

Yes, this definitely is the direction we wanna take it.  I'll wait for
Vivek to chime in but have you also tested hierarchical setup?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-14 13:36   ` Tejun Heo
@ 2013-10-14 13:47     ` Hong zhi guo
  2013-10-14 13:53     ` Hong zhi guo
  2013-10-15 13:03     ` Vivek Goyal
  2 siblings, 0 replies; 25+ messages in thread
From: Hong zhi guo @ 2013-10-14 13:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, vgoyal, cgroups, linux-kernel, Hong Zhiguo

Hi, Tejun,

I've not tested hierarchical setup yet. I'll do it tomorrow.
BTW, what kind of setup do you expect? Is 2 levels of hierarchical enough?

Zhiguo

On Mon, Oct 14, 2013 at 9:36 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> Yes, this definitely is the direction we wanna take it.  I'll wait for
> Vivek to chime in but have you also tested hierarchical setup?
>
> Thanks.
>
> --
> tejun



-- 
best regards
Hong Zhiguo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-14 13:36   ` Tejun Heo
  2013-10-14 13:47     ` Hong zhi guo
@ 2013-10-14 13:53     ` Hong zhi guo
  2013-10-14 13:59       ` Tejun Heo
  2013-10-15 13:03     ` Vivek Goyal
  2 siblings, 1 reply; 25+ messages in thread
From: Hong zhi guo @ 2013-10-14 13:53 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, vgoyal, cgroups, linux-kernel, Hong Zhiguo

Hi, Tejun,

I've not tested hierarchical setup yet. I'll do it tomorrow.
BTW, what kind of setup do you expect? Is hierarchy of 2 levels enough?

Zhiguo

On Mon, Oct 14, 2013 at 9:36 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> Yes, this definitely is the direction we wanna take it.  I'll wait for
> Vivek to chime in but have you also tested hierarchical setup?
>
> Thanks.
>
> --
> tejun



-- 
best regards
Hong Zhiguo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-14 13:53     ` Hong zhi guo
@ 2013-10-14 13:59       ` Tejun Heo
  2013-10-15 12:35         ` Hong zhi guo
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2013-10-14 13:59 UTC (permalink / raw)
  To: Hong zhi guo; +Cc: Jens Axboe, vgoyal, cgroups, linux-kernel, Hong Zhiguo

On Mon, Oct 14, 2013 at 09:53:54PM +0800, Hong zhi guo wrote:
> I've not tested hierarchical setup yet. I'll do it tomorrow.
> BTW, what kind of setup do you expect? Is hierarchy of 2 levels enough?

That should cover most of the functionalities but going for 3 levels
can't hurt, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-14 13:59       ` Tejun Heo
@ 2013-10-15 12:35         ` Hong zhi guo
  2013-10-15 16:19           ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Hong zhi guo @ 2013-10-15 12:35 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, vgoyal, cgroups, linux-kernel, Hong Zhiguo

Hi, Tejun,

I did the test for 3 levels hierarchy. It works.

Preparation
============
1) mount subsys blkio with "__DEVEL__sane_behavior"
2) Create 3 levels of directories under the blkio mount point:
    mkdir 1
    mkdir 1/2
    mkdir 1/2/3
3) start 3 bash sessions, write their PIDs into:
    1/cgroup.procs
    1/2/cgroup.procs
    1/2/3/cgroup.procs
4) prepare 3 10MB files on sdb(ext4 fs)

Note: in below hierarchy graph:
    "[50k]" means configured value for read_bps_device is 50kB/s
    "(50k)" means bandwidth reported by dd is 50kB/s

Test A: 1 process throttled by ancestor group
=============================================
Hierarchy set-up:
    (echo "8:16 204800" > 1/blkio.throttle.read_bps_device)
    1 [200k]
    `-- 2 [-]
        `-- 3 [-]

dd within group 3:
    (drop cache then: dd if=10M-file-3 of=/dev/null)
Result:
    206kB/s (I did same test without the token-bucket patch, The
result is 205kB/s)

dd within group 2:
    (drop cache then: dd if=10M-file-2 of=/dev/null)
Result:
    205kB/s


Test B: 3 processes in 3 levels of hierarchy
=============================================
Hierarchy set-up:
    echo "8:16 204800" > 1/blkio.throttle.read_bps_device
    echo "8:16 102400" > 1/2/blkio.throttle.read_bps_device
    echo "8:16 51200"  > 1/2/3/blkio.throttle.read_bps_device
    1 [200k]
    `-- 2 [100k]
        `-- 3 [50k]
start 3 dd processes from 3 bash sessions
    (dd if=10M-file-x of=/dev/null)
Result:
    1 (103k)
    `-- 2 (51.9k)
        `-- 3 (51.4k)


best regards
Hong Zhiguo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-15 12:35         ` Hong zhi guo
@ 2013-10-15 16:19           ` Jens Axboe
  0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2013-10-15 16:19 UTC (permalink / raw)
  To: Hong zhi guo; +Cc: Tejun Heo, vgoyal, cgroups, linux-kernel, Hong Zhiguo

On Tue, Oct 15 2013, Hong zhi guo wrote:
> Hi, Tejun,
> 
> I did the test for 3 levels hierarchy. It works.
> 
> Preparation
> ============
> 1) mount subsys blkio with "__DEVEL__sane_behavior"
> 2) Create 3 levels of directories under the blkio mount point:
>     mkdir 1
>     mkdir 1/2
>     mkdir 1/2/3
> 3) start 3 bash sessions, write their PIDs into:
>     1/cgroup.procs
>     1/2/cgroup.procs
>     1/2/3/cgroup.procs
> 4) prepare 3 10MB files on sdb(ext4 fs)
> 
> Note: in below hierarchy graph:
>     "[50k]" means configured value for read_bps_device is 50kB/s
>     "(50k)" means bandwidth reported by dd is 50kB/s
> 
> Test A: 1 process throttled by ancestor group
> =============================================
> Hierarchy set-up:
>     (echo "8:16 204800" > 1/blkio.throttle.read_bps_device)
>     1 [200k]
>     `-- 2 [-]
>         `-- 3 [-]
> 
> dd within group 3:
>     (drop cache then: dd if=10M-file-3 of=/dev/null)
> Result:
>     206kB/s (I did same test without the token-bucket patch, The
> result is 205kB/s)
> 
> dd within group 2:
>     (drop cache then: dd if=10M-file-2 of=/dev/null)
> Result:
>     205kB/s
> 
> 
> Test B: 3 processes in 3 levels of hierarchy
> =============================================
> Hierarchy set-up:
>     echo "8:16 204800" > 1/blkio.throttle.read_bps_device
>     echo "8:16 102400" > 1/2/blkio.throttle.read_bps_device
>     echo "8:16 51200"  > 1/2/3/blkio.throttle.read_bps_device
>     1 [200k]
>     `-- 2 [100k]
>         `-- 3 [50k]
> start 3 dd processes from 3 bash sessions
>     (dd if=10M-file-x of=/dev/null)
> Result:
>     1 (103k)
>     `-- 2 (51.9k)
>         `-- 3 (51.4k)

Thanks for doing this testing, and the work to move to a token based
system. It makes a lot of sense. I've been investigating a token based
scheduler for blk-mq as well, since that could feasibly be made to scale
as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-14 13:36   ` Tejun Heo
  2013-10-14 13:47     ` Hong zhi guo
  2013-10-14 13:53     ` Hong zhi guo
@ 2013-10-15 13:03     ` Vivek Goyal
  2 siblings, 0 replies; 25+ messages in thread
From: Vivek Goyal @ 2013-10-15 13:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Hong Zhiguo, axboe, cgroups, linux-kernel, Hong Zhiguo

On Mon, Oct 14, 2013 at 09:36:20AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Oct 14, 2013 at 05:09:17PM +0800, Hong Zhiguo wrote:
> > From: Hong Zhiguo <zhiguohong@tencent.com>
> > 
> > Token bucket algorithm(http://en.wikipedia.org/wiki/Token_bucket)
> > is very simple and widely used for rate limiting and shaping.
> > It's well suitable for blk-throttle. And it natually works for
> > hierarchical cgroups. So I took it to replace the original time
> > _slicing_|_trimming_|_extending_ logic.
> > 
> > The advantage is simplicity, reliability and less fluctuation.
> > 
> > About 300 lines of code for time-slicing is replaced with 60 lines of
> > code for token bucket in blk-throttle.c.
> > 
> > I've tested this patch by fio with rw=randread, rw=randrw. It
> > behaves almost the same with original time-slicing implementation,
> > and with more accuracy.
> 
> Yes, this definitely is the direction we wanna take it.  I'll wait for
> Vivek to chime in but have you also tested hierarchical setup?

Hi Tejun,

I will have a look at it either today or tomorrow.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-14  9:09 ` [PATCH v2] blk-throttle: simplify logic by token bucket algorithm Hong Zhiguo
  2013-10-14 13:36   ` Tejun Heo
@ 2013-10-15 17:32   ` Vivek Goyal
  2013-10-16  6:09     ` Hong zhi guo
  1 sibling, 1 reply; 25+ messages in thread
From: Vivek Goyal @ 2013-10-15 17:32 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: axboe, cgroups, tj, linux-kernel, Hong Zhiguo

On Mon, Oct 14, 2013 at 05:09:17PM +0800, Hong Zhiguo wrote:

[..]

Hi Hong,

Thanks for the token based throttling implementation. In general it looks
good and it simplies the logic. Couple of comments/concerns below.

> @@ -133,14 +135,13 @@ struct throtl_grp {
>  	/* IOPS limits */
>  	unsigned int iops[2];
>  
> -	/* Number of bytes disptached in current slice */
> -	uint64_t bytes_disp[2];
> -	/* Number of bio's dispatched in current slice */
> -	unsigned int io_disp[2];
> +	/* Token for throttling by bps */
> +	uint64_t bytes_token[2];
> +	/* Token for throttling by iops */
> +	unsigned int io_token[2];
>  
> -	/* When did we start a new slice */
> -	unsigned long slice_start[2];
> -	unsigned long slice_end[2];
> +	/* Time check-point */
> +	unsigned long t_c[2];

Can we rename this variable to say last_dispatch[2].

[..]
>  
> @@ -852,41 +708,26 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
>  				 unsigned long *wait)
>  {
>  	bool rw = bio_data_dir(bio);
> -	u64 bytes_allowed, extra_bytes, tmp;
> -	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> -
> -	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
> -
> -	/* Slice has just started. Consider one slice interval */
> -	if (!jiffy_elapsed)
> -		jiffy_elapsed_rnd = throtl_slice;
> -
> -	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
> +	u64 extra_bytes, token;
> +	unsigned long jiffy_wait;
>  
> -	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
> -	do_div(tmp, HZ);
> -	bytes_allowed = tmp;
> +	token = (u64)tg->bps[rw] * (jiffies - tg->t_c[rw]);

So here we seem to be calculating how many tokens a group is eligible
for. This is done based on since last time check. And last time check
was updated in last dispatch from the group.

What happens if no IO happens in a group for some time, say for 5 minutes,
and then one big IO comes in. IIUC, we will allow a big burst of IO for
the first IO (tokens worth of full 5 minutes will be given to this group).
I think we should have a mechanism where we don't allow too big a burst
for the first IO. (Possibly limit the burst to THROTL_BURST_IO and
THROTL_BURST_BYTES until and unless a bigger IO is already queued and
we are waiting for it).

In previous time slice logic, I took care of it by extend time slice
by 100ms and if no IO happens in the group for 100ms, time slice will
expire and next IO will get at max of 100ms of worth of credit (equivalent
of burst limits).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-15 17:32   ` Vivek Goyal
@ 2013-10-16  6:09     ` Hong zhi guo
  2013-10-16 14:14       ` Vivek Goyal
  0 siblings, 1 reply; 25+ messages in thread
From: Hong zhi guo @ 2013-10-16  6:09 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, cgroups, Tejun Heo, linux-kernel, Hong Zhiguo

Hi, Vivek,

Thanks for your careful review. I'll rename t_c to last_dispatch, it's
much better.

For the big burst issue, I've different opinion. Let's discuss it.

Any time a big IO means a big burst. Even if it's throttled at first
time, queued in
the service_queue, and then waited for a while, when it's delivered
out, it IS still
a big burst. So throttling the bio for another while makes no sense.

If a group has been idle for 5 minutes, then it owns the credit to
deliver a big IO
(within 300 * bps bytes). And the extra credit will be cut off after
the delivery.

Waiting for your comments.

Zhiguo


On Wed, Oct 16, 2013 at 1:32 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Oct 14, 2013 at 05:09:17PM +0800, Hong Zhiguo wrote:
>
> [..]
>
> Hi Hong,
>
> Thanks for the token based throttling implementation. In general it looks
> good and it simplies the logic. Couple of comments/concerns below.
>
>> @@ -133,14 +135,13 @@ struct throtl_grp {
>>       /* IOPS limits */
>>       unsigned int iops[2];
>>
>> -     /* Number of bytes disptached in current slice */
>> -     uint64_t bytes_disp[2];
>> -     /* Number of bio's dispatched in current slice */
>> -     unsigned int io_disp[2];
>> +     /* Token for throttling by bps */
>> +     uint64_t bytes_token[2];
>> +     /* Token for throttling by iops */
>> +     unsigned int io_token[2];
>>
>> -     /* When did we start a new slice */
>> -     unsigned long slice_start[2];
>> -     unsigned long slice_end[2];
>> +     /* Time check-point */
>> +     unsigned long t_c[2];
>
> Can we rename this variable to say last_dispatch[2].
>
> [..]
>>
>> @@ -852,41 +708,26 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
>>                                unsigned long *wait)
>>  {
>>       bool rw = bio_data_dir(bio);
>> -     u64 bytes_allowed, extra_bytes, tmp;
>> -     unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
>> -
>> -     jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
>> -
>> -     /* Slice has just started. Consider one slice interval */
>> -     if (!jiffy_elapsed)
>> -             jiffy_elapsed_rnd = throtl_slice;
>> -
>> -     jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
>> +     u64 extra_bytes, token;
>> +     unsigned long jiffy_wait;
>>
>> -     tmp = tg->bps[rw] * jiffy_elapsed_rnd;
>> -     do_div(tmp, HZ);
>> -     bytes_allowed = tmp;
>> +     token = (u64)tg->bps[rw] * (jiffies - tg->t_c[rw]);
>
> So here we seem to be calculating how many tokens a group is eligible
> for. This is done based on since last time check. And last time check
> was updated in last dispatch from the group.
>
> What happens if no IO happens in a group for some time, say for 5 minutes,
> and then one big IO comes in. IIUC, we will allow a big burst of IO for
> the first IO (tokens worth of full 5 minutes will be given to this group).
> I think we should have a mechanism where we don't allow too big a burst
> for the first IO. (Possibly limit the burst to THROTL_BURST_IO and
> THROTL_BURST_BYTES until and unless a bigger IO is already queued and
> we are waiting for it).
>
> In previous time slice logic, I took care of it by extend time slice
> by 100ms and if no IO happens in the group for 100ms, time slice will
> expire and next IO will get at max of 100ms of worth of credit (equivalent
> of burst limits).
>
> Thanks
> Vivek



-- 
best regards
Hong Zhiguo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-16  6:09     ` Hong zhi guo
@ 2013-10-16 14:14       ` Vivek Goyal
  2013-10-16 15:47         ` Hong zhi guo
  2013-10-16 15:53         ` Tejun Heo
  0 siblings, 2 replies; 25+ messages in thread
From: Vivek Goyal @ 2013-10-16 14:14 UTC (permalink / raw)
  To: Hong zhi guo; +Cc: Jens Axboe, cgroups, Tejun Heo, linux-kernel, Hong Zhiguo

On Wed, Oct 16, 2013 at 02:09:40PM +0800, Hong zhi guo wrote:
> Hi, Vivek,
> 
> Thanks for your careful review. I'll rename t_c to last_dispatch, it's
> much better.
> 
> For the big burst issue, I've different opinion. Let's discuss it.
> 
> Any time a big IO means a big burst. Even if it's throttled at first
> time, queued in
> the service_queue, and then waited for a while, when it's delivered
> out, it IS still
> a big burst. So throttling the bio for another while makes no sense.

If a malicious application is creating a big BIO and sending it out, then
effective IO rate as seen by application will be much higher than
throttling limit.

So yes, a burst is anyway going to happen when big IO is dispatched
to disk, but the question is when should that burst be allowed. What's
the effective IO rate application should see.

> 
> If a group has been idle for 5 minutes, then it owns the credit to
> deliver a big IO
> (within 300 * bps bytes). And the extra credit will be cut off after
> the delivery.

I think there are couple of issues here.

- First of all, if you think that a group is entitiled for tokens even
  when it is not doing IO, then why are you truncating the tokens after
  dispatch of a BIO.

- Second in general it does not seem right that a group is entitiled to
  tokens even when no IO is happening or group is not backlogged. That
  would mean a group will not do IO for 10 hours and then be entitiled
  to those tokens suddenly after 10 hours with a huge burst.

So I think you also agree that a group should not be entitiled to
tokens when group is not backlogged and that's why you seem to be
truncating extra tokens after dispatch of a BIO. If that's the case,
then even for first BIO, ideally a group should not be given tokens
for idle time.

Just think that an application has a huge BIO, say size 2MB. And group
has limit of say 8KB per second. Now if group has been idling long enough,
this BIO will be dispatched immediately. And effective rate a group
will be is much higher than 8KB/s. Which is not right, IMO.

If you argue that token entitilement for idle groups is not right and
doing it for first BIO in a batch is exception for simplicity reasons,
that still might be fine. But right now that does not seem to be the
case.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-16 14:14       ` Vivek Goyal
@ 2013-10-16 15:47         ` Hong zhi guo
  2013-10-16 15:53         ` Tejun Heo
  1 sibling, 0 replies; 25+ messages in thread
From: Hong zhi guo @ 2013-10-16 15:47 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, cgroups, Tejun Heo, linux-kernel, Hong Zhiguo

Hi, Vivek,

Thanks for your elaboration. I got your points. I'll update the patch
to have such logic.

Do you think adding below logic in tg_with_in_bps_limit enough?
if (!sq->nr_queued[rw]) {
    trim the token to bucket depth;
}

Thanks

On Wed, Oct 16, 2013 at 10:14 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Oct 16, 2013 at 02:09:40PM +0800, Hong zhi guo wrote:
>> Hi, Vivek,
>>
>> Thanks for your careful review. I'll rename t_c to last_dispatch, it's
>> much better.
>>
>> For the big burst issue, I've different opinion. Let's discuss it.
>>
>> Any time a big IO means a big burst. Even if it's throttled at first
>> time, queued in
>> the service_queue, and then waited for a while, when it's delivered
>> out, it IS still
>> a big burst. So throttling the bio for another while makes no sense.
>
> If a malicious application is creating a big BIO and sending it out, then
> effective IO rate as seen by application will be much higher than
> throttling limit.
>
> So yes, a burst is anyway going to happen when big IO is dispatched
> to disk, but the question is when should that burst be allowed. What's
> the effective IO rate application should see.
>
>>
>> If a group has been idle for 5 minutes, then it owns the credit to
>> deliver a big IO
>> (within 300 * bps bytes). And the extra credit will be cut off after
>> the delivery.
>
> I think there are couple of issues here.
>
> - First of all, if you think that a group is entitiled for tokens even
>   when it is not doing IO, then why are you truncating the tokens after
>   dispatch of a BIO.
>
> - Second in general it does not seem right that a group is entitiled to
>   tokens even when no IO is happening or group is not backlogged. That
>   would mean a group will not do IO for 10 hours and then be entitiled
>   to those tokens suddenly after 10 hours with a huge burst.
>
> So I think you also agree that a group should not be entitiled to
> tokens when group is not backlogged and that's why you seem to be
> truncating extra tokens after dispatch of a BIO. If that's the case,
> then even for first BIO, ideally a group should not be given tokens
> for idle time.
>
> Just think that an application has a huge BIO, say size 2MB. And group
> has limit of say 8KB per second. Now if group has been idling long enough,
> this BIO will be dispatched immediately. And effective rate a group
> will be is much higher than 8KB/s. Which is not right, IMO.
>
> If you argue that token entitilement for idle groups is not right and
> doing it for first BIO in a batch is exception for simplicity reasons,
> that still might be fine. But right now that does not seem to be the
> case.
>
> Thanks
> Vivek



-- 
best regards
Hong Zhiguo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-16 14:14       ` Vivek Goyal
  2013-10-16 15:47         ` Hong zhi guo
@ 2013-10-16 15:53         ` Tejun Heo
  2013-10-16 16:22           ` Vivek Goyal
  1 sibling, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2013-10-16 15:53 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Hong zhi guo, Jens Axboe, cgroups, linux-kernel, Hong Zhiguo

Hello,

On Wed, Oct 16, 2013 at 10:14:06AM -0400, Vivek Goyal wrote:
> - First of all, if you think that a group is entitiled for tokens even
>   when it is not doing IO, then why are you truncating the tokens after
>   dispatch of a BIO.
> 
> - Second in general it does not seem right that a group is entitiled to
>   tokens even when no IO is happening or group is not backlogged. That
>   would mean a group will not do IO for 10 hours and then be entitiled
>   to those tokens suddenly after 10 hours with a huge burst.
> 
> So I think you also agree that a group should not be entitiled to
> tokens when group is not backlogged and that's why you seem to be
> truncating extra tokens after dispatch of a BIO. If that's the case,
> then even for first BIO, ideally a group should not be given tokens
> for idle time.

Without going into details, having token reserve is an important part
of token based implementation.  The large the reserve could be
debatable but that's what provides "smoothing" of allocation.  e.g. if
you trim bucket as soon as the queue becomes empty, a queue with
sequential access pattern can easily get disadvantaged.  Another way
to look at it is to consider as though the IO has been issued some
time before than actual and waited for the token - it is the same to
external observers.

So, while how large the reserve should be is definitely debatable,
bucket scheduling *needs* idle reserve.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-16 15:53         ` Tejun Heo
@ 2013-10-16 16:22           ` Vivek Goyal
  0 siblings, 0 replies; 25+ messages in thread
From: Vivek Goyal @ 2013-10-16 16:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Hong zhi guo, Jens Axboe, cgroups, linux-kernel, Hong Zhiguo

On Wed, Oct 16, 2013 at 11:53:44AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Oct 16, 2013 at 10:14:06AM -0400, Vivek Goyal wrote:
> > - First of all, if you think that a group is entitiled for tokens even
> >   when it is not doing IO, then why are you truncating the tokens after
> >   dispatch of a BIO.
> > 
> > - Second in general it does not seem right that a group is entitiled to
> >   tokens even when no IO is happening or group is not backlogged. That
> >   would mean a group will not do IO for 10 hours and then be entitiled
> >   to those tokens suddenly after 10 hours with a huge burst.
> > 
> > So I think you also agree that a group should not be entitiled to
> > tokens when group is not backlogged and that's why you seem to be
> > truncating extra tokens after dispatch of a BIO. If that's the case,
> > then even for first BIO, ideally a group should not be given tokens
> > for idle time.
> 
> Without going into details, having token reserve is an important part
> of token based implementation.  The large the reserve could be
> debatable but that's what provides "smoothing" of allocation.  e.g. if
> you trim bucket as soon as the queue becomes empty, a queue with
> sequential access pattern can easily get disadvantaged.  Another way
> to look at it is to consider as though the IO has been issued some
> time before than actual and waited for the token - it is the same to
> external observers.
> 
> So, while how large the reserve should be is definitely debatable,
> bucket scheduling *needs* idle reserve.

Hi Tejun,

Agreed. We need some kind of smoothing and allow burst up to a limit. I
am only questioning *unlimited* tokens for the first bio in a group which
has been idle for a long time.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3] blk-throttle: simplify logic by token bucket algorithm
       [not found] <1381574794-7639-1-git-send-email-zhiguohong@tencent.com>
  2013-10-14  9:09 ` [PATCH v2] blk-throttle: simplify logic by token bucket algorithm Hong Zhiguo
@ 2013-10-17 12:17 ` Hong Zhiguo
  2013-10-18 15:55   ` Vivek Goyal
  1 sibling, 1 reply; 25+ messages in thread
From: Hong Zhiguo @ 2013-10-17 12:17 UTC (permalink / raw)
  To: tj; +Cc: vgoyal, cgroups, axboe, linux-kernel, Hong Zhiguo

From: Hong Zhiguo <zhiguohong@tencent.com>

Token bucket algorithm(http://en.wikipedia.org/wiki/Token_bucket)
is very simple and widely used for rate limiting and shaping.
It's well suitable for blk-throttle. And it natually works for
hierarchical cgroups. So I took it to replace the original time
_slicing_|_trimming_|_extending_ logic.

The advantage is simplicity, reliability and less fluctuation.

About 300 lines of code for time-slicing is replaced with 60 lines
of code for token bucket in blk-throttle.c.

I've tested this patch by fio with rw=randread, rw=randrw. It
behaves almost the same with original time-slicing implementation,
and with less fluctuation. It's also tested on hierarchical setup.

v3:
- rename "t_c" to "last_dispatch", suggested by Vivek
- limit the token generated during idle period, suggested by Vivek

Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Tested-by: Hong Zhiguo <zhiguohong@tencent.com>
---
 block/blk-throttle.c | 288 +++++++++------------------------------------------
 1 file changed, 48 insertions(+), 240 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8331aba..e09db55 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2,6 +2,8 @@
  * Interface for controlling IO bandwidth on a request queue
  *
  * Copyright (C) 2010 Vivek Goyal <vgoyal@redhat.com>
+ *
+ * Simplify the logic with token bucket, Hong Zhiguo <zhiguohong@tencent.com>
  */
 
 #include <linux/module.h>
@@ -18,9 +20,6 @@ static int throtl_grp_quantum = 8;
 /* Total max dispatch from all groups in one round */
 static int throtl_quantum = 32;
 
-/* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10;	/* 100 ms */
-
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -91,6 +90,11 @@ struct tg_stats_cpu {
 	struct blkg_rwstat		serviced;
 };
 
+/* Depth of iops token bucket */
+#define THROTL_BURST_IO 8
+/* Depth of bps token bucket */
+#define THROTL_BURST_BYTES (4096 * 8)
+
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -133,14 +137,13 @@ struct throtl_grp {
 	/* IOPS limits */
 	unsigned int iops[2];
 
-	/* Number of bytes disptached in current slice */
-	uint64_t bytes_disp[2];
-	/* Number of bio's dispatched in current slice */
-	unsigned int io_disp[2];
+	/* Token for throttling by bps */
+	uint64_t bytes_token[2];
+	/* Token for throttling by iops */
+	unsigned int io_token[2];
 
-	/* When did we start a new slice */
-	unsigned long slice_start[2];
-	unsigned long slice_end[2];
+	/* Time check-point */
+	unsigned long last_dispatch[2];
 
 	/* Per cpu stats pointer */
 	struct tg_stats_cpu __percpu *stats_cpu;
@@ -680,171 +683,26 @@ static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq,
 	return false;
 }
 
-static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
-		bool rw, unsigned long start)
-{
-	tg->bytes_disp[rw] = 0;
-	tg->io_disp[rw] = 0;
-
-	/*
-	 * Previous slice has expired. We must have trimmed it after last
-	 * bio dispatch. That means since start of last slice, we never used
-	 * that bandwidth. Do try to make use of that bandwidth while giving
-	 * credit.
-	 */
-	if (time_after_eq(start, tg->slice_start[rw]))
-		tg->slice_start[rw] = start;
-
-	tg->slice_end[rw] = jiffies + throtl_slice;
-	throtl_log(&tg->service_queue,
-		   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
-{
-	tg->bytes_disp[rw] = 0;
-	tg->io_disp[rw] = 0;
-	tg->slice_start[rw] = jiffies;
-	tg->slice_end[rw] = jiffies + throtl_slice;
-	throtl_log(&tg->service_queue,
-		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
-					unsigned long jiffy_end)
-{
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
-}
-
-static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
-				       unsigned long jiffy_end)
-{
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
-	throtl_log(&tg->service_queue,
-		   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-/* Determine if previously allocated or extended slice is complete or not */
-static bool throtl_slice_used(struct throtl_grp *tg, bool rw)
-{
-	if (time_in_range(jiffies, tg->slice_start[rw], tg->slice_end[rw]))
-		return 0;
-
-	return 1;
-}
-
-/* Trim the used slices and adjust slice start accordingly */
-static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
-{
-	unsigned long nr_slices, time_elapsed, io_trim;
-	u64 bytes_trim, tmp;
-
-	BUG_ON(time_before(tg->slice_end[rw], tg->slice_start[rw]));
-
-	/*
-	 * If bps are unlimited (-1), then time slice don't get
-	 * renewed. Don't try to trim the slice if slice is used. A new
-	 * slice will start when appropriate.
-	 */
-	if (throtl_slice_used(tg, rw))
-		return;
-
-	/*
-	 * A bio has been dispatched. Also adjust slice_end. It might happen
-	 * that initially cgroup limit was very low resulting in high
-	 * slice_end, but later limit was bumped up and bio was dispached
-	 * sooner, then we need to reduce slice_end. A high bogus slice_end
-	 * is bad because it does not allow new slice to start.
-	 */
-
-	throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
-
-	time_elapsed = jiffies - tg->slice_start[rw];
-
-	nr_slices = time_elapsed / throtl_slice;
-
-	if (!nr_slices)
-		return;
-	tmp = tg->bps[rw] * throtl_slice * nr_slices;
-	do_div(tmp, HZ);
-	bytes_trim = tmp;
-
-	io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
-
-	if (!bytes_trim && !io_trim)
-		return;
-
-	if (tg->bytes_disp[rw] >= bytes_trim)
-		tg->bytes_disp[rw] -= bytes_trim;
-	else
-		tg->bytes_disp[rw] = 0;
-
-	if (tg->io_disp[rw] >= io_trim)
-		tg->io_disp[rw] -= io_trim;
-	else
-		tg->io_disp[rw] = 0;
-
-	tg->slice_start[rw] += nr_slices * throtl_slice;
-
-	throtl_log(&tg->service_queue,
-		   "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim,
-		   tg->slice_start[rw], tg->slice_end[rw], jiffies);
-}
-
 static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 				  unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
-	unsigned int io_allowed;
-	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
-	u64 tmp;
-
-	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
-
-	/* Slice has just started. Consider one slice interval */
-	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
+	u64 token;
 
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+	token = (u64)tg->iops[rw] * (jiffies - tg->last_dispatch[rw]);
+	do_div(token, HZ);
+	token += tg->io_token[rw];
 
-	/*
-	 * jiffy_elapsed_rnd should not be a big value as minimum iops can be
-	 * 1 then at max jiffy elapsed should be equivalent of 1 second as we
-	 * will allow dispatch after 1 second and after that slice should
-	 * have been trimmed.
-	 */
-
-	tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
-	do_div(tmp, HZ);
-
-	if (tmp > UINT_MAX)
-		io_allowed = UINT_MAX;
-	else
-		io_allowed = tmp;
-
-	if (tg->io_disp[rw] + 1 <= io_allowed) {
+	if (token) {
+		tg->io_token[rw] = token;
 		if (wait)
 			*wait = 0;
 		return 1;
 	}
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
-
-	if (jiffy_wait > jiffy_elapsed)
-		jiffy_wait = jiffy_wait - jiffy_elapsed;
-	else
-		jiffy_wait = 1;
-
 	if (wait)
-		*wait = jiffy_wait;
+		*wait = HZ / tg->iops[rw] ?: 1;
 	return 0;
 }
 
@@ -852,41 +710,30 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 				 unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
-	u64 bytes_allowed, extra_bytes, tmp;
-	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
-
-	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
-
-	/* Slice has just started. Consider one slice interval */
-	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
+	u64 extra_bytes, token;
+	unsigned long jiffy_wait;
 
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
+	token = (u64)tg->bps[rw] * (jiffies - tg->last_dispatch[rw]);
+	do_div(token, HZ);
+	token += tg->bytes_token[rw];
 
-	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
-	do_div(tmp, HZ);
-	bytes_allowed = tmp;
+	/* trim the token if the group is idle */
+	if (!tg->service_queue.nr_queued[rw] && token > THROTL_BURST_BYTES)
+		token = THROTL_BURST_BYTES;
 
-	if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) {
+	if (bio->bi_size <= token) {
+		tg->bytes_token[rw] = token;
 		if (wait)
 			*wait = 0;
 		return 1;
 	}
 
 	/* Calc approx time to dispatch */
-	extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
-
-	if (!jiffy_wait)
-		jiffy_wait = 1;
-
-	/*
-	 * This wait time is without taking into consideration the rounding
-	 * up we did. Add that time also.
-	 */
-	jiffy_wait = jiffy_wait + (jiffy_elapsed_rnd - jiffy_elapsed);
-	if (wait)
-		*wait = jiffy_wait;
+	if (wait) {
+		extra_bytes = bio->bi_size - token;
+		jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+		*wait = jiffy_wait ?: 1;
+	}
 	return 0;
 }
 
@@ -916,18 +763,6 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 		return 1;
 	}
 
-	/*
-	 * If previous slice expired, start a new one otherwise renew/extend
-	 * existing slice to make sure it is at least throtl_slice interval
-	 * long since now.
-	 */
-	if (throtl_slice_used(tg, rw))
-		throtl_start_new_slice(tg, rw);
-	else {
-		if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
-			throtl_extend_slice(tg, rw, jiffies + throtl_slice);
-	}
-
 	if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
 	    tg_with_in_iops_limit(tg, bio, &iops_wait)) {
 		if (wait)
@@ -940,9 +775,6 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	if (wait)
 		*wait = max_wait;
 
-	if (time_before(tg->slice_end[rw], jiffies + max_wait))
-		throtl_extend_slice(tg, rw, jiffies + max_wait);
-
 	return 0;
 }
 
@@ -976,10 +808,16 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 {
 	bool rw = bio_data_dir(bio);
 
-	/* Charge the bio to the group */
-	tg->bytes_disp[rw] += bio->bi_size;
-	tg->io_disp[rw]++;
+	/* Charge the bio to the group and trim the bucket */
+	tg->bytes_token[rw] -= bio->bi_size;
+	if (tg->bytes_token[rw] > THROTL_BURST_BYTES)
+		tg->bytes_token[rw] = THROTL_BURST_BYTES;
 
+	tg->io_token[rw]--;
+	if (tg->io_token[rw] > THROTL_BURST_IO)
+		tg->io_token[rw] = THROTL_BURST_IO;
+
+	tg->last_dispatch[rw] = jiffies;
 	/*
 	 * REQ_THROTTLED is used to prevent the same bio to be throttled
 	 * more than once as a throttled bio will go through blk-throtl the
@@ -1055,16 +893,6 @@ static void tg_update_disptime(struct throtl_grp *tg)
 	tg->flags &= ~THROTL_TG_WAS_EMPTY;
 }
 
-static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
-					struct throtl_grp *parent_tg, bool rw)
-{
-	if (throtl_slice_used(parent_tg, rw)) {
-		throtl_start_new_slice_with_credit(parent_tg, rw,
-				child_tg->slice_start[rw]);
-	}
-
-}
-
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1093,7 +921,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 	 */
 	if (parent_tg) {
 		throtl_add_bio_tg(bio, &tg->qnode_on_parent[rw], parent_tg);
-		start_parent_slice_with_credit(tg, parent_tg, rw);
 	} else {
 		throtl_qnode_add_bio(bio, &tg->qnode_on_parent[rw],
 				     &parent_sq->queued[rw]);
@@ -1101,8 +928,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 		tg->td->nr_queued[rw]--;
 	}
 
-	throtl_trim_slice(tg, rw);
-
 	if (tg_to_put)
 		blkg_put(tg_to_blkg(tg_to_put));
 }
@@ -1386,12 +1211,8 @@ static int tg_set_conf(struct cgroup_subsys_state *css, struct cftype *cft,
 	 * We're already holding queue_lock and know @tg is valid.  Let's
 	 * apply the new config directly.
 	 *
-	 * Restart the slices for both READ and WRITES. It might happen
-	 * that a group's limit are dropped suddenly and we don't want to
-	 * account recently dispatched IO with new low rate.
+	 * Update dispatch time.
 	 */
-	throtl_start_new_slice(tg, 0);
-	throtl_start_new_slice(tg, 1);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
@@ -1527,19 +1348,6 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		throtl_charge_bio(tg, bio);
 
 		/*
-		 * We need to trim slice even when bios are not being queued
-		 * otherwise it might happen that a bio is not queued for
-		 * a long time and slice keeps on extending and trim is not
-		 * called for a long time. Now if limits are reduced suddenly
-		 * we take into account all the IO dispatched so far at new
-		 * low rate and * newly queued IO gets a really long dispatch
-		 * time.
-		 *
-		 * So keep on trimming slice even if bio is not queued.
-		 */
-		throtl_trim_slice(tg, rw);
-
-		/*
 		 * @bio passed through this layer without being throttled.
 		 * Climb up the ladder.  If we''re already at the top, it
 		 * can be executed directly.
@@ -1552,10 +1360,10 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	}
 
 	/* out-of-limit, queue to @tg */
-	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
+	throtl_log(sq, "[%c] bio. btk=%llu sz=%u bps=%llu iotk=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
-		   tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
-		   tg->io_disp[rw], tg->iops[rw],
+		   tg->bytes_token[rw], bio->bi_size, tg->bps[rw],
+		   tg->io_token[rw], tg->iops[rw],
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3] blk-throttle: simplify logic by token bucket algorithm
  2013-10-17 12:17 ` [PATCH v3] " Hong Zhiguo
@ 2013-10-18 15:55   ` Vivek Goyal
  2013-10-20 12:08     ` Hong zhi guo
  2013-10-20 12:11     ` [PATCH v4 0/2] " Hong Zhiguo
  0 siblings, 2 replies; 25+ messages in thread
From: Vivek Goyal @ 2013-10-18 15:55 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: tj, cgroups, axboe, linux-kernel, Hong Zhiguo

On Thu, Oct 17, 2013 at 08:17:52PM +0800, Hong Zhiguo wrote:

[..]
> @@ -852,41 +710,30 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
>  				 unsigned long *wait)
>  {
>  	bool rw = bio_data_dir(bio);
> -	u64 bytes_allowed, extra_bytes, tmp;
> -	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> -
> -	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
> -
> -	/* Slice has just started. Consider one slice interval */
> -	if (!jiffy_elapsed)
> -		jiffy_elapsed_rnd = throtl_slice;
> +	u64 extra_bytes, token;
> +	unsigned long jiffy_wait;
>  
> -	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
> +	token = (u64)tg->bps[rw] * (jiffies - tg->last_dispatch[rw]);
> +	do_div(token, HZ);
> +	token += tg->bytes_token[rw];
>  
> -	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
> -	do_div(tmp, HZ);
> -	bytes_allowed = tmp;
> +	/* trim the token if the group is idle */
> +	if (!tg->service_queue.nr_queued[rw] && token > THROTL_BURST_BYTES)
> +		token = THROTL_BURST_BYTES;

I think same logic need to be applied to iops too?

Also, how does it work in case of hierarchy? IIUC, when bio is being
propogated upwards, we first add it to the parent, and then later
update dispatch time which will in turn call tg_with_in_bps_limit().

So by the time you come here after bio propogation, bio is already
on service queue and it will be entitiled to *unlimited* idle tokens.

On the flip side, we can't do blind truncation of idle tokens in
parent group. Reason being that once bio got queued in child group,
effectively it should be subjected to parent's limits too right now
and not necessarily when it climbs up the tree. 

To solve this problem I had put following patch.

commit 32ee5bc4787dfbdb280b4d81a338dcdd55918c1e
Author: Vivek Goyal <vgoyal@redhat.com>
Date:   Tue May 14 13:52:38 2013 -0700

    blk-throttle: Account for child group's start time in parent while bio
climb

Above solution was not perfect but seemed like reasonable approximation.
We might have to do something similar. I think we can keep track of time
when an bio starts waiting in a group (gets to head of list) and then
when that bio is dispatched, pass that time to parent group. Now
parent can give more tokens to bio based on wait time start as passed
by children.

For example say Twaitstart is the time a bio started waiting on the head
of queue of a group. Say Tidle is time when parent became idle. Now when
child passes this bio to parent, it will also pass Twaitstart to parent.
When it comes time for token calculation, parent can do following.

If (time_after(Twaitstart, Tidle) 
	start_time = Twaitstart;
else
	start_time = Tidle;

token_eligibility = (jiffies - start_time) * rate;

This is far from perfect, but it should work reasonably.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3] blk-throttle: simplify logic by token bucket algorithm
  2013-10-18 15:55   ` Vivek Goyal
@ 2013-10-20 12:08     ` Hong zhi guo
  2013-10-20 12:11     ` [PATCH v4 0/2] " Hong Zhiguo
  1 sibling, 0 replies; 25+ messages in thread
From: Hong zhi guo @ 2013-10-20 12:08 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, cgroups, Jens Axboe, linux-kernel, Hong Zhiguo

Hi, Vivek,

Thanks for your comments. I didn't realize the ancestor over-trim issue before.

Trimming of iops token is not necessary. Since a bio always costs
_one_ iops token. Trimming it won't change the fact that current iops
token is zero or not.

For hierarchical triming, as you pointed out, there's 2 issues in my v3 patch.
1) When bio climbs up, the parent group is not trimmed.
2) When bio climbs up, we should not trim groups level by level,
   that would block the bio too much time than expected.

You proposed time keeping method helps.
But how about some bios waiting on multiple child group, and climb up
in different order than they get queued on clild group? Some token is
still missed.
How does this works for multiple levels of ancestors?

I got another idea and implementation for the issues. The basic idea is:
When the bio is queued on child group, all it's ancestor groups
becomes non-idle. They should start to reserve tokens for that bio
from this moment. And tokens they reserved before should be trimmed at
this moment.

I'll also send the implementation later.

Please help to review the idea and code. Thanks.

On Fri, Oct 18, 2013 at 11:55 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Thu, Oct 17, 2013 at 08:17:52PM +0800, Hong Zhiguo wrote:
>
> I think same logic need to be applied to iops too?
>
> Also, how does it work in case of hierarchy? IIUC, when bio is being
> propogated upwards, we first add it to the parent, and then later
> update dispatch time which will in turn call tg_with_in_bps_limit().
>
> So by the time you come here after bio propogation, bio is already
> on service queue and it will be entitiled to *unlimited* idle tokens.
>
> On the flip side, we can't do blind truncation of idle tokens in
> parent group. Reason being that once bio got queued in child group,
> effectively it should be subjected to parent's limits too right now
> and not necessarily when it climbs up the tree.
>
> To solve this problem I had put following patch.
>
> commit 32ee5bc4787dfbdb280b4d81a338dcdd55918c1e
> Author: Vivek Goyal <vgoyal@redhat.com>
> Date:   Tue May 14 13:52:38 2013 -0700
>
>     blk-throttle: Account for child group's start time in parent while bio
> climb
>
>
> Above solution was not perfect but seemed like reasonable approximation.
> We might have to do something similar. I think we can keep track of time
> when an bio starts waiting in a group (gets to head of list) and then
> when that bio is dispatched, pass that time to parent group. Now
> parent can give more tokens to bio based on wait time start as passed
> by children.
>
> For example say Twaitstart is the time a bio started waiting on the head
> of queue of a group. Say Tidle is time when parent became idle. Now when
> child passes this bio to parent, it will also pass Twaitstart to parent.
> When it comes time for token calculation, parent can do following.
>
> If (time_after(Twaitstart, Tidle)
>         start_time = Twaitstart;
> else
>         start_time = Tidle;
>
> token_eligibility = (jiffies - start_time) * rate;
>
> This is far from perfect, but it should work reasonably.
>
> Thanks
> Vivek

-- 
best regards
Hong Zhiguo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 0/2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-18 15:55   ` Vivek Goyal
  2013-10-20 12:08     ` Hong zhi guo
@ 2013-10-20 12:11     ` Hong Zhiguo
  2013-10-20 12:11       ` [PATCH v4 1/2] " Hong Zhiguo
  2013-10-20 12:11       ` [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree Hong Zhiguo
  1 sibling, 2 replies; 25+ messages in thread
From: Hong Zhiguo @ 2013-10-20 12:11 UTC (permalink / raw)
  To: tj, vgoyal; +Cc: cgroups, axboe, linux-kernel, Hong Zhiguo

From: Hong Zhiguo <zhiguohong@tencent.com>

Based on the discussion with Vivek, Tejun and Jens.

Patch 1/2 implements basic token bucket rate limiting for blk-throttle.
Patch 2/2 adjusts some behavior for the cases pointed out by Vivek:
	- A huge bio may be allowed immediately after a long time of idle
	- When token is trimmed for above case, should not trim ancestors
	  when bio climbs up, instead, trim ancestors when bio is queued
	  on lowest child group.

Trimming of iops token is not necessary. Since a bio always costs _one_ 
iops token. Trimming it won't change the fact that current iops token is
zero or not.

I renamed "last_dispatch" back to "t_c"(Time Checkpoint, naming from network
code) because now it's not only updated when a bio is dispatched. See patch
2/2.

Thanks Vivek for the ancestor over-trim issue and the proposed solution.
I took another method inspired by Vivek's comments but different. Please
help to review it.

Hong Zhiguo (2):
  blk-throttle: simplify logic by token bucket algorithm
  blk-throttle: trim tokens generated for an idle tree

 block/blk-throttle.c | 321 ++++++++++++++-------------------------------------
 1 file changed, 87 insertions(+), 234 deletions(-)

-- 
1.8.1.2


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-20 12:11     ` [PATCH v4 0/2] " Hong Zhiguo
@ 2013-10-20 12:11       ` Hong Zhiguo
  2014-04-10 10:07         ` Hong zhi guo
  2013-10-20 12:11       ` [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree Hong Zhiguo
  1 sibling, 1 reply; 25+ messages in thread
From: Hong Zhiguo @ 2013-10-20 12:11 UTC (permalink / raw)
  To: tj, vgoyal; +Cc: cgroups, axboe, linux-kernel, Hong Zhiguo

Token bucket algorithm(http://en.wikipedia.org/wiki/Token_bucket)
is very simple and widely used for rate limiting and shaping.
It's well suitable for blk-throttle. And it naturally works for
hierarchical cgroups. So I took it to replace the original time
_slicing_|_trimming_|_extending_ logic.

The advantage is simplicity, reliability and less fluctuation.

About 300 lines of code for time-slicing is replaced with 60 lines
of code for token bucket in blk-throttle.c.

I've tested this patch by fio with rw=randread, rw=randrw. It
behaves almost the same with original time-slicing implementation,
and with less fluctuation. It's also tested on hierarchical setup.

Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
Tested-by: Hong Zhiguo <zhiguohong@tencent.com>
---
 block/blk-throttle.c | 285 +++++++++------------------------------------------
 1 file changed, 50 insertions(+), 235 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8331aba..45d4d91 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2,6 +2,8 @@
  * Interface for controlling IO bandwidth on a request queue
  *
  * Copyright (C) 2010 Vivek Goyal <vgoyal@redhat.com>
+ *
+ * Simplify the logic with token bucket, Hong Zhiguo <zhiguohong@tencent.com>
  */
 
 #include <linux/module.h>
@@ -18,9 +20,6 @@ static int throtl_grp_quantum = 8;
 /* Total max dispatch from all groups in one round */
 static int throtl_quantum = 32;
 
-/* Throttling is performed over 100ms slice and after that slice is renewed */
-static unsigned long throtl_slice = HZ/10;	/* 100 ms */
-
 static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
@@ -91,6 +90,11 @@ struct tg_stats_cpu {
 	struct blkg_rwstat		serviced;
 };
 
+/* Depth of iops token bucket */
+#define THROTL_BURST_IO 8
+/* Depth of bps token bucket */
+#define THROTL_BURST_BYTES (4096 * 8)
+
 struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
@@ -133,14 +137,13 @@ struct throtl_grp {
 	/* IOPS limits */
 	unsigned int iops[2];
 
-	/* Number of bytes disptached in current slice */
-	uint64_t bytes_disp[2];
-	/* Number of bio's dispatched in current slice */
-	unsigned int io_disp[2];
+	/* Token for throttling by bps */
+	uint64_t bytes_token[2];
+	/* Token for throttling by iops */
+	unsigned int io_token[2];
 
-	/* When did we start a new slice */
-	unsigned long slice_start[2];
-	unsigned long slice_end[2];
+	/* Time check-point */
+	unsigned long t_c[2];
 
 	/* Per cpu stats pointer */
 	struct tg_stats_cpu __percpu *stats_cpu;
@@ -680,171 +683,39 @@ static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq,
 	return false;
 }
 
-static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
-		bool rw, unsigned long start)
-{
-	tg->bytes_disp[rw] = 0;
-	tg->io_disp[rw] = 0;
-
-	/*
-	 * Previous slice has expired. We must have trimmed it after last
-	 * bio dispatch. That means since start of last slice, we never used
-	 * that bandwidth. Do try to make use of that bandwidth while giving
-	 * credit.
-	 */
-	if (time_after_eq(start, tg->slice_start[rw]))
-		tg->slice_start[rw] = start;
-
-	tg->slice_end[rw] = jiffies + throtl_slice;
-	throtl_log(&tg->service_queue,
-		   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
-{
-	tg->bytes_disp[rw] = 0;
-	tg->io_disp[rw] = 0;
-	tg->slice_start[rw] = jiffies;
-	tg->slice_end[rw] = jiffies + throtl_slice;
-	throtl_log(&tg->service_queue,
-		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
-					unsigned long jiffy_end)
-{
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
-}
-
-static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
-				       unsigned long jiffy_end)
-{
-	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
-	throtl_log(&tg->service_queue,
-		   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-		   tg->slice_end[rw], jiffies);
-}
-
-/* Determine if previously allocated or extended slice is complete or not */
-static bool throtl_slice_used(struct throtl_grp *tg, bool rw)
-{
-	if (time_in_range(jiffies, tg->slice_start[rw], tg->slice_end[rw]))
-		return 0;
-
-	return 1;
-}
-
-/* Trim the used slices and adjust slice start accordingly */
-static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
+static void tg_update_token(struct throtl_grp *tg, bool rw)
 {
-	unsigned long nr_slices, time_elapsed, io_trim;
-	u64 bytes_trim, tmp;
+	unsigned long tdiff = jiffies - tg->t_c[rw];
+	u64 token;
 
-	BUG_ON(time_before(tg->slice_end[rw], tg->slice_start[rw]));
-
-	/*
-	 * If bps are unlimited (-1), then time slice don't get
-	 * renewed. Don't try to trim the slice if slice is used. A new
-	 * slice will start when appropriate.
-	 */
-	if (throtl_slice_used(tg, rw))
+	if (!tdiff)
 		return;
 
-	/*
-	 * A bio has been dispatched. Also adjust slice_end. It might happen
-	 * that initially cgroup limit was very low resulting in high
-	 * slice_end, but later limit was bumped up and bio was dispached
-	 * sooner, then we need to reduce slice_end. A high bogus slice_end
-	 * is bad because it does not allow new slice to start.
-	 */
+	token = (u64)tg->iops[rw] * tdiff;
+	do_div(token, HZ);
+	tg->io_token[rw] += token;
 
-	throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
+	token = (u64)tg->bps[rw] * tdiff;
+	do_div(token, HZ);
+	tg->bytes_token[rw] += token;
 
-	time_elapsed = jiffies - tg->slice_start[rw];
-
-	nr_slices = time_elapsed / throtl_slice;
-
-	if (!nr_slices)
-		return;
-	tmp = tg->bps[rw] * throtl_slice * nr_slices;
-	do_div(tmp, HZ);
-	bytes_trim = tmp;
-
-	io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
-
-	if (!bytes_trim && !io_trim)
-		return;
-
-	if (tg->bytes_disp[rw] >= bytes_trim)
-		tg->bytes_disp[rw] -= bytes_trim;
-	else
-		tg->bytes_disp[rw] = 0;
-
-	if (tg->io_disp[rw] >= io_trim)
-		tg->io_disp[rw] -= io_trim;
-	else
-		tg->io_disp[rw] = 0;
-
-	tg->slice_start[rw] += nr_slices * throtl_slice;
-
-	throtl_log(&tg->service_queue,
-		   "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
-		   rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim,
-		   tg->slice_start[rw], tg->slice_end[rw], jiffies);
+	tg->t_c[rw] = jiffies;
 }
 
 static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
 				  unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
-	unsigned int io_allowed;
-	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
-	u64 tmp;
-
-	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
-
-	/* Slice has just started. Consider one slice interval */
-	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
-
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
 
-	/*
-	 * jiffy_elapsed_rnd should not be a big value as minimum iops can be
-	 * 1 then at max jiffy elapsed should be equivalent of 1 second as we
-	 * will allow dispatch after 1 second and after that slice should
-	 * have been trimmed.
-	 */
-
-	tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
-	do_div(tmp, HZ);
-
-	if (tmp > UINT_MAX)
-		io_allowed = UINT_MAX;
-	else
-		io_allowed = tmp;
-
-	if (tg->io_disp[rw] + 1 <= io_allowed) {
+	if (tg->io_token[rw]) {
 		if (wait)
 			*wait = 0;
 		return 1;
 	}
 
 	/* Calc approx time to dispatch */
-	jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
-
-	if (jiffy_wait > jiffy_elapsed)
-		jiffy_wait = jiffy_wait - jiffy_elapsed;
-	else
-		jiffy_wait = 1;
-
 	if (wait)
-		*wait = jiffy_wait;
+		*wait = HZ / tg->iops[rw] ?: 1;
 	return 0;
 }
 
@@ -852,41 +723,21 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 				 unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
-	u64 bytes_allowed, extra_bytes, tmp;
-	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
-
-	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
+	u64 extra_bytes;
+	unsigned long jiffy_wait;
 
-	/* Slice has just started. Consider one slice interval */
-	if (!jiffy_elapsed)
-		jiffy_elapsed_rnd = throtl_slice;
-
-	jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
-
-	tmp = tg->bps[rw] * jiffy_elapsed_rnd;
-	do_div(tmp, HZ);
-	bytes_allowed = tmp;
-
-	if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) {
+	if (bio->bi_size <= tg->bytes_token[rw]) {
 		if (wait)
 			*wait = 0;
 		return 1;
 	}
 
 	/* Calc approx time to dispatch */
-	extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed;
-	jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
-
-	if (!jiffy_wait)
-		jiffy_wait = 1;
-
-	/*
-	 * This wait time is without taking into consideration the rounding
-	 * up we did. Add that time also.
-	 */
-	jiffy_wait = jiffy_wait + (jiffy_elapsed_rnd - jiffy_elapsed);
-	if (wait)
-		*wait = jiffy_wait;
+	if (wait) {
+		extra_bytes = bio->bi_size - tg->bytes_token[rw];
+		jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
+		*wait = jiffy_wait ?: 1;
+	}
 	return 0;
 }
 
@@ -916,18 +767,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 		return 1;
 	}
 
-	/*
-	 * If previous slice expired, start a new one otherwise renew/extend
-	 * existing slice to make sure it is at least throtl_slice interval
-	 * long since now.
-	 */
-	if (throtl_slice_used(tg, rw))
-		throtl_start_new_slice(tg, rw);
-	else {
-		if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
-			throtl_extend_slice(tg, rw, jiffies + throtl_slice);
-	}
-
+	tg_update_token(tg, rw);
 	if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
 	    tg_with_in_iops_limit(tg, bio, &iops_wait)) {
 		if (wait)
@@ -940,9 +780,6 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	if (wait)
 		*wait = max_wait;
 
-	if (time_before(tg->slice_end[rw], jiffies + max_wait))
-		throtl_extend_slice(tg, rw, jiffies + max_wait);
-
 	return 0;
 }
 
@@ -976,9 +813,16 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 {
 	bool rw = bio_data_dir(bio);
 
-	/* Charge the bio to the group */
-	tg->bytes_disp[rw] += bio->bi_size;
-	tg->io_disp[rw]++;
+	tg_update_token(tg, rw);
+
+	/* Charge the bio to the group and trim the bucket */
+	tg->bytes_token[rw] -= bio->bi_size;
+	if (tg->bytes_token[rw] > THROTL_BURST_BYTES)
+		tg->bytes_token[rw] = THROTL_BURST_BYTES;
+
+	tg->io_token[rw]--;
+	if (tg->io_token[rw] > THROTL_BURST_IO)
+		tg->io_token[rw] = THROTL_BURST_IO;
 
 	/*
 	 * REQ_THROTTLED is used to prevent the same bio to be throttled
@@ -1055,16 +899,6 @@ static void tg_update_disptime(struct throtl_grp *tg)
 	tg->flags &= ~THROTL_TG_WAS_EMPTY;
 }
 
-static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
-					struct throtl_grp *parent_tg, bool rw)
-{
-	if (throtl_slice_used(parent_tg, rw)) {
-		throtl_start_new_slice_with_credit(parent_tg, rw,
-				child_tg->slice_start[rw]);
-	}
-
-}
-
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -1093,7 +927,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 	 */
 	if (parent_tg) {
 		throtl_add_bio_tg(bio, &tg->qnode_on_parent[rw], parent_tg);
-		start_parent_slice_with_credit(tg, parent_tg, rw);
 	} else {
 		throtl_qnode_add_bio(bio, &tg->qnode_on_parent[rw],
 				     &parent_sq->queued[rw]);
@@ -1101,8 +934,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 		tg->td->nr_queued[rw]--;
 	}
 
-	throtl_trim_slice(tg, rw);
-
 	if (tg_to_put)
 		blkg_put(tg_to_blkg(tg_to_put));
 }
@@ -1386,12 +1217,8 @@ static int tg_set_conf(struct cgroup_subsys_state *css, struct cftype *cft,
 	 * We're already holding queue_lock and know @tg is valid.  Let's
 	 * apply the new config directly.
 	 *
-	 * Restart the slices for both READ and WRITES. It might happen
-	 * that a group's limit are dropped suddenly and we don't want to
-	 * account recently dispatched IO with new low rate.
+	 * Update dispatch time.
 	 */
-	throtl_start_new_slice(tg, 0);
-	throtl_start_new_slice(tg, 1);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
@@ -1527,19 +1354,6 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		throtl_charge_bio(tg, bio);
 
 		/*
-		 * We need to trim slice even when bios are not being queued
-		 * otherwise it might happen that a bio is not queued for
-		 * a long time and slice keeps on extending and trim is not
-		 * called for a long time. Now if limits are reduced suddenly
-		 * we take into account all the IO dispatched so far at new
-		 * low rate and * newly queued IO gets a really long dispatch
-		 * time.
-		 *
-		 * So keep on trimming slice even if bio is not queued.
-		 */
-		throtl_trim_slice(tg, rw);
-
-		/*
 		 * @bio passed through this layer without being throttled.
 		 * Climb up the ladder.  If we''re already at the top, it
 		 * can be executed directly.
@@ -1552,10 +1366,11 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	}
 
 	/* out-of-limit, queue to @tg */
-	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
+	throtl_log(sq,
+		   "[%c] btk=%llu sz=%u bps=%llu iotk=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
-		   tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
-		   tg->io_disp[rw], tg->iops[rw],
+		   tg->bytes_token[rw], bio->bi_size, tg->bps[rw],
+		   tg->io_token[rw], tg->iops[rw],
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] blk-throttle: simplify logic by token bucket algorithm
  2013-10-20 12:11       ` [PATCH v4 1/2] " Hong Zhiguo
@ 2014-04-10 10:07         ` Hong zhi guo
  2014-04-10 13:32           ` Vivek Goyal
  0 siblings, 1 reply; 25+ messages in thread
From: Hong zhi guo @ 2014-04-10 10:07 UTC (permalink / raw)
  To: Tejun Heo, Vivek Goyal; +Cc: cgroups, Jens Axboe, linux-kernel, Hong Zhiguo

Hi, Tejun, Vivek and Jens,

I did tests and you affirmed the idea, and Vivek said he'll review the
last version of the patch. But it seems he left blkio area more than
half year. What next should I do to make progress ?

BRs
Zhiguo

On Sun, Oct 20, 2013 at 8:11 PM, Hong Zhiguo <honkiko@gmail.com> wrote:
> Token bucket algorithm(http://en.wikipedia.org/wiki/Token_bucket)
> is very simple and widely used for rate limiting and shaping.
> It's well suitable for blk-throttle. And it naturally works for
> hierarchical cgroups. So I took it to replace the original time
> _slicing_|_trimming_|_extending_ logic.
>
> The advantage is simplicity, reliability and less fluctuation.
>
> About 300 lines of code for time-slicing is replaced with 60 lines
> of code for token bucket in blk-throttle.c.
>
> I've tested this patch by fio with rw=randread, rw=randrw. It
> behaves almost the same with original time-slicing implementation,
> and with less fluctuation. It's also tested on hierarchical setup.
>
> Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
> Tested-by: Hong Zhiguo <zhiguohong@tencent.com>
> ---
>  block/blk-throttle.c | 285 +++++++++------------------------------------------
>  1 file changed, 50 insertions(+), 235 deletions(-)
>
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 8331aba..45d4d91 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2,6 +2,8 @@
>   * Interface for controlling IO bandwidth on a request queue
>   *
>   * Copyright (C) 2010 Vivek Goyal <vgoyal@redhat.com>
> + *
> + * Simplify the logic with token bucket, Hong Zhiguo <zhiguohong@tencent.com>
>   */
>
>  #include <linux/module.h>
> @@ -18,9 +20,6 @@ static int throtl_grp_quantum = 8;
>  /* Total max dispatch from all groups in one round */
>  static int throtl_quantum = 32;
>
> -/* Throttling is performed over 100ms slice and after that slice is renewed */
> -static unsigned long throtl_slice = HZ/10;     /* 100 ms */
> -
>  static struct blkcg_policy blkcg_policy_throtl;
>
>  /* A workqueue to queue throttle related work */
> @@ -91,6 +90,11 @@ struct tg_stats_cpu {
>         struct blkg_rwstat              serviced;
>  };
>
> +/* Depth of iops token bucket */
> +#define THROTL_BURST_IO 8
> +/* Depth of bps token bucket */
> +#define THROTL_BURST_BYTES (4096 * 8)
> +
>  struct throtl_grp {
>         /* must be the first member */
>         struct blkg_policy_data pd;
> @@ -133,14 +137,13 @@ struct throtl_grp {
>         /* IOPS limits */
>         unsigned int iops[2];
>
> -       /* Number of bytes disptached in current slice */
> -       uint64_t bytes_disp[2];
> -       /* Number of bio's dispatched in current slice */
> -       unsigned int io_disp[2];
> +       /* Token for throttling by bps */
> +       uint64_t bytes_token[2];
> +       /* Token for throttling by iops */
> +       unsigned int io_token[2];
>
> -       /* When did we start a new slice */
> -       unsigned long slice_start[2];
> -       unsigned long slice_end[2];
> +       /* Time check-point */
> +       unsigned long t_c[2];
>
>         /* Per cpu stats pointer */
>         struct tg_stats_cpu __percpu *stats_cpu;
> @@ -680,171 +683,39 @@ static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq,
>         return false;
>  }
>
> -static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
> -               bool rw, unsigned long start)
> -{
> -       tg->bytes_disp[rw] = 0;
> -       tg->io_disp[rw] = 0;
> -
> -       /*
> -        * Previous slice has expired. We must have trimmed it after last
> -        * bio dispatch. That means since start of last slice, we never used
> -        * that bandwidth. Do try to make use of that bandwidth while giving
> -        * credit.
> -        */
> -       if (time_after_eq(start, tg->slice_start[rw]))
> -               tg->slice_start[rw] = start;
> -
> -       tg->slice_end[rw] = jiffies + throtl_slice;
> -       throtl_log(&tg->service_queue,
> -                  "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
> -                  rw == READ ? 'R' : 'W', tg->slice_start[rw],
> -                  tg->slice_end[rw], jiffies);
> -}
> -
> -static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
> -{
> -       tg->bytes_disp[rw] = 0;
> -       tg->io_disp[rw] = 0;
> -       tg->slice_start[rw] = jiffies;
> -       tg->slice_end[rw] = jiffies + throtl_slice;
> -       throtl_log(&tg->service_queue,
> -                  "[%c] new slice start=%lu end=%lu jiffies=%lu",
> -                  rw == READ ? 'R' : 'W', tg->slice_start[rw],
> -                  tg->slice_end[rw], jiffies);
> -}
> -
> -static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
> -                                       unsigned long jiffy_end)
> -{
> -       tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
> -}
> -
> -static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
> -                                      unsigned long jiffy_end)
> -{
> -       tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
> -       throtl_log(&tg->service_queue,
> -                  "[%c] extend slice start=%lu end=%lu jiffies=%lu",
> -                  rw == READ ? 'R' : 'W', tg->slice_start[rw],
> -                  tg->slice_end[rw], jiffies);
> -}
> -
> -/* Determine if previously allocated or extended slice is complete or not */
> -static bool throtl_slice_used(struct throtl_grp *tg, bool rw)
> -{
> -       if (time_in_range(jiffies, tg->slice_start[rw], tg->slice_end[rw]))
> -               return 0;
> -
> -       return 1;
> -}
> -
> -/* Trim the used slices and adjust slice start accordingly */
> -static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
> +static void tg_update_token(struct throtl_grp *tg, bool rw)
>  {
> -       unsigned long nr_slices, time_elapsed, io_trim;
> -       u64 bytes_trim, tmp;
> +       unsigned long tdiff = jiffies - tg->t_c[rw];
> +       u64 token;
>
> -       BUG_ON(time_before(tg->slice_end[rw], tg->slice_start[rw]));
> -
> -       /*
> -        * If bps are unlimited (-1), then time slice don't get
> -        * renewed. Don't try to trim the slice if slice is used. A new
> -        * slice will start when appropriate.
> -        */
> -       if (throtl_slice_used(tg, rw))
> +       if (!tdiff)
>                 return;
>
> -       /*
> -        * A bio has been dispatched. Also adjust slice_end. It might happen
> -        * that initially cgroup limit was very low resulting in high
> -        * slice_end, but later limit was bumped up and bio was dispached
> -        * sooner, then we need to reduce slice_end. A high bogus slice_end
> -        * is bad because it does not allow new slice to start.
> -        */
> +       token = (u64)tg->iops[rw] * tdiff;
> +       do_div(token, HZ);
> +       tg->io_token[rw] += token;
>
> -       throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
> +       token = (u64)tg->bps[rw] * tdiff;
> +       do_div(token, HZ);
> +       tg->bytes_token[rw] += token;
>
> -       time_elapsed = jiffies - tg->slice_start[rw];
> -
> -       nr_slices = time_elapsed / throtl_slice;
> -
> -       if (!nr_slices)
> -               return;
> -       tmp = tg->bps[rw] * throtl_slice * nr_slices;
> -       do_div(tmp, HZ);
> -       bytes_trim = tmp;
> -
> -       io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
> -
> -       if (!bytes_trim && !io_trim)
> -               return;
> -
> -       if (tg->bytes_disp[rw] >= bytes_trim)
> -               tg->bytes_disp[rw] -= bytes_trim;
> -       else
> -               tg->bytes_disp[rw] = 0;
> -
> -       if (tg->io_disp[rw] >= io_trim)
> -               tg->io_disp[rw] -= io_trim;
> -       else
> -               tg->io_disp[rw] = 0;
> -
> -       tg->slice_start[rw] += nr_slices * throtl_slice;
> -
> -       throtl_log(&tg->service_queue,
> -                  "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
> -                  rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim,
> -                  tg->slice_start[rw], tg->slice_end[rw], jiffies);
> +       tg->t_c[rw] = jiffies;
>  }
>
>  static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
>                                   unsigned long *wait)
>  {
>         bool rw = bio_data_dir(bio);
> -       unsigned int io_allowed;
> -       unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> -       u64 tmp;
> -
> -       jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
> -
> -       /* Slice has just started. Consider one slice interval */
> -       if (!jiffy_elapsed)
> -               jiffy_elapsed_rnd = throtl_slice;
> -
> -       jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
>
> -       /*
> -        * jiffy_elapsed_rnd should not be a big value as minimum iops can be
> -        * 1 then at max jiffy elapsed should be equivalent of 1 second as we
> -        * will allow dispatch after 1 second and after that slice should
> -        * have been trimmed.
> -        */
> -
> -       tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
> -       do_div(tmp, HZ);
> -
> -       if (tmp > UINT_MAX)
> -               io_allowed = UINT_MAX;
> -       else
> -               io_allowed = tmp;
> -
> -       if (tg->io_disp[rw] + 1 <= io_allowed) {
> +       if (tg->io_token[rw]) {
>                 if (wait)
>                         *wait = 0;
>                 return 1;
>         }
>
>         /* Calc approx time to dispatch */
> -       jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
> -
> -       if (jiffy_wait > jiffy_elapsed)
> -               jiffy_wait = jiffy_wait - jiffy_elapsed;
> -       else
> -               jiffy_wait = 1;
> -
>         if (wait)
> -               *wait = jiffy_wait;
> +               *wait = HZ / tg->iops[rw] ?: 1;
>         return 0;
>  }
>
> @@ -852,41 +723,21 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
>                                  unsigned long *wait)
>  {
>         bool rw = bio_data_dir(bio);
> -       u64 bytes_allowed, extra_bytes, tmp;
> -       unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> -
> -       jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
> +       u64 extra_bytes;
> +       unsigned long jiffy_wait;
>
> -       /* Slice has just started. Consider one slice interval */
> -       if (!jiffy_elapsed)
> -               jiffy_elapsed_rnd = throtl_slice;
> -
> -       jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
> -
> -       tmp = tg->bps[rw] * jiffy_elapsed_rnd;
> -       do_div(tmp, HZ);
> -       bytes_allowed = tmp;
> -
> -       if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) {
> +       if (bio->bi_size <= tg->bytes_token[rw]) {
>                 if (wait)
>                         *wait = 0;
>                 return 1;
>         }
>
>         /* Calc approx time to dispatch */
> -       extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed;
> -       jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
> -
> -       if (!jiffy_wait)
> -               jiffy_wait = 1;
> -
> -       /*
> -        * This wait time is without taking into consideration the rounding
> -        * up we did. Add that time also.
> -        */
> -       jiffy_wait = jiffy_wait + (jiffy_elapsed_rnd - jiffy_elapsed);
> -       if (wait)
> -               *wait = jiffy_wait;
> +       if (wait) {
> +               extra_bytes = bio->bi_size - tg->bytes_token[rw];
> +               jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
> +               *wait = jiffy_wait ?: 1;
> +       }
>         return 0;
>  }
>
> @@ -916,18 +767,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
>                 return 1;
>         }
>
> -       /*
> -        * If previous slice expired, start a new one otherwise renew/extend
> -        * existing slice to make sure it is at least throtl_slice interval
> -        * long since now.
> -        */
> -       if (throtl_slice_used(tg, rw))
> -               throtl_start_new_slice(tg, rw);
> -       else {
> -               if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
> -                       throtl_extend_slice(tg, rw, jiffies + throtl_slice);
> -       }
> -
> +       tg_update_token(tg, rw);
>         if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
>             tg_with_in_iops_limit(tg, bio, &iops_wait)) {
>                 if (wait)
> @@ -940,9 +780,6 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
>         if (wait)
>                 *wait = max_wait;
>
> -       if (time_before(tg->slice_end[rw], jiffies + max_wait))
> -               throtl_extend_slice(tg, rw, jiffies + max_wait);
> -
>         return 0;
>  }
>
> @@ -976,9 +813,16 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
>  {
>         bool rw = bio_data_dir(bio);
>
> -       /* Charge the bio to the group */
> -       tg->bytes_disp[rw] += bio->bi_size;
> -       tg->io_disp[rw]++;
> +       tg_update_token(tg, rw);
> +
> +       /* Charge the bio to the group and trim the bucket */
> +       tg->bytes_token[rw] -= bio->bi_size;
> +       if (tg->bytes_token[rw] > THROTL_BURST_BYTES)
> +               tg->bytes_token[rw] = THROTL_BURST_BYTES;
> +
> +       tg->io_token[rw]--;
> +       if (tg->io_token[rw] > THROTL_BURST_IO)
> +               tg->io_token[rw] = THROTL_BURST_IO;
>
>         /*
>          * REQ_THROTTLED is used to prevent the same bio to be throttled
> @@ -1055,16 +899,6 @@ static void tg_update_disptime(struct throtl_grp *tg)
>         tg->flags &= ~THROTL_TG_WAS_EMPTY;
>  }
>
> -static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
> -                                       struct throtl_grp *parent_tg, bool rw)
> -{
> -       if (throtl_slice_used(parent_tg, rw)) {
> -               throtl_start_new_slice_with_credit(parent_tg, rw,
> -                               child_tg->slice_start[rw]);
> -       }
> -
> -}
> -
>  static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
>  {
>         struct throtl_service_queue *sq = &tg->service_queue;
> @@ -1093,7 +927,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
>          */
>         if (parent_tg) {
>                 throtl_add_bio_tg(bio, &tg->qnode_on_parent[rw], parent_tg);
> -               start_parent_slice_with_credit(tg, parent_tg, rw);
>         } else {
>                 throtl_qnode_add_bio(bio, &tg->qnode_on_parent[rw],
>                                      &parent_sq->queued[rw]);
> @@ -1101,8 +934,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
>                 tg->td->nr_queued[rw]--;
>         }
>
> -       throtl_trim_slice(tg, rw);
> -
>         if (tg_to_put)
>                 blkg_put(tg_to_blkg(tg_to_put));
>  }
> @@ -1386,12 +1217,8 @@ static int tg_set_conf(struct cgroup_subsys_state *css, struct cftype *cft,
>          * We're already holding queue_lock and know @tg is valid.  Let's
>          * apply the new config directly.
>          *
> -        * Restart the slices for both READ and WRITES. It might happen
> -        * that a group's limit are dropped suddenly and we don't want to
> -        * account recently dispatched IO with new low rate.
> +        * Update dispatch time.
>          */
> -       throtl_start_new_slice(tg, 0);
> -       throtl_start_new_slice(tg, 1);
>
>         if (tg->flags & THROTL_TG_PENDING) {
>                 tg_update_disptime(tg);
> @@ -1527,19 +1354,6 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
>                 throtl_charge_bio(tg, bio);
>
>                 /*
> -                * We need to trim slice even when bios are not being queued
> -                * otherwise it might happen that a bio is not queued for
> -                * a long time and slice keeps on extending and trim is not
> -                * called for a long time. Now if limits are reduced suddenly
> -                * we take into account all the IO dispatched so far at new
> -                * low rate and * newly queued IO gets a really long dispatch
> -                * time.
> -                *
> -                * So keep on trimming slice even if bio is not queued.
> -                */
> -               throtl_trim_slice(tg, rw);
> -
> -               /*
>                  * @bio passed through this layer without being throttled.
>                  * Climb up the ladder.  If we''re already at the top, it
>                  * can be executed directly.
> @@ -1552,10 +1366,11 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
>         }
>
>         /* out-of-limit, queue to @tg */
> -       throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
> +       throtl_log(sq,
> +                  "[%c] btk=%llu sz=%u bps=%llu iotk=%u iops=%u queued=%d/%d",
>                    rw == READ ? 'R' : 'W',
> -                  tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
> -                  tg->io_disp[rw], tg->iops[rw],
> +                  tg->bytes_token[rw], bio->bi_size, tg->bps[rw],
> +                  tg->io_token[rw], tg->iops[rw],
>                    sq->nr_queued[READ], sq->nr_queued[WRITE]);
>
>         bio_associate_current(bio);
> --
> 1.8.1.2
>



-- 
best regards
Hong Zhiguo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 1/2] blk-throttle: simplify logic by token bucket algorithm
  2014-04-10 10:07         ` Hong zhi guo
@ 2014-04-10 13:32           ` Vivek Goyal
  0 siblings, 0 replies; 25+ messages in thread
From: Vivek Goyal @ 2014-04-10 13:32 UTC (permalink / raw)
  To: Hong zhi guo; +Cc: Tejun Heo, cgroups, Jens Axboe, linux-kernel, Hong Zhiguo

On Thu, Apr 10, 2014 at 06:07:05PM +0800, Hong zhi guo wrote:
> Hi, Tejun, Vivek and Jens,
> 
> I did tests and you affirmed the idea, and Vivek said he'll review the
> last version of the patch. But it seems he left blkio area more than
> half year. What next should I do to make progress ?

Hong,

I would say post the patches again with your testing information. I will
review these.

Thanks
Vivek

> 
> BRs
> Zhiguo
> 
> On Sun, Oct 20, 2013 at 8:11 PM, Hong Zhiguo <honkiko@gmail.com> wrote:
> > Token bucket algorithm(http://en.wikipedia.org/wiki/Token_bucket)
> > is very simple and widely used for rate limiting and shaping.
> > It's well suitable for blk-throttle. And it naturally works for
> > hierarchical cgroups. So I took it to replace the original time
> > _slicing_|_trimming_|_extending_ logic.
> >
> > The advantage is simplicity, reliability and less fluctuation.
> >
> > About 300 lines of code for time-slicing is replaced with 60 lines
> > of code for token bucket in blk-throttle.c.
> >
> > I've tested this patch by fio with rw=randread, rw=randrw. It
> > behaves almost the same with original time-slicing implementation,
> > and with less fluctuation. It's also tested on hierarchical setup.
> >
> > Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
> > Tested-by: Hong Zhiguo <zhiguohong@tencent.com>
> > ---
> >  block/blk-throttle.c | 285 +++++++++------------------------------------------
> >  1 file changed, 50 insertions(+), 235 deletions(-)
> >
> > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > index 8331aba..45d4d91 100644
> > --- a/block/blk-throttle.c
> > +++ b/block/blk-throttle.c
> > @@ -2,6 +2,8 @@
> >   * Interface for controlling IO bandwidth on a request queue
> >   *
> >   * Copyright (C) 2010 Vivek Goyal <vgoyal@redhat.com>
> > + *
> > + * Simplify the logic with token bucket, Hong Zhiguo <zhiguohong@tencent.com>
> >   */
> >
> >  #include <linux/module.h>
> > @@ -18,9 +20,6 @@ static int throtl_grp_quantum = 8;
> >  /* Total max dispatch from all groups in one round */
> >  static int throtl_quantum = 32;
> >
> > -/* Throttling is performed over 100ms slice and after that slice is renewed */
> > -static unsigned long throtl_slice = HZ/10;     /* 100 ms */
> > -
> >  static struct blkcg_policy blkcg_policy_throtl;
> >
> >  /* A workqueue to queue throttle related work */
> > @@ -91,6 +90,11 @@ struct tg_stats_cpu {
> >         struct blkg_rwstat              serviced;
> >  };
> >
> > +/* Depth of iops token bucket */
> > +#define THROTL_BURST_IO 8
> > +/* Depth of bps token bucket */
> > +#define THROTL_BURST_BYTES (4096 * 8)
> > +
> >  struct throtl_grp {
> >         /* must be the first member */
> >         struct blkg_policy_data pd;
> > @@ -133,14 +137,13 @@ struct throtl_grp {
> >         /* IOPS limits */
> >         unsigned int iops[2];
> >
> > -       /* Number of bytes disptached in current slice */
> > -       uint64_t bytes_disp[2];
> > -       /* Number of bio's dispatched in current slice */
> > -       unsigned int io_disp[2];
> > +       /* Token for throttling by bps */
> > +       uint64_t bytes_token[2];
> > +       /* Token for throttling by iops */
> > +       unsigned int io_token[2];
> >
> > -       /* When did we start a new slice */
> > -       unsigned long slice_start[2];
> > -       unsigned long slice_end[2];
> > +       /* Time check-point */
> > +       unsigned long t_c[2];
> >
> >         /* Per cpu stats pointer */
> >         struct tg_stats_cpu __percpu *stats_cpu;
> > @@ -680,171 +683,39 @@ static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq,
> >         return false;
> >  }
> >
> > -static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
> > -               bool rw, unsigned long start)
> > -{
> > -       tg->bytes_disp[rw] = 0;
> > -       tg->io_disp[rw] = 0;
> > -
> > -       /*
> > -        * Previous slice has expired. We must have trimmed it after last
> > -        * bio dispatch. That means since start of last slice, we never used
> > -        * that bandwidth. Do try to make use of that bandwidth while giving
> > -        * credit.
> > -        */
> > -       if (time_after_eq(start, tg->slice_start[rw]))
> > -               tg->slice_start[rw] = start;
> > -
> > -       tg->slice_end[rw] = jiffies + throtl_slice;
> > -       throtl_log(&tg->service_queue,
> > -                  "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
> > -                  rw == READ ? 'R' : 'W', tg->slice_start[rw],
> > -                  tg->slice_end[rw], jiffies);
> > -}
> > -
> > -static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
> > -{
> > -       tg->bytes_disp[rw] = 0;
> > -       tg->io_disp[rw] = 0;
> > -       tg->slice_start[rw] = jiffies;
> > -       tg->slice_end[rw] = jiffies + throtl_slice;
> > -       throtl_log(&tg->service_queue,
> > -                  "[%c] new slice start=%lu end=%lu jiffies=%lu",
> > -                  rw == READ ? 'R' : 'W', tg->slice_start[rw],
> > -                  tg->slice_end[rw], jiffies);
> > -}
> > -
> > -static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
> > -                                       unsigned long jiffy_end)
> > -{
> > -       tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
> > -}
> > -
> > -static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
> > -                                      unsigned long jiffy_end)
> > -{
> > -       tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
> > -       throtl_log(&tg->service_queue,
> > -                  "[%c] extend slice start=%lu end=%lu jiffies=%lu",
> > -                  rw == READ ? 'R' : 'W', tg->slice_start[rw],
> > -                  tg->slice_end[rw], jiffies);
> > -}
> > -
> > -/* Determine if previously allocated or extended slice is complete or not */
> > -static bool throtl_slice_used(struct throtl_grp *tg, bool rw)
> > -{
> > -       if (time_in_range(jiffies, tg->slice_start[rw], tg->slice_end[rw]))
> > -               return 0;
> > -
> > -       return 1;
> > -}
> > -
> > -/* Trim the used slices and adjust slice start accordingly */
> > -static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
> > +static void tg_update_token(struct throtl_grp *tg, bool rw)
> >  {
> > -       unsigned long nr_slices, time_elapsed, io_trim;
> > -       u64 bytes_trim, tmp;
> > +       unsigned long tdiff = jiffies - tg->t_c[rw];
> > +       u64 token;
> >
> > -       BUG_ON(time_before(tg->slice_end[rw], tg->slice_start[rw]));
> > -
> > -       /*
> > -        * If bps are unlimited (-1), then time slice don't get
> > -        * renewed. Don't try to trim the slice if slice is used. A new
> > -        * slice will start when appropriate.
> > -        */
> > -       if (throtl_slice_used(tg, rw))
> > +       if (!tdiff)
> >                 return;
> >
> > -       /*
> > -        * A bio has been dispatched. Also adjust slice_end. It might happen
> > -        * that initially cgroup limit was very low resulting in high
> > -        * slice_end, but later limit was bumped up and bio was dispached
> > -        * sooner, then we need to reduce slice_end. A high bogus slice_end
> > -        * is bad because it does not allow new slice to start.
> > -        */
> > +       token = (u64)tg->iops[rw] * tdiff;
> > +       do_div(token, HZ);
> > +       tg->io_token[rw] += token;
> >
> > -       throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
> > +       token = (u64)tg->bps[rw] * tdiff;
> > +       do_div(token, HZ);
> > +       tg->bytes_token[rw] += token;
> >
> > -       time_elapsed = jiffies - tg->slice_start[rw];
> > -
> > -       nr_slices = time_elapsed / throtl_slice;
> > -
> > -       if (!nr_slices)
> > -               return;
> > -       tmp = tg->bps[rw] * throtl_slice * nr_slices;
> > -       do_div(tmp, HZ);
> > -       bytes_trim = tmp;
> > -
> > -       io_trim = (tg->iops[rw] * throtl_slice * nr_slices)/HZ;
> > -
> > -       if (!bytes_trim && !io_trim)
> > -               return;
> > -
> > -       if (tg->bytes_disp[rw] >= bytes_trim)
> > -               tg->bytes_disp[rw] -= bytes_trim;
> > -       else
> > -               tg->bytes_disp[rw] = 0;
> > -
> > -       if (tg->io_disp[rw] >= io_trim)
> > -               tg->io_disp[rw] -= io_trim;
> > -       else
> > -               tg->io_disp[rw] = 0;
> > -
> > -       tg->slice_start[rw] += nr_slices * throtl_slice;
> > -
> > -       throtl_log(&tg->service_queue,
> > -                  "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
> > -                  rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim,
> > -                  tg->slice_start[rw], tg->slice_end[rw], jiffies);
> > +       tg->t_c[rw] = jiffies;
> >  }
> >
> >  static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
> >                                   unsigned long *wait)
> >  {
> >         bool rw = bio_data_dir(bio);
> > -       unsigned int io_allowed;
> > -       unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> > -       u64 tmp;
> > -
> > -       jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
> > -
> > -       /* Slice has just started. Consider one slice interval */
> > -       if (!jiffy_elapsed)
> > -               jiffy_elapsed_rnd = throtl_slice;
> > -
> > -       jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
> >
> > -       /*
> > -        * jiffy_elapsed_rnd should not be a big value as minimum iops can be
> > -        * 1 then at max jiffy elapsed should be equivalent of 1 second as we
> > -        * will allow dispatch after 1 second and after that slice should
> > -        * have been trimmed.
> > -        */
> > -
> > -       tmp = (u64)tg->iops[rw] * jiffy_elapsed_rnd;
> > -       do_div(tmp, HZ);
> > -
> > -       if (tmp > UINT_MAX)
> > -               io_allowed = UINT_MAX;
> > -       else
> > -               io_allowed = tmp;
> > -
> > -       if (tg->io_disp[rw] + 1 <= io_allowed) {
> > +       if (tg->io_token[rw]) {
> >                 if (wait)
> >                         *wait = 0;
> >                 return 1;
> >         }
> >
> >         /* Calc approx time to dispatch */
> > -       jiffy_wait = ((tg->io_disp[rw] + 1) * HZ)/tg->iops[rw] + 1;
> > -
> > -       if (jiffy_wait > jiffy_elapsed)
> > -               jiffy_wait = jiffy_wait - jiffy_elapsed;
> > -       else
> > -               jiffy_wait = 1;
> > -
> >         if (wait)
> > -               *wait = jiffy_wait;
> > +               *wait = HZ / tg->iops[rw] ?: 1;
> >         return 0;
> >  }
> >
> > @@ -852,41 +723,21 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
> >                                  unsigned long *wait)
> >  {
> >         bool rw = bio_data_dir(bio);
> > -       u64 bytes_allowed, extra_bytes, tmp;
> > -       unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
> > -
> > -       jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
> > +       u64 extra_bytes;
> > +       unsigned long jiffy_wait;
> >
> > -       /* Slice has just started. Consider one slice interval */
> > -       if (!jiffy_elapsed)
> > -               jiffy_elapsed_rnd = throtl_slice;
> > -
> > -       jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, throtl_slice);
> > -
> > -       tmp = tg->bps[rw] * jiffy_elapsed_rnd;
> > -       do_div(tmp, HZ);
> > -       bytes_allowed = tmp;
> > -
> > -       if (tg->bytes_disp[rw] + bio->bi_size <= bytes_allowed) {
> > +       if (bio->bi_size <= tg->bytes_token[rw]) {
> >                 if (wait)
> >                         *wait = 0;
> >                 return 1;
> >         }
> >
> >         /* Calc approx time to dispatch */
> > -       extra_bytes = tg->bytes_disp[rw] + bio->bi_size - bytes_allowed;
> > -       jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
> > -
> > -       if (!jiffy_wait)
> > -               jiffy_wait = 1;
> > -
> > -       /*
> > -        * This wait time is without taking into consideration the rounding
> > -        * up we did. Add that time also.
> > -        */
> > -       jiffy_wait = jiffy_wait + (jiffy_elapsed_rnd - jiffy_elapsed);
> > -       if (wait)
> > -               *wait = jiffy_wait;
> > +       if (wait) {
> > +               extra_bytes = bio->bi_size - tg->bytes_token[rw];
> > +               jiffy_wait = div64_u64(extra_bytes * HZ, tg->bps[rw]);
> > +               *wait = jiffy_wait ?: 1;
> > +       }
> >         return 0;
> >  }
> >
> > @@ -916,18 +767,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
> >                 return 1;
> >         }
> >
> > -       /*
> > -        * If previous slice expired, start a new one otherwise renew/extend
> > -        * existing slice to make sure it is at least throtl_slice interval
> > -        * long since now.
> > -        */
> > -       if (throtl_slice_used(tg, rw))
> > -               throtl_start_new_slice(tg, rw);
> > -       else {
> > -               if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
> > -                       throtl_extend_slice(tg, rw, jiffies + throtl_slice);
> > -       }
> > -
> > +       tg_update_token(tg, rw);
> >         if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
> >             tg_with_in_iops_limit(tg, bio, &iops_wait)) {
> >                 if (wait)
> > @@ -940,9 +780,6 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
> >         if (wait)
> >                 *wait = max_wait;
> >
> > -       if (time_before(tg->slice_end[rw], jiffies + max_wait))
> > -               throtl_extend_slice(tg, rw, jiffies + max_wait);
> > -
> >         return 0;
> >  }
> >
> > @@ -976,9 +813,16 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
> >  {
> >         bool rw = bio_data_dir(bio);
> >
> > -       /* Charge the bio to the group */
> > -       tg->bytes_disp[rw] += bio->bi_size;
> > -       tg->io_disp[rw]++;
> > +       tg_update_token(tg, rw);
> > +
> > +       /* Charge the bio to the group and trim the bucket */
> > +       tg->bytes_token[rw] -= bio->bi_size;
> > +       if (tg->bytes_token[rw] > THROTL_BURST_BYTES)
> > +               tg->bytes_token[rw] = THROTL_BURST_BYTES;
> > +
> > +       tg->io_token[rw]--;
> > +       if (tg->io_token[rw] > THROTL_BURST_IO)
> > +               tg->io_token[rw] = THROTL_BURST_IO;
> >
> >         /*
> >          * REQ_THROTTLED is used to prevent the same bio to be throttled
> > @@ -1055,16 +899,6 @@ static void tg_update_disptime(struct throtl_grp *tg)
> >         tg->flags &= ~THROTL_TG_WAS_EMPTY;
> >  }
> >
> > -static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
> > -                                       struct throtl_grp *parent_tg, bool rw)
> > -{
> > -       if (throtl_slice_used(parent_tg, rw)) {
> > -               throtl_start_new_slice_with_credit(parent_tg, rw,
> > -                               child_tg->slice_start[rw]);
> > -       }
> > -
> > -}
> > -
> >  static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
> >  {
> >         struct throtl_service_queue *sq = &tg->service_queue;
> > @@ -1093,7 +927,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
> >          */
> >         if (parent_tg) {
> >                 throtl_add_bio_tg(bio, &tg->qnode_on_parent[rw], parent_tg);
> > -               start_parent_slice_with_credit(tg, parent_tg, rw);
> >         } else {
> >                 throtl_qnode_add_bio(bio, &tg->qnode_on_parent[rw],
> >                                      &parent_sq->queued[rw]);
> > @@ -1101,8 +934,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
> >                 tg->td->nr_queued[rw]--;
> >         }
> >
> > -       throtl_trim_slice(tg, rw);
> > -
> >         if (tg_to_put)
> >                 blkg_put(tg_to_blkg(tg_to_put));
> >  }
> > @@ -1386,12 +1217,8 @@ static int tg_set_conf(struct cgroup_subsys_state *css, struct cftype *cft,
> >          * We're already holding queue_lock and know @tg is valid.  Let's
> >          * apply the new config directly.
> >          *
> > -        * Restart the slices for both READ and WRITES. It might happen
> > -        * that a group's limit are dropped suddenly and we don't want to
> > -        * account recently dispatched IO with new low rate.
> > +        * Update dispatch time.
> >          */
> > -       throtl_start_new_slice(tg, 0);
> > -       throtl_start_new_slice(tg, 1);
> >
> >         if (tg->flags & THROTL_TG_PENDING) {
> >                 tg_update_disptime(tg);
> > @@ -1527,19 +1354,6 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
> >                 throtl_charge_bio(tg, bio);
> >
> >                 /*
> > -                * We need to trim slice even when bios are not being queued
> > -                * otherwise it might happen that a bio is not queued for
> > -                * a long time and slice keeps on extending and trim is not
> > -                * called for a long time. Now if limits are reduced suddenly
> > -                * we take into account all the IO dispatched so far at new
> > -                * low rate and * newly queued IO gets a really long dispatch
> > -                * time.
> > -                *
> > -                * So keep on trimming slice even if bio is not queued.
> > -                */
> > -               throtl_trim_slice(tg, rw);
> > -
> > -               /*
> >                  * @bio passed through this layer without being throttled.
> >                  * Climb up the ladder.  If we''re already at the top, it
> >                  * can be executed directly.
> > @@ -1552,10 +1366,11 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
> >         }
> >
> >         /* out-of-limit, queue to @tg */
> > -       throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
> > +       throtl_log(sq,
> > +                  "[%c] btk=%llu sz=%u bps=%llu iotk=%u iops=%u queued=%d/%d",
> >                    rw == READ ? 'R' : 'W',
> > -                  tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
> > -                  tg->io_disp[rw], tg->iops[rw],
> > +                  tg->bytes_token[rw], bio->bi_size, tg->bps[rw],
> > +                  tg->io_token[rw], tg->iops[rw],
> >                    sq->nr_queued[READ], sq->nr_queued[WRITE]);
> >
> >         bio_associate_current(bio);
> > --
> > 1.8.1.2
> >
> 
> 
> 
> -- 
> best regards
> Hong Zhiguo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree
  2013-10-20 12:11     ` [PATCH v4 0/2] " Hong Zhiguo
  2013-10-20 12:11       ` [PATCH v4 1/2] " Hong Zhiguo
@ 2013-10-20 12:11       ` Hong Zhiguo
  2013-10-22 21:02         ` Vivek Goyal
  1 sibling, 1 reply; 25+ messages in thread
From: Hong Zhiguo @ 2013-10-20 12:11 UTC (permalink / raw)
  To: tj, vgoyal; +Cc: cgroups, axboe, linux-kernel, Hong Zhiguo

From: Hong Zhiguo <zhiguohong@tencent.com>

Why
====
Pointed out by Vivek: Tokens generated during idle period should
be trimmed. Otherwise a huge bio may be permited immediately.
Overlimit behaviour may be observed during short I/O throughput
test.

Vivek also pointed out: We should not over-trim for hierarchical
groups. Suppose a subtree of groups are idle when a big bio comes.
The token of the child group is trimmed and not enough. So the bio is
queued on the child group. After some jiffies the child group reserved
enough tokens and the bio climbs up. If we trim the parent group at
this time again, this bio will wait too much time than expected.

Analysis
========
When the bio is queued on child group, all it's ancestor groups
becomes non-idle. They should start to reserve tokens for that
bio from this moment. And their reserved tokens before should be
trimmed at this moment.

How
====
service_queue now has a new member nr_queued_tree[2], to represent
the the number of bios waiting on the subtree rooted by this sq.

When a bio is queued on the hierarchy first time, nr_queued_tree
of all ancestors and the child group itself are increased. When a
bio climbs up, nr_queued_tree of the child group is decreased.

When nr_queued_tree turns from zero to one, the tokens reserved
before are trimmed. And after this switch, this group will never
be trimmed to reserve tokens for the bio waiting on it's descendant
group.

Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com>
---
 block/blk-throttle.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 45d4d91..d10a5e1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -63,6 +63,10 @@ struct throtl_service_queue {
 	 */
 	struct list_head	queued[2];	/* throtl_qnode [READ/WRITE] */
 	unsigned int		nr_queued[2];	/* number of queued bios */
+	/*
+	 * number of bios queued on this subtree
+	 */
+	unsigned int		nr_queued_tree[2];

 	/*
 	 * RB tree of active children throtl_grp's, which are sorted by
@@ -699,6 +703,11 @@ static void tg_update_token(struct throtl_grp *tg, bool rw)
 	do_div(token, HZ);
 	tg->bytes_token[rw] += token;

+	/* trim token if the whole subtree is idle */
+	if (!tg->service_queue.nr_queued_tree[rw] &&
+	    tg->bytes_token[rw] > THROTL_BURST_BYTES)
+		token = THROTL_BURST_BYTES;
+
 	tg->t_c[rw] = jiffies;
 }

@@ -843,6 +852,28 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 }

 /**
+ * tg_update_nr_queue_tree - update nr_queued_tree of all ancestors
+ * @sq: the target service_queue
+ *
+ * Traverse from @sq to the top throtl_grp, increase their
+ * nr_queued_tree. If one sq has zero nr_queued_tree and now turns to
+ * one, trim the tokens generated before. From now on it will never
+ * get trimmed until same thing happens next time.
+ */
+static void tg_update_nr_queue_tree(struct throtl_service_queue *sq, bool rw)
+{
+	struct throtl_grp *tg;
+
+	while (sq && (tg = sq_to_tg(sq))) {
+		if (!sq->nr_queued_tree[rw])
+			tg_update_token(tg, rw);
+
+		sq->nr_queued_tree[rw]++;
+		sq = sq->parent_sq;
+	}
+}
+
+/**
  * throtl_add_bio_tg - add a bio to the specified throtl_grp
  * @bio: bio to add
  * @qn: qnode to use
@@ -916,6 +947,9 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 	bio = throtl_pop_queued(&sq->queued[rw], &tg_to_put);
 	sq->nr_queued[rw]--;

+	/* Here's the only place nr_queued_tree get decreased */
+	sq->nr_queued_tree[rw]--;
+
 	throtl_charge_bio(tg, bio);

 	/*
@@ -1375,6 +1409,10 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)

 	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
+
+	/* Here's the only place nr_queued_tree get increased */
+	tg_update_nr_queue_tree(sq, rw);
+
 	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;

-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree
  2013-10-20 12:11       ` [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree Hong Zhiguo
@ 2013-10-22 21:02         ` Vivek Goyal
  2013-10-23  3:30           ` Hong zhi guo
  2013-10-28  5:08           ` Hong zhi guo
  0 siblings, 2 replies; 25+ messages in thread
From: Vivek Goyal @ 2013-10-22 21:02 UTC (permalink / raw)
  To: Hong Zhiguo; +Cc: tj, cgroups, axboe, linux-kernel, Hong Zhiguo

On Sun, Oct 20, 2013 at 08:11:12PM +0800, Hong Zhiguo wrote:
> From: Hong Zhiguo <zhiguohong@tencent.com>
> 
> Why
> ====
> Pointed out by Vivek: Tokens generated during idle period should
> be trimmed. Otherwise a huge bio may be permited immediately.
> Overlimit behaviour may be observed during short I/O throughput
> test.
> 
> Vivek also pointed out: We should not over-trim for hierarchical
> groups. Suppose a subtree of groups are idle when a big bio comes.
> The token of the child group is trimmed and not enough. So the bio is
> queued on the child group. After some jiffies the child group reserved
> enough tokens and the bio climbs up. If we trim the parent group at
> this time again, this bio will wait too much time than expected.
> 
> Analysis
> ========
> When the bio is queued on child group, all it's ancestor groups
> becomes non-idle. They should start to reserve tokens for that
> bio from this moment. And their reserved tokens before should be
> trimmed at this moment.
> 
> How
> ====
> service_queue now has a new member nr_queued_tree[2], to represent
> the the number of bios waiting on the subtree rooted by this sq.
> 
> When a bio is queued on the hierarchy first time, nr_queued_tree
> of all ancestors and the child group itself are increased. When a
> bio climbs up, nr_queued_tree of the child group is decreased.
> 
> When nr_queued_tree turns from zero to one, the tokens reserved
> before are trimmed. And after this switch, this group will never
> be trimmed to reserve tokens for the bio waiting on it's descendant
> group.
> 

Hi Hong,

This approach looks good in general. Only downside I can think of
updation of nr_requests throughout the hierarchy. So deeper the 
hierarchy, higher the overhead.

I am not sure if that's a concern or not. I will have a closer look
a the patches tomorrow and do some testing too.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree
  2013-10-22 21:02         ` Vivek Goyal
@ 2013-10-23  3:30           ` Hong zhi guo
  2013-10-28  5:08           ` Hong zhi guo
  1 sibling, 0 replies; 25+ messages in thread
From: Hong zhi guo @ 2013-10-23  3:30 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Tejun Heo, cgroups, Jens Axboe, linux-kernel, Hong Zhiguo

Hi, Vivek,

Sorry I don't have time to test them carefully this week. I'll do it
in this weekend.

The updating of nr_queued_tree level by level only happens once for
each bio. We have already the computing(and maybe queue operation)
level by level for every bio, even it's passed through without being
queued on the tree.  And the computing(and queue operation) on each
level is much bigger than updating one counter (nr_queued_tree).

So updating of nr_queued_tree doesn't introduce notable overhead
compared to existing overhead of blk-throttle.

Zhiguo

On Wed, Oct 23, 2013 at 5:02 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sun, Oct 20, 2013 at 08:11:12PM +0800, Hong Zhiguo wrote:
>> From: Hong Zhiguo <zhiguohong@tencent.com>
>>
>> Why
>> ====
>> Pointed out by Vivek: Tokens generated during idle period should
>> be trimmed. Otherwise a huge bio may be permited immediately.
>> Overlimit behaviour may be observed during short I/O throughput
>> test.
>>
>> Vivek also pointed out: We should not over-trim for hierarchical
>> groups. Suppose a subtree of groups are idle when a big bio comes.
>> The token of the child group is trimmed and not enough. So the bio is
>> queued on the child group. After some jiffies the child group reserved
>> enough tokens and the bio climbs up. If we trim the parent group at
>> this time again, this bio will wait too much time than expected.
>>
>> Analysis
>> ========
>> When the bio is queued on child group, all it's ancestor groups
>> becomes non-idle. They should start to reserve tokens for that
>> bio from this moment. And their reserved tokens before should be
>> trimmed at this moment.
>>
>> How
>> ====
>> service_queue now has a new member nr_queued_tree[2], to represent
>> the the number of bios waiting on the subtree rooted by this sq.
>>
>> When a bio is queued on the hierarchy first time, nr_queued_tree
>> of all ancestors and the child group itself are increased. When a
>> bio climbs up, nr_queued_tree of the child group is decreased.
>>
>> When nr_queued_tree turns from zero to one, the tokens reserved
>> before are trimmed. And after this switch, this group will never
>> be trimmed to reserve tokens for the bio waiting on it's descendant
>> group.
>>
>
> Hi Hong,
>
> This approach looks good in general. Only downside I can think of
> updation of nr_requests throughout the hierarchy. So deeper the
> hierarchy, higher the overhead.
>
> I am not sure if that's a concern or not. I will have a closer look
> a the patches tomorrow and do some testing too.
>
> Thanks
> Vivek



-- 
best regards
Hong Zhiguo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree
  2013-10-22 21:02         ` Vivek Goyal
  2013-10-23  3:30           ` Hong zhi guo
@ 2013-10-28  5:08           ` Hong zhi guo
  1 sibling, 0 replies; 25+ messages in thread
From: Hong zhi guo @ 2013-10-28  5:08 UTC (permalink / raw)
  To: Vivek Goyal, Tejun Heo; +Cc: cgroups, Jens Axboe, linux-kernel, Hong Zhiguo

Hi, Vivek,

I tested the PATCH v4 for some basic hierarchical setup as I did
before. And I get the similar result.

Preparation
============
1) mount subsys blkio with "__DEVEL__sane_behavior"
2) Create 3 levels of directories under the blkio mount point:
    mkdir 1
    mkdir 1/2
    mkdir 1/2/3
    mkdir 4
3) start 4 bash sessions, write their PIDs into:
    1/cgroup.procs
    1/2/cgroup.procs
    1/2/3/cgroup.procs
    4/cgroup.procs
4) prepare 4 10MB files on sdb(ext4 fs)

Note: in below hierarchy graph:
    "[50k]" means configured value for read_bps_device is 50kB/s
    "(50k)" means bandwidth reported by dd is 50kB/s

Test A: 1 process throttled by ancestor group
=============================================
Hierarchy set-up:
    (echo "8:16 204800" > 1/blkio.throttle.read_bps_device)
    |-- 1 [200k]
    |    `-- 2 [-]
    |         `-- 3 [-]
    `-- 4 [-]

dd within group 3:
    (drop cache then: dd if=10M-file-3 of=/dev/null)
Result:
    206kB/s (I did same test without the token-bucket patch, The
result is 205kB/s)

dd within group 2:
    (drop cache then: dd if=10M-file-2 of=/dev/null)
Result:
    206kB/s


Test B: 4 processes in 3 levels of hierarchy
=============================================
Hierarchy set-up:
    echo "8:16 204800" > 1/blkio.throttle.read_bps_device
    echo "8:16 102400" > 1/2/blkio.throttle.read_bps_device
    echo "8:16 51200"  > 1/2/3/blkio.throttle.read_bps_device
    echo "8:16 51200"  > 4/blkio.throttle.read_bps_device
    |-- 1 [200k]
    |    `-- 2 [100k]
    |         `-- 3 [50k]
    `-- 4 [50k]

start 4 dd processes from 4 bash sessions
    (dd if=10M-file-x of=/dev/null)
Result:
    |-- 1 (104k)
    |    `-- 2 (52.1k)
    |         `-- 3 (51.3k)
    `-- 4 (51.4k)


On Wed, Oct 23, 2013 at 5:02 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> Hi Hong,
>
> This approach looks good in general. Only downside I can think of
> updation of nr_requests throughout the hierarchy. So deeper the
> hierarchy, higher the overhead.
>
> I am not sure if that's a concern or not. I will have a closer look
> a the patches tomorrow and do some testing too.
>
> Thanks
> Vivek


-- 
best regards
Hong Zhiguo

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-04-10 13:32 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1381574794-7639-1-git-send-email-zhiguohong@tencent.com>
2013-10-14  9:09 ` [PATCH v2] blk-throttle: simplify logic by token bucket algorithm Hong Zhiguo
2013-10-14 13:36   ` Tejun Heo
2013-10-14 13:47     ` Hong zhi guo
2013-10-14 13:53     ` Hong zhi guo
2013-10-14 13:59       ` Tejun Heo
2013-10-15 12:35         ` Hong zhi guo
2013-10-15 16:19           ` Jens Axboe
2013-10-15 13:03     ` Vivek Goyal
2013-10-15 17:32   ` Vivek Goyal
2013-10-16  6:09     ` Hong zhi guo
2013-10-16 14:14       ` Vivek Goyal
2013-10-16 15:47         ` Hong zhi guo
2013-10-16 15:53         ` Tejun Heo
2013-10-16 16:22           ` Vivek Goyal
2013-10-17 12:17 ` [PATCH v3] " Hong Zhiguo
2013-10-18 15:55   ` Vivek Goyal
2013-10-20 12:08     ` Hong zhi guo
2013-10-20 12:11     ` [PATCH v4 0/2] " Hong Zhiguo
2013-10-20 12:11       ` [PATCH v4 1/2] " Hong Zhiguo
2014-04-10 10:07         ` Hong zhi guo
2014-04-10 13:32           ` Vivek Goyal
2013-10-20 12:11       ` [PATCH v4 2/2] blk-throttle: trim tokens generated for an idle tree Hong Zhiguo
2013-10-22 21:02         ` Vivek Goyal
2013-10-23  3:30           ` Hong zhi guo
2013-10-28  5:08           ` Hong zhi guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).