linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET] blk-throttle: implement proper hierarchy support
@ 2013-05-02  0:39 Tejun Heo
  2013-05-02  0:39 ` [PATCH 01/31] blkcg: fix error return path in blkg_create() Tejun Heo
                   ` (32 more replies)
  0 siblings, 33 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal

blk-throttle is the last controller with broken hierarchy support
making blkcg the last one tagged with .broken_hierarchy.  This
patchset implements hierarchy support for blk-throttle.  The semantics
is pretty simple - limits on an intermediate node applies to the whole
subtree and the statistics remain local.

As this changes the meaning of the knobs in an incompatible manner -
e.g. configuring limits on root cgroup now means setting the limit for
the whole system - the hierarchy mode is enabled by "sane_behavior"
cgroup mount flag.  If the flag is not specified, the original broken
flat hierarchy behavior is retained.

While this patchset contains many patches, the implementation is
pretty straight-forward.  throtl_grp's form a tree anchored at
throtl_data and bios climb the tree as they get dispatched at each
level.  The bios which reach the top of the tree - throl_data - are
issued.  The scheduling algorithm remains unchanged at each level and
blk-throttle should behave the same for flat hierarchy after the
changes.  The same algorithm is repeated until bios clear all limits
to the top of the tree.

This patchset contains the following 31 patches.

 0001-blkcg-fix-error-return-path-in-blkg_create.patch
 0002-blkcg-move-blkg_for_each_descendant_pre-to-block-blk.patch
 0003-blkcg-implement-blkg_for_each_descendant_post.patch
 0004-blkcg-invoke-blkcg_policy-pd_init-after-parent-is-li.patch
 0005-blkcg-move-bulk-of-blkcg_gq-release-operations-to-th.patch
 0006-blk-throttle-remove-spurious-throtl_enqueue_tg-call-.patch
 0007-blk-throttle-removed-deferred-config-application-mec.patch
 0008-blk-throttle-collapse-throtl_dispatch-into-the-work-.patch
 0009-blk-throttle-relocate-throtl_schedule_delayed_work.patch
 0010-blk-throttle-remove-pointless-throtl_nr_queued-optim.patch
 0011-blk-throttle-rename-throtl_rb_root-to-throtl_service.patch
 0012-blk-throttle-simplify-throtl_grp-flag-handling.patch
 0013-blk-throttle-add-backlink-pointer-from-throtl_grp-to.patch
 0014-blk-throttle-pass-around-throtl_service_queue-instea.patch
 0015-blk-throttle-reorganize-throtl_service_queue-passed-.patch
 0016-blk-throttle-add-throtl_grp-service_queue.patch
 0017-blk-throttle-move-bio_lists-and-friends-to-throtl_se.patch
 0018-blk-throttle-dispatch-to-throtl_data-service_queue.b.patch
 0019-blk-throttle-generalize-update_disptime-optimization.patch
 0020-blk-throttle-add-throtl_service_queue-parent_sq.patch
 0021-blk-throttle-implement-sq_to_tg-sq_to_td-and-throtl_.patch
 0022-blk-throttle-set-REQ_THROTTLED-from-throtl_charge_bi.patch
 0023-blk-throttle-separate-out-throtl_service_queue-pendi.patch
 0024-blk-throttle-implement-dispatch-looping.patch
 0025-blk-throttle-dispatch-from-throtl_pending_timer_fn.patch
 0026-blk-throttle-make-blk_throtl_drain-ready-for-hierarc.patch
 0027-blk-throttle-make-blk_throtl_bio-ready-for-hierarchy.patch
 0028-blk-throttle-make-tg_dispatch_one_bio-ready-for-hier.patch
 0029-blk-throttle-make-throtl_pending_timer_fn-ready-for-.patch
 0030-blk-throttle-implement-throtl_grp-has_rules.patch
 0031-blk-throttle-implement-proper-hierarchy-support.patch

0001-0005 prepare blkcg so that hierarchy operations are easier.

0006-0016 reorganize code piece-by-piece so that hierarchy support can
be added.  These don't change behaviors.

0017-0025 prepare for hierarchy support.  Moves fields which are used
in hierarchy to throtl_service_queue and define parent-child
relationship.

0026-0030 make queueing, dispatching and configuration changes
propagate through the hierarchy.

0031 implemenats hierarchy support.

As we're in the middle of a merge window, this patchset is currently
based on cgroup/for-3.10.  Once 3.10-rc1 drops, I'll rebase the tree
and send pull request to Jens so that it can be routed with other
block changes.  The patches are also available on the following git
branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-throtl-hierarchy

diffstat follows.  Thanks.

 block/blk-cgroup.c     |  105 ++---
 block/blk-cgroup.h     |   38 ++
 block/blk-throttle.c   |  875 +++++++++++++++++++++++++++++++------------------
 include/linux/cgroup.h |    2 
 4 files changed, 635 insertions(+), 385 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 01/31] blkcg: fix error return path in blkg_create()
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 02/31] blkcg: move blkg_for_each_descendant_pre() to block/blk-cgroup.h Tejun Heo
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

In blkg_create(), after lookup of parent fails, the control jumps to
error path with the error code encoded into @blkg.  The error path
doesn't use @blkg for the return value.  It returns ERR_PTR(ret).
Make lookup fail path set @ret instead of @blkg.

Note that the parent lookup is guaranteed to succeed at that point and
the condition check is purely for sanity and triggers WARN when fails.
As such, I don't think it's necessary to mark it for -stable.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b2b9837..0ab211a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -238,7 +238,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	if (blkcg_parent(blkcg)) {
 		blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
 		if (WARN_ON_ONCE(!blkg->parent)) {
-			blkg = ERR_PTR(-EINVAL);
+			ret = -EINVAL;
 			goto err_put_css;
 		}
 		blkg_get(blkg->parent);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 02/31] blkcg: move blkg_for_each_descendant_pre() to block/blk-cgroup.h
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
  2013-05-02  0:39 ` [PATCH 01/31] blkcg: fix error return path in blkg_create() Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 03/31] blkcg: implement blkg_for_each_descendant_post() Tejun Heo
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

blk-throttle hierarchy support will make use of it.  Move
blkg_for_each_descendant_pre() from block/blk-cgroup.c to
block/blk-cgroup.h.

signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 24 ++----------------------
 block/blk-cgroup.h | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0ab211a..6b10d5c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -32,26 +32,6 @@ EXPORT_SYMBOL_GPL(blkcg_root);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 
-static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
-				      struct request_queue *q, bool update_hint);
-
-/**
- * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants
- * @d_blkg: loop cursor pointing to the current descendant
- * @pos_cgrp: used for iteration
- * @p_blkg: target blkg to walk descendants of
- *
- * Walk @c_blkg through the descendants of @p_blkg.  Must be used with RCU
- * read locked.  If called under either blkcg or queue lock, the iteration
- * is guaranteed to include all and only online blkgs.  The caller may
- * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip
- * subtree.
- */
-#define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg)		\
-	cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \
-		if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \
-					      (p_blkg)->q, false)))
-
 static bool blkcg_policy_enabled(struct request_queue *q,
 				 const struct blkcg_policy *pol)
 {
@@ -158,8 +138,8 @@ err_free:
  * @q's bypass state.  If @update_hint is %true, the caller should be
  * holding @q->queue_lock and lookup hint is updated on success.
  */
-static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
-				      struct request_queue *q, bool update_hint)
+struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q,
+			       bool update_hint)
 {
 	struct blkcg_gq *blkg;
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 4e595ee..11f5b92 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -282,6 +282,26 @@ static inline void blkg_put(struct blkcg_gq *blkg)
 		__blkg_release(blkg);
 }
 
+struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q,
+			       bool update_hint);
+
+/**
+ * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants
+ * @d_blkg: loop cursor pointing to the current descendant
+ * @pos_cgrp: used for iteration
+ * @p_blkg: target blkg to walk descendants of
+ *
+ * Walk @c_blkg through the descendants of @p_blkg.  Must be used with RCU
+ * read locked.  If called under either blkcg or queue lock, the iteration
+ * is guaranteed to include all and only online blkgs.  The caller may
+ * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip
+ * subtree.
+ */
+#define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg)		\
+	cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \
+		if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \
+					      (p_blkg)->q, false)))
+
 /**
  * blk_get_rl - get request_list to use
  * @q: request_queue of interest
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 03/31] blkcg: implement blkg_for_each_descendant_post()
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
  2013-05-02  0:39 ` [PATCH 01/31] blkcg: fix error return path in blkg_create() Tejun Heo
  2013-05-02  0:39 ` [PATCH 02/31] blkcg: move blkg_for_each_descendant_pre() to block/blk-cgroup.h Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 04/31] blkcg: invoke blkcg_policy->pd_init() after parent is linked Tejun Heo
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

This will be used by blk-throttle hierarchy support.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 11f5b92..e15f731 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -303,6 +303,20 @@ struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q,
 					      (p_blkg)->q, false)))
 
 /**
+ * blkg_for_each_descendant_post - post-order walk of a blkg's descendants
+ * @d_blkg: loop cursor pointing to the current descendant
+ * @pos_cgrp: used for iteration
+ * @p_blkg: target blkg to walk descendants of
+ *
+ * Similar to blkg_for_each_descendant_pre() but performs post-order
+ * traversal instead.  Synchronization rules are the same.
+ */
+#define blkg_for_each_descendant_post(d_blkg, pos_cgrp, p_blkg)		\
+	cgroup_for_each_descendant_post((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \
+		if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \
+					      (p_blkg)->q, false)))
+
+/**
  * blk_get_rl - get request_list to use
  * @q: request_queue of interest
  * @bio: bio which will be attached to the allocated request (may be %NULL)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 04/31] blkcg: invoke blkcg_policy->pd_init() after parent is linked
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (2 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 03/31] blkcg: implement blkg_for_each_descendant_post() Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 05/31] blkcg: move bulk of blkcg_gq release operations to the RCU callback Tejun Heo
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

Currently, when creating a new blkcg_gq, each policy's pd_init_fn() is
invoked in blkg_alloc() before the parent is linked.  This makes it
difficult for policies to perform initializations which are dependent
on the parent.

This patch moves pd_init_fn() invocations to blkg_create() after the
parent blkg is linked where the new blkg is fully initialized.  As
this means that blkg_free() can't assume that pd's are initialized,
pd_exit_fn() invocations are moved to __blkg_release().  This
guarantees that pd_exit_fn() is also invoked with fully initialized
blkgs with valid parent pointers.

This will help implementing hierarchy support in blk-throttle.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 39 ++++++++++++++++++++++-----------------
 1 file changed, 22 insertions(+), 17 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 6b10d5c..f13cf95 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -51,18 +51,8 @@ static void blkg_free(struct blkcg_gq *blkg)
 	if (!blkg)
 		return;
 
-	for (i = 0; i < BLKCG_MAX_POLS; i++) {
-		struct blkcg_policy *pol = blkcg_policy[i];
-		struct blkg_policy_data *pd = blkg->pd[i];
-
-		if (!pd)
-			continue;
-
-		if (pol && pol->pd_exit_fn)
-			pol->pd_exit_fn(blkg);
-
-		kfree(pd);
-	}
+	for (i = 0; i < BLKCG_MAX_POLS; i++)
+		kfree(blkg->pd[i]);
 
 	blk_exit_rl(&blkg->rl);
 	kfree(blkg);
@@ -114,10 +104,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
 		pd->plid = i;
-
-		/* invoke per-policy init */
-		if (pol->pd_init_fn)
-			pol->pd_init_fn(blkg);
 	}
 
 	return blkg;
@@ -214,7 +200,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	}
 	blkg = new_blkg;
 
-	/* link parent and insert */
+	/* link parent */
 	if (blkcg_parent(blkcg)) {
 		blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
 		if (WARN_ON_ONCE(!blkg->parent)) {
@@ -224,6 +210,15 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 		blkg_get(blkg->parent);
 	}
 
+	/* invoke per-policy init */
+	for (i = 0; i < BLKCG_MAX_POLS; i++) {
+		struct blkcg_policy *pol = blkcg_policy[i];
+
+		if (blkg->pd[i] && pol->pd_init_fn)
+			pol->pd_init_fn(blkg);
+	}
+
+	/* insert */
 	spin_lock(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
 	if (likely(!ret)) {
@@ -381,6 +376,16 @@ static void blkg_rcu_free(struct rcu_head *rcu_head)
 
 void __blkg_release(struct blkcg_gq *blkg)
 {
+	int i;
+
+	/* tell policies that this one is being freed */
+	for (i = 0; i < BLKCG_MAX_POLS; i++) {
+		struct blkcg_policy *pol = blkcg_policy[i];
+
+		if (blkg->pd[i] && pol->pd_exit_fn)
+			pol->pd_exit_fn(blkg);
+	}
+
 	/* release the blkcg and parent blkg refs this blkg has been holding */
 	css_put(&blkg->blkcg->css);
 	if (blkg->parent)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 05/31] blkcg: move bulk of blkcg_gq release operations to the RCU callback
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (3 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 04/31] blkcg: invoke blkcg_policy->pd_init() after parent is linked Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 06/31] blk-throttle: remove spurious throtl_enqueue_tg() call from throtl_select_dispatch() Tejun Heo
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

Currently, when the last reference of a blkcg_gq is put, all then
release operations sans the actual freeing happen directly in
blkg_put().  As blkg_put() may be called under queue_lock, all
pd_exit_fn()s may be too.  This makes it impossible for pd_exit_fn()s
to use del_timer_sync() on timers which grab the queue_lock which is
an irq-safe lock due to the deadlock possibility described in the
comment on top of del_timer_sync().

This can be easily avoided by perfoming the release operations in the
RCU callback instead of directly from blkg_put().  This patch moves
the blkcg_gq release operations to the RCU callback.

As this leaves __blkg_release() with only call_rcu() invocation,
blkg_rcu_free() is renamed to __blkg_release_rcu(), exported and
call_rcu() invocation is now done directly from blkg_put() instead of
going through __blkg_release() which is removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 34 ++++++++++++++++------------------
 block/blk-cgroup.h |  4 ++--
 2 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f13cf95..af2ca27 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -369,13 +369,17 @@ static void blkg_destroy_all(struct request_queue *q)
 	q->root_rl.blkg = NULL;
 }
 
-static void blkg_rcu_free(struct rcu_head *rcu_head)
-{
-	blkg_free(container_of(rcu_head, struct blkcg_gq, rcu_head));
-}
-
-void __blkg_release(struct blkcg_gq *blkg)
+/*
+ * A group is RCU protected, but having an rcu lock does not mean that one
+ * can access all the fields of blkg and assume these are valid.  For
+ * example, don't try to follow throtl_data and request queue links.
+ *
+ * Having a reference to blkg under an rcu allows accesses to only values
+ * local to groups like group stats and group rate limits.
+ */
+void __blkg_release_rcu(struct rcu_head *rcu_head)
 {
+	struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
 	int i;
 
 	/* tell policies that this one is being freed */
@@ -388,21 +392,15 @@ void __blkg_release(struct blkcg_gq *blkg)
 
 	/* release the blkcg and parent blkg refs this blkg has been holding */
 	css_put(&blkg->blkcg->css);
-	if (blkg->parent)
+	if (blkg->parent) {
+		spin_lock_irq(blkg->q->queue_lock);
 		blkg_put(blkg->parent);
+		spin_unlock_irq(blkg->q->queue_lock);
+	}
 
-	/*
-	 * A group is freed in rcu manner. But having an rcu lock does not
-	 * mean that one can access all the fields of blkg and assume these
-	 * are valid. For example, don't try to follow throtl_data and
-	 * request queue links.
-	 *
-	 * Having a reference to blkg under an rcu allows acess to only
-	 * values local to groups like group stats and group rate limits
-	 */
-	call_rcu(&blkg->rcu_head, blkg_rcu_free);
+	blkg_free(blkg);
 }
-EXPORT_SYMBOL_GPL(__blkg_release);
+EXPORT_SYMBOL_GPL(__blkg_release_rcu);
 
 /*
  * The next function used by blk_queue_for_each_rl().  It's a bit tricky
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index e15f731..8056c03 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -266,7 +266,7 @@ static inline void blkg_get(struct blkcg_gq *blkg)
 	blkg->refcnt++;
 }
 
-void __blkg_release(struct blkcg_gq *blkg);
+void __blkg_release_rcu(struct rcu_head *rcu);
 
 /**
  * blkg_put - put a blkg reference
@@ -279,7 +279,7 @@ static inline void blkg_put(struct blkcg_gq *blkg)
 	lockdep_assert_held(blkg->q->queue_lock);
 	WARN_ON_ONCE(blkg->refcnt <= 0);
 	if (!--blkg->refcnt)
-		__blkg_release(blkg);
+		call_rcu(&blkg->rcu_head, __blkg_release_rcu);
 }
 
 struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q,
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 06/31] blk-throttle: remove spurious throtl_enqueue_tg() call from throtl_select_dispatch()
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (4 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 05/31] blkcg: move bulk of blkcg_gq release operations to the RCU callback Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 07/31] blk-throttle: removed deferred config application mechanism Tejun Heo
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

throtl_select_dispatch() calls throtl_enqueue_tg() right after
tg_update_disptime(), which always calls the function anyway.  The
call is, while harmless, unnecessary.  Remove it.

This patch doesn't introduce any behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3114622..3960787 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -816,10 +816,8 @@ static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
 
 		nr_disp += throtl_dispatch_tg(td, tg, bl);
 
-		if (tg->nr_queued[0] || tg->nr_queued[1]) {
+		if (tg->nr_queued[0] || tg->nr_queued[1])
 			tg_update_disptime(td, tg);
-			throtl_enqueue_tg(td, tg);
-		}
 
 		if (nr_disp >= throtl_quantum)
 			break;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 07/31] blk-throttle: removed deferred config application mechanism
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (5 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 06/31] blk-throttle: remove spurious throtl_enqueue_tg() call from throtl_select_dispatch() Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02 14:49   ` Vivek Goyal
  2013-05-02  0:39 ` [PATCH 08/31] blk-throttle: collapse throtl_dispatch() into the work function Tejun Heo
                   ` (25 subsequent siblings)
  32 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

When bps or iops configuration changes, blk-throttle records the new
configuration and sets a flag indicating that the config has changed.
The flag is checked in the bio dispatch path and applied.  This
deferred config application was necessary due to limitations in blkcg
framework, which haven't existed for quite a while now.

This patch removes the deferred config application mechanism and
applies new configurations directly from tg_set_conf(), which is
simpler.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 75 +++++++++++++++-------------------------------------
 1 file changed, 22 insertions(+), 53 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3960787..7a5e08e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -85,9 +85,6 @@ struct throtl_grp {
 	unsigned long slice_start[2];
 	unsigned long slice_end[2];
 
-	/* Some throttle limits got updated for the group */
-	int limits_changed;
-
 	/* Per cpu stats pointer */
 	struct tg_stats_cpu __percpu *stats_cpu;
 
@@ -112,8 +109,6 @@ struct throtl_data
 
 	/* Work for dispatching throttled bios */
 	struct delayed_work throtl_work;
-
-	int limits_changed;
 };
 
 /* list and work item to allocate percpu group stats */
@@ -223,7 +218,6 @@ static void throtl_pd_init(struct blkcg_gq *blkg)
 	RB_CLEAR_NODE(&tg->rb_node);
 	bio_list_init(&tg->bio_lists[0]);
 	bio_list_init(&tg->bio_lists[1]);
-	tg->limits_changed = false;
 
 	tg->bps[READ] = -1;
 	tg->bps[WRITE] = -1;
@@ -826,45 +820,6 @@ static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
 	return nr_disp;
 }
 
-static void throtl_process_limit_change(struct throtl_data *td)
-{
-	struct request_queue *q = td->queue;
-	struct blkcg_gq *blkg, *n;
-
-	if (!td->limits_changed)
-		return;
-
-	xchg(&td->limits_changed, false);
-
-	throtl_log(td, "limits changed");
-
-	list_for_each_entry_safe(blkg, n, &q->blkg_list, q_node) {
-		struct throtl_grp *tg = blkg_to_tg(blkg);
-
-		if (!tg->limits_changed)
-			continue;
-
-		if (!xchg(&tg->limits_changed, false))
-			continue;
-
-		throtl_log_tg(td, tg, "limit change rbps=%llu wbps=%llu"
-			" riops=%u wiops=%u", tg->bps[READ], tg->bps[WRITE],
-			tg->iops[READ], tg->iops[WRITE]);
-
-		/*
-		 * Restart the slices for both READ and WRITES. It
-		 * might happen that a group's limit are dropped
-		 * suddenly and we don't want to account recently
-		 * dispatched IO with new low rate
-		 */
-		throtl_start_new_slice(td, tg, 0);
-		throtl_start_new_slice(td, tg, 1);
-
-		if (throtl_tg_on_rr(tg))
-			tg_update_disptime(td, tg);
-	}
-}
-
 /* Dispatch throttled bios. Should be called without queue lock held. */
 static int throtl_dispatch(struct request_queue *q)
 {
@@ -876,8 +831,6 @@ static int throtl_dispatch(struct request_queue *q)
 
 	spin_lock_irq(q->queue_lock);
 
-	throtl_process_limit_change(td);
-
 	if (!total_nr_queued(td))
 		goto out;
 
@@ -925,8 +878,7 @@ throtl_schedule_delayed_work(struct throtl_data *td, unsigned long delay)
 
 	struct delayed_work *dwork = &td->throtl_work;
 
-	/* schedule work if limits changed even if no bio is queued */
-	if (total_nr_queued(td) || td->limits_changed) {
+	if (total_nr_queued(td)) {
 		mod_delayed_work(kthrotld_workqueue, dwork, delay);
 		throtl_log(td, "schedule work. delay=%lu jiffies=%lu",
 				delay, jiffies);
@@ -1023,9 +975,27 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 	else
 		*(unsigned int *)((void *)tg + cft->private) = ctx.v;
 
-	/* XXX: we don't need the following deferred processing */
-	xchg(&tg->limits_changed, true);
-	xchg(&td->limits_changed, true);
+	throtl_log_tg(td, tg, "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
+		      tg->bps[READ], tg->bps[WRITE],
+		      tg->iops[READ], tg->iops[WRITE]);
+
+	/*
+	 * We're already holding queue_lock and know @tg is valid.  Let's
+	 * apply the new config directly.
+	 *
+	 * Restart the slices for both READ and WRITES. It might happen
+	 * that a group's limit are dropped suddenly and we don't want to
+	 * account recently dispatched IO with new low rate.
+	 */
+	throtl_start_new_slice(td, tg, 0);
+	throtl_start_new_slice(td, tg, 1);
+
+	if (throtl_tg_on_rr(tg)) {
+		tg_update_disptime(td, tg);
+		throtl_schedule_next_dispatch(td);
+	}
+
+	/* kick dispatch in case disptime got shortened */
 	throtl_schedule_delayed_work(td, 0);
 
 	blkg_conf_finish(&ctx);
@@ -1239,7 +1209,6 @@ int blk_throtl_init(struct request_queue *q)
 		return -ENOMEM;
 
 	td->tg_service_tree = THROTL_RB_ROOT;
-	td->limits_changed = false;
 	INIT_DELAYED_WORK(&td->throtl_work, blk_throtl_work);
 
 	q->td = td;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 08/31] blk-throttle: collapse throtl_dispatch() into the work function
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (6 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 07/31] blk-throttle: removed deferred config application mechanism Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 09/31] blk-throttle: relocate throtl_schedule_delayed_work() Tejun Heo
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

blk-throttle is about to go through major restructuring to support
hierarchy.  Do cosmetic updates in preparation.

* s/throtl_data->throtl_work/throtl_data->dispatch_work/

* s/blk_throtl_work()/blk_throtl_dispatch_work_fn()/

* Collapse throtl_dispatch() into blk_throtl_dispatch_work_fn()

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 26 +++++++++-----------------
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 7a5e08e..d50d8d1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -108,7 +108,7 @@ struct throtl_data
 	unsigned int nr_undestroyed_grps;
 
 	/* Work for dispatching throttled bios */
-	struct delayed_work throtl_work;
+	struct delayed_work dispatch_work;
 };
 
 /* list and work item to allocate percpu group stats */
@@ -820,10 +820,12 @@ static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
 	return nr_disp;
 }
 
-/* Dispatch throttled bios. Should be called without queue lock held. */
-static int throtl_dispatch(struct request_queue *q)
+/* work function to dispatch throttled bios */
+void blk_throtl_dispatch_work_fn(struct work_struct *work)
 {
-	struct throtl_data *td = q->td;
+	struct throtl_data *td = container_of(to_delayed_work(work),
+					      struct throtl_data, dispatch_work);
+	struct request_queue *q = td->queue;
 	unsigned int nr_disp = 0;
 	struct bio_list bio_list_on_stack;
 	struct bio *bio;
@@ -859,16 +861,6 @@ out:
 			generic_make_request(bio);
 		blk_finish_plug(&plug);
 	}
-	return nr_disp;
-}
-
-void blk_throtl_work(struct work_struct *work)
-{
-	struct throtl_data *td = container_of(work, struct throtl_data,
-					throtl_work.work);
-	struct request_queue *q = td->queue;
-
-	throtl_dispatch(q);
 }
 
 /* Call with queue lock held */
@@ -876,7 +868,7 @@ static void
 throtl_schedule_delayed_work(struct throtl_data *td, unsigned long delay)
 {
 
-	struct delayed_work *dwork = &td->throtl_work;
+	struct delayed_work *dwork = &td->dispatch_work;
 
 	if (total_nr_queued(td)) {
 		mod_delayed_work(kthrotld_workqueue, dwork, delay);
@@ -1060,7 +1052,7 @@ static void throtl_shutdown_wq(struct request_queue *q)
 {
 	struct throtl_data *td = q->td;
 
-	cancel_delayed_work_sync(&td->throtl_work);
+	cancel_delayed_work_sync(&td->dispatch_work);
 }
 
 static struct blkcg_policy blkcg_policy_throtl = {
@@ -1209,7 +1201,7 @@ int blk_throtl_init(struct request_queue *q)
 		return -ENOMEM;
 
 	td->tg_service_tree = THROTL_RB_ROOT;
-	INIT_DELAYED_WORK(&td->throtl_work, blk_throtl_work);
+	INIT_DELAYED_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 
 	q->td = td;
 	td->queue = q;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 09/31] blk-throttle: relocate throtl_schedule_delayed_work()
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (7 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 08/31] blk-throttle: collapse throtl_dispatch() into the work function Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 10/31] blk-throttle: remove pointless throtl_nr_queued() optimizations Tejun Heo
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

Move throtl_schedule_delayed_work() above its first user so that the
forward declaration can be removed.

This patch is pure relocaiton.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d50d8d1..00e72f4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -25,8 +25,6 @@ static struct blkcg_policy blkcg_policy_throtl;
 
 /* A workqueue to queue throttle related work */
 static struct workqueue_struct *kthrotld_workqueue;
-static void throtl_schedule_delayed_work(struct throtl_data *td,
-				unsigned long delay);
 
 struct throtl_rb_root {
 	struct rb_root rb;
@@ -398,6 +396,19 @@ static void throtl_dequeue_tg(struct throtl_data *td, struct throtl_grp *tg)
 		__throtl_dequeue_tg(td, tg);
 }
 
+/* Call with queue lock held */
+static void throtl_schedule_delayed_work(struct throtl_data *td,
+					 unsigned long delay)
+{
+	struct delayed_work *dwork = &td->dispatch_work;
+
+	if (total_nr_queued(td)) {
+		mod_delayed_work(kthrotld_workqueue, dwork, delay);
+		throtl_log(td, "schedule work. delay=%lu jiffies=%lu",
+			   delay, jiffies);
+	}
+}
+
 static void throtl_schedule_next_dispatch(struct throtl_data *td)
 {
 	struct throtl_rb_root *st = &td->tg_service_tree;
@@ -863,20 +874,6 @@ out:
 	}
 }
 
-/* Call with queue lock held */
-static void
-throtl_schedule_delayed_work(struct throtl_data *td, unsigned long delay)
-{
-
-	struct delayed_work *dwork = &td->dispatch_work;
-
-	if (total_nr_queued(td)) {
-		mod_delayed_work(kthrotld_workqueue, dwork, delay);
-		throtl_log(td, "schedule work. delay=%lu jiffies=%lu",
-				delay, jiffies);
-	}
-}
-
 static u64 tg_prfill_cpu_rwstat(struct seq_file *sf,
 				struct blkg_policy_data *pd, int off)
 {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 10/31] blk-throttle: remove pointless throtl_nr_queued() optimizations
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (8 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 09/31] blk-throttle: relocate throtl_schedule_delayed_work() Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 11/31] blk-throttle: rename throtl_rb_root to throtl_service_queue Tejun Heo
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

throtl_nr_queued() is used in several places to avoid performing
certain operations when the throtl_data is empty.  This usually is
useless as those paths usually aren't traveled if there's no bio
queued.

* throtl_schedule_delayed_work() skips scheduling dispatch work item
  if @td doesn't have any bios queued; however, the only case it can
  be called when @td is empty is from tg_set_conf() which isn't
  something we should be optimizing for.

* throtl_schedule_next_dispatch() takes a quick exit if @td is empty;
  however, right after that it triggers BUG if the service tree is
  empty.  The two conditions are equivalent and it can just test
  @st->count for the quick exit.

* blk_throtl_dispatch_work_fn() skips dispatch if @td is empty.  This
  work function isn't usually invoked when @td is empty.  The only
  possibility is from tg_set_conf() and when it happens the normal
  dispatching path can handle empty @td fine.  No need to add special
  skip path.

This patch removes the above three unnecessary optimizations, which
leave throtl_log() call in blk_throtl_dispatch_work_fn() the only user
of throtl_nr_queued().  Remove throtl_nr_queued() and open code it in
throtl_log().  I don't think we need td->nr_queued[] at all.  Maybe we
can remove it later.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 29 +++++++----------------------
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 00e72f4..6fd08a4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -166,11 +166,6 @@ THROTL_TG_FNS(on_rr);
 #define throtl_log(td, fmt, args...)	\
 	blk_add_trace_msg((td)->queue, "throtl " fmt, ##args)
 
-static inline unsigned int total_nr_queued(struct throtl_data *td)
-{
-	return td->nr_queued[0] + td->nr_queued[1];
-}
-
 /*
  * Worker for allocating per cpu stat for tgs. This is scheduled on the
  * system_wq once there are some groups on the alloc_list waiting for
@@ -402,25 +397,18 @@ static void throtl_schedule_delayed_work(struct throtl_data *td,
 {
 	struct delayed_work *dwork = &td->dispatch_work;
 
-	if (total_nr_queued(td)) {
-		mod_delayed_work(kthrotld_workqueue, dwork, delay);
-		throtl_log(td, "schedule work. delay=%lu jiffies=%lu",
-			   delay, jiffies);
-	}
+	mod_delayed_work(kthrotld_workqueue, dwork, delay);
+	throtl_log(td, "schedule work. delay=%lu jiffies=%lu", delay, jiffies);
 }
 
 static void throtl_schedule_next_dispatch(struct throtl_data *td)
 {
 	struct throtl_rb_root *st = &td->tg_service_tree;
 
-	/*
-	 * If there are more bios pending, schedule more work.
-	 */
-	if (!total_nr_queued(td))
+	/* any pending children left? */
+	if (!st->count)
 		return;
 
-	BUG_ON(!st->count);
-
 	update_min_dispatch_time(st);
 
 	if (time_before_eq(st->min_disptime, jiffies))
@@ -844,14 +832,11 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 
 	spin_lock_irq(q->queue_lock);
 
-	if (!total_nr_queued(td))
-		goto out;
-
 	bio_list_init(&bio_list_on_stack);
 
 	throtl_log(td, "dispatch nr_queued=%u read=%u write=%u",
-			total_nr_queued(td), td->nr_queued[READ],
-			td->nr_queued[WRITE]);
+		   td->nr_queued[READ] + td->nr_queued[WRITE],
+		   td->nr_queued[READ], td->nr_queued[WRITE]);
 
 	nr_disp = throtl_select_dispatch(td, &bio_list_on_stack);
 
@@ -859,7 +844,7 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 		throtl_log(td, "bios disp=%u", nr_disp);
 
 	throtl_schedule_next_dispatch(td);
-out:
+
 	spin_unlock_irq(q->queue_lock);
 
 	/*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 11/31] blk-throttle: rename throtl_rb_root to throtl_service_queue
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (9 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 10/31] blk-throttle: remove pointless throtl_nr_queued() optimizations Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 12/31] blk-throttle: simplify throtl_grp flag handling Tejun Heo
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

throtl_rb_root will be expanded to cover more roles for hierarchy
support.  Rename it to throtl_service_queue and make its fields more
descriptive.

* rb		-> pending_tree
* left		-> first_pending
* count		-> nr_pending
* min_disptime	-> first_pending_disptime

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org
---
 block/blk-throttle.c | 84 ++++++++++++++++++++++++++--------------------------
 1 file changed, 42 insertions(+), 42 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6fd08a4..6723ca2 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -26,15 +26,15 @@ static struct blkcg_policy blkcg_policy_throtl;
 /* A workqueue to queue throttle related work */
 static struct workqueue_struct *kthrotld_workqueue;
 
-struct throtl_rb_root {
-	struct rb_root rb;
-	struct rb_node *left;
-	unsigned int count;
-	unsigned long min_disptime;
+struct throtl_service_queue {
+	struct rb_root		pending_tree;	/* RB tree of active tgs */
+	struct rb_node		*first_pending;	/* first node in the tree */
+	unsigned int		nr_pending;	/* # queued in the tree */
+	unsigned long		first_pending_disptime;	/* disptime of the first tg */
 };
 
-#define THROTL_RB_ROOT	(struct throtl_rb_root) { .rb = RB_ROOT, .left = NULL, \
-			.count = 0, .min_disptime = 0}
+#define THROTL_SERVICE_QUEUE_INITIALIZER				\
+	(struct throtl_service_queue){ .pending_tree = RB_ROOT }
 
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
@@ -50,7 +50,7 @@ struct throtl_grp {
 	/* must be the first member */
 	struct blkg_policy_data pd;
 
-	/* active throtl group service_tree member */
+	/* active throtl group service_queue member */
 	struct rb_node rb_node;
 
 	/*
@@ -93,7 +93,7 @@ struct throtl_grp {
 struct throtl_data
 {
 	/* service tree for active throtl groups */
-	struct throtl_rb_root tg_service_tree;
+	struct throtl_service_queue service_queue;
 
 	struct request_queue *queue;
 
@@ -296,17 +296,17 @@ static struct throtl_grp *throtl_lookup_create_tg(struct throtl_data *td,
 	return tg;
 }
 
-static struct throtl_grp *throtl_rb_first(struct throtl_rb_root *root)
+static struct throtl_grp *throtl_rb_first(struct throtl_service_queue *sq)
 {
 	/* Service tree is empty */
-	if (!root->count)
+	if (!sq->nr_pending)
 		return NULL;
 
-	if (!root->left)
-		root->left = rb_first(&root->rb);
+	if (!sq->first_pending)
+		sq->first_pending = rb_first(&sq->pending_tree);
 
-	if (root->left)
-		return rb_entry_tg(root->left);
+	if (sq->first_pending)
+		return rb_entry_tg(sq->first_pending);
 
 	return NULL;
 }
@@ -317,29 +317,29 @@ static void rb_erase_init(struct rb_node *n, struct rb_root *root)
 	RB_CLEAR_NODE(n);
 }
 
-static void throtl_rb_erase(struct rb_node *n, struct throtl_rb_root *root)
+static void throtl_rb_erase(struct rb_node *n, struct throtl_service_queue *sq)
 {
-	if (root->left == n)
-		root->left = NULL;
-	rb_erase_init(n, &root->rb);
-	--root->count;
+	if (sq->first_pending == n)
+		sq->first_pending = NULL;
+	rb_erase_init(n, &sq->pending_tree);
+	--sq->nr_pending;
 }
 
-static void update_min_dispatch_time(struct throtl_rb_root *st)
+static void update_min_dispatch_time(struct throtl_service_queue *sq)
 {
 	struct throtl_grp *tg;
 
-	tg = throtl_rb_first(st);
+	tg = throtl_rb_first(sq);
 	if (!tg)
 		return;
 
-	st->min_disptime = tg->disptime;
+	sq->first_pending_disptime = tg->disptime;
 }
 
-static void
-tg_service_tree_add(struct throtl_rb_root *st, struct throtl_grp *tg)
+static void tg_service_queue_add(struct throtl_service_queue *sq,
+				 struct throtl_grp *tg)
 {
-	struct rb_node **node = &st->rb.rb_node;
+	struct rb_node **node = &sq->pending_tree.rb_node;
 	struct rb_node *parent = NULL;
 	struct throtl_grp *__tg;
 	unsigned long key = tg->disptime;
@@ -358,19 +358,19 @@ tg_service_tree_add(struct throtl_rb_root *st, struct throtl_grp *tg)
 	}
 
 	if (left)
-		st->left = &tg->rb_node;
+		sq->first_pending = &tg->rb_node;
 
 	rb_link_node(&tg->rb_node, parent, node);
-	rb_insert_color(&tg->rb_node, &st->rb);
+	rb_insert_color(&tg->rb_node, &sq->pending_tree);
 }
 
 static void __throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg)
 {
-	struct throtl_rb_root *st = &td->tg_service_tree;
+	struct throtl_service_queue *sq = &td->service_queue;
 
-	tg_service_tree_add(st, tg);
+	tg_service_queue_add(sq, tg);
 	throtl_mark_tg_on_rr(tg);
-	st->count++;
+	sq->nr_pending++;
 }
 
 static void throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg)
@@ -381,7 +381,7 @@ static void throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg)
 
 static void __throtl_dequeue_tg(struct throtl_data *td, struct throtl_grp *tg)
 {
-	throtl_rb_erase(&tg->rb_node, &td->tg_service_tree);
+	throtl_rb_erase(&tg->rb_node, &td->service_queue);
 	throtl_clear_tg_on_rr(tg);
 }
 
@@ -403,18 +403,18 @@ static void throtl_schedule_delayed_work(struct throtl_data *td,
 
 static void throtl_schedule_next_dispatch(struct throtl_data *td)
 {
-	struct throtl_rb_root *st = &td->tg_service_tree;
+	struct throtl_service_queue *sq = &td->service_queue;
 
 	/* any pending children left? */
-	if (!st->count)
+	if (!sq->nr_pending)
 		return;
 
-	update_min_dispatch_time(st);
+	update_min_dispatch_time(sq);
 
-	if (time_before_eq(st->min_disptime, jiffies))
+	if (time_before_eq(sq->first_pending_disptime, jiffies))
 		throtl_schedule_delayed_work(td, 0);
 	else
-		throtl_schedule_delayed_work(td, (st->min_disptime - jiffies));
+		throtl_schedule_delayed_work(td, sq->first_pending_disptime - jiffies);
 }
 
 static inline void
@@ -794,10 +794,10 @@ static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
 {
 	unsigned int nr_disp = 0;
 	struct throtl_grp *tg;
-	struct throtl_rb_root *st = &td->tg_service_tree;
+	struct throtl_service_queue *sq = &td->service_queue;
 
 	while (1) {
-		tg = throtl_rb_first(st);
+		tg = throtl_rb_first(sq);
 
 		if (!tg)
 			break;
@@ -1148,7 +1148,7 @@ void blk_throtl_drain(struct request_queue *q)
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
 	struct throtl_data *td = q->td;
-	struct throtl_rb_root *st = &td->tg_service_tree;
+	struct throtl_service_queue *sq = &td->service_queue;
 	struct throtl_grp *tg;
 	struct bio_list bl;
 	struct bio *bio;
@@ -1157,7 +1157,7 @@ void blk_throtl_drain(struct request_queue *q)
 
 	bio_list_init(&bl);
 
-	while ((tg = throtl_rb_first(st))) {
+	while ((tg = throtl_rb_first(sq))) {
 		throtl_dequeue_tg(td, tg);
 
 		while ((bio = bio_list_peek(&tg->bio_lists[READ])))
@@ -1182,7 +1182,7 @@ int blk_throtl_init(struct request_queue *q)
 	if (!td)
 		return -ENOMEM;
 
-	td->tg_service_tree = THROTL_RB_ROOT;
+	td->service_queue = THROTL_SERVICE_QUEUE_INITIALIZER;
 	INIT_DELAYED_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 
 	q->td = td;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 12/31] blk-throttle: simplify throtl_grp flag handling
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (10 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 11/31] blk-throttle: rename throtl_rb_root to throtl_service_queue Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 13/31] blk-throttle: add backlink pointer from throtl_grp to throtl_data Tejun Heo
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

blk-throttle is still using function-defining macros to define flag
handling functions, which went out style at least a decade ago.

Just define the flag as bitmask and use direct bit operations.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 34 +++++++++-------------------------
 1 file changed, 9 insertions(+), 25 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6723ca2..fc55dda 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -36,6 +36,10 @@ struct throtl_service_queue {
 #define THROTL_SERVICE_QUEUE_INITIALIZER				\
 	(struct throtl_service_queue){ .pending_tree = RB_ROOT }
 
+enum tg_state_flags {
+	THROTL_TG_PENDING	= 1 << 0,	/* on parent's pending tree */
+};
+
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
 
 /* Per-cpu group stats */
@@ -136,26 +140,6 @@ static inline struct throtl_grp *td_root_tg(struct throtl_data *td)
 	return blkg_to_tg(td->queue->root_blkg);
 }
 
-enum tg_state_flags {
-	THROTL_TG_FLAG_on_rr = 0,	/* on round-robin busy list */
-};
-
-#define THROTL_TG_FNS(name)						\
-static inline void throtl_mark_tg_##name(struct throtl_grp *tg)		\
-{									\
-	(tg)->flags |= (1 << THROTL_TG_FLAG_##name);			\
-}									\
-static inline void throtl_clear_tg_##name(struct throtl_grp *tg)	\
-{									\
-	(tg)->flags &= ~(1 << THROTL_TG_FLAG_##name);			\
-}									\
-static inline int throtl_tg_##name(const struct throtl_grp *tg)		\
-{									\
-	return ((tg)->flags & (1 << THROTL_TG_FLAG_##name)) != 0;	\
-}
-
-THROTL_TG_FNS(on_rr);
-
 #define throtl_log_tg(td, tg, fmt, args...)	do {			\
 	char __pbuf[128];						\
 									\
@@ -369,25 +353,25 @@ static void __throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg)
 	struct throtl_service_queue *sq = &td->service_queue;
 
 	tg_service_queue_add(sq, tg);
-	throtl_mark_tg_on_rr(tg);
+	tg->flags |= THROTL_TG_PENDING;
 	sq->nr_pending++;
 }
 
 static void throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg)
 {
-	if (!throtl_tg_on_rr(tg))
+	if (!(tg->flags & THROTL_TG_PENDING))
 		__throtl_enqueue_tg(td, tg);
 }
 
 static void __throtl_dequeue_tg(struct throtl_data *td, struct throtl_grp *tg)
 {
 	throtl_rb_erase(&tg->rb_node, &td->service_queue);
-	throtl_clear_tg_on_rr(tg);
+	tg->flags &= ~THROTL_TG_PENDING;
 }
 
 static void throtl_dequeue_tg(struct throtl_data *td, struct throtl_grp *tg)
 {
-	if (throtl_tg_on_rr(tg))
+	if (tg->flags & THROTL_TG_PENDING)
 		__throtl_dequeue_tg(td, tg);
 }
 
@@ -964,7 +948,7 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 	throtl_start_new_slice(td, tg, 0);
 	throtl_start_new_slice(td, tg, 1);
 
-	if (throtl_tg_on_rr(tg)) {
+	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(td, tg);
 		throtl_schedule_next_dispatch(td);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 13/31] blk-throttle: add backlink pointer from throtl_grp to throtl_data
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (11 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 12/31] blk-throttle: simplify throtl_grp flag handling Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 14/31] blk-throttle: pass around throtl_service_queue instead of throtl_data Tejun Heo
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

Add throtl_grp->td so that the td (throtl_data) a given tg
(throtl_grp) belongs to can be determined, and remove @td argument
from functions which take both @td and @tg as the former now can be
determined from the latter.

This generally simplifies the code and removes a number of cases where
@td is passed as an argument without being actually used.  This will
also help hierarchy support implementation.

While at it, in multi-line conditions, move the logical operators
leading broken lines to the end of the previous line.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 106 +++++++++++++++++++++++++--------------------------
 1 file changed, 53 insertions(+), 53 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index fc55dda..6aae239 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -57,6 +57,9 @@ struct throtl_grp {
 	/* active throtl group service_queue member */
 	struct rb_node rb_node;
 
+	/* throtl_data this group belongs to */
+	struct throtl_data *td;
+
 	/*
 	 * Dispatch time in jiffies. This is the estimated time when group
 	 * will unthrottle and is ready to dispatch more bio. It is used as
@@ -140,11 +143,11 @@ static inline struct throtl_grp *td_root_tg(struct throtl_data *td)
 	return blkg_to_tg(td->queue->root_blkg);
 }
 
-#define throtl_log_tg(td, tg, fmt, args...)	do {			\
+#define throtl_log_tg(tg, fmt, args...)	do {				\
 	char __pbuf[128];						\
 									\
 	blkg_path(tg_to_blkg(tg), __pbuf, sizeof(__pbuf));		\
-	blk_add_trace_msg((td)->queue, "throtl %s " fmt, __pbuf, ##args); \
+	blk_add_trace_msg((tg)->td->queue, "throtl %s " fmt, __pbuf, ##args); \
 } while (0)
 
 #define throtl_log(td, fmt, args...)	\
@@ -193,6 +196,7 @@ static void throtl_pd_init(struct blkcg_gq *blkg)
 	unsigned long flags;
 
 	RB_CLEAR_NODE(&tg->rb_node);
+	tg->td = blkg->q->td;
 	bio_list_init(&tg->bio_lists[0]);
 	bio_list_init(&tg->bio_lists[1]);
 
@@ -401,36 +405,34 @@ static void throtl_schedule_next_dispatch(struct throtl_data *td)
 		throtl_schedule_delayed_work(td, sq->first_pending_disptime - jiffies);
 }
 
-static inline void
-throtl_start_new_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw)
+static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 {
 	tg->bytes_disp[rw] = 0;
 	tg->io_disp[rw] = 0;
 	tg->slice_start[rw] = jiffies;
 	tg->slice_end[rw] = jiffies + throtl_slice;
-	throtl_log_tg(td, tg, "[%c] new slice start=%lu end=%lu jiffies=%lu",
+	throtl_log_tg(tg, "[%c] new slice start=%lu end=%lu jiffies=%lu",
 			rw == READ ? 'R' : 'W', tg->slice_start[rw],
 			tg->slice_end[rw], jiffies);
 }
 
-static inline void throtl_set_slice_end(struct throtl_data *td,
-		struct throtl_grp *tg, bool rw, unsigned long jiffy_end)
+static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
+					unsigned long jiffy_end)
 {
 	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
 }
 
-static inline void throtl_extend_slice(struct throtl_data *td,
-		struct throtl_grp *tg, bool rw, unsigned long jiffy_end)
+static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
+				       unsigned long jiffy_end)
 {
 	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
-	throtl_log_tg(td, tg, "[%c] extend slice start=%lu end=%lu jiffies=%lu",
+	throtl_log_tg(tg, "[%c] extend slice start=%lu end=%lu jiffies=%lu",
 			rw == READ ? 'R' : 'W', tg->slice_start[rw],
 			tg->slice_end[rw], jiffies);
 }
 
 /* Determine if previously allocated or extended slice is complete or not */
-static bool
-throtl_slice_used(struct throtl_data *td, struct throtl_grp *tg, bool rw)
+static bool throtl_slice_used(struct throtl_grp *tg, bool rw)
 {
 	if (time_in_range(jiffies, tg->slice_start[rw], tg->slice_end[rw]))
 		return 0;
@@ -439,8 +441,7 @@ throtl_slice_used(struct throtl_data *td, struct throtl_grp *tg, bool rw)
 }
 
 /* Trim the used slices and adjust slice start accordingly */
-static inline void
-throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw)
+static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 {
 	unsigned long nr_slices, time_elapsed, io_trim;
 	u64 bytes_trim, tmp;
@@ -452,7 +453,7 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw)
 	 * renewed. Don't try to trim the slice if slice is used. A new
 	 * slice will start when appropriate.
 	 */
-	if (throtl_slice_used(td, tg, rw))
+	if (throtl_slice_used(tg, rw))
 		return;
 
 	/*
@@ -463,7 +464,7 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw)
 	 * is bad because it does not allow new slice to start.
 	 */
 
-	throtl_set_slice_end(td, tg, rw, jiffies + throtl_slice);
+	throtl_set_slice_end(tg, rw, jiffies + throtl_slice);
 
 	time_elapsed = jiffies - tg->slice_start[rw];
 
@@ -492,14 +493,14 @@ throtl_trim_slice(struct throtl_data *td, struct throtl_grp *tg, bool rw)
 
 	tg->slice_start[rw] += nr_slices * throtl_slice;
 
-	throtl_log_tg(td, tg, "[%c] trim slice nr=%lu bytes=%llu io=%lu"
+	throtl_log_tg(tg, "[%c] trim slice nr=%lu bytes=%llu io=%lu"
 			" start=%lu end=%lu jiffies=%lu",
 			rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim,
 			tg->slice_start[rw], tg->slice_end[rw], jiffies);
 }
 
-static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
-		struct bio *bio, unsigned long *wait)
+static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
+				  unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
 	unsigned int io_allowed;
@@ -548,8 +549,8 @@ static bool tg_with_in_iops_limit(struct throtl_data *td, struct throtl_grp *tg,
 	return 0;
 }
 
-static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
-		struct bio *bio, unsigned long *wait)
+static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
+				 unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
 	u64 bytes_allowed, extra_bytes, tmp;
@@ -600,8 +601,8 @@ static bool tg_no_rule_group(struct throtl_grp *tg, bool rw) {
  * Returns whether one can dispatch a bio or not. Also returns approx number
  * of jiffies to wait before this bio is with-in IO rate and can be dispatched
  */
-static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
-				struct bio *bio, unsigned long *wait)
+static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
+			    unsigned long *wait)
 {
 	bool rw = bio_data_dir(bio);
 	unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
@@ -626,15 +627,15 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
 	 * existing slice to make sure it is at least throtl_slice interval
 	 * long since now.
 	 */
-	if (throtl_slice_used(td, tg, rw))
-		throtl_start_new_slice(td, tg, rw);
+	if (throtl_slice_used(tg, rw))
+		throtl_start_new_slice(tg, rw);
 	else {
 		if (time_before(tg->slice_end[rw], jiffies + throtl_slice))
-			throtl_extend_slice(td, tg, rw, jiffies + throtl_slice);
+			throtl_extend_slice(tg, rw, jiffies + throtl_slice);
 	}
 
-	if (tg_with_in_bps_limit(td, tg, bio, &bps_wait)
-	    && tg_with_in_iops_limit(td, tg, bio, &iops_wait)) {
+	if (tg_with_in_bps_limit(tg, bio, &bps_wait) &&
+	    tg_with_in_iops_limit(tg, bio, &iops_wait)) {
 		if (wait)
 			*wait = 0;
 		return 1;
@@ -646,7 +647,7 @@ static bool tg_may_dispatch(struct throtl_data *td, struct throtl_grp *tg,
 		*wait = max_wait;
 
 	if (time_before(tg->slice_end[rw], jiffies + max_wait))
-		throtl_extend_slice(td, tg, rw, jiffies + max_wait);
+		throtl_extend_slice(tg, rw, jiffies + max_wait);
 
 	return 0;
 }
@@ -707,10 +708,10 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
 	struct bio *bio;
 
 	if ((bio = bio_list_peek(&tg->bio_lists[READ])))
-		tg_may_dispatch(td, tg, bio, &read_wait);
+		tg_may_dispatch(tg, bio, &read_wait);
 
 	if ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
-		tg_may_dispatch(td, tg, bio, &write_wait);
+		tg_may_dispatch(tg, bio, &write_wait);
 
 	min_wait = min(read_wait, write_wait);
 	disptime = jiffies + min_wait;
@@ -721,8 +722,8 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
 	throtl_enqueue_tg(td, tg);
 }
 
-static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
-				bool rw, struct bio_list *bl)
+static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw,
+				struct bio_list *bl)
 {
 	struct bio *bio;
 
@@ -731,18 +732,17 @@ static void tg_dispatch_one_bio(struct throtl_data *td, struct throtl_grp *tg,
 	/* Drop bio reference on blkg */
 	blkg_put(tg_to_blkg(tg));
 
-	BUG_ON(td->nr_queued[rw] <= 0);
-	td->nr_queued[rw]--;
+	BUG_ON(tg->td->nr_queued[rw] <= 0);
+	tg->td->nr_queued[rw]--;
 
 	throtl_charge_bio(tg, bio);
 	bio_list_add(bl, bio);
 	bio->bi_rw |= REQ_THROTTLED;
 
-	throtl_trim_slice(td, tg, rw);
+	throtl_trim_slice(tg, rw);
 }
 
-static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
-				struct bio_list *bl)
+static int throtl_dispatch_tg(struct throtl_grp *tg, struct bio_list *bl)
 {
 	unsigned int nr_reads = 0, nr_writes = 0;
 	unsigned int max_nr_reads = throtl_grp_quantum*3/4;
@@ -751,20 +751,20 @@ static int throtl_dispatch_tg(struct throtl_data *td, struct throtl_grp *tg,
 
 	/* Try to dispatch 75% READS and 25% WRITES */
 
-	while ((bio = bio_list_peek(&tg->bio_lists[READ]))
-		&& tg_may_dispatch(td, tg, bio, NULL)) {
+	while ((bio = bio_list_peek(&tg->bio_lists[READ])) &&
+	       tg_may_dispatch(tg, bio, NULL)) {
 
-		tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
+		tg_dispatch_one_bio(tg, bio_data_dir(bio), bl);
 		nr_reads++;
 
 		if (nr_reads >= max_nr_reads)
 			break;
 	}
 
-	while ((bio = bio_list_peek(&tg->bio_lists[WRITE]))
-		&& tg_may_dispatch(td, tg, bio, NULL)) {
+	while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) &&
+	       tg_may_dispatch(tg, bio, NULL)) {
 
-		tg_dispatch_one_bio(td, tg, bio_data_dir(bio), bl);
+		tg_dispatch_one_bio(tg, bio_data_dir(bio), bl);
 		nr_writes++;
 
 		if (nr_writes >= max_nr_writes)
@@ -791,7 +791,7 @@ static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
 
 		throtl_dequeue_tg(td, tg);
 
-		nr_disp += throtl_dispatch_tg(td, tg, bl);
+		nr_disp += throtl_dispatch_tg(tg, bl);
 
 		if (tg->nr_queued[0] || tg->nr_queued[1])
 			tg_update_disptime(td, tg);
@@ -933,7 +933,7 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 	else
 		*(unsigned int *)((void *)tg + cft->private) = ctx.v;
 
-	throtl_log_tg(td, tg, "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
+	throtl_log_tg(tg, "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
 		      tg->bps[READ], tg->bps[WRITE],
 		      tg->iops[READ], tg->iops[WRITE]);
 
@@ -945,8 +945,8 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 	 * that a group's limit are dropped suddenly and we don't want to
 	 * account recently dispatched IO with new low rate.
 	 */
-	throtl_start_new_slice(td, tg, 0);
-	throtl_start_new_slice(td, tg, 1);
+	throtl_start_new_slice(tg, 0);
+	throtl_start_new_slice(tg, 1);
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(td, tg);
@@ -1079,7 +1079,7 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	}
 
 	/* Bio is with-in rate limit of group */
-	if (tg_may_dispatch(td, tg, bio, NULL)) {
+	if (tg_may_dispatch(tg, bio, NULL)) {
 		throtl_charge_bio(tg, bio);
 
 		/*
@@ -1093,12 +1093,12 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		 *
 		 * So keep on trimming slice even if bio is not queued.
 		 */
-		throtl_trim_slice(td, tg, rw);
+		throtl_trim_slice(tg, rw);
 		goto out_unlock;
 	}
 
 queue_bio:
-	throtl_log_tg(td, tg, "[%c] bio. bdisp=%llu sz=%u bps=%llu"
+	throtl_log_tg(tg, "[%c] bio. bdisp=%llu sz=%u bps=%llu"
 			" iodisp=%u iops=%u queued=%d/%d",
 			rw == READ ? 'R' : 'W',
 			tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
@@ -1145,9 +1145,9 @@ void blk_throtl_drain(struct request_queue *q)
 		throtl_dequeue_tg(td, tg);
 
 		while ((bio = bio_list_peek(&tg->bio_lists[READ])))
-			tg_dispatch_one_bio(td, tg, bio_data_dir(bio), &bl);
+			tg_dispatch_one_bio(tg, bio_data_dir(bio), &bl);
 		while ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
-			tg_dispatch_one_bio(td, tg, bio_data_dir(bio), &bl);
+			tg_dispatch_one_bio(tg, bio_data_dir(bio), &bl);
 	}
 	spin_unlock_irq(q->queue_lock);
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 14/31] blk-throttle: pass around throtl_service_queue instead of throtl_data
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (12 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 13/31] blk-throttle: add backlink pointer from throtl_grp to throtl_data Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 15/31] blk-throttle: reorganize throtl_service_queue passed around as argument Tejun Heo
                   ` (18 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

throtl_service_queue will be used as the basic block to implement
hierarchy support.  Pass around throtl_service_queue *sq instead of
throtl_data *td in the following functions which will be used across
multiple levels of hierarchy.

* [__]throtl_enqueue/dequeue_tg()

* throtl_add_bio_tg()

* tg_update_disptime()

* throtl_select_dispatch()

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 53 +++++++++++++++++++++++++++-------------------------
 1 file changed, 28 insertions(+), 25 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6aae239..a81e10b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -352,31 +352,33 @@ static void tg_service_queue_add(struct throtl_service_queue *sq,
 	rb_insert_color(&tg->rb_node, &sq->pending_tree);
 }
 
-static void __throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg)
+static void __throtl_enqueue_tg(struct throtl_service_queue *sq,
+				struct throtl_grp *tg)
 {
-	struct throtl_service_queue *sq = &td->service_queue;
-
 	tg_service_queue_add(sq, tg);
 	tg->flags |= THROTL_TG_PENDING;
 	sq->nr_pending++;
 }
 
-static void throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg)
+static void throtl_enqueue_tg(struct throtl_service_queue *sq,
+			      struct throtl_grp *tg)
 {
 	if (!(tg->flags & THROTL_TG_PENDING))
-		__throtl_enqueue_tg(td, tg);
+		__throtl_enqueue_tg(sq, tg);
 }
 
-static void __throtl_dequeue_tg(struct throtl_data *td, struct throtl_grp *tg)
+static void __throtl_dequeue_tg(struct throtl_service_queue *sq,
+				struct throtl_grp *tg)
 {
-	throtl_rb_erase(&tg->rb_node, &td->service_queue);
+	throtl_rb_erase(&tg->rb_node, sq);
 	tg->flags &= ~THROTL_TG_PENDING;
 }
 
-static void throtl_dequeue_tg(struct throtl_data *td, struct throtl_grp *tg)
+static void throtl_dequeue_tg(struct throtl_service_queue *sq,
+			      struct throtl_grp *tg)
 {
 	if (tg->flags & THROTL_TG_PENDING)
-		__throtl_dequeue_tg(td, tg);
+		__throtl_dequeue_tg(sq, tg);
 }
 
 /* Call with queue lock held */
@@ -689,8 +691,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	throtl_update_dispatch_stats(tg_to_blkg(tg), bio->bi_size, bio->bi_rw);
 }
 
-static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
-			struct bio *bio)
+static void throtl_add_bio_tg(struct throtl_service_queue *sq,
+			      struct throtl_grp *tg, struct bio *bio)
 {
 	bool rw = bio_data_dir(bio);
 
@@ -698,11 +700,12 @@ static void throtl_add_bio_tg(struct throtl_data *td, struct throtl_grp *tg,
 	/* Take a bio reference on tg */
 	blkg_get(tg_to_blkg(tg));
 	tg->nr_queued[rw]++;
-	td->nr_queued[rw]++;
-	throtl_enqueue_tg(td, tg);
+	tg->td->nr_queued[rw]++;
+	throtl_enqueue_tg(sq, tg);
 }
 
-static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
+static void tg_update_disptime(struct throtl_service_queue *sq,
+			       struct throtl_grp *tg)
 {
 	unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime;
 	struct bio *bio;
@@ -717,9 +720,9 @@ static void tg_update_disptime(struct throtl_data *td, struct throtl_grp *tg)
 	disptime = jiffies + min_wait;
 
 	/* Update dispatch time */
-	throtl_dequeue_tg(td, tg);
+	throtl_dequeue_tg(sq, tg);
 	tg->disptime = disptime;
-	throtl_enqueue_tg(td, tg);
+	throtl_enqueue_tg(sq, tg);
 }
 
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw,
@@ -774,11 +777,11 @@ static int throtl_dispatch_tg(struct throtl_grp *tg, struct bio_list *bl)
 	return nr_reads + nr_writes;
 }
 
-static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
+static int throtl_select_dispatch(struct throtl_service_queue *sq,
+				  struct bio_list *bl)
 {
 	unsigned int nr_disp = 0;
 	struct throtl_grp *tg;
-	struct throtl_service_queue *sq = &td->service_queue;
 
 	while (1) {
 		tg = throtl_rb_first(sq);
@@ -789,12 +792,12 @@ static int throtl_select_dispatch(struct throtl_data *td, struct bio_list *bl)
 		if (time_before(jiffies, tg->disptime))
 			break;
 
-		throtl_dequeue_tg(td, tg);
+		throtl_dequeue_tg(sq, tg);
 
 		nr_disp += throtl_dispatch_tg(tg, bl);
 
 		if (tg->nr_queued[0] || tg->nr_queued[1])
-			tg_update_disptime(td, tg);
+			tg_update_disptime(sq, tg);
 
 		if (nr_disp >= throtl_quantum)
 			break;
@@ -822,7 +825,7 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 		   td->nr_queued[READ] + td->nr_queued[WRITE],
 		   td->nr_queued[READ], td->nr_queued[WRITE]);
 
-	nr_disp = throtl_select_dispatch(td, &bio_list_on_stack);
+	nr_disp = throtl_select_dispatch(&td->service_queue, &bio_list_on_stack);
 
 	if (nr_disp)
 		throtl_log(td, "bios disp=%u", nr_disp);
@@ -949,7 +952,7 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 	throtl_start_new_slice(tg, 1);
 
 	if (tg->flags & THROTL_TG_PENDING) {
-		tg_update_disptime(td, tg);
+		tg_update_disptime(&td->service_queue, tg);
 		throtl_schedule_next_dispatch(td);
 	}
 
@@ -1106,11 +1109,11 @@ queue_bio:
 			tg->nr_queued[READ], tg->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
-	throtl_add_bio_tg(q->td, tg, bio);
+	throtl_add_bio_tg(&q->td->service_queue, tg, bio);
 	throttled = true;
 
 	if (update_disptime) {
-		tg_update_disptime(td, tg);
+		tg_update_disptime(&td->service_queue, tg);
 		throtl_schedule_next_dispatch(td);
 	}
 
@@ -1142,7 +1145,7 @@ void blk_throtl_drain(struct request_queue *q)
 	bio_list_init(&bl);
 
 	while ((tg = throtl_rb_first(sq))) {
-		throtl_dequeue_tg(td, tg);
+		throtl_dequeue_tg(sq, tg);
 
 		while ((bio = bio_list_peek(&tg->bio_lists[READ])))
 			tg_dispatch_one_bio(tg, bio_data_dir(bio), &bl);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 15/31] blk-throttle: reorganize throtl_service_queue passed around as argument
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (13 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 14/31] blk-throttle: pass around throtl_service_queue instead of throtl_data Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02 15:21   ` Vivek Goyal
  2013-05-02  0:39 ` [PATCH 16/31] blk-throttle: add throtl_grp->service_queue Tejun Heo
                   ` (17 subsequent siblings)
  32 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

throtl_service_queue will be the building block of hierarchy support
and will form a tree.  This patch updates its usages as arguments to
reduce confusion.

* When a service queue is used as the parent role - the host of the
  rbtree - use @parent_sq instead of @sq.

* For functions taking both @tg and @parent_sq, reorder them so that
  the order is (@tg, @parent_sq) not the other way around.  This makes
  the code follow the usual convention of specifying the primary
  target of the operation as the first argument.

This patch doesn't make any functional differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 100 ++++++++++++++++++++++++++-------------------------
 1 file changed, 51 insertions(+), 49 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a81e10b..56b5e2a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -284,17 +284,18 @@ static struct throtl_grp *throtl_lookup_create_tg(struct throtl_data *td,
 	return tg;
 }
 
-static struct throtl_grp *throtl_rb_first(struct throtl_service_queue *sq)
+static struct throtl_grp *
+throtl_rb_first(struct throtl_service_queue *parent_sq)
 {
 	/* Service tree is empty */
-	if (!sq->nr_pending)
+	if (!parent_sq->nr_pending)
 		return NULL;
 
-	if (!sq->first_pending)
-		sq->first_pending = rb_first(&sq->pending_tree);
+	if (!parent_sq->first_pending)
+		parent_sq->first_pending = rb_first(&parent_sq->pending_tree);
 
-	if (sq->first_pending)
-		return rb_entry_tg(sq->first_pending);
+	if (parent_sq->first_pending)
+		return rb_entry_tg(parent_sq->first_pending);
 
 	return NULL;
 }
@@ -305,29 +306,30 @@ static void rb_erase_init(struct rb_node *n, struct rb_root *root)
 	RB_CLEAR_NODE(n);
 }
 
-static void throtl_rb_erase(struct rb_node *n, struct throtl_service_queue *sq)
+static void throtl_rb_erase(struct rb_node *n,
+			    struct throtl_service_queue *parent_sq)
 {
-	if (sq->first_pending == n)
-		sq->first_pending = NULL;
-	rb_erase_init(n, &sq->pending_tree);
-	--sq->nr_pending;
+	if (parent_sq->first_pending == n)
+		parent_sq->first_pending = NULL;
+	rb_erase_init(n, &parent_sq->pending_tree);
+	--parent_sq->nr_pending;
 }
 
-static void update_min_dispatch_time(struct throtl_service_queue *sq)
+static void update_min_dispatch_time(struct throtl_service_queue *parent_sq)
 {
 	struct throtl_grp *tg;
 
-	tg = throtl_rb_first(sq);
+	tg = throtl_rb_first(parent_sq);
 	if (!tg)
 		return;
 
-	sq->first_pending_disptime = tg->disptime;
+	parent_sq->first_pending_disptime = tg->disptime;
 }
 
-static void tg_service_queue_add(struct throtl_service_queue *sq,
-				 struct throtl_grp *tg)
+static void tg_service_queue_add(struct throtl_grp *tg,
+				 struct throtl_service_queue *parent_sq)
 {
-	struct rb_node **node = &sq->pending_tree.rb_node;
+	struct rb_node **node = &parent_sq->pending_tree.rb_node;
 	struct rb_node *parent = NULL;
 	struct throtl_grp *__tg;
 	unsigned long key = tg->disptime;
@@ -346,39 +348,39 @@ static void tg_service_queue_add(struct throtl_service_queue *sq,
 	}
 
 	if (left)
-		sq->first_pending = &tg->rb_node;
+		parent_sq->first_pending = &tg->rb_node;
 
 	rb_link_node(&tg->rb_node, parent, node);
-	rb_insert_color(&tg->rb_node, &sq->pending_tree);
+	rb_insert_color(&tg->rb_node, &parent_sq->pending_tree);
 }
 
-static void __throtl_enqueue_tg(struct throtl_service_queue *sq,
-				struct throtl_grp *tg)
+static void __throtl_enqueue_tg(struct throtl_grp *tg,
+				struct throtl_service_queue *parent_sq)
 {
-	tg_service_queue_add(sq, tg);
+	tg_service_queue_add(tg, parent_sq);
 	tg->flags |= THROTL_TG_PENDING;
-	sq->nr_pending++;
+	parent_sq->nr_pending++;
 }
 
-static void throtl_enqueue_tg(struct throtl_service_queue *sq,
-			      struct throtl_grp *tg)
+static void throtl_enqueue_tg(struct throtl_grp *tg,
+			      struct throtl_service_queue *parent_sq)
 {
 	if (!(tg->flags & THROTL_TG_PENDING))
-		__throtl_enqueue_tg(sq, tg);
+		__throtl_enqueue_tg(tg, parent_sq);
 }
 
-static void __throtl_dequeue_tg(struct throtl_service_queue *sq,
-				struct throtl_grp *tg)
+static void __throtl_dequeue_tg(struct throtl_grp *tg,
+				struct throtl_service_queue *parent_sq)
 {
-	throtl_rb_erase(&tg->rb_node, sq);
+	throtl_rb_erase(&tg->rb_node, parent_sq);
 	tg->flags &= ~THROTL_TG_PENDING;
 }
 
-static void throtl_dequeue_tg(struct throtl_service_queue *sq,
-			      struct throtl_grp *tg)
+static void throtl_dequeue_tg(struct throtl_grp *tg,
+			      struct throtl_service_queue *parent_sq)
 {
 	if (tg->flags & THROTL_TG_PENDING)
-		__throtl_dequeue_tg(sq, tg);
+		__throtl_dequeue_tg(tg, parent_sq);
 }
 
 /* Call with queue lock held */
@@ -691,8 +693,8 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	throtl_update_dispatch_stats(tg_to_blkg(tg), bio->bi_size, bio->bi_rw);
 }
 
-static void throtl_add_bio_tg(struct throtl_service_queue *sq,
-			      struct throtl_grp *tg, struct bio *bio)
+static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg,
+			      struct throtl_service_queue *parent_sq)
 {
 	bool rw = bio_data_dir(bio);
 
@@ -701,11 +703,11 @@ static void throtl_add_bio_tg(struct throtl_service_queue *sq,
 	blkg_get(tg_to_blkg(tg));
 	tg->nr_queued[rw]++;
 	tg->td->nr_queued[rw]++;
-	throtl_enqueue_tg(sq, tg);
+	throtl_enqueue_tg(tg, parent_sq);
 }
 
-static void tg_update_disptime(struct throtl_service_queue *sq,
-			       struct throtl_grp *tg)
+static void tg_update_disptime(struct throtl_grp *tg,
+			       struct throtl_service_queue *parent_sq)
 {
 	unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime;
 	struct bio *bio;
@@ -720,9 +722,9 @@ static void tg_update_disptime(struct throtl_service_queue *sq,
 	disptime = jiffies + min_wait;
 
 	/* Update dispatch time */
-	throtl_dequeue_tg(sq, tg);
+	throtl_dequeue_tg(tg, parent_sq);
 	tg->disptime = disptime;
-	throtl_enqueue_tg(sq, tg);
+	throtl_enqueue_tg(tg, parent_sq);
 }
 
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw,
@@ -777,14 +779,14 @@ static int throtl_dispatch_tg(struct throtl_grp *tg, struct bio_list *bl)
 	return nr_reads + nr_writes;
 }
 
-static int throtl_select_dispatch(struct throtl_service_queue *sq,
+static int throtl_select_dispatch(struct throtl_service_queue *parent_sq,
 				  struct bio_list *bl)
 {
 	unsigned int nr_disp = 0;
 	struct throtl_grp *tg;
 
 	while (1) {
-		tg = throtl_rb_first(sq);
+		tg = throtl_rb_first(parent_sq);
 
 		if (!tg)
 			break;
@@ -792,12 +794,12 @@ static int throtl_select_dispatch(struct throtl_service_queue *sq,
 		if (time_before(jiffies, tg->disptime))
 			break;
 
-		throtl_dequeue_tg(sq, tg);
+		throtl_dequeue_tg(tg, parent_sq);
 
 		nr_disp += throtl_dispatch_tg(tg, bl);
 
 		if (tg->nr_queued[0] || tg->nr_queued[1])
-			tg_update_disptime(sq, tg);
+			tg_update_disptime(tg, parent_sq);
 
 		if (nr_disp >= throtl_quantum)
 			break;
@@ -952,7 +954,7 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 	throtl_start_new_slice(tg, 1);
 
 	if (tg->flags & THROTL_TG_PENDING) {
-		tg_update_disptime(&td->service_queue, tg);
+		tg_update_disptime(tg, &td->service_queue);
 		throtl_schedule_next_dispatch(td);
 	}
 
@@ -1109,11 +1111,11 @@ queue_bio:
 			tg->nr_queued[READ], tg->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
-	throtl_add_bio_tg(&q->td->service_queue, tg, bio);
+	throtl_add_bio_tg(bio, tg, &q->td->service_queue);
 	throttled = true;
 
 	if (update_disptime) {
-		tg_update_disptime(&td->service_queue, tg);
+		tg_update_disptime(tg, &td->service_queue);
 		throtl_schedule_next_dispatch(td);
 	}
 
@@ -1135,7 +1137,7 @@ void blk_throtl_drain(struct request_queue *q)
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
 	struct throtl_data *td = q->td;
-	struct throtl_service_queue *sq = &td->service_queue;
+	struct throtl_service_queue *parent_sq = &td->service_queue;
 	struct throtl_grp *tg;
 	struct bio_list bl;
 	struct bio *bio;
@@ -1144,8 +1146,8 @@ void blk_throtl_drain(struct request_queue *q)
 
 	bio_list_init(&bl);
 
-	while ((tg = throtl_rb_first(sq))) {
-		throtl_dequeue_tg(sq, tg);
+	while ((tg = throtl_rb_first(parent_sq))) {
+		throtl_dequeue_tg(tg, parent_sq);
 
 		while ((bio = bio_list_peek(&tg->bio_lists[READ])))
 			tg_dispatch_one_bio(tg, bio_data_dir(bio), &bl);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 16/31] blk-throttle: add throtl_grp->service_queue
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (14 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 15/31] blk-throttle: reorganize throtl_service_queue passed around as argument Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 17/31] blk-throttle: move bio_lists[] and friends to throtl_service_queue Tejun Heo
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

Currently, there's single service_queue per queue -
throtl_data->service_queue.  All active throtl_grp's are queued on the
queue and dispatched according to their limits.  To support hierarchy,
this will be expanded such that active throtl_grp's form a tree
anchored at throtl_data->service_queue and chained through each
intermediate throtl_grp's service_queue.

This patch adds throtl_grp->service_queue to prepare for hierarchy
support.  The initialization function - throtl_service_queue_init() -
is added and replaces the macro initializer.  The newly added
tg->service_queue isn't used yet.  Following patches will do.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 56b5e2a..ee615af 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -33,9 +33,6 @@ struct throtl_service_queue {
 	unsigned long		first_pending_disptime;	/* disptime of the first tg */
 };
 
-#define THROTL_SERVICE_QUEUE_INITIALIZER				\
-	(struct throtl_service_queue){ .pending_tree = RB_ROOT }
-
 enum tg_state_flags {
 	THROTL_TG_PENDING	= 1 << 0,	/* on parent's pending tree */
 };
@@ -60,6 +57,9 @@ struct throtl_grp {
 	/* throtl_data this group belongs to */
 	struct throtl_data *td;
 
+	/* this group's service queue */
+	struct throtl_service_queue service_queue;
+
 	/*
 	 * Dispatch time in jiffies. This is the estimated time when group
 	 * will unthrottle and is ready to dispatch more bio. It is used as
@@ -190,11 +190,18 @@ alloc_stats:
 		goto alloc_stats;
 }
 
+/* init a service_queue, assumes the caller zeroed it */
+static void throtl_service_queue_init(struct throtl_service_queue *sq)
+{
+	sq->pending_tree = RB_ROOT;
+}
+
 static void throtl_pd_init(struct blkcg_gq *blkg)
 {
 	struct throtl_grp *tg = blkg_to_tg(blkg);
 	unsigned long flags;
 
+	throtl_service_queue_init(&tg->service_queue);
 	RB_CLEAR_NODE(&tg->rb_node);
 	tg->td = blkg->q->td;
 	bio_list_init(&tg->bio_lists[0]);
@@ -1171,8 +1178,8 @@ int blk_throtl_init(struct request_queue *q)
 	if (!td)
 		return -ENOMEM;
 
-	td->service_queue = THROTL_SERVICE_QUEUE_INITIALIZER;
 	INIT_DELAYED_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
+	throtl_service_queue_init(&td->service_queue);
 
 	q->td = td;
 	td->queue = q;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 17/31] blk-throttle: move bio_lists[] and friends to throtl_service_queue
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (15 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 16/31] blk-throttle: add throtl_grp->service_queue Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 18/31] blk-throttle: dispatch to throtl_data->service_queue.bio_lists[] Tejun Heo
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

throtl_service_queues will eventually form a tree which is anchored at
throtl_data->service_queue and queue bios will climb the tree to the
top service_queue to be executed.

This patch moves bio_lists[] and nr_queued[] from throtl_grp to its
service_queue to prepare for that.  As currently only the
throtl_data->service_queue is in use, this patch just ends up moving
throtl_grp->bio_lists[] and ->nr_queued[] to
throtl_grp->service_queue.bio_lists[] and ->nr_queued[] without making
any functional differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 63 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 39 insertions(+), 24 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index ee615af..bebe14b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -27,6 +27,17 @@ static struct blkcg_policy blkcg_policy_throtl;
 static struct workqueue_struct *kthrotld_workqueue;
 
 struct throtl_service_queue {
+	/*
+	 * Bios queued directly to this service_queue or dispatched from
+	 * children throtl_grp's.
+	 */
+	struct bio_list		bio_lists[2];	/* queued bios [READ/WRITE] */
+	unsigned int		nr_queued[2];	/* number of queued bios */
+
+	/*
+	 * RB tree of active children throtl_grp's, which are sorted by
+	 * their ->disptime.
+	 */
 	struct rb_root		pending_tree;	/* RB tree of active tgs */
 	struct rb_node		*first_pending;	/* first node in the tree */
 	unsigned int		nr_pending;	/* # queued in the tree */
@@ -69,12 +80,6 @@ struct throtl_grp {
 
 	unsigned int flags;
 
-	/* Two lists for READ and WRITE */
-	struct bio_list bio_lists[2];
-
-	/* Number of queued bios on READ and WRITE lists */
-	unsigned int nr_queued[2];
-
 	/* bytes per second rate limits */
 	uint64_t bps[2];
 
@@ -193,6 +198,8 @@ alloc_stats:
 /* init a service_queue, assumes the caller zeroed it */
 static void throtl_service_queue_init(struct throtl_service_queue *sq)
 {
+	bio_list_init(&sq->bio_lists[0]);
+	bio_list_init(&sq->bio_lists[1]);
 	sq->pending_tree = RB_ROOT;
 }
 
@@ -204,8 +211,6 @@ static void throtl_pd_init(struct blkcg_gq *blkg)
 	throtl_service_queue_init(&tg->service_queue);
 	RB_CLEAR_NODE(&tg->rb_node);
 	tg->td = blkg->q->td;
-	bio_list_init(&tg->bio_lists[0]);
-	bio_list_init(&tg->bio_lists[1]);
 
 	tg->bps[READ] = -1;
 	tg->bps[WRITE] = -1;
@@ -624,7 +629,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	 * this function with a different bio if there are other bios
 	 * queued.
 	 */
-	BUG_ON(tg->nr_queued[rw] && bio != bio_list_peek(&tg->bio_lists[rw]));
+	BUG_ON(tg->service_queue.nr_queued[rw] &&
+	       bio != bio_list_peek(&tg->service_queue.bio_lists[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
 	if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
@@ -703,12 +709,13 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg,
 			      struct throtl_service_queue *parent_sq)
 {
+	struct throtl_service_queue *sq = &tg->service_queue;
 	bool rw = bio_data_dir(bio);
 
-	bio_list_add(&tg->bio_lists[rw], bio);
+	bio_list_add(&sq->bio_lists[rw], bio);
 	/* Take a bio reference on tg */
 	blkg_get(tg_to_blkg(tg));
-	tg->nr_queued[rw]++;
+	sq->nr_queued[rw]++;
 	tg->td->nr_queued[rw]++;
 	throtl_enqueue_tg(tg, parent_sq);
 }
@@ -716,13 +723,14 @@ static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg,
 static void tg_update_disptime(struct throtl_grp *tg,
 			       struct throtl_service_queue *parent_sq)
 {
+	struct throtl_service_queue *sq = &tg->service_queue;
 	unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime;
 	struct bio *bio;
 
-	if ((bio = bio_list_peek(&tg->bio_lists[READ])))
+	if ((bio = bio_list_peek(&sq->bio_lists[READ])))
 		tg_may_dispatch(tg, bio, &read_wait);
 
-	if ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
+	if ((bio = bio_list_peek(&sq->bio_lists[WRITE])))
 		tg_may_dispatch(tg, bio, &write_wait);
 
 	min_wait = min(read_wait, write_wait);
@@ -737,10 +745,11 @@ static void tg_update_disptime(struct throtl_grp *tg,
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw,
 				struct bio_list *bl)
 {
+	struct throtl_service_queue *sq = &tg->service_queue;
 	struct bio *bio;
 
-	bio = bio_list_pop(&tg->bio_lists[rw]);
-	tg->nr_queued[rw]--;
+	bio = bio_list_pop(&sq->bio_lists[rw]);
+	sq->nr_queued[rw]--;
 	/* Drop bio reference on blkg */
 	blkg_put(tg_to_blkg(tg));
 
@@ -756,6 +765,7 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw,
 
 static int throtl_dispatch_tg(struct throtl_grp *tg, struct bio_list *bl)
 {
+	struct throtl_service_queue *sq = &tg->service_queue;
 	unsigned int nr_reads = 0, nr_writes = 0;
 	unsigned int max_nr_reads = throtl_grp_quantum*3/4;
 	unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads;
@@ -763,7 +773,7 @@ static int throtl_dispatch_tg(struct throtl_grp *tg, struct bio_list *bl)
 
 	/* Try to dispatch 75% READS and 25% WRITES */
 
-	while ((bio = bio_list_peek(&tg->bio_lists[READ])) &&
+	while ((bio = bio_list_peek(&sq->bio_lists[READ])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
 
 		tg_dispatch_one_bio(tg, bio_data_dir(bio), bl);
@@ -773,7 +783,7 @@ static int throtl_dispatch_tg(struct throtl_grp *tg, struct bio_list *bl)
 			break;
 	}
 
-	while ((bio = bio_list_peek(&tg->bio_lists[WRITE])) &&
+	while ((bio = bio_list_peek(&sq->bio_lists[WRITE])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
 
 		tg_dispatch_one_bio(tg, bio_data_dir(bio), bl);
@@ -790,10 +800,10 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq,
 				  struct bio_list *bl)
 {
 	unsigned int nr_disp = 0;
-	struct throtl_grp *tg;
 
 	while (1) {
-		tg = throtl_rb_first(parent_sq);
+		struct throtl_grp *tg = throtl_rb_first(parent_sq);
+		struct throtl_service_queue *sq = &tg->service_queue;
 
 		if (!tg)
 			break;
@@ -805,7 +815,7 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq,
 
 		nr_disp += throtl_dispatch_tg(tg, bl);
 
-		if (tg->nr_queued[0] || tg->nr_queued[1])
+		if (sq->nr_queued[0] || sq->nr_queued[1])
 			tg_update_disptime(tg, parent_sq);
 
 		if (nr_disp >= throtl_quantum)
@@ -1046,6 +1056,7 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 {
 	struct throtl_data *td = q->td;
 	struct throtl_grp *tg;
+	struct throtl_service_queue *sq;
 	bool rw = bio_data_dir(bio), update_disptime = true;
 	struct blkcg *blkcg;
 	bool throttled = false;
@@ -1080,7 +1091,9 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	if (unlikely(!tg))
 		goto out_unlock;
 
-	if (tg->nr_queued[rw]) {
+	sq = &tg->service_queue;
+
+	if (sq->nr_queued[rw]) {
 		/*
 		 * There is already another bio queued in same dir. No
 		 * need to update dispatch time.
@@ -1115,7 +1128,7 @@ queue_bio:
 			rw == READ ? 'R' : 'W',
 			tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
 			tg->io_disp[rw], tg->iops[rw],
-			tg->nr_queued[READ], tg->nr_queued[WRITE]);
+			sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
 	throtl_add_bio_tg(bio, tg, &q->td->service_queue);
@@ -1154,11 +1167,13 @@ void blk_throtl_drain(struct request_queue *q)
 	bio_list_init(&bl);
 
 	while ((tg = throtl_rb_first(parent_sq))) {
+		struct throtl_service_queue *sq = &tg->service_queue;
+
 		throtl_dequeue_tg(tg, parent_sq);
 
-		while ((bio = bio_list_peek(&tg->bio_lists[READ])))
+		while ((bio = bio_list_peek(&sq->bio_lists[READ])))
 			tg_dispatch_one_bio(tg, bio_data_dir(bio), &bl);
-		while ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
+		while ((bio = bio_list_peek(&sq->bio_lists[WRITE])))
 			tg_dispatch_one_bio(tg, bio_data_dir(bio), &bl);
 	}
 	spin_unlock_irq(q->queue_lock);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 18/31] blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (16 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 17/31] blk-throttle: move bio_lists[] and friends to throtl_service_queue Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 19/31] blk-throttle: generalize update_disptime optimization in blk_throtl_bio() Tejun Heo
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

throtl_service_queues will eventually form a tree which is anchored at
throtl_data->service_queue and queue bios will climb the tree to the
top service_queue to be executed.

This patch makes the dispatch paths in blk_throtl_dispatch_work_fn()
and blk_throtl_drain() to dispatch bios to
throtl_data->service_queue.bio_lists[] instead of the on-stack
bio_lists.  This will keep the final dispatch to the top level
service_queue share the same mechanism as dispatches through the rest
of the hierarchy.

As bio's should be issued in a sleepable context,
blk_throtl_dispatch_work_fn() transfers all dispatched bio's from the
service_queue bio_lists[] into an onstack one before dropping
queue_lock and issuing the bio's.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 40 +++++++++++++++++++++++-----------------
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index bebe14b..95e1d2a 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -743,7 +743,7 @@ static void tg_update_disptime(struct throtl_grp *tg,
 }
 
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw,
-				struct bio_list *bl)
+				struct throtl_service_queue *parent_sq)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	struct bio *bio;
@@ -757,13 +757,14 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw,
 	tg->td->nr_queued[rw]--;
 
 	throtl_charge_bio(tg, bio);
-	bio_list_add(bl, bio);
+	bio_list_add(&parent_sq->bio_lists[rw], bio);
 	bio->bi_rw |= REQ_THROTTLED;
 
 	throtl_trim_slice(tg, rw);
 }
 
-static int throtl_dispatch_tg(struct throtl_grp *tg, struct bio_list *bl)
+static int throtl_dispatch_tg(struct throtl_grp *tg,
+			      struct throtl_service_queue *parent_sq)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	unsigned int nr_reads = 0, nr_writes = 0;
@@ -776,7 +777,7 @@ static int throtl_dispatch_tg(struct throtl_grp *tg, struct bio_list *bl)
 	while ((bio = bio_list_peek(&sq->bio_lists[READ])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
 
-		tg_dispatch_one_bio(tg, bio_data_dir(bio), bl);
+		tg_dispatch_one_bio(tg, bio_data_dir(bio), parent_sq);
 		nr_reads++;
 
 		if (nr_reads >= max_nr_reads)
@@ -786,7 +787,7 @@ static int throtl_dispatch_tg(struct throtl_grp *tg, struct bio_list *bl)
 	while ((bio = bio_list_peek(&sq->bio_lists[WRITE])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
 
-		tg_dispatch_one_bio(tg, bio_data_dir(bio), bl);
+		tg_dispatch_one_bio(tg, bio_data_dir(bio), parent_sq);
 		nr_writes++;
 
 		if (nr_writes >= max_nr_writes)
@@ -796,8 +797,7 @@ static int throtl_dispatch_tg(struct throtl_grp *tg, struct bio_list *bl)
 	return nr_reads + nr_writes;
 }
 
-static int throtl_select_dispatch(struct throtl_service_queue *parent_sq,
-				  struct bio_list *bl)
+static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
 {
 	unsigned int nr_disp = 0;
 
@@ -813,7 +813,7 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq,
 
 		throtl_dequeue_tg(tg, parent_sq);
 
-		nr_disp += throtl_dispatch_tg(tg, bl);
+		nr_disp += throtl_dispatch_tg(tg, parent_sq);
 
 		if (sq->nr_queued[0] || sq->nr_queued[1])
 			tg_update_disptime(tg, parent_sq);
@@ -830,11 +830,13 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 {
 	struct throtl_data *td = container_of(to_delayed_work(work),
 					      struct throtl_data, dispatch_work);
+	struct throtl_service_queue *sq = &td->service_queue;
 	struct request_queue *q = td->queue;
 	unsigned int nr_disp = 0;
 	struct bio_list bio_list_on_stack;
 	struct bio *bio;
 	struct blk_plug plug;
+	int rw;
 
 	spin_lock_irq(q->queue_lock);
 
@@ -844,10 +846,15 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 		   td->nr_queued[READ] + td->nr_queued[WRITE],
 		   td->nr_queued[READ], td->nr_queued[WRITE]);
 
-	nr_disp = throtl_select_dispatch(&td->service_queue, &bio_list_on_stack);
+	nr_disp = throtl_select_dispatch(sq);
 
-	if (nr_disp)
+	if (nr_disp) {
+		for (rw = READ; rw <= WRITE; rw++) {
+			bio_list_merge(&bio_list_on_stack, &sq->bio_lists[rw]);
+			bio_list_init(&sq->bio_lists[rw]);
+		}
 		throtl_log(td, "bios disp=%u", nr_disp);
+	}
 
 	throtl_schedule_next_dispatch(td);
 
@@ -1159,27 +1166,26 @@ void blk_throtl_drain(struct request_queue *q)
 	struct throtl_data *td = q->td;
 	struct throtl_service_queue *parent_sq = &td->service_queue;
 	struct throtl_grp *tg;
-	struct bio_list bl;
 	struct bio *bio;
+	int rw;
 
 	queue_lockdep_assert_held(q);
 
-	bio_list_init(&bl);
-
 	while ((tg = throtl_rb_first(parent_sq))) {
 		struct throtl_service_queue *sq = &tg->service_queue;
 
 		throtl_dequeue_tg(tg, parent_sq);
 
 		while ((bio = bio_list_peek(&sq->bio_lists[READ])))
-			tg_dispatch_one_bio(tg, bio_data_dir(bio), &bl);
+			tg_dispatch_one_bio(tg, bio_data_dir(bio), parent_sq);
 		while ((bio = bio_list_peek(&sq->bio_lists[WRITE])))
-			tg_dispatch_one_bio(tg, bio_data_dir(bio), &bl);
+			tg_dispatch_one_bio(tg, bio_data_dir(bio), parent_sq);
 	}
 	spin_unlock_irq(q->queue_lock);
 
-	while ((bio = bio_list_pop(&bl)))
-		generic_make_request(bio);
+	for (rw = READ; rw <= WRITE; rw++)
+		while ((bio = bio_list_pop(&parent_sq->bio_lists[rw])))
+			generic_make_request(bio);
 
 	spin_lock_irq(q->queue_lock);
 }
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 19/31] blk-throttle: generalize update_disptime optimization in blk_throtl_bio()
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (17 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 18/31] blk-throttle: dispatch to throtl_data->service_queue.bio_lists[] Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 20/31] blk-throttle: add throtl_service_queue->parent_sq Tejun Heo
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

When blk_throtl_bio() wants to queue a bio to a tg (throtl_grp), it
avoids invoking tg_update_disptime() and
throtl_schedule_next_dispatch() if the tg already has bios queued in
that direction.  As a new bio is appeneded after the existing ones, it
can't change the tg's next dispatch time or the parent's dispatch
schedule.

This optimization is currently open coded in blk_throtl_bio().
Whether the target biolist was occupied was recorded in a local
variable and later used to skip disptime update.  This patch moves
generalizes it so that throtl_add_bio_tg() sets a new flag
THROTL_TG_WAS_EMPTY if the biolist was empty before the new bio was
added.  tg_update_disptime() clears the flag automatically.
blk_throtl_bio() is updated to simply test the flag before updating
disptime.

This patch doesn't make any functional differences now but will enable
using the same optimization for recursive dispatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 95e1d2a..65b8b38 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -46,6 +46,7 @@ struct throtl_service_queue {
 
 enum tg_state_flags {
 	THROTL_TG_PENDING	= 1 << 0,	/* on parent's pending tree */
+	THROTL_TG_WAS_EMPTY	= 1 << 1,	/* bio_lists[] became non-empty */
 };
 
 #define rb_entry_tg(node)	rb_entry((node), struct throtl_grp, rb_node)
@@ -712,6 +713,15 @@ static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg,
 	struct throtl_service_queue *sq = &tg->service_queue;
 	bool rw = bio_data_dir(bio);
 
+	/*
+	 * If @tg doesn't currently have any bios queued in the same
+	 * direction, queueing @bio can change when @tg should be
+	 * dispatched.  Mark that @tg was empty.  This is automatically
+	 * cleaered on the next tg_update_disptime().
+	 */
+	if (!sq->nr_queued[rw])
+		tg->flags |= THROTL_TG_WAS_EMPTY;
+
 	bio_list_add(&sq->bio_lists[rw], bio);
 	/* Take a bio reference on tg */
 	blkg_get(tg_to_blkg(tg));
@@ -740,6 +750,9 @@ static void tg_update_disptime(struct throtl_grp *tg,
 	throtl_dequeue_tg(tg, parent_sq);
 	tg->disptime = disptime;
 	throtl_enqueue_tg(tg, parent_sq);
+
+	/* see throtl_add_bio_tg() */
+	tg->flags &= ~THROTL_TG_WAS_EMPTY;
 }
 
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw,
@@ -1064,7 +1077,7 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	struct throtl_data *td = q->td;
 	struct throtl_grp *tg;
 	struct throtl_service_queue *sq;
-	bool rw = bio_data_dir(bio), update_disptime = true;
+	bool rw = bio_data_dir(bio);
 	struct blkcg *blkcg;
 	bool throttled = false;
 
@@ -1100,16 +1113,10 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 
 	sq = &tg->service_queue;
 
-	if (sq->nr_queued[rw]) {
-		/*
-		 * There is already another bio queued in same dir. No
-		 * need to update dispatch time.
-		 */
-		update_disptime = false;
+	/* throtl is FIFO - if other bios are already queued, should queue */
+	if (sq->nr_queued[rw])
 		goto queue_bio;
 
-	}
-
 	/* Bio is with-in rate limit of group */
 	if (tg_may_dispatch(tg, bio, NULL)) {
 		throtl_charge_bio(tg, bio);
@@ -1141,7 +1148,8 @@ queue_bio:
 	throtl_add_bio_tg(bio, tg, &q->td->service_queue);
 	throttled = true;
 
-	if (update_disptime) {
+	/* update @tg's dispatch time if @tg was empty before @bio */
+	if (tg->flags & THROTL_TG_WAS_EMPTY) {
 		tg_update_disptime(tg, &td->service_queue);
 		throtl_schedule_next_dispatch(td);
 	}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 20/31] blk-throttle: add throtl_service_queue->parent_sq
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (18 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 19/31] blk-throttle: generalize update_disptime optimization in blk_throtl_bio() Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() Tejun Heo
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

To prepare for hierarchy support, this patch adds
throtl_service_queue->service_sq which points to the arent
service_queue.  Currently, for all service_queues embedded in
throtl_grps, it points to throtl_data->service_queue.  As
throtl_data->service_queue doesn't have a parent its parent_sq is set
to NULL.

There are a number of functions which take both throtl_grp *tg and
throtl_service_queue *parent_sq.  With this patch, the parent
service_queue can be determined from @tg and the @parent_sq arguments
are removed.

This patch doesn't make any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 81 +++++++++++++++++++++++++---------------------------
 1 file changed, 39 insertions(+), 42 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 65b8b38..64d1923 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -27,6 +27,8 @@ static struct blkcg_policy blkcg_policy_throtl;
 static struct workqueue_struct *kthrotld_workqueue;
 
 struct throtl_service_queue {
+	struct throtl_service_queue *parent_sq;	/* the parent service_queue */
+
 	/*
 	 * Bios queued directly to this service_queue or dispatched from
 	 * children throtl_grp's.
@@ -197,21 +199,24 @@ alloc_stats:
 }
 
 /* init a service_queue, assumes the caller zeroed it */
-static void throtl_service_queue_init(struct throtl_service_queue *sq)
+static void throtl_service_queue_init(struct throtl_service_queue *sq,
+				      struct throtl_service_queue *parent_sq)
 {
 	bio_list_init(&sq->bio_lists[0]);
 	bio_list_init(&sq->bio_lists[1]);
 	sq->pending_tree = RB_ROOT;
+	sq->parent_sq = parent_sq;
 }
 
 static void throtl_pd_init(struct blkcg_gq *blkg)
 {
 	struct throtl_grp *tg = blkg_to_tg(blkg);
+	struct throtl_data *td = blkg->q->td;
 	unsigned long flags;
 
-	throtl_service_queue_init(&tg->service_queue);
+	throtl_service_queue_init(&tg->service_queue, &td->service_queue);
 	RB_CLEAR_NODE(&tg->rb_node);
-	tg->td = blkg->q->td;
+	tg->td = td;
 
 	tg->bps[READ] = -1;
 	tg->bps[WRITE] = -1;
@@ -339,9 +344,9 @@ static void update_min_dispatch_time(struct throtl_service_queue *parent_sq)
 	parent_sq->first_pending_disptime = tg->disptime;
 }
 
-static void tg_service_queue_add(struct throtl_grp *tg,
-				 struct throtl_service_queue *parent_sq)
+static void tg_service_queue_add(struct throtl_grp *tg)
 {
+	struct throtl_service_queue *parent_sq = tg->service_queue.parent_sq;
 	struct rb_node **node = &parent_sq->pending_tree.rb_node;
 	struct rb_node *parent = NULL;
 	struct throtl_grp *__tg;
@@ -367,33 +372,29 @@ static void tg_service_queue_add(struct throtl_grp *tg,
 	rb_insert_color(&tg->rb_node, &parent_sq->pending_tree);
 }
 
-static void __throtl_enqueue_tg(struct throtl_grp *tg,
-				struct throtl_service_queue *parent_sq)
+static void __throtl_enqueue_tg(struct throtl_grp *tg)
 {
-	tg_service_queue_add(tg, parent_sq);
+	tg_service_queue_add(tg);
 	tg->flags |= THROTL_TG_PENDING;
-	parent_sq->nr_pending++;
+	tg->service_queue.parent_sq->nr_pending++;
 }
 
-static void throtl_enqueue_tg(struct throtl_grp *tg,
-			      struct throtl_service_queue *parent_sq)
+static void throtl_enqueue_tg(struct throtl_grp *tg)
 {
 	if (!(tg->flags & THROTL_TG_PENDING))
-		__throtl_enqueue_tg(tg, parent_sq);
+		__throtl_enqueue_tg(tg);
 }
 
-static void __throtl_dequeue_tg(struct throtl_grp *tg,
-				struct throtl_service_queue *parent_sq)
+static void __throtl_dequeue_tg(struct throtl_grp *tg)
 {
-	throtl_rb_erase(&tg->rb_node, parent_sq);
+	throtl_rb_erase(&tg->rb_node, tg->service_queue.parent_sq);
 	tg->flags &= ~THROTL_TG_PENDING;
 }
 
-static void throtl_dequeue_tg(struct throtl_grp *tg,
-			      struct throtl_service_queue *parent_sq)
+static void throtl_dequeue_tg(struct throtl_grp *tg)
 {
 	if (tg->flags & THROTL_TG_PENDING)
-		__throtl_dequeue_tg(tg, parent_sq);
+		__throtl_dequeue_tg(tg);
 }
 
 /* Call with queue lock held */
@@ -707,8 +708,7 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	throtl_update_dispatch_stats(tg_to_blkg(tg), bio->bi_size, bio->bi_rw);
 }
 
-static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg,
-			      struct throtl_service_queue *parent_sq)
+static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	bool rw = bio_data_dir(bio);
@@ -727,11 +727,10 @@ static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg,
 	blkg_get(tg_to_blkg(tg));
 	sq->nr_queued[rw]++;
 	tg->td->nr_queued[rw]++;
-	throtl_enqueue_tg(tg, parent_sq);
+	throtl_enqueue_tg(tg);
 }
 
-static void tg_update_disptime(struct throtl_grp *tg,
-			       struct throtl_service_queue *parent_sq)
+static void tg_update_disptime(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime;
@@ -747,16 +746,15 @@ static void tg_update_disptime(struct throtl_grp *tg,
 	disptime = jiffies + min_wait;
 
 	/* Update dispatch time */
-	throtl_dequeue_tg(tg, parent_sq);
+	throtl_dequeue_tg(tg);
 	tg->disptime = disptime;
-	throtl_enqueue_tg(tg, parent_sq);
+	throtl_enqueue_tg(tg);
 
 	/* see throtl_add_bio_tg() */
 	tg->flags &= ~THROTL_TG_WAS_EMPTY;
 }
 
-static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw,
-				struct throtl_service_queue *parent_sq)
+static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	struct bio *bio;
@@ -770,14 +768,13 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw,
 	tg->td->nr_queued[rw]--;
 
 	throtl_charge_bio(tg, bio);
-	bio_list_add(&parent_sq->bio_lists[rw], bio);
+	bio_list_add(&sq->parent_sq->bio_lists[rw], bio);
 	bio->bi_rw |= REQ_THROTTLED;
 
 	throtl_trim_slice(tg, rw);
 }
 
-static int throtl_dispatch_tg(struct throtl_grp *tg,
-			      struct throtl_service_queue *parent_sq)
+static int throtl_dispatch_tg(struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	unsigned int nr_reads = 0, nr_writes = 0;
@@ -790,7 +787,7 @@ static int throtl_dispatch_tg(struct throtl_grp *tg,
 	while ((bio = bio_list_peek(&sq->bio_lists[READ])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
 
-		tg_dispatch_one_bio(tg, bio_data_dir(bio), parent_sq);
+		tg_dispatch_one_bio(tg, bio_data_dir(bio));
 		nr_reads++;
 
 		if (nr_reads >= max_nr_reads)
@@ -800,7 +797,7 @@ static int throtl_dispatch_tg(struct throtl_grp *tg,
 	while ((bio = bio_list_peek(&sq->bio_lists[WRITE])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
 
-		tg_dispatch_one_bio(tg, bio_data_dir(bio), parent_sq);
+		tg_dispatch_one_bio(tg, bio_data_dir(bio));
 		nr_writes++;
 
 		if (nr_writes >= max_nr_writes)
@@ -824,12 +821,12 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
 		if (time_before(jiffies, tg->disptime))
 			break;
 
-		throtl_dequeue_tg(tg, parent_sq);
+		throtl_dequeue_tg(tg);
 
-		nr_disp += throtl_dispatch_tg(tg, parent_sq);
+		nr_disp += throtl_dispatch_tg(tg);
 
 		if (sq->nr_queued[0] || sq->nr_queued[1])
-			tg_update_disptime(tg, parent_sq);
+			tg_update_disptime(tg);
 
 		if (nr_disp >= throtl_quantum)
 			break;
@@ -991,7 +988,7 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 	throtl_start_new_slice(tg, 1);
 
 	if (tg->flags & THROTL_TG_PENDING) {
-		tg_update_disptime(tg, &td->service_queue);
+		tg_update_disptime(tg);
 		throtl_schedule_next_dispatch(td);
 	}
 
@@ -1145,12 +1142,12 @@ queue_bio:
 			sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
-	throtl_add_bio_tg(bio, tg, &q->td->service_queue);
+	throtl_add_bio_tg(bio, tg);
 	throttled = true;
 
 	/* update @tg's dispatch time if @tg was empty before @bio */
 	if (tg->flags & THROTL_TG_WAS_EMPTY) {
-		tg_update_disptime(tg, &td->service_queue);
+		tg_update_disptime(tg);
 		throtl_schedule_next_dispatch(td);
 	}
 
@@ -1182,12 +1179,12 @@ void blk_throtl_drain(struct request_queue *q)
 	while ((tg = throtl_rb_first(parent_sq))) {
 		struct throtl_service_queue *sq = &tg->service_queue;
 
-		throtl_dequeue_tg(tg, parent_sq);
+		throtl_dequeue_tg(tg);
 
 		while ((bio = bio_list_peek(&sq->bio_lists[READ])))
-			tg_dispatch_one_bio(tg, bio_data_dir(bio), parent_sq);
+			tg_dispatch_one_bio(tg, bio_data_dir(bio));
 		while ((bio = bio_list_peek(&sq->bio_lists[WRITE])))
-			tg_dispatch_one_bio(tg, bio_data_dir(bio), parent_sq);
+			tg_dispatch_one_bio(tg, bio_data_dir(bio));
 	}
 	spin_unlock_irq(q->queue_lock);
 
@@ -1208,7 +1205,7 @@ int blk_throtl_init(struct request_queue *q)
 		return -ENOMEM;
 
 	INIT_DELAYED_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
-	throtl_service_queue_init(&td->service_queue);
+	throtl_service_queue_init(&td->service_queue, NULL);
 
 	q->td = td;
 	td->queue = q;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (19 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 20/31] blk-throttle: add throtl_service_queue->parent_sq Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-06 17:36   ` Vivek Goyal
  2013-05-02  0:39 ` [PATCH 22/31] blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it Tejun Heo
                   ` (11 subsequent siblings)
  32 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

Now that both throtl_data and throtl_grp embed throtl_service_queue,
we can unify throtl_log() and throtl_log_tg().

* sq_to_tg() is added.  This returns the throtl_grp a service_queue is
  embedded in.  If the service_queue is the top-level one embedded in
  throtl_data, NULL is returned.

* sq_to_td() is added.  A service_queue is always associated with a
  throtl_data.  This function finds the associated td and returns it.

* throtl_log() is updated to take throtl_service_queue instead of
  throtl_data.  If the service_queue is one embedded in throtl_grp, it
  prints the same header as throtl_log_tg() did.  If it's one embedded
  in throtl_data, it behaves the same as before.  This renders
  throtl_log_tg() unnecessary.  Removed.

This change is necessary for hierarchy support as we're gonna be using
the same code paths to dispatch bios to intermediate service_queues
embedded in throtl_grps and the top-level service_queue embedded in
throtl_data.

This patch doesn't make any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 105 +++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 77 insertions(+), 28 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 64d1923..9bbe312 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -151,16 +151,62 @@ static inline struct throtl_grp *td_root_tg(struct throtl_data *td)
 	return blkg_to_tg(td->queue->root_blkg);
 }
 
-#define throtl_log_tg(tg, fmt, args...)	do {				\
+/**
+ * sq_to_tg - return the throl_grp the specified service queue belongs to
+ * @sq: the throtl_service_queue of interest
+ *
+ * Return the throtl_grp @sq belongs to.  If @sq is the top-level one
+ * embedded in throtl_data, %NULL is returned.
+ */
+static struct throtl_grp *sq_to_tg(struct throtl_service_queue *sq)
+{
+	if (sq && sq->parent_sq)
+		return container_of(sq, struct throtl_grp, service_queue);
+	else
+		return NULL;
+}
+
+/**
+ * sq_to_td - return throtl_data the specified service queue belongs to
+ * @sq: the throtl_service_queue of interest
+ *
+ * A service_queue can be embeded in either a throtl_grp or throtl_data.
+ * Determine the associated throtl_data accordingly and return it.
+ */
+static struct throtl_data *sq_to_td(struct throtl_service_queue *sq)
+{
+	struct throtl_grp *tg = sq_to_tg(sq);
+
+	if (tg)
+		return tg->td;
+	else
+		return container_of(sq, struct throtl_data, service_queue);
+}
+
+/**
+ * throtl_log - log debug message via blktrace
+ * @sq: the service_queue being reported
+ * @fmt: printf format string
+ * @args: printf args
+ *
+ * The messages are prefixed with "throtl BLKG_NAME" if @sq belongs to a
+ * throtl_grp; otherwise, just "throtl".
+ *
+ * TODO: this should be made a function and name formatting should happen
+ * after testing whether blktrace is enabled.
+ */
+#define throtl_log(sq, fmt, args...)	do {				\
+	struct throtl_grp *__tg = sq_to_tg((sq));			\
+	struct throtl_data *__td = sq_to_td((sq));			\
 	char __pbuf[128];						\
 									\
-	blkg_path(tg_to_blkg(tg), __pbuf, sizeof(__pbuf));		\
-	blk_add_trace_msg((tg)->td->queue, "throtl %s " fmt, __pbuf, ##args); \
+	__pbuf[0] = ' ';						\
+	__pbuf[1] = '\0';						\
+	if ((__tg))							\
+		blkg_path(tg_to_blkg(__tg), __pbuf + 1, sizeof(__pbuf) - 1); \
+	blk_add_trace_msg(__td->queue, "throtl%s" fmt, __pbuf, ##args); \
 } while (0)
 
-#define throtl_log(td, fmt, args...)	\
-	blk_add_trace_msg((td)->queue, "throtl " fmt, ##args)
-
 /*
  * Worker for allocating per cpu stat for tgs. This is scheduled on the
  * system_wq once there are some groups on the alloc_list waiting for
@@ -402,9 +448,10 @@ static void throtl_schedule_delayed_work(struct throtl_data *td,
 					 unsigned long delay)
 {
 	struct delayed_work *dwork = &td->dispatch_work;
+	struct throtl_service_queue *sq = &td->service_queue;
 
 	mod_delayed_work(kthrotld_workqueue, dwork, delay);
-	throtl_log(td, "schedule work. delay=%lu jiffies=%lu", delay, jiffies);
+	throtl_log(sq, "schedule work. delay=%lu jiffies=%lu", delay, jiffies);
 }
 
 static void throtl_schedule_next_dispatch(struct throtl_data *td)
@@ -429,9 +476,10 @@ static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 	tg->io_disp[rw] = 0;
 	tg->slice_start[rw] = jiffies;
 	tg->slice_end[rw] = jiffies + throtl_slice;
-	throtl_log_tg(tg, "[%c] new slice start=%lu end=%lu jiffies=%lu",
-			rw == READ ? 'R' : 'W', tg->slice_start[rw],
-			tg->slice_end[rw], jiffies);
+	throtl_log(&tg->service_queue,
+		   "[%c] new slice start=%lu end=%lu jiffies=%lu",
+		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
+		   tg->slice_end[rw], jiffies);
 }
 
 static inline void throtl_set_slice_end(struct throtl_grp *tg, bool rw,
@@ -444,9 +492,10 @@ static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
 				       unsigned long jiffy_end)
 {
 	tg->slice_end[rw] = roundup(jiffy_end, throtl_slice);
-	throtl_log_tg(tg, "[%c] extend slice start=%lu end=%lu jiffies=%lu",
-			rw == READ ? 'R' : 'W', tg->slice_start[rw],
-			tg->slice_end[rw], jiffies);
+	throtl_log(&tg->service_queue,
+		   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
+		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
+		   tg->slice_end[rw], jiffies);
 }
 
 /* Determine if previously allocated or extended slice is complete or not */
@@ -511,10 +560,10 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
 
 	tg->slice_start[rw] += nr_slices * throtl_slice;
 
-	throtl_log_tg(tg, "[%c] trim slice nr=%lu bytes=%llu io=%lu"
-			" start=%lu end=%lu jiffies=%lu",
-			rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim,
-			tg->slice_start[rw], tg->slice_end[rw], jiffies);
+	throtl_log(&tg->service_queue,
+		   "[%c] trim slice nr=%lu bytes=%llu io=%lu start=%lu end=%lu jiffies=%lu",
+		   rw == READ ? 'R' : 'W', nr_slices, bytes_trim, io_trim,
+		   tg->slice_start[rw], tg->slice_end[rw], jiffies);
 }
 
 static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
@@ -852,7 +901,7 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 
 	bio_list_init(&bio_list_on_stack);
 
-	throtl_log(td, "dispatch nr_queued=%u read=%u write=%u",
+	throtl_log(sq, "dispatch nr_queued=%u read=%u write=%u",
 		   td->nr_queued[READ] + td->nr_queued[WRITE],
 		   td->nr_queued[READ], td->nr_queued[WRITE]);
 
@@ -863,7 +912,7 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 			bio_list_merge(&bio_list_on_stack, &sq->bio_lists[rw]);
 			bio_list_init(&sq->bio_lists[rw]);
 		}
-		throtl_log(td, "bios disp=%u", nr_disp);
+		throtl_log(sq, "bios disp=%u", nr_disp);
 	}
 
 	throtl_schedule_next_dispatch(td);
@@ -972,9 +1021,10 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 	else
 		*(unsigned int *)((void *)tg + cft->private) = ctx.v;
 
-	throtl_log_tg(tg, "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
-		      tg->bps[READ], tg->bps[WRITE],
-		      tg->iops[READ], tg->iops[WRITE]);
+	throtl_log(&tg->service_queue,
+		   "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
+		   tg->bps[READ], tg->bps[WRITE],
+		   tg->iops[READ], tg->iops[WRITE]);
 
 	/*
 	 * We're already holding queue_lock and know @tg is valid.  Let's
@@ -1134,12 +1184,11 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	}
 
 queue_bio:
-	throtl_log_tg(tg, "[%c] bio. bdisp=%llu sz=%u bps=%llu"
-			" iodisp=%u iops=%u queued=%d/%d",
-			rw == READ ? 'R' : 'W',
-			tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
-			tg->io_disp[rw], tg->iops[rw],
-			sq->nr_queued[READ], sq->nr_queued[WRITE]);
+	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
+		   rw == READ ? 'R' : 'W',
+		   tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
+		   tg->io_disp[rw], tg->iops[rw],
+		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
 	throtl_add_bio_tg(bio, tg);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 22/31] blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (20 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 23/31] blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work Tejun Heo
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

With proper hierarchy support, a bio can be dispatched multiple times
until it reaches the top-level service_queue and we don't want to
update dispatch stats at each step.  They are local stats and will be
kept local.  If recursive stats are necessary, they should be
implemented separately and definitely not by updating counters
recursively on each dispatch.

This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate
stats update with it so that dispatch stats are updated only on the
first time the bio is charged to a throtl_grp, which will always be
the throtl_grp the bio was originally queued to.

This means that REQ_THROTTLED would be set even for bios which don't
get throttled.  As we don't want bios to leave blk-throtl with the
flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio()
and clear if the bio is being issued directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9bbe312..7f8ca43 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -754,7 +754,22 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	tg->bytes_disp[rw] += bio->bi_size;
 	tg->io_disp[rw]++;
 
-	throtl_update_dispatch_stats(tg_to_blkg(tg), bio->bi_size, bio->bi_rw);
+	/*
+	 * REQ_THROTTLED is used to prevent the same bio to be throttled
+	 * more than once as a throttled bio will go through blk-throtl the
+	 * second time when it eventually gets issued.  Set it when a bio
+	 * is being charged to a tg.
+	 *
+	 * Dispatch stats aren't recursive and each @bio should only be
+	 * accounted by the @tg it was originally associated with.  Let's
+	 * update the stats when setting REQ_THROTTLED for the first time
+	 * which is guaranteed to be for the @bio's original tg.
+	 */
+	if (!(bio->bi_rw & REQ_THROTTLED)) {
+		bio->bi_rw |= REQ_THROTTLED;
+		throtl_update_dispatch_stats(tg_to_blkg(tg), bio->bi_size,
+					     bio->bi_rw);
+	}
 }
 
 static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg)
@@ -818,7 +833,6 @@ static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 
 	throtl_charge_bio(tg, bio);
 	bio_list_add(&sq->parent_sq->bio_lists[rw], bio);
-	bio->bi_rw |= REQ_THROTTLED;
 
 	throtl_trim_slice(tg, rw);
 }
@@ -1128,10 +1142,9 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	struct blkcg *blkcg;
 	bool throttled = false;
 
-	if (bio->bi_rw & REQ_THROTTLED) {
-		bio->bi_rw &= ~REQ_THROTTLED;
+	/* see throtl_charge_bio() */
+	if (bio->bi_rw & REQ_THROTTLED)
 		goto out;
-	}
 
 	/*
 	 * A throtl_grp pointer retrieved under rcu can be used to access
@@ -1205,6 +1218,13 @@ out_unlock:
 out_unlock_rcu:
 	rcu_read_unlock();
 out:
+	/*
+	 * As multiple blk-throtls may stack in the same issue path, we
+	 * don't want bios to leave with the flag set.  Clear the flag if
+	 * being issued.
+	 */
+	if (!throttled)
+		bio->bi_rw &= ~REQ_THROTTLED;
 	return throttled;
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 23/31] blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (21 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 22/31] blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 24/31] blk-throttle: implement dispatch looping Tejun Heo
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

Currently, throtl_data->dispatch_work is a delayed_work item which
handles both delayed dispatch and issuing bios.  The two tasks will be
separated to support proper hierarchy.  To prepare for that, this
patch separates out the timer into throtl_service_queue->pending_timer
from throtl_data->dispatch_work and make the latter a work_struct.

* As the timer is now per-service_queue, it's initialized and
  del_sync'd as its corresponding service_queue is created and
  destroyed.  The timer, when triggered, simply schedules
  throtl_data->dispathc_work for execution.

* throtl_schedule_delayed_work() is renamed to
  throtl_schedule_pending_timer() and takes @sq and @expires now.

* Simiarly, throtl_schedule_next_dispatch() now takes @sq, which
  should be the parent_sq of the service_queue which just got a new
  bio or updated.  As the parent_sq is always the top-level
  service_queue now, this doesn't change anything at this point.

This patch doesn't introduce any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 70 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 46 insertions(+), 24 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 7f8ca43..9270663 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -44,6 +44,7 @@ struct throtl_service_queue {
 	struct rb_node		*first_pending;	/* first node in the tree */
 	unsigned int		nr_pending;	/* # queued in the tree */
 	unsigned long		first_pending_disptime;	/* disptime of the first tg */
+	struct timer_list	pending_timer;	/* fires on first_pending_disptime */
 };
 
 enum tg_state_flags {
@@ -121,7 +122,7 @@ struct throtl_data
 	unsigned int nr_undestroyed_grps;
 
 	/* Work for dispatching throttled bios */
-	struct delayed_work dispatch_work;
+	struct work_struct dispatch_work;
 };
 
 /* list and work item to allocate percpu group stats */
@@ -131,6 +132,8 @@ static LIST_HEAD(tg_stats_alloc_list);
 static void tg_stats_alloc_fn(struct work_struct *);
 static DECLARE_DELAYED_WORK(tg_stats_alloc_work, tg_stats_alloc_fn);
 
+static void throtl_pending_timer_fn(unsigned long arg);
+
 static inline struct throtl_grp *pd_to_tg(struct blkg_policy_data *pd)
 {
 	return pd ? container_of(pd, struct throtl_grp, pd) : NULL;
@@ -252,6 +255,13 @@ static void throtl_service_queue_init(struct throtl_service_queue *sq,
 	bio_list_init(&sq->bio_lists[1]);
 	sq->pending_tree = RB_ROOT;
 	sq->parent_sq = parent_sq;
+	setup_timer(&sq->pending_timer, throtl_pending_timer_fn,
+		    (unsigned long)sq);
+}
+
+static void throtl_service_queue_exit(struct throtl_service_queue *sq)
+{
+	del_timer_sync(&sq->pending_timer);
 }
 
 static void throtl_pd_init(struct blkcg_gq *blkg)
@@ -290,6 +300,8 @@ static void throtl_pd_exit(struct blkcg_gq *blkg)
 	spin_unlock_irqrestore(&tg_stats_alloc_lock, flags);
 
 	free_percpu(tg->stats_cpu);
+
+	throtl_service_queue_exit(&tg->service_queue);
 }
 
 static void throtl_pd_reset_stats(struct blkcg_gq *blkg)
@@ -444,19 +456,17 @@ static void throtl_dequeue_tg(struct throtl_grp *tg)
 }
 
 /* Call with queue lock held */
-static void throtl_schedule_delayed_work(struct throtl_data *td,
-					 unsigned long delay)
+static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
+					  unsigned long expires)
 {
-	struct delayed_work *dwork = &td->dispatch_work;
-	struct throtl_service_queue *sq = &td->service_queue;
-
-	mod_delayed_work(kthrotld_workqueue, dwork, delay);
-	throtl_log(sq, "schedule work. delay=%lu jiffies=%lu", delay, jiffies);
+	mod_timer(&sq->pending_timer, expires);
+	throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
+		   expires - jiffies, jiffies);
 }
 
-static void throtl_schedule_next_dispatch(struct throtl_data *td)
+static void throtl_schedule_next_dispatch(struct throtl_service_queue *sq)
 {
-	struct throtl_service_queue *sq = &td->service_queue;
+	struct throtl_data *td = sq_to_td(sq);
 
 	/* any pending children left? */
 	if (!sq->nr_pending)
@@ -464,10 +474,14 @@ static void throtl_schedule_next_dispatch(struct throtl_data *td)
 
 	update_min_dispatch_time(sq);
 
-	if (time_before_eq(sq->first_pending_disptime, jiffies))
-		throtl_schedule_delayed_work(td, 0);
-	else
-		throtl_schedule_delayed_work(td, sq->first_pending_disptime - jiffies);
+	/* is the next dispatch time in the future? */
+	if (time_after(sq->first_pending_disptime, jiffies)) {
+		throtl_schedule_pending_timer(sq, sq->first_pending_disptime);
+		return;
+	}
+
+	/* kick immediate execution */
+	queue_work(kthrotld_workqueue, &td->dispatch_work);
 }
 
 static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
@@ -898,11 +912,19 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
 	return nr_disp;
 }
 
+static void throtl_pending_timer_fn(unsigned long arg)
+{
+	struct throtl_service_queue *sq = (void *)arg;
+	struct throtl_data *td = sq_to_td(sq);
+
+	queue_work(kthrotld_workqueue, &td->dispatch_work);
+}
+
 /* work function to dispatch throttled bios */
 void blk_throtl_dispatch_work_fn(struct work_struct *work)
 {
-	struct throtl_data *td = container_of(to_delayed_work(work),
-					      struct throtl_data, dispatch_work);
+	struct throtl_data *td = container_of(work, struct throtl_data,
+					      dispatch_work);
 	struct throtl_service_queue *sq = &td->service_queue;
 	struct request_queue *q = td->queue;
 	unsigned int nr_disp = 0;
@@ -929,7 +951,7 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 		throtl_log(sq, "bios disp=%u", nr_disp);
 	}
 
-	throtl_schedule_next_dispatch(td);
+	throtl_schedule_next_dispatch(sq);
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -1017,7 +1039,7 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
 	struct blkg_conf_ctx ctx;
 	struct throtl_grp *tg;
-	struct throtl_data *td;
+	struct throtl_service_queue *sq;
 	int ret;
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
@@ -1025,7 +1047,7 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 		return ret;
 
 	tg = blkg_to_tg(ctx.blkg);
-	td = ctx.blkg->q->td;
+	sq = &tg->service_queue;
 
 	if (!ctx.v)
 		ctx.v = -1;
@@ -1053,11 +1075,11 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
-		throtl_schedule_next_dispatch(td);
+		throtl_schedule_next_dispatch(sq->parent_sq);
 	}
 
 	/* kick dispatch in case disptime got shortened */
-	throtl_schedule_delayed_work(td, 0);
+	throtl_schedule_pending_timer(sq->parent_sq, jiffies);
 
 	blkg_conf_finish(&ctx);
 	return 0;
@@ -1121,7 +1143,7 @@ static void throtl_shutdown_wq(struct request_queue *q)
 {
 	struct throtl_data *td = q->td;
 
-	cancel_delayed_work_sync(&td->dispatch_work);
+	cancel_work_sync(&td->dispatch_work);
 }
 
 static struct blkcg_policy blkcg_policy_throtl = {
@@ -1210,7 +1232,7 @@ queue_bio:
 	/* update @tg's dispatch time if @tg was empty before @bio */
 	if (tg->flags & THROTL_TG_WAS_EMPTY) {
 		tg_update_disptime(tg);
-		throtl_schedule_next_dispatch(td);
+		throtl_schedule_next_dispatch(tg->service_queue.parent_sq);
 	}
 
 out_unlock:
@@ -1273,7 +1295,7 @@ int blk_throtl_init(struct request_queue *q)
 	if (!td)
 		return -ENOMEM;
 
-	INIT_DELAYED_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
+	INIT_WORK(&td->dispatch_work, blk_throtl_dispatch_work_fn);
 	throtl_service_queue_init(&td->service_queue, NULL);
 
 	q->td = td;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 24/31] blk-throttle: implement dispatch looping
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (22 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 23/31] blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 25/31] blk-throttle: dispatch from throtl_pending_timer_fn() Tejun Heo
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

throtl_select_dispatch() only dispatches throtl_quantum bios on each
invocation.  blk_throtl_dispatch_work_fn() in turn depends on
throtl_schedule_next_dispatch() scheduling the next dispatch window
immediately so that undue delays aren't incurred.  This effectively
chains multiple dispatch work item executions back-to-back when there
are more than throtl_quantum bios to dispatch on a given tick.

There is no reason to finish the current work item just to repeat it
immediately.  This patch makes throtl_schedule_next_dispatch() return
%false without doing anything if the current dispatch window is still
open and updates blk_throtl_dispatch_work_fn() repeat dispatching
after cpu_relax() on %false return.

This change will help implementing hierarchy support as dispatching
will be done from pending_timer and immediate reschedule of timer
function isn't supported and doesn't make much sense.

While this patch changes how dispatch behaves when there are more than
throtl_quantum bios to dispatch on a single tick, the behavior change
is immaterial.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 82 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 56 insertions(+), 26 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 9270663..d573cdf 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -464,24 +464,41 @@ static void throtl_schedule_pending_timer(struct throtl_service_queue *sq,
 		   expires - jiffies, jiffies);
 }
 
-static void throtl_schedule_next_dispatch(struct throtl_service_queue *sq)
+/**
+ * throtl_schedule_next_dispatch - schedule the next dispatch cycle
+ * @sq: the service_queue to schedule dispatch for
+ * @force: force scheduling
+ *
+ * Arm @sq->pending_timer so that the next dispatch cycle starts on the
+ * dispatch time of the first pending child.  Returns %true if either timer
+ * is armed or there's no pending child left.  %false if the current
+ * dispatch window is still open and the caller should continue
+ * dispatching.
+ *
+ * If @force is %true, the dispatch timer is always scheduled and this
+ * function is guaranteed to return %true.  This is to be used when the
+ * caller can't dispatch itself and needs to invoke pending_timer
+ * unconditionally.  Note that forced scheduling is likely to induce short
+ * delay before dispatch starts even if @sq->first_pending_disptime is not
+ * in the future and thus shouldn't be used in hot paths.
+ */
+static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq,
+					  bool force)
 {
-	struct throtl_data *td = sq_to_td(sq);
-
 	/* any pending children left? */
 	if (!sq->nr_pending)
-		return;
+		return true;
 
 	update_min_dispatch_time(sq);
 
 	/* is the next dispatch time in the future? */
-	if (time_after(sq->first_pending_disptime, jiffies)) {
+	if (force || time_after(sq->first_pending_disptime, jiffies)) {
 		throtl_schedule_pending_timer(sq, sq->first_pending_disptime);
-		return;
+		return true;
 	}
 
-	/* kick immediate execution */
-	queue_work(kthrotld_workqueue, &td->dispatch_work);
+	/* tell the caller to continue dispatching */
+	return false;
 }
 
 static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
@@ -927,39 +944,47 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 					      dispatch_work);
 	struct throtl_service_queue *sq = &td->service_queue;
 	struct request_queue *q = td->queue;
-	unsigned int nr_disp = 0;
 	struct bio_list bio_list_on_stack;
 	struct bio *bio;
 	struct blk_plug plug;
-	int rw;
+	bool dispatched = false;
+	int rw, ret;
 
 	spin_lock_irq(q->queue_lock);
 
 	bio_list_init(&bio_list_on_stack);
 
-	throtl_log(sq, "dispatch nr_queued=%u read=%u write=%u",
-		   td->nr_queued[READ] + td->nr_queued[WRITE],
-		   td->nr_queued[READ], td->nr_queued[WRITE]);
+	while (true) {
+		throtl_log(sq, "dispatch nr_queued=%u read=%u write=%u",
+			   td->nr_queued[READ] + td->nr_queued[WRITE],
+			   td->nr_queued[READ], td->nr_queued[WRITE]);
+
+		ret = throtl_select_dispatch(sq);
+		if (ret) {
+			for (rw = READ; rw <= WRITE; rw++) {
+				bio_list_merge(&bio_list_on_stack, &sq->bio_lists[rw]);
+				bio_list_init(&sq->bio_lists[rw]);
+			}
+			throtl_log(sq, "bios disp=%u", ret);
+			dispatched = true;
+		}
 
-	nr_disp = throtl_select_dispatch(sq);
+		if (throtl_schedule_next_dispatch(sq, false))
+			break;
 
-	if (nr_disp) {
-		for (rw = READ; rw <= WRITE; rw++) {
-			bio_list_merge(&bio_list_on_stack, &sq->bio_lists[rw]);
-			bio_list_init(&sq->bio_lists[rw]);
-		}
-		throtl_log(sq, "bios disp=%u", nr_disp);
+		/* this dispatch windows is still open, relax and repeat */
+		spin_unlock_irq(q->queue_lock);
+		cpu_relax();
+		spin_lock_irq(q->queue_lock);
 	}
 
-	throtl_schedule_next_dispatch(sq);
-
 	spin_unlock_irq(q->queue_lock);
 
 	/*
 	 * If we dispatched some requests, unplug the queue to make sure
 	 * immediate dispatch
 	 */
-	if (nr_disp) {
+	if (dispatched) {
 		blk_start_plug(&plug);
 		while((bio = bio_list_pop(&bio_list_on_stack)))
 			generic_make_request(bio);
@@ -1075,7 +1100,7 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 
 	if (tg->flags & THROTL_TG_PENDING) {
 		tg_update_disptime(tg);
-		throtl_schedule_next_dispatch(sq->parent_sq);
+		throtl_schedule_next_dispatch(sq->parent_sq, true);
 	}
 
 	/* kick dispatch in case disptime got shortened */
@@ -1229,10 +1254,15 @@ queue_bio:
 	throtl_add_bio_tg(bio, tg);
 	throttled = true;
 
-	/* update @tg's dispatch time if @tg was empty before @bio */
+	/*
+	 * Update @tg's dispatch time and force schedule dispatch if @tg
+	 * was empty before @bio.  The forced scheduling isn't likely to
+	 * cause undue delay as @bio is likely to be dispatched directly if
+	 * its @tg's disptime is not in the future.
+	 */
 	if (tg->flags & THROTL_TG_WAS_EMPTY) {
 		tg_update_disptime(tg);
-		throtl_schedule_next_dispatch(tg->service_queue.parent_sq);
+		throtl_schedule_next_dispatch(tg->service_queue.parent_sq, true);
 	}
 
 out_unlock:
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 25/31] blk-throttle: dispatch from throtl_pending_timer_fn()
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (23 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 24/31] blk-throttle: implement dispatch looping Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 26/31] blk-throttle: make blk_throtl_drain() ready for hierarchy Tejun Heo
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

Currently, blk_throtl_dispatch_work_fn() is responsible for both
dispatching bio's from throtl_grp's according to their limits and then
issuing the dispatched bios.

This patch moves the dispatch part to throtl_pending_timer_fn() so
that the work item is kicked iff there are bio's to issue.  This is to
avoid work item execution at each step when hierarchy support is
enabled.  bio's will be dispatched towards the top-level service_queue
from the timers at each layer and the work item will only be used to
issue the bio's which reached the top-level service_queue.

While fetching bio's to issue from bio_lists[],
blk_throtl_dispatch_work_fn() fetches all READs before WRITEs.  While
the original code also dispatched READs first, if multiple throtl_grps
are dispatched on the same run, WRITEs from throtl_grp which is
dispatched first would precede READs from throtl_grps which are
dispatched later.  While this is a behavior change, given that the
previous code already prioritized READs and block layer generally
prioritizes and segregates READs from WRITEs, this isn't likely to
make any noticeable differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 69 +++++++++++++++++++++++++++++++++-------------------
 1 file changed, 44 insertions(+), 25 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index d573cdf..8f435a7 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -929,31 +929,26 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
 	return nr_disp;
 }
 
+/**
+ * throtl_pending_timer_fn - timer function for service_queue->pending_timer
+ * @arg: the throtl_service_queue being serviced
+ *
+ * This timer is armed when a child throtl_grp with active bio's become
+ * pending and queued on the service_queue's pending_tree and expires when
+ * the first child throtl_grp should be dispatched.  This function
+ * dispatches bio's from the children throtl_grps and kicks
+ * throtl_data->dispatch_work if there are bio's ready to be issued.
+ */
 static void throtl_pending_timer_fn(unsigned long arg)
 {
 	struct throtl_service_queue *sq = (void *)arg;
 	struct throtl_data *td = sq_to_td(sq);
-
-	queue_work(kthrotld_workqueue, &td->dispatch_work);
-}
-
-/* work function to dispatch throttled bios */
-void blk_throtl_dispatch_work_fn(struct work_struct *work)
-{
-	struct throtl_data *td = container_of(work, struct throtl_data,
-					      dispatch_work);
-	struct throtl_service_queue *sq = &td->service_queue;
 	struct request_queue *q = td->queue;
-	struct bio_list bio_list_on_stack;
-	struct bio *bio;
-	struct blk_plug plug;
 	bool dispatched = false;
-	int rw, ret;
+	int ret;
 
 	spin_lock_irq(q->queue_lock);
 
-	bio_list_init(&bio_list_on_stack);
-
 	while (true) {
 		throtl_log(sq, "dispatch nr_queued=%u read=%u write=%u",
 			   td->nr_queued[READ] + td->nr_queued[WRITE],
@@ -961,10 +956,6 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 
 		ret = throtl_select_dispatch(sq);
 		if (ret) {
-			for (rw = READ; rw <= WRITE; rw++) {
-				bio_list_merge(&bio_list_on_stack, &sq->bio_lists[rw]);
-				bio_list_init(&sq->bio_lists[rw]);
-			}
 			throtl_log(sq, "bios disp=%u", ret);
 			dispatched = true;
 		}
@@ -978,13 +969,41 @@ void blk_throtl_dispatch_work_fn(struct work_struct *work)
 		spin_lock_irq(q->queue_lock);
 	}
 
+	if (dispatched)
+		queue_work(kthrotld_workqueue, &td->dispatch_work);
+
 	spin_unlock_irq(q->queue_lock);
+}
 
-	/*
-	 * If we dispatched some requests, unplug the queue to make sure
-	 * immediate dispatch
-	 */
-	if (dispatched) {
+/**
+ * blk_throtl_dispatch_work_fn - work function for throtl_data->dispatch_work
+ * @work: work item being executed
+ *
+ * This function is queued for execution when bio's reach the bio_lists[]
+ * of throtl_data->service_queue.  Those bio's are ready and issued by this
+ * function.
+ */
+void blk_throtl_dispatch_work_fn(struct work_struct *work)
+{
+	struct throtl_data *td = container_of(work, struct throtl_data,
+					      dispatch_work);
+	struct throtl_service_queue *td_sq = &td->service_queue;
+	struct request_queue *q = td->queue;
+	struct bio_list bio_list_on_stack;
+	struct bio *bio;
+	struct blk_plug plug;
+	int rw;
+
+	bio_list_init(&bio_list_on_stack);
+
+	spin_lock_irq(q->queue_lock);
+	for (rw = READ; rw <= WRITE; rw++) {
+		bio_list_merge(&bio_list_on_stack, &td_sq->bio_lists[rw]);
+		bio_list_init(&td_sq->bio_lists[rw]);
+	}
+	spin_unlock_irq(q->queue_lock);
+
+	if (!bio_list_empty(&bio_list_on_stack)) {
 		blk_start_plug(&plug);
 		while((bio = bio_list_pop(&bio_list_on_stack)))
 			generic_make_request(bio);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 26/31] blk-throttle: make blk_throtl_drain() ready for hierarchy
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (24 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 25/31] blk-throttle: dispatch from throtl_pending_timer_fn() Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 27/31] blk-throttle: make blk_throtl_bio() " Tejun Heo
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

The current blk_throtl_drain() assumes that all active throtl_grps are
queued on throtl_data->service_queue, which won't be true once
hierarchy support is implemented.

This patch makes blk_throtl_drain() perform post-order walk of the
blkg hierarchy draining each associated throtl_grp, which guarantees
that all bios will eventually be pushed to the top-level service_queue
in throtl_data.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 51 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 40 insertions(+), 11 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8f435a7..1080563 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1299,6 +1299,28 @@ out:
 	return throttled;
 }
 
+/*
+ * Dispatch all bios from all children tg's queued on @parent_sq.  On
+ * return, @parent_sq is guaranteed to not have any active children tg's
+ * and all bios from previously active tg's are on @parent_sq->bio_lists[].
+ */
+static void tg_drain_bios(struct throtl_service_queue *parent_sq)
+{
+	struct throtl_grp *tg;
+
+	while ((tg = throtl_rb_first(parent_sq))) {
+		struct throtl_service_queue *sq = &tg->service_queue;
+		struct bio *bio;
+
+		throtl_dequeue_tg(tg);
+
+		while ((bio = bio_list_peek(&sq->bio_lists[READ])))
+			tg_dispatch_one_bio(tg, bio_data_dir(bio));
+		while ((bio = bio_list_peek(&sq->bio_lists[WRITE])))
+			tg_dispatch_one_bio(tg, bio_data_dir(bio));
+	}
+}
+
 /**
  * blk_throtl_drain - drain throttled bios
  * @q: request_queue to drain throttled bios for
@@ -1309,27 +1331,34 @@ void blk_throtl_drain(struct request_queue *q)
 	__releases(q->queue_lock) __acquires(q->queue_lock)
 {
 	struct throtl_data *td = q->td;
-	struct throtl_service_queue *parent_sq = &td->service_queue;
-	struct throtl_grp *tg;
+	struct blkcg_gq *blkg;
+	struct cgroup *pos_cgrp;
 	struct bio *bio;
 	int rw;
 
 	queue_lockdep_assert_held(q);
+	rcu_read_lock();
 
-	while ((tg = throtl_rb_first(parent_sq))) {
-		struct throtl_service_queue *sq = &tg->service_queue;
+	/*
+	 * Drain each tg while doing post-order walk on the blkg tree, so
+	 * that all bios are propagated to td->service_queue.  It'd be
+	 * better to walk service_queue tree directly but blkg walk is
+	 * easier.
+	 */
+	blkg_for_each_descendant_post(blkg, pos_cgrp, td->queue->root_blkg)
+		tg_drain_bios(&blkg_to_tg(blkg)->service_queue);
 
-		throtl_dequeue_tg(tg);
+	tg_drain_bios(&td_root_tg(td)->service_queue);
 
-		while ((bio = bio_list_peek(&sq->bio_lists[READ])))
-			tg_dispatch_one_bio(tg, bio_data_dir(bio));
-		while ((bio = bio_list_peek(&sq->bio_lists[WRITE])))
-			tg_dispatch_one_bio(tg, bio_data_dir(bio));
-	}
+	/* finally, transfer bios from top-level tg's into the td */
+	tg_drain_bios(&td->service_queue);
+
+	rcu_read_unlock();
 	spin_unlock_irq(q->queue_lock);
 
+	/* all bios now should be in td->service_queue, issue them */
 	for (rw = READ; rw <= WRITE; rw++)
-		while ((bio = bio_list_pop(&parent_sq->bio_lists[rw])))
+		while ((bio = bio_list_pop(&td->service_queue.bio_lists[rw])))
 			generic_make_request(bio);
 
 	spin_lock_irq(q->queue_lock);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 27/31] blk-throttle: make blk_throtl_bio() ready for hierarchy
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (25 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 26/31] blk-throttle: make blk_throtl_drain() ready for hierarchy Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 28/31] blk-throttle: make tg_dispatch_one_bio() " Tejun Heo
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

Currently, blk_throtl_bio() issues the passed in bio directly if it's
within limits of its associated tg (throtl_grp).  This behavior
becomes incorrect with hierarchy support as the bio should be
accounted to and throttled by the ancestor throtl_grps too.

This patch makes the direct issue path of blk_throtl_bio() to loop
until it reaches the top-level service_queue or gets throttled.  If
the former, the bio can be issued directly; otherwise, it gets queued
at the first layer it was above limits.

As tg->parent_sq is always the top-level service queue currently, this
patch in itself doesn't make any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1080563..99e1e78 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1239,12 +1239,16 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 
 	sq = &tg->service_queue;
 
-	/* throtl is FIFO - if other bios are already queued, should queue */
-	if (sq->nr_queued[rw])
-		goto queue_bio;
+	while (true) {
+		/* throtl is FIFO - if bios are already queued, should queue */
+		if (sq->nr_queued[rw])
+			break;
 
-	/* Bio is with-in rate limit of group */
-	if (tg_may_dispatch(tg, bio, NULL)) {
+		/* if above limits, break to queue */
+		if (!tg_may_dispatch(tg, bio, NULL))
+			break;
+
+		/* within limits, let's charge and dispatch directly */
 		throtl_charge_bio(tg, bio);
 
 		/*
@@ -1259,10 +1263,19 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		 * So keep on trimming slice even if bio is not queued.
 		 */
 		throtl_trim_slice(tg, rw);
-		goto out_unlock;
+
+		/*
+		 * @bio passed through this layer without being throttled.
+		 * Climb up the ladder.  If we''re already at the top, it
+		 * can be executed directly.
+		 */
+		sq = sq->parent_sq;
+		tg = sq_to_tg(sq);
+		if (!tg)
+			goto out_unlock;
 	}
 
-queue_bio:
+	/* out-of-limit, queue to @tg */
 	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
 		   rw == READ ? 'R' : 'W',
 		   tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 28/31] blk-throttle: make tg_dispatch_one_bio() ready for hierarchy
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (26 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 27/31] blk-throttle: make blk_throtl_bio() " Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 29/31] blk-throttle: make throtl_pending_timer_fn() " Tejun Heo
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

tg_dispatch_one_bio() currently assumes that the parent_sq is the top
level one and the bio being dispatched is ready to be issued; however,
this assumption will be wrong with proper hierarchy support.  This
patch makes the following changes to make tg_dispatch_on_bio() ready
for hiearchy.

* throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead
  of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to
  transfer a bio from a child tg to its parent.

* tg_dispatch_one_bio() is updated to distinguish whether its parent
  is another throtl_grp or the throtl_data.  If former, the bio is
  transferred to the parent throtl_grp using throtl_add_bio_tg().  If
  latter, the bio is ready to be issued and put on the top-level
  service_queue's bio_lists[] and throtl_data->nr_queued is
  decremented.

As all throtl_grps currently have the top level service_queue as their
->parent_sq, this patch in itself doesn't make any behavior
difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 99e1e78..367c269 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -821,7 +821,6 @@ static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg)
 	/* Take a bio reference on tg */
 	blkg_get(tg_to_blkg(tg));
 	sq->nr_queued[rw]++;
-	tg->td->nr_queued[rw]++;
 	throtl_enqueue_tg(tg);
 }
 
@@ -852,20 +851,34 @@ static void tg_update_disptime(struct throtl_grp *tg)
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
+	struct throtl_service_queue *parent_sq = sq->parent_sq;
+	struct throtl_grp *parent_tg = sq_to_tg(parent_sq);
 	struct bio *bio;
 
 	bio = bio_list_pop(&sq->bio_lists[rw]);
 	sq->nr_queued[rw]--;
-	/* Drop bio reference on blkg */
-	blkg_put(tg_to_blkg(tg));
-
-	BUG_ON(tg->td->nr_queued[rw] <= 0);
-	tg->td->nr_queued[rw]--;
 
 	throtl_charge_bio(tg, bio);
-	bio_list_add(&sq->parent_sq->bio_lists[rw], bio);
+
+	/*
+	 * If our parent is another tg, we just need to transfer @bio to
+	 * the parent using throtl_add_bio_tg().  If our parent is
+	 * @td->service_queue, @bio is ready to be issued.  Put it on its
+	 * bio_lists[] and decrease total number queued.  The caller is
+	 * responsible for issuing these bios.
+	 */
+	if (parent_tg) {
+		throtl_add_bio_tg(bio, parent_tg);
+	} else {
+		bio_list_add(&parent_sq->bio_lists[rw], bio);
+		BUG_ON(tg->td->nr_queued[rw] <= 0);
+		tg->td->nr_queued[rw]--;
+	}
 
 	throtl_trim_slice(tg, rw);
+
+	/* @bio is transferred to parent, drop its blkg reference */
+	blkg_put(tg_to_blkg(tg));
 }
 
 static int throtl_dispatch_tg(struct throtl_grp *tg)
@@ -1283,6 +1296,7 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 	bio_associate_current(bio);
+	tg->td->nr_queued[rw]++;
 	throtl_add_bio_tg(bio, tg);
 	throttled = true;
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 29/31] blk-throttle: make throtl_pending_timer_fn() ready for hierarchy
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (27 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 28/31] blk-throttle: make tg_dispatch_one_bio() " Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 30/31] blk-throttle: implement throtl_grp->has_rules[] Tejun Heo
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

throtl_pending_timer_fn() currently assumes that the parent_sq is the
top level one and the bio's dispatched are ready to be issued;
however, this assumption will be wrong with proper hierarchy support.
This patch makes the following changes to make
throtl_pending_timer_fn() ready for hiearchy.

* If the parent_sq isn't the top-level one, update the parent
  throtl_grp's dispatch time and schedule the next dispatch as
  necessary.  If the parent's dispatch time is now, repeat the
  function for the parent throtl_grp.

* If the parent_sq is the top-level one, kick issue work_item as
  before.

* The debug message printed by throtl_log() now prints out the
  service_queue's nr_queued[] instead of the total nr_queued as the
  latter becomes uninteresting and misleading with hierarchical
  dispatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 40 +++++++++++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 367c269..f3bd278 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -949,23 +949,33 @@ static int throtl_select_dispatch(struct throtl_service_queue *parent_sq)
  * This timer is armed when a child throtl_grp with active bio's become
  * pending and queued on the service_queue's pending_tree and expires when
  * the first child throtl_grp should be dispatched.  This function
- * dispatches bio's from the children throtl_grps and kicks
- * throtl_data->dispatch_work if there are bio's ready to be issued.
+ * dispatches bio's from the children throtl_grps to the parent
+ * service_queue.
+ *
+ * If the parent's parent is another throtl_grp, dispatching is propagated
+ * by either arming its pending_timer or repeating dispatch directly.  If
+ * the top-level service_tree is reached, throtl_data->dispatch_work is
+ * kicked so that the ready bio's are issued.
  */
 static void throtl_pending_timer_fn(unsigned long arg)
 {
 	struct throtl_service_queue *sq = (void *)arg;
+	struct throtl_grp *tg = sq_to_tg(sq);
 	struct throtl_data *td = sq_to_td(sq);
 	struct request_queue *q = td->queue;
-	bool dispatched = false;
+	struct throtl_service_queue *parent_sq;
+	bool dispatched;
 	int ret;
 
 	spin_lock_irq(q->queue_lock);
+again:
+	parent_sq = sq->parent_sq;
+	dispatched = false;
 
 	while (true) {
 		throtl_log(sq, "dispatch nr_queued=%u read=%u write=%u",
-			   td->nr_queued[READ] + td->nr_queued[WRITE],
-			   td->nr_queued[READ], td->nr_queued[WRITE]);
+			   sq->nr_queued[READ] + sq->nr_queued[WRITE],
+			   sq->nr_queued[READ], sq->nr_queued[WRITE]);
 
 		ret = throtl_select_dispatch(sq);
 		if (ret) {
@@ -982,9 +992,25 @@ static void throtl_pending_timer_fn(unsigned long arg)
 		spin_lock_irq(q->queue_lock);
 	}
 
-	if (dispatched)
-		queue_work(kthrotld_workqueue, &td->dispatch_work);
+	if (!dispatched)
+		goto out_unlock;
 
+	if (parent_sq) {
+		/* @parent_sq is another throl_grp, propagate dispatch */
+		if (tg->flags & THROTL_TG_WAS_EMPTY) {
+			tg_update_disptime(tg);
+			if (!throtl_schedule_next_dispatch(parent_sq, false)) {
+				/* window is already open, repeat dispatching */
+				sq = parent_sq;
+				tg = sq_to_tg(sq);
+				goto again;
+			}
+		}
+	} else {
+		/* reached the top-level, queue issueing */
+		queue_work(kthrotld_workqueue, &td->dispatch_work);
+	}
+out_unlock:
 	spin_unlock_irq(q->queue_lock);
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 30/31] blk-throttle: implement throtl_grp->has_rules[]
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (28 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 29/31] blk-throttle: make throtl_pending_timer_fn() " Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02  0:39 ` [PATCH 31/31] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

blk_throtl_bio() has a quick exit path for throtl_grps without limits
configured.  It looks at the bps and iops limits and if both are not
configured, the bio is issued immediately.  While this is correct in
the current flat hierarchy as each throtl_grp behaves completely
independently, it would become wrong in proper hierarchy mode.  A
group without any limits could still be limited by one of its
ancestors and bio's queued for such group should not bypass
blk-throtl.

As having a quick bypass mechanism is beneficial, this patch
reimplements the mechanism such that it's correct even with proper
hierarchy.  throtl_grp->has_rules[] is added.  These booleans are
updated for the whole subtree whenever a config is updated so that
has_rules[] of the whole subtree stays synchronized.  They're also
updated when a new throtl_grp comes online so that it can't escape the
limits of its ancestors.

As no throtl_grp has another throtl_grp as parent now, this patch
doesn't yet make any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-throttle.c | 49 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f3bd278..970355e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -84,6 +84,9 @@ struct throtl_grp {
 
 	unsigned int flags;
 
+	/* are there any throtl rules between this group and td? */
+	bool has_rules[2];
+
 	/* bytes per second rate limits */
 	uint64_t bps[2];
 
@@ -290,6 +293,30 @@ static void throtl_pd_init(struct blkcg_gq *blkg)
 	spin_unlock_irqrestore(&tg_stats_alloc_lock, flags);
 }
 
+/*
+ * Set has_rules[] if @tg or any of its parents have limits configured.
+ * This doesn't require walking up to the top of the hierarchy as the
+ * parent's has_rules[] is guaranteed to be correct.
+ */
+static void tg_update_has_rules(struct throtl_grp *tg)
+{
+	struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq);
+	int rw;
+
+	for (rw = READ; rw <= WRITE; rw++)
+		tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) ||
+				    (tg->bps[rw] != -1 || tg->iops[rw] != -1);
+}
+
+static void throtl_pd_online(struct blkcg_gq *blkg)
+{
+	/*
+	 * We don't want new groups to escape the limits of its ancestors.
+	 * Update has_rules[] after a new group is brought online.
+	 */
+	tg_update_has_rules(blkg_to_tg(blkg));
+}
+
 static void throtl_pd_exit(struct blkcg_gq *blkg)
 {
 	struct throtl_grp *tg = blkg_to_tg(blkg);
@@ -689,12 +716,6 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 	return 0;
 }
 
-static bool tg_no_rule_group(struct throtl_grp *tg, bool rw) {
-	if (tg->bps[rw] == -1 && tg->iops[rw] == -1)
-		return 1;
-	return 0;
-}
-
 /*
  * Returns whether one can dispatch a bio or not. Also returns approx number
  * of jiffies to wait before this bio is with-in IO rate and can be dispatched
@@ -1123,6 +1144,8 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 	struct blkg_conf_ctx ctx;
 	struct throtl_grp *tg;
 	struct throtl_service_queue *sq;
+	struct blkcg_gq *blkg;
+	struct cgroup *pos_cgrp;
 	int ret;
 
 	ret = blkg_conf_prep(blkcg, &blkcg_policy_throtl, buf, &ctx);
@@ -1146,6 +1169,17 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
 		   tg->iops[READ], tg->iops[WRITE]);
 
 	/*
+	 * Update has_rules[] flags for the updated tg's subtree.  A tg is
+	 * considered to have rules if either the tg itself or any of its
+	 * ancestors has rules.  This identifies groups without any
+	 * restrictions in the whole hierarchy and allows them to bypass
+	 * blk-throttle.
+	 */
+	tg_update_has_rules(tg);
+	blkg_for_each_descendant_pre(blkg, pos_cgrp, ctx.blkg)
+		tg_update_has_rules(blkg_to_tg(blkg));
+
+	/*
 	 * We're already holding queue_lock and know @tg is valid.  Let's
 	 * apply the new config directly.
 	 *
@@ -1234,6 +1268,7 @@ static struct blkcg_policy blkcg_policy_throtl = {
 	.cftypes		= throtl_files,
 
 	.pd_init_fn		= throtl_pd_init,
+	.pd_online_fn		= throtl_pd_online,
 	.pd_exit_fn		= throtl_pd_exit,
 	.pd_reset_stats_fn	= throtl_pd_reset_stats,
 };
@@ -1260,7 +1295,7 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	blkcg = bio_blkcg(bio);
 	tg = throtl_lookup_tg(td, blkcg);
 	if (tg) {
-		if (tg_no_rule_group(tg, rw)) {
+		if (!tg->has_rules[rw]) {
 			throtl_update_dispatch_stats(tg_to_blkg(tg),
 						     bio->bi_size, bio->bi_rw);
 			goto out_unlock_rcu;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 31/31] blk-throttle: implement proper hierarchy support
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (29 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 30/31] blk-throttle: implement throtl_grp->has_rules[] Tejun Heo
@ 2013-05-02  0:39 ` Tejun Heo
  2013-05-02 17:34 ` [PATCHSET] " Vivek Goyal
  2013-05-04  0:50 ` [PATCH 29.5/32] blk-throttle: add throtl_qnode for dispatch fairness Tejun Heo
  32 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02  0:39 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal, Tejun Heo

With the recent updates, blk-throttle is finally ready for proper
hierarchy support.  Dispatching now honors service_queue->parent_sq
and propagates correctly.  The only thing missing is setting
->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
hierarchy.

This patch updates throtl_pd_init() such that service_queues form the
same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
As this concludes proper hierarchy support for blkcg, the shameful
.broken_hierarchy tag is removed from blkio_subsys.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
---
 block/blk-cgroup.c     |  8 --------
 block/blk-throttle.c   | 21 ++++++++++++++++++++-
 include/linux/cgroup.h |  2 ++
 3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index af2ca27..8d9edc8 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -911,14 +911,6 @@ struct cgroup_subsys blkio_subsys = {
 	.subsys_id = blkio_subsys_id,
 	.base_cftypes = blkcg_files,
 	.module = THIS_MODULE,
-
-	/*
-	 * blkio subsystem is utterly broken in terms of hierarchy support.
-	 * It treats all cgroups equally regardless of where they're
-	 * located in the hierarchy - all cgroups are treated as if they're
-	 * right below the root.  Fix it and remove the following.
-	 */
-	.broken_hierarchy = true,
 };
 EXPORT_SYMBOL_GPL(blkio_subsys);
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 970355e..fa3e237 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -271,9 +271,28 @@ static void throtl_pd_init(struct blkcg_gq *blkg)
 {
 	struct throtl_grp *tg = blkg_to_tg(blkg);
 	struct throtl_data *td = blkg->q->td;
+	struct throtl_service_queue *parent_sq;
 	unsigned long flags;
 
-	throtl_service_queue_init(&tg->service_queue, &td->service_queue);
+	/*
+	 * If sane_hierarchy is enabled, we switch to properly hierarchical
+	 * behavior where limits on a given throtl_grp are applied to the
+	 * whole subtree rather than just the group itself.  e.g. If 16M
+	 * read_bps limit is set on the root group, the whole system can't
+	 * exceed 16M for the device.
+	 *
+	 * If sane_hierarchy is not enabled, the broken flat hierarchy
+	 * behavior is retained where all throtl_grps are treated as if
+	 * they're all separate root groups right below throtl_data.
+	 * Limits of a group don't interact with limits of other groups
+	 * regardless of the position of the group in the hierarchy.
+	 */
+	parent_sq = &td->service_queue;
+
+	if (cgroup_sane_behavior(blkg->blkcg->css.cgroup) && blkg->parent)
+		parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
+
+	throtl_service_queue_init(&tg->service_queue, parent_sq);
 	RB_CLEAR_NODE(&tg->rb_node);
 	tg->td = td;
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c371888..3c5f780 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -271,6 +271,8 @@ enum {
 	 * - memcg: use_hierarchy is on by default and the cgroup file for
 	 *   the flag is not created.
 	 *
+	 * - blkcg: blk-throttle becomes properly hierarchical.
+	 *
 	 * The followings are planned changes.
 	 *
 	 * - release_agent will be disallowed once replacement notification
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 07/31] blk-throttle: removed deferred config application mechanism
  2013-05-02  0:39 ` [PATCH 07/31] blk-throttle: removed deferred config application mechanism Tejun Heo
@ 2013-05-02 14:49   ` Vivek Goyal
  2013-05-02 17:27     ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-02 14:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Wed, May 01, 2013 at 05:39:25PM -0700, Tejun Heo wrote:

[..]
> @@ -1023,9 +975,27 @@ static int tg_set_conf(struct cgroup *cgrp, struct cftype *cft, const char *buf,
>  	else
>  		*(unsigned int *)((void *)tg + cft->private) = ctx.v;
>  
> -	/* XXX: we don't need the following deferred processing */
> -	xchg(&tg->limits_changed, true);
> -	xchg(&td->limits_changed, true);
> +	throtl_log_tg(td, tg, "limit change rbps=%llu wbps=%llu riops=%u wiops=%u",
> +		      tg->bps[READ], tg->bps[WRITE],
> +		      tg->iops[READ], tg->iops[WRITE]);
> +
> +	/*
> +	 * We're already holding queue_lock and know @tg is valid.  Let's
> +	 * apply the new config directly.
> +	 *
> +	 * Restart the slices for both READ and WRITES. It might happen
> +	 * that a group's limit are dropped suddenly and we don't want to
> +	 * account recently dispatched IO with new low rate.
> +	 */
> +	throtl_start_new_slice(td, tg, 0);
> +	throtl_start_new_slice(td, tg, 1);
> +
> +	if (throtl_tg_on_rr(tg)) {
> +		tg_update_disptime(td, tg);
> +		throtl_schedule_next_dispatch(td);
> +	}
> +
> +	/* kick dispatch in case disptime got shortened */
>  	throtl_schedule_delayed_work(td, 0);

Hi Tejun,

Do we need above throtl_schedule_delayed_work() now?
throtl_schedule_next_dispatch() should take care of it. And if group
is not on service tree at the time of limit change, then anyway, we don't
have to schedule any work.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 15/31] blk-throttle: reorganize throtl_service_queue passed around as argument
  2013-05-02  0:39 ` [PATCH 15/31] blk-throttle: reorganize throtl_service_queue passed around as argument Tejun Heo
@ 2013-05-02 15:21   ` Vivek Goyal
  2013-05-02 17:29     ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-02 15:21 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Wed, May 01, 2013 at 05:39:33PM -0700, Tejun Heo wrote:
> throtl_service_queue will be the building block of hierarchy support
> and will form a tree.  This patch updates its usages as arguments to
> reduce confusion.
> 
> * When a service queue is used as the parent role - the host of the
>   rbtree - use @parent_sq instead of @sq.

Hi Tejun,

Apart from parent_sq, what other kind of sq are present.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 07/31] blk-throttle: removed deferred config application mechanism
  2013-05-02 14:49   ` Vivek Goyal
@ 2013-05-02 17:27     ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02 17:27 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Thu, May 02, 2013 at 10:49:12AM -0400, Vivek Goyal wrote:
> > +	/* kick dispatch in case disptime got shortened */
> >  	throtl_schedule_delayed_work(td, 0);
> 
> Hi Tejun,
> 
> Do we need above throtl_schedule_delayed_work() now?
> throtl_schedule_next_dispatch() should take care of it. And if group
> is not on service tree at the time of limit change, then anyway, we don't
> have to schedule any work.

Right, that one can go away.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 15/31] blk-throttle: reorganize throtl_service_queue passed around as argument
  2013-05-02 15:21   ` Vivek Goyal
@ 2013-05-02 17:29     ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-02 17:29 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Thu, May 02, 2013 at 11:21:48AM -0400, Vivek Goyal wrote:
> On Wed, May 01, 2013 at 05:39:33PM -0700, Tejun Heo wrote:
> > throtl_service_queue will be the building block of hierarchy support
> > and will form a tree.  This patch updates its usages as arguments to
> > reduce confusion.
> > 
> > * When a service queue is used as the parent role - the host of the
> >   rbtree - use @parent_sq instead of @sq.
> 
> Hi Tejun,
> 
> Apart from parent_sq, what other kind of sq are present.

Just sq and parent_sq where the former is used when it isn't
particularly about being the parent or used as the child variable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (30 preceding siblings ...)
  2013-05-02  0:39 ` [PATCH 31/31] blk-throttle: implement proper hierarchy support Tejun Heo
@ 2013-05-02 17:34 ` Vivek Goyal
  2013-05-02 17:57   ` Tejun Heo
  2013-05-02 18:08   ` Vivek Goyal
  2013-05-04  0:50 ` [PATCH 29.5/32] blk-throttle: add throtl_qnode for dispatch fairness Tejun Heo
  32 siblings, 2 replies; 68+ messages in thread
From: Vivek Goyal @ 2013-05-02 17:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Wed, May 01, 2013 at 05:39:18PM -0700, Tejun Heo wrote:

[..]
> While this patchset contains many patches, the implementation is
> pretty straight-forward.  throtl_grp's form a tree anchored at
> throtl_data and bios climb the tree as they get dispatched at each
> level.  The bios which reach the top of the tree - throl_data - are
> issued. 

Have a question here. Looks like when bio climbs from child group
to parent group, then parent group slice starts fresh if parent
was empty. So if we have a parent with 1MB/s limit and a child with
1MB/s limit and a bio gets queued in child, then looks like effective
IO rate would be .5MB/s and not 1MB/s?

IOW, when child gets queued, we should start time accounting for
all parents in the hiearchy too.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 17:34 ` [PATCHSET] " Vivek Goyal
@ 2013-05-02 17:57   ` Tejun Heo
  2013-05-02 18:17     ` Vivek Goyal
  2013-05-02 18:08   ` Vivek Goyal
  1 sibling, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2013-05-02 17:57 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

Hey, Vivek.

On Thu, May 02, 2013 at 01:34:28PM -0400, Vivek Goyal wrote:
> On Wed, May 01, 2013 at 05:39:18PM -0700, Tejun Heo wrote:
> 
> [..]
> > While this patchset contains many patches, the implementation is
> > pretty straight-forward.  throtl_grp's form a tree anchored at
> > throtl_data and bios climb the tree as they get dispatched at each
> > level.  The bios which reach the top of the tree - throl_data - are
> > issued. 
> 
> Have a question here. Looks like when bio climbs from child group
> to parent group, then parent group slice starts fresh if parent
> was empty. So if we have a parent with 1MB/s limit and a child with
> 1MB/s limit and a bio gets queued in child, then looks like effective
> IO rate would be .5MB/s and not 1MB/s?

Hmmm.... not that drastic but when the same limit is configured in
both parent and its single active child, the child gets penalized by
about 15%, which is not nice.

> IOW, when child gets queued, we should start time accounting for
> all parents in the hiearchy too.

I don't particularly like doing that as a separate step, maybe we can
just push the child's start time to the parent while dispatching?
Does that sound doable to you?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 17:34 ` [PATCHSET] " Vivek Goyal
  2013-05-02 17:57   ` Tejun Heo
@ 2013-05-02 18:08   ` Vivek Goyal
  2013-05-02 18:44     ` Tejun Heo
  1 sibling, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-02 18:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Thu, May 02, 2013 at 01:34:28PM -0400, Vivek Goyal wrote:
> On Wed, May 01, 2013 at 05:39:18PM -0700, Tejun Heo wrote:
> 
> [..]
> > While this patchset contains many patches, the implementation is
> > pretty straight-forward.  throtl_grp's form a tree anchored at
> > throtl_data and bios climb the tree as they get dispatched at each
> > level.  The bios which reach the top of the tree - throl_data - are
> > issued. 
> 
> Have a question here. Looks like when bio climbs from child group
> to parent group, then parent group slice starts fresh if parent
> was empty. So if we have a parent with 1MB/s limit and a child with
> 1MB/s limit and a bio gets queued in child, then looks like effective
> IO rate would be .5MB/s and not 1MB/s?
> 
> IOW, when child gets queued, we should start time accounting for
> all parents in the hiearchy too.

Hi Tejun,

Also, IIUC, there might be bandwidth sharing problems. Once a bio
climbs up the ladder, it gets queued at the end of the parent queue. And
that can lead to unfair distribution of available bandwidth. For
example,

			G1
		       /  \
	              T1  G2
			  |
			  T2

G1 and G2 are 2 groups and T1 and T2 are tasks in groups respectively.
Assume both G1 and G2 are having 1MB/s IO rate limit. Assume T1 and
T2 are doing enough IO to keep respective queues backlogged.

Now While T2 is backlogged in in G2, T1 can queue up multiple 1MB
size bio and all these bio's will be served first and then one bio
from T2. And this can repeat for long time and problem only worsens
with hierarchy depth.

Ideally both T1 and G2 should share 1MB/s link equally (That is .5MB/sec)
each but in this case, T1 can run away with lot more than fair share.

Can't think how to get fair distribution of available bandwidth with
bio climbing the tree model.

I was thinking that we should implement it something along the lines
of what cpu scheduler has done. All parent groups get enqueued on 
service tree when IO gets queued in any of child groups. Time slice
accounting starts at each level. And at each level we do round robin
for dispatch of bio from each eligible child group/queue.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 17:57   ` Tejun Heo
@ 2013-05-02 18:17     ` Vivek Goyal
  2013-05-02 18:29       ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-02 18:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Thu, May 02, 2013 at 10:57:01AM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Thu, May 02, 2013 at 01:34:28PM -0400, Vivek Goyal wrote:
> > On Wed, May 01, 2013 at 05:39:18PM -0700, Tejun Heo wrote:
> > 
> > [..]
> > > While this patchset contains many patches, the implementation is
> > > pretty straight-forward.  throtl_grp's form a tree anchored at
> > > throtl_data and bios climb the tree as they get dispatched at each
> > > level.  The bios which reach the top of the tree - throl_data - are
> > > issued. 
> > 
> > Have a question here. Looks like when bio climbs from child group
> > to parent group, then parent group slice starts fresh if parent
> > was empty. So if we have a parent with 1MB/s limit and a child with
> > 1MB/s limit and a bio gets queued in child, then looks like effective
> > IO rate would be .5MB/s and not 1MB/s?
> 
> Hmmm.... not that drastic but when the same limit is configured in
> both parent and its single active child, the child gets penalized by
> about 15%, which is not nice.

Sorry, did not understand how did you arrive at 15% penalty. I think
in worst case it will be 50%. Assume size of bio is 1MB. So it will
wait for 1 second in child group and then it will wait again for
another second in parent group. Assume next bio gets queued only
after first bio gets dispatched. 

That means each 1MB bio will wait for 2 second which will lead to
effective rate of .5MB/second.

> 
> > IOW, when child gets queued, we should start time accounting for
> > all parents in the hiearchy too.
> 
> I don't particularly like doing that as a separate step, maybe we can
> just push the child's start time to the parent while dispatching?
> Does that sound doable to you?

May be. But climbing the ladder has unfairness problems too. We might
have to rethink about the hierarchical algorithm altogether.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 18:17     ` Vivek Goyal
@ 2013-05-02 18:29       ` Tejun Heo
  2013-05-02 18:45         ` Vivek Goyal
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2013-05-02 18:29 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

Hello, Vivek.

On Thu, May 02, 2013 at 02:17:48PM -0400, Vivek Goyal wrote:
> Sorry, did not understand how did you arrive at 15% penalty. I think
> in worst case it will be 50%. Assume size of bio is 1MB. So it will

Oh, that's the number I got by running test.

> wait for 1 second in child group and then it will wait again for
> another second in parent group. Assume next bio gets queued only
> after first bio gets dispatched. 
> 
> That means each 1MB bio will wait for 2 second which will lead to
> effective rate of .5MB/second.

Then the scheduling algorithm is broken in itself regardless of
hierarchy.  That means an issuer which issues at exact the exactly
configured pace gets penalized, right?

> > I don't particularly like doing that as a separate step, maybe we can
> > just push the child's start time to the parent while dispatching?
> > Does that sound doable to you?
> 
> May be. But climbing the ladder has unfairness problems too. We might
> have to rethink about the hierarchical algorithm altogether.

Meh, it's not different from flat case.  It gives workable enough
fairness from the simple fact that the queues at each layer are FIFO.
Beyond that, as long as the limits are honored, it's fine.  If this
thing is going to be used for high-bandwidth applications, it should
probably get reimplemented as per-cpu token distributing hierarchy,
but blk-throttle is not gonna hold back the whole cgroup hierarchy
support.  If you want to reimplement the whole thing, please feel free
to do so afterwards.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 18:08   ` Vivek Goyal
@ 2013-05-02 18:44     ` Tejun Heo
  2013-05-02 18:59       ` Vivek Goyal
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2013-05-02 18:44 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

Hello, Vivek.

On Thu, May 02, 2013 at 02:08:15PM -0400, Vivek Goyal wrote:
> 			G1
> 		       /  \
> 	              T1  G2
> 			  |
> 			  T2
> 
> G1 and G2 are 2 groups and T1 and T2 are tasks in groups respectively.
> Assume both G1 and G2 are having 1MB/s IO rate limit. Assume T1 and
> T2 are doing enough IO to keep respective queues backlogged.

For the most part, I don't really care as long as the limits are
followed.  We can implement something better when dispatching from
child group into ->bio_lists[].  ->bio_lists[] could be organized in a
way that it round robins certain number of bios from different sources
- ie. it becomes FIFO lists of different sources of bios which is
fetched in round-robin.  We already have a similar logic in
select_dispatch() BTW.

> I was thinking that we should implement it something along the lines
> of what cpu scheduler has done. All parent groups get enqueued on 
> service tree when IO gets queued in any of child groups. Time slice
> accounting starts at each level. And at each level we do round robin
> for dispatch of bio from each eligible child group/queue.

Let's please not do something which is gonna take a lot of time and
effort.  If the fairness bothers you, please implement something
simple on top.  It really just comes down to doing RR when taking bios
from ->bio_lists[].  If you wanna reimplement the whole thing, that's
fine too but let's please do that after getting the basic hierarchy
support working because blkcg literally is the last subsystem with
.broken_hierarchy at this point.

Also, if you're actually thinking about reimplementing blk-throttle,
please do consider the followings.

* Currently, blk-throttle doesn't throttle the number of bios being
  queued.  Note that this breaks the basic back-pressure mechanism
  where IO pressure is propagated back to the issuer by throttling the
  issuing task.  blk-throttle breaks that link and converts it to a
  memory pressure.

* It's almost inherently unscalable with highops devices.  Given that
  IO limiting doesn't require very fine granularity, I think doing
  this per-cpu shouldn't be too hard.  e.g. build a per-cpu token
  distributing hierarchy with rebalancing across CPUs happening
  periodically.

In short, right now, the goal is getting the hierarchy support
acceptably working ASAP and yeap we wanna get the nested limits and at
least certain level of fairness, but let's please implement something
simple for now and strive for sophistification later because it's
holding back everyone else.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 18:29       ` Tejun Heo
@ 2013-05-02 18:45         ` Vivek Goyal
  2013-05-02 18:49           ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-02 18:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Thu, May 02, 2013 at 11:29:33AM -0700, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Thu, May 02, 2013 at 02:17:48PM -0400, Vivek Goyal wrote:
> > Sorry, did not understand how did you arrive at 15% penalty. I think
> > in worst case it will be 50%. Assume size of bio is 1MB. So it will
> 
> Oh, that's the number I got by running test.
> 
> > wait for 1 second in child group and then it will wait again for
> > another second in parent group. Assume next bio gets queued only
> > after first bio gets dispatched. 
> > 
> > That means each 1MB bio will wait for 2 second which will lead to
> > effective rate of .5MB/second.
> 
> Then the scheduling algorithm is broken in itself regardless of
> hierarchy.  That means an issuer which issues at exact the exactly
> configured pace gets penalized, right?

I did not understand this point. In flat model, application issuing
at configured page will not get penalized.

This penalty is coming from the fact that we are moving bios after the
wait and make them wait in another queue.

In flat model there is no such problem. So to me, it is the problem
of how hierarchical scheduling is implemented. In flat model, I did
not have to deal with it.

> 
> > > I don't particularly like doing that as a separate step, maybe we can
> > > just push the child's start time to the parent while dispatching?
> > > Does that sound doable to you?
> > 
> > May be. But climbing the ladder has unfairness problems too. We might
> > have to rethink about the hierarchical algorithm altogether.
> 
> Meh, it's not different from flat case.  It gives workable enough
> fairness from the simple fact that the queues at each layer are FIFO.
> Beyond that, as long as the limits are honored, it's fine.  If this
> thing is going to be used for high-bandwidth applications, it should
> probably get reimplemented as per-cpu token distributing hierarchy,
> but blk-throttle is not gonna hold back the whole cgroup hierarchy
> support.  If you want to reimplement the whole thing, please feel free
> to do so afterwards.

Ok. Not having a perfect algorithm now is fine. We can always redo it
later.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 18:45         ` Vivek Goyal
@ 2013-05-02 18:49           ` Tejun Heo
  2013-05-02 19:07             ` Vivek Goyal
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2013-05-02 18:49 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

Hello,

On Thu, May 02, 2013 at 02:45:14PM -0400, Vivek Goyal wrote:
> I did not understand this point. In flat model, application issuing
> at configured page will not get penalized.
> 
> This penalty is coming from the fact that we are moving bios after the
> wait and make them wait in another queue.
> 
> In flat model there is no such problem. So to me, it is the problem
> of how hierarchical scheduling is implemented. In flat model, I did
> not have to deal with it.

But seen from the parent, the child isn't different from any other
issuer in flat hierarchy.  It's just being repeated, so if you assume
a process which behaves in the exact same manner, that process would
get penalized too.  e.g. imagine an application which throttles itself
and issues exactly 1MB/s amount of data in direct IO.  It'd get
penalized the same way, right?

> Ok. Not having a perfect algorithm now is fine. We can always redo it
> later.

I think we can do source-based RR on bio_lists[] fetching which is
simple enough and should be able to avoid most of the problems, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 18:44     ` Tejun Heo
@ 2013-05-02 18:59       ` Vivek Goyal
  0 siblings, 0 replies; 68+ messages in thread
From: Vivek Goyal @ 2013-05-02 18:59 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Thu, May 02, 2013 at 11:44:26AM -0700, Tejun Heo wrote:

[..]
> Also, if you're actually thinking about reimplementing blk-throttle,
> please do consider the followings.
> 
> * Currently, blk-throttle doesn't throttle the number of bios being
>   queued.  Note that this breaks the basic back-pressure mechanism
>   where IO pressure is propagated back to the issuer by throttling the
>   issuing task.  blk-throttle breaks that link and converts it to a
>   memory pressure.

Agreed. Implementing something along the lines of per group nr_requests
is needed.

> 
> * It's almost inherently unscalable with highops devices.  Given that
>   IO limiting doesn't require very fine granularity, I think doing
>   this per-cpu shouldn't be too hard.  e.g. build a per-cpu token
>   distributing hierarchy with rebalancing across CPUs happening
>   periodically.

Interesting. I thought for highops devices we will these multi queue
patches from jens and there we can probably implement per queue tokens
and rebalance tokens across queues periodically.

> 
> In short, right now, the goal is getting the hierarchy support
> acceptably working ASAP and yeap we wanna get the nested limits and at
> least certain level of fairness, but let's please implement something
> simple for now and strive for sophistification later because it's
> holding back everyone else.

I am fine with this. Having some hierarchical algorithm and not blocking
full hierarchical support for cgroup is better than not having any
hierachical support and wait for better implementation.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 18:49           ` Tejun Heo
@ 2013-05-02 19:07             ` Vivek Goyal
  2013-05-02 19:11               ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-02 19:07 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Thu, May 02, 2013 at 11:49:53AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Thu, May 02, 2013 at 02:45:14PM -0400, Vivek Goyal wrote:
> > I did not understand this point. In flat model, application issuing
> > at configured page will not get penalized.
> > 
> > This penalty is coming from the fact that we are moving bios after the
> > wait and make them wait in another queue.
> > 
> > In flat model there is no such problem. So to me, it is the problem
> > of how hierarchical scheduling is implemented. In flat model, I did
> > not have to deal with it.
> 
> But seen from the parent, the child isn't different from any other
> issuer in flat hierarchy.  It's just being repeated, so if you assume
> a process which behaves in the exact same manner, that process would
> get penalized too.  e.g. imagine an application which throttles itself
> and issues exactly 1MB/s amount of data in direct IO.  It'd get
> penalized the same way, right?

It should not. Why do you think in flat model an application which
throttles itself will be penalized.

So application issue an bio of size 1MB in a group of rate 1MB/s. bio
gets queued and gets dispatched after 1 second. Almost immediately next
bio will come from application (as application also is throttling 
itself at 1MB/s rate). And then this bio waits for a second. So we
almost get steady rate of 1MB/s as configured.

Sorry, I am not able uderstand the problem here.

> 
> > Ok. Not having a perfect algorithm now is fine. We can always redo it
> > later.
> 
> I think we can do source-based RR on bio_lists[] fetching which is
> simple enough and should be able to avoid most of the problems, right?

Yes, maintaining per child/source bio_lists[] in parent and doing round
robin there should mitigate the fairness problem to a great extent.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 19:07             ` Vivek Goyal
@ 2013-05-02 19:11               ` Tejun Heo
  2013-05-02 19:31                 ` Vivek Goyal
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2013-05-02 19:11 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

Hey,


On Thu, May 2, 2013 at 12:07 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> It should not. Why do you think in flat model an application which
> throttles itself will be penalized.
>
> So application issue an bio of size 1MB in a group of rate 1MB/s. bio
> gets queued and gets dispatched after 1 second. Almost immediately next
> bio will come from application (as application also is throttling
> itself at 1MB/s rate). And then this bio waits for a second. So we
> almost get steady rate of 1MB/s as configured.

I'm confused now, so why does the hierarchy make any difference? When
seen from the parent, what's the difference between a process issuing
IO directly and an IO which already went through another throttle
layer if the IOs arrive at the same intervals?

>
> >
> > > Ok. Not having a perfect algorithm now is fine. We can always redo it
> > > later.
> >
> > I think we can do source-based RR on bio_lists[] fetching which is
> > simple enough and should be able to avoid most of the problems, right?
>
> Yes, maintaining per child/source bio_lists[] in parent and doing round
> robin there should mitigate the fairness problem to a great extent.

Yeap, will implement that.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 19:11               ` Tejun Heo
@ 2013-05-02 19:31                 ` Vivek Goyal
  2013-05-02 23:13                   ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-02 19:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

On Thu, May 02, 2013 at 12:11:30PM -0700, Tejun Heo wrote:
> Hey,
> 
> 
> On Thu, May 2, 2013 at 12:07 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > It should not. Why do you think in flat model an application which
> > throttles itself will be penalized.
> >
> > So application issue an bio of size 1MB in a group of rate 1MB/s. bio
> > gets queued and gets dispatched after 1 second. Almost immediately next
> > bio will come from application (as application also is throttling
> > itself at 1MB/s rate). And then this bio waits for a second. So we
> > almost get steady rate of 1MB/s as configured.
> 
> I'm confused now, so why does the hierarchy make any difference? When
> seen from the parent, what's the difference between a process issuing
> IO directly and an IO which already went through another throttle
> layer if the IOs arrive at the same intervals?

I think my example was little flawed previously. I think you are right.
Penalty is not probably as bad as I have been thinking.

So if both parent and child have limit of 1MB/s and application is doing
IO (say at 2MB/sec), in long term it should still see 1MB/s rate.

		T1	T2	T3	T4	T5	T6
Parent group:		B1	B2	B3	B4	B5
Child group:	B1	B2	B3	B4	B5	B6 

Above B1 to B6 are bios of 1MB size. T1 to T6 are 1 second time interval.
B1 waits for T1 interval in child group and then for T2 interval in
parent group and then gets dispatched. But a pipe line has formed in
child group and B2 is waiting in child group in T2 slice. So penalty
is not double.

So each group migration will add one extra wait period. In above case
5 bios dispatched in 6 seconds. Longer the sampling interval, delay
remains the constant to one time interval and % penalty goes down.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 19:31                 ` Vivek Goyal
@ 2013-05-02 23:13                   ` Tejun Heo
  2013-05-03 17:56                     ` Vivek Goyal
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2013-05-02 23:13 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

Hello, Vivek.

On Thu, May 02, 2013 at 03:31:39PM -0400, Vivek Goyal wrote:
> I think my example was little flawed previously. I think you are right.
> Penalty is not probably as bad as I have been thinking.
> 
> So if both parent and child have limit of 1MB/s and application is doing
> IO (say at 2MB/sec), in long term it should still see 1MB/s rate.
> 
> 		T1	T2	T3	T4	T5	T6
> Parent group:		B1	B2	B3	B4	B5
> Child group:	B1	B2	B3	B4	B5	B6 
> 
> Above B1 to B6 are bios of 1MB size. T1 to T6 are 1 second time interval.
> B1 waits for T1 interval in child group and then for T2 interval in
> parent group and then gets dispatched. But a pipe line has formed in
> child group and B2 is waiting in child group in T2 slice. So penalty
> is not double.
> 
> So each group migration will add one extra wait period. In above case
> 5 bios dispatched in 6 seconds. Longer the sampling interval, delay
> remains the constant to one time interval and % penalty goes down.

Yeah, I think that's what *should* be happening but not what I'm
seeing.  I'm seeing ~15% penalty.  It works fine if there are more
than one active children but with a single child configured at the
same limit, it doesn't work as expected.  I'm a bit lost where the
difference is coming from.  Hmmm... also in the above example, we
really should be doing the following.

 		T1	T2	T3	T4	T5	T6
 Parent group:	B1	B2	B3	B4	B5	B6
 Child group:	B1	B2	B3	B4	B5	B6 

I mean, if there's no other IO going on, there's no point in delaying
the first IO.  ie. the slice should be considered as started before so
that B1 can be issued immediately, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-02 23:13                   ` Tejun Heo
@ 2013-05-03 17:56                     ` Vivek Goyal
  2013-05-03 18:57                       ` Tejun Heo
  0 siblings, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-03 17:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

On Thu, May 02, 2013 at 04:13:07PM -0700, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Thu, May 02, 2013 at 03:31:39PM -0400, Vivek Goyal wrote:
> > I think my example was little flawed previously. I think you are right.
> > Penalty is not probably as bad as I have been thinking.
> > 
> > So if both parent and child have limit of 1MB/s and application is doing
> > IO (say at 2MB/sec), in long term it should still see 1MB/s rate.
> > 
> > 		T1	T2	T3	T4	T5	T6
> > Parent group:		B1	B2	B3	B4	B5
> > Child group:	B1	B2	B3	B4	B5	B6 
> > 
> > Above B1 to B6 are bios of 1MB size. T1 to T6 are 1 second time interval.
> > B1 waits for T1 interval in child group and then for T2 interval in
> > parent group and then gets dispatched. But a pipe line has formed in
> > child group and B2 is waiting in child group in T2 slice. So penalty
> > is not double.
> > 
> > So each group migration will add one extra wait period. In above case
> > 5 bios dispatched in 6 seconds. Longer the sampling interval, delay
> > remains the constant to one time interval and % penalty goes down.
> 
> Yeah, I think that's what *should* be happening but not what I'm
> seeing.  I'm seeing ~15% penalty.

What test are you running. I am running a simple dd with directIO and
I am not seeing any penalty.

# set limit to 1000000 bytes/second both in parent and child cgroup
# dd if=/dev/vdb of=/dev/null iflag=direct

I will capture blktrace and analyze it though to understand better
what's happening.

>  It works fine if there are more
> than one active children but with a single child configured at the
> same limit, it doesn't work as expected.  I'm a bit lost where the
> difference is coming from.  Hmmm... also in the above example, we
> really should be doing the following.
> 
>  		T1	T2	T3	T4	T5	T6
>  Parent group:	B1	B2	B3	B4	B5	B6
>  Child group:	B1	B2	B3	B4	B5	B6 
> 
> I mean, if there's no other IO going on, there's no point in delaying
> the first IO.  ie. the slice should be considered as started before so
> that B1 can be issued immediately, right?

Yes that's the right thing to do. So may be we can tell parent group when
bio was queued in. Parent will have to start a new time slice. It also
needs to look into when was the last slice it finished and take greater
of last slice finished and time slice passed by child.

/me needs to think little more about time slice management.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-03 17:56                     ` Vivek Goyal
@ 2013-05-03 18:57                       ` Tejun Heo
  2013-05-03 18:58                         ` Tejun Heo
  2013-05-03 19:08                         ` Vivek Goyal
  0 siblings, 2 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-03 18:57 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

On Fri, May 03, 2013 at 01:56:52PM -0400, Vivek Goyal wrote:
> > Yeah, I think that's what *should* be happening but not what I'm
> > seeing.  I'm seeing ~15% penalty.
> 
> What test are you running. I am running a simple dd with directIO and
> I am not seeing any penalty.

Combination of dd and a test program that I've been using for some
while which can generate concurrent direct random IOs.  Attaching the
source code for the latter.

> # set limit to 1000000 bytes/second both in parent and child cgroup
> # dd if=/dev/vdb of=/dev/null iflag=direct
> 
> I will capture blktrace and analyze it though to understand better
> what's happening.

Try using larger block size.  It looks like dispatch windows being
reset depending on timing is hurting the overall bandwidth.  It
becomes pronounced with larger IOs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-03 18:57                       ` Tejun Heo
@ 2013-05-03 18:58                         ` Tejun Heo
  2013-05-03 19:08                         ` Vivek Goyal
  1 sibling, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-03 18:58 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

[-- Attachment #1: Type: text/plain, Size: 557 bytes --]

On Fri, May 03, 2013 at 11:57:51AM -0700, Tejun Heo wrote:
> On Fri, May 03, 2013 at 01:56:52PM -0400, Vivek Goyal wrote:
> > > Yeah, I think that's what *should* be happening but not what I'm
> > > seeing.  I'm seeing ~15% penalty.
> > 
> > What test are you running. I am running a simple dd with directIO and
> > I am not seeing any penalty.
> 
> Combination of dd and a test program that I've been using for some
> while which can generate concurrent direct random IOs.  Attaching the
> source code for the latter.

And actually attaching...

-- 
tejun

[-- Attachment #2: test_rawio.c --]
[-- Type: text/plain, Size: 5824 bytes --]

#define _GNU_SOURCE
#define _FILE_OFFSET_BITS 64

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <ctype.h>
#include <unistd.h>
#include <inttypes.h>
#include <sys/ioctl.h>
#include <signal.h>
#include <pthread.h>
#include <time.h>
#include <string.h>
#include <sys/time.h>

#include <sys/user.h>
#include <linux/fs.h>

static int dev_fd, blocks_per_rq, concurrency, do_write;
static int block_size;
static uint64_t device_size, nr_blocks;

static int exiting, nr_exited;

static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static uint64_t *dispenser_ar;
static unsigned nr_succeeded, nr_failed;

static void sigexit_handler(int dummy)
{
	exiting = 1;
}

static uint64_t dispense_block(int idx)
{
	while (1) {
		uint64_t block;
		int i;
		block = ((uint64_t)random() << 31 | random())
			% (nr_blocks - blocks_per_rq + 1);
		for (i = 0; i < concurrency; i++) {
			if (block + blocks_per_rq > dispenser_ar[i] &&
			    block < dispenser_ar[i] + blocks_per_rq)
				break;
		}
		if (i == concurrency) {
			dispenser_ar[idx] = block;
			return block;
		}
	}
}

static void * do_rawio(void *arg)
{
	int idx = (int)(unsigned long)arg, my_exiting = 0, i;
	size_t bufsz = blocks_per_rq * block_size;
	char *rbuf, *wbuf = NULL;
	uint64_t block;
	ssize_t ret;

	if ((rbuf = malloc(bufsz + PAGE_SIZE)) == NULL ||
	    (do_write && (wbuf = malloc(bufsz + PAGE_SIZE)) == NULL)) {
		perror("malloc");
		exit(1);
	}

	rbuf = (void *)((unsigned long)(rbuf + PAGE_SIZE-1) & ~(PAGE_SIZE-1));
	wbuf = (void *)((unsigned long)(wbuf + PAGE_SIZE-1) & ~(PAGE_SIZE-1));

	if (do_write)
		for (i = 0; i < bufsz / sizeof(int); i++)
			wbuf[i] = idx + i;

	pthread_mutex_lock(&mutex);
 again:
	if (exiting || my_exiting) {
		nr_exited++;
		pthread_mutex_unlock(&mutex);
		return NULL;
	}
	block = dispense_block(idx);
	pthread_mutex_unlock(&mutex);

	if (do_write) {
		ret = pwrite(dev_fd, wbuf, bufsz, block * block_size);
		if (ret != bufsz) {
			fprintf(stderr, "\rThread %02d: write failed on "
				"block %"PRIu64" ret=%zd errno=%d wbuf=%p\n",
				idx, block, ret, errno, wbuf);
			goto failed;
		}
	}

	ret = pread(dev_fd, rbuf, bufsz, block * block_size);
	if (ret != bufsz) {
		fprintf(stderr, "\rThread %02d: read failed on block "
			"%"PRIu64" ret=%zd errno=%d rbuf=%p\n",
			idx, block, ret, errno, rbuf);
		goto failed;
	}

	if (do_write && memcmp(wbuf, rbuf, bufsz) != 0) {
		fprintf(stderr, "\rThread %02d: data mismatch on block "
			"%"PRIu64" ret=%zd errno=%d\n", idx, block, ret, errno);
		goto failed;
	}

	nr_succeeded++;
	pthread_mutex_lock(&mutex);
	goto again;

 failed:
	nr_failed++;
	my_exiting = 1;
	pthread_mutex_lock(&mutex);
	goto again;
}

static uint64_t now_in_usec(void)
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	return (uint64_t)tv.tv_sec * 1000000 + tv.tv_usec;
}

int main(int argc, char **argv)
{
	struct stat sbuf;
	int i, summary_only;
	pthread_t *thrs;
	uint64_t started_at, last_tstmp;
	unsigned last_succeeded = 0;
	double iops = 0;

	if (argc < 5) {
		fprintf(stderr,
		"Usage: test_rawio BLOCKDEV BLOCKS_PER_RQ CONCURRENCY (r|w) [s(ummary)|w(ait)]\n");
		return 1;
	}

	blocks_per_rq = atoi(argv[2]);
	concurrency = atoi(argv[3]);

	if (blocks_per_rq <= 0 || concurrency <= 0) {
		fprintf(stderr, "invalid parameters\n");
		return 1;
	}

	if (!(dispenser_ar = malloc(sizeof(dispenser_ar[0]) * concurrency)) ||
	    !(thrs = malloc(sizeof(thrs[0]) * concurrency))) {
		perror("malloc");
		return 1;
	}
	memset(dispenser_ar, 0, sizeof(dispenser_ar[0]) * concurrency);

	do_write = tolower(argv[4][0]) == 'w';

	summary_only = 0;
	if (argc >= 6 && strchr(argv[5], 's'))
		summary_only = 1;

	if (argc >= 6 && strchr(argv[5], 'w')) {
		char buf[64];
		printf("press enter to continue\n");
		fgets(buf, sizeof(buf), stdin);
	}

	dev_fd = open(argv[1], (do_write ? O_RDWR : O_RDONLY) | O_DIRECT);
	if (dev_fd < 0) {
		perror("open");
		return 1;
	}

	if (fstat(dev_fd, &sbuf) < 0) {
		perror("fstat");
		return 1;
	}

	if (!S_ISBLK(sbuf.st_mode)) {
		fprintf(stderr, "not a block device\n");
		return 1;
	}

	if (ioctl(dev_fd, BLKSSZGET, &block_size) < 0 ||
	    ioctl(dev_fd, BLKGETSIZE64, &device_size) < 0) {
		perror("ioctl");
		return 1;
	}
	nr_blocks = device_size / block_size;

	if (!summary_only)
		printf("%s block_size=%d nr_blocks=%"PRIu64" (%.2lfGiB)\n",
		       argv[1], block_size, nr_blocks,
		       (double)device_size / (1 << 30));

	if (signal(SIGINT, sigexit_handler) == SIG_ERR) {
		perror("signal");
		return 1;
	}

	srandom(getpid());

	for (i = 0; i < concurrency; i++)
		if ((errno = pthread_create(&thrs[i], NULL, do_rawio,
					    (void *)(unsigned long)i))) {
			perror("pthread_create");
			return 1;
		}

	started_at = last_tstmp = now_in_usec();

	while (nr_exited < concurrency) {
		struct timespec ts_200ms = { 0, 200 * 1000 * 1000 };
		const char pgstr[] = "|/-\\";

		if (!summary_only) {
			uint64_t now = now_in_usec();
			double time_delta = ((double)now - last_tstmp) / 1000000;
			double io_delta = nr_succeeded - last_succeeded;

			if (last_tstmp - started_at < 1000000)
				iops = io_delta / time_delta;
			else
				iops = iops * 0.9 + io_delta / time_delta * 0.1;

			printf("\rnr_succeeded=%-8u nr_failed=%-8u iops=%7.03lf kbps=%9.03lf %s%c",
			       nr_succeeded, nr_failed, iops,
			       iops * block_size * blocks_per_rq / 1024,
			       exiting ? "exiting..." : "",
			       pgstr[i++%(sizeof(pgstr)-1)]);

			last_tstmp = now;
			last_succeeded += io_delta;
		}

		fflush(stdout);
		nanosleep(&ts_200ms, NULL);
	}

	if (!summary_only)
		printf("\n");
	else
		printf("nr_succeeded=%u nr_failed=%8u iops=%03.03lf\n",
		       nr_succeeded, nr_failed,
		       (double)nr_succeeded /
		       (((double)now_in_usec() - started_at) / 1000000));

	return 0;
}

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-03 18:57                       ` Tejun Heo
  2013-05-03 18:58                         ` Tejun Heo
@ 2013-05-03 19:08                         ` Vivek Goyal
  2013-05-03 19:14                           ` Tejun Heo
  1 sibling, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-03 19:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

On Fri, May 03, 2013 at 11:57:51AM -0700, Tejun Heo wrote:

[..]
> > # set limit to 1000000 bytes/second both in parent and child cgroup
> > # dd if=/dev/vdb of=/dev/null iflag=direct
> > 
> > I will capture blktrace and analyze it though to understand better
> > what's happening.
> 
> Try using larger block size.  It looks like dispatch windows being
> reset depending on timing is hurting the overall bandwidth.  It
> becomes pronounced with larger IOs.

Ok, I tried dd with block size 1M and I can now see it happening.

dd if=/dev/vdb of=/dev/null bs=1M iflag=direct

dd program sends down 2-3 bios of 512K each. And then it is waiting
for all the bios to finish before it issues more IO.

So if three bios b1, b2, and b3 have been sent down, b4 does not
get issued till b3 has finished. Hence following happens.

		T1	T2	T3	T4	T5	T6	T7
parent:			b1	b2	b3		b4 	b5
child: 		b1	b2	b3		b4	b5	


So continuity breaks down because application is waiting for previous
IO to finish. This forces expiry of existing time slices and new time
slice start both in child and parent and penalty keep on increasing.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-03 19:08                         ` Vivek Goyal
@ 2013-05-03 19:14                           ` Tejun Heo
  2013-05-03 19:26                             ` Vivek Goyal
  2013-05-03 21:05                             ` Vivek Goyal
  0 siblings, 2 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-03 19:14 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

On Fri, May 03, 2013 at 03:08:23PM -0400, Vivek Goyal wrote:
> 		T1	T2	T3	T4	T5	T6	T7
> parent:			b1	b2	b3		b4 	b5
> child: 		b1	b2	b3		b4	b5	
> 
> 
> So continuity breaks down because application is waiting for previous
> IO to finish. This forces expiry of existing time slices and new time
> slice start both in child and parent and penalty keep on increasing.

It's a problem even in flat mode as the "child" above can easily be
just a process which is throttling itself and it won't be able to get
the configured bandwidth due to the scheduling bubbles introduced
whenever new slice is started.  Shouldn't be too difficult to get rid
of, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-03 19:14                           ` Tejun Heo
@ 2013-05-03 19:26                             ` Vivek Goyal
  2013-05-03 21:05                             ` Vivek Goyal
  1 sibling, 0 replies; 68+ messages in thread
From: Vivek Goyal @ 2013-05-03 19:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

On Fri, May 03, 2013 at 12:14:18PM -0700, Tejun Heo wrote:
> On Fri, May 03, 2013 at 03:08:23PM -0400, Vivek Goyal wrote:
> > 		T1	T2	T3	T4	T5	T6	T7
> > parent:			b1	b2	b3		b4 	b5
> > child: 		b1	b2	b3		b4	b5	
> > 
> > 
> > So continuity breaks down because application is waiting for previous
> > IO to finish. This forces expiry of existing time slices and new time
> > slice start both in child and parent and penalty keep on increasing.
> 
> It's a problem even in flat mode as the "child" above can easily be
> just a process which is throttling itself and it won't be able to get
> the configured bandwidth due to the scheduling bubbles introduced
> whenever new slice is started.  Shouldn't be too difficult to get rid
> of, right?

Key thing here is when to start a new slice. Generally when an IO has been
dispatched from a group, we do not expire slice immediately. We kind of
give group some grace period of throtl_slice (100ms). If next IO does not
come with-in that duration, we start a fresh slice upon next IO arrival.

I think similar problem should happen if there are two stacked devices
and both are doing throttling and if delays between 2 IOs are big enough
that it forces expirty of slice on each device.

Atleast for the hiearchy case, we should be able to start a fresh time
slice when child transfer bio to parent. I will write a patch and do
some experiment.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-03 19:14                           ` Tejun Heo
  2013-05-03 19:26                             ` Vivek Goyal
@ 2013-05-03 21:05                             ` Vivek Goyal
  2013-05-03 23:54                               ` Tejun Heo
  1 sibling, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-03 21:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

On Fri, May 03, 2013 at 12:14:18PM -0700, Tejun Heo wrote:
> On Fri, May 03, 2013 at 03:08:23PM -0400, Vivek Goyal wrote:
> > 		T1	T2	T3	T4	T5	T6	T7
> > parent:			b1	b2	b3		b4 	b5
> > child: 		b1	b2	b3		b4	b5	
> > 
> > 
> > So continuity breaks down because application is waiting for previous
> > IO to finish. This forces expiry of existing time slices and new time
> > slice start both in child and parent and penalty keep on increasing.
> 
> It's a problem even in flat mode as the "child" above can easily be
> just a process which is throttling itself and it won't be able to get
> the configured bandwidth due to the scheduling bubbles introduced
> whenever new slice is started.  Shouldn't be too difficult to get rid
> of, right?

Hi Tejun,

Following inline patch implements transferring child's start time to 
parent, if parent slice had expired at the time of bio migration.

I does seem to help a lot on my machine. Can you please give it a try.

I think even this approach is flawed. If there are multiple children,
it might happen that one child pushes up the bio early (as it is smaller
size bio or group rate is high). And later other child pushes up the bio
(bio was big or group limit was slow). We still lost time and effective
rate will still be lower than anticipated.

Thanks
Vivek


---
 block/blk-throttle.c |   40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

Index: linux-2.6/block/blk-throttle.c
===================================================================
--- linux-2.6.orig/block/blk-throttle.c	2013-05-03 16:03:10.279016400 -0400
+++ linux-2.6/block/blk-throttle.c	2013-05-03 16:46:11.398936631 -0400
@@ -547,6 +547,30 @@ static bool throtl_schedule_next_dispatc
 	return false;
 }
 
+static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
+		bool rw, unsigned long start)
+{
+	tg->bytes_disp[rw] = 0;
+	tg->io_disp[rw] = 0;
+
+	/*
+	 * Previous slice has expired. We must have trimmed it after last
+	 * bio dispatch. That means since start of last slice, we never used
+	 * that bandwidth. Do try to make use of that bandwidth while giving
+	 * credit
+	 */
+	if (time_after_eq(start, tg->slice_start[rw])) {
+		tg->slice_start[rw] = start;
+	} else
+		tg->slice_start[rw] = tg->slice_start[rw];
+
+	tg->slice_end[rw] = jiffies + throtl_slice;
+	throtl_log(&tg->service_queue,
+		   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
+		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
+		   tg->slice_end[rw], jiffies);
+}
+
 static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 {
 	tg->bytes_disp[rw] = 0;
@@ -888,6 +912,16 @@ static void tg_update_disptime(struct th
 	tg->flags &= ~THROTL_TG_WAS_EMPTY;
 }
 
+static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
+					struct throtl_grp *parent_tg, bool rw)
+{
+	if (throtl_slice_used(parent_tg, rw)) {
+		throtl_start_new_slice_with_credit(parent_tg, rw,
+				child_tg->slice_start[rw]);
+	}
+
+}
+
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -908,7 +942,9 @@ static void tg_dispatch_one_bio(struct t
 	 * responsible for issuing these bios.
 	 */
 	if (parent_tg) {
+		throtl_log(parent_sq, "add bio sz=%u", bio->bi_size);
 		throtl_add_bio_tg(bio, parent_tg);
+		start_parent_slice_with_credit(tg, parent_tg, rw);
 	} else {
 		bio_list_add(&parent_sq->bio_lists[rw], bio);
 		BUG_ON(tg->td->nr_queued[rw] <= 0);
@@ -1369,11 +1405,11 @@ bool blk_throtl_bio(struct request_queue
 	}
 
 	/* out-of-limit, queue to @tg */
-	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d",
+	throtl_log(sq, "[%c] bio. bdisp=%llu sz=%u bps=%llu iodisp=%u iops=%u queued=%d/%d sz=%u",
 		   rw == READ ? 'R' : 'W',
 		   tg->bytes_disp[rw], bio->bi_size, tg->bps[rw],
 		   tg->io_disp[rw], tg->iops[rw],
-		   sq->nr_queued[READ], sq->nr_queued[WRITE]);
+		   sq->nr_queued[READ], sq->nr_queued[WRITE], bio->bi_size);
 
 	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-03 21:05                             ` Vivek Goyal
@ 2013-05-03 23:54                               ` Tejun Heo
  2013-05-06 17:33                                 ` Vivek Goyal
  0 siblings, 1 reply; 68+ messages in thread
From: Tejun Heo @ 2013-05-03 23:54 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

Hey,

On Fri, May 03, 2013 at 05:05:13PM -0400, Vivek Goyal wrote:
> Following inline patch implements transferring child's start time to 
> parent, if parent slice had expired at the time of bio migration.
> 
> I does seem to help a lot on my machine. Can you please give it a try.

Cool, will give it a try.  Can you please make it a proper patch with
SOB?  Please feel free to base it on top of the series.  It can
probably go right below the final patch but the rebase there should be
trivial.

> I think even this approach is flawed. If there are multiple children,
> it might happen that one child pushes up the bio early (as it is smaller
> size bio or group rate is high). And later other child pushes up the bio
> (bio was big or group limit was slow). We still lost time and effective
> rate will still be lower than anticipated.

Hmmm... I see.  Well, as long as it behaves acceptably on three level
nesting, I think it's good for now.

Here's RR dispatch patch I'm working on now.  I think it's almost
ready now.  I'll go over it one more time, split and merge it into the
patch series.

Thanks.

Index: work/block/blk-throttle.c
===================================================================
--- work.orig/block/blk-throttle.c
+++ work/block/blk-throttle.c
@@ -26,6 +26,29 @@ static struct blkcg_policy blkcg_policy_
 /* A workqueue to queue throttle related work */
 static struct workqueue_struct *kthrotld_workqueue;
 
+/*
+ * To implement hierarchical throttling, throtl_grps form a tree and bios
+ * are dispatched upwards level by level until they reach the top and get
+ * issued.  When dispatching bios from the children and local group at each
+ * level, if the bios are dispatched into a single bio_list, there's a risk
+ * of a local or child group which can queue many bios at once filling up
+ * the list starving others.
+ *
+ * To avoid such starvation, dispatched bios are queued separately
+ * according to where they came from.  When they are again dispatched to
+ * the parent, they're popped in round-robin order so that no single source
+ * hogs the dispatch window.
+ *
+ * throtl_qnode is used to keep the queued bios separated by their sources.
+ * Bios are queued to throtl_qnode which in turn is queued to
+ * throtl_service_queue and then dispatched in round-robin order.
+ */
+struct throtl_qnode {
+	struct list_head	node;		/* service_queue->queued[] */
+	struct bio_list		bios;		/* queued bios */
+	struct throtl_grp	*tg;		/* tg this qnode belongs to */
+};
+
 struct throtl_service_queue {
 	struct throtl_service_queue *parent_sq;	/* the parent service_queue */
 
@@ -33,7 +56,7 @@ struct throtl_service_queue {
 	 * Bios queued directly to this service_queue or dispatched from
 	 * children throtl_grp's.
 	 */
-	struct bio_list		bio_lists[2];	/* queued bios [READ/WRITE] */
+	struct list_head	queued[2];	/* throtl_qnode [READ/WRITE] */
 	unsigned int		nr_queued[2];	/* number of queued bios */
 
 	/*
@@ -76,6 +99,17 @@ struct throtl_grp {
 	struct throtl_service_queue service_queue;
 
 	/*
+	 * qnode_on_self is used when bios are directly queued to this
+	 * throtl_grp so that local bios compete fairly with bios
+	 * dispatched from children.  qnode_on_parent is used when bios are
+	 * dispatched from this this throtl_grp into its parent and will
+	 * compete with sibling qnode_on_parents and the parent's
+	 * qnode_on_self.
+	 */
+	struct throtl_qnode qnode_on_self;
+	struct throtl_qnode qnode_on_parent;
+
+	/*
 	 * Dispatch time in jiffies. This is the estimated time when group
 	 * will unthrottle and is ready to dispatch more bio. It is used as
 	 * key to sort active groups in service tree.
@@ -250,12 +284,81 @@ alloc_stats:
 		goto alloc_stats;
 }
 
+static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg)
+{
+	INIT_LIST_HEAD(&qn->node);
+	bio_list_init(&qn->bios);
+	qn->tg = tg;
+}
+
+/**
+ * throtl_qnode_add_bio - add a bio to a throtl_qnode and activate it
+ * @bio: bio being added
+ * @qn: qnode to add bio to
+ * @queued: the service_queue->queued[] list @qn belongs to
+ */
+static void throtl_qnode_add_bio(struct bio *bio, struct throtl_qnode *qn,
+				 struct list_head *queued)
+{
+	bio_list_add(&qn->bios, bio);
+	if (list_empty(&qn->node)) {
+		list_add_tail(&qn->node, queued);
+		blkg_get(tg_to_blkg(qn->tg));
+	}
+}
+
+/**
+ * throtl_peek_queued - peek the first bio on a qnode list
+ * @queued: the qnode list to peek
+ */
+static struct bio *throtl_peek_queued(struct list_head *queued)
+{
+	struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, node);
+	struct bio *bio;
+
+	if (list_empty(queued))
+		return NULL;
+
+	bio = bio_list_peek(&qn->bios);
+	WARN_ON_ONCE(!bio);
+	return bio;
+}
+
+/**
+ * throtl_pop_queued - pop the first bio form a qnode list
+ * @queued: the qnode list to pop a bio from
+ *
+ * Pop the first bio from the qnode list @queued.  After popping, the first
+ * qnode is removed from @queued if empty or moved to the end of @queued so
+ * that the popping order is round-robin.
+ */
+static struct bio *throtl_pop_queued(struct list_head *queued)
+{
+	struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, node);
+	struct bio *bio;
+
+	if (list_empty(queued))
+		return NULL;
+
+	bio = bio_list_pop(&qn->bios);
+	WARN_ON_ONCE(!bio);
+
+	if (bio_list_empty(&qn->bios)) {
+		list_del_init(&qn->node);
+		blkg_put(tg_to_blkg(qn->tg));
+	} else {
+		list_move_tail(&qn->node, queued);
+	}
+
+	return bio;
+}
+
 /* init a service_queue, assumes the caller zeroed it */
 static void throtl_service_queue_init(struct throtl_service_queue *sq,
 				      struct throtl_service_queue *parent_sq)
 {
-	bio_list_init(&sq->bio_lists[0]);
-	bio_list_init(&sq->bio_lists[1]);
+	INIT_LIST_HEAD(&sq->queued[0]);
+	INIT_LIST_HEAD(&sq->queued[1]);
 	sq->pending_tree = RB_ROOT;
 	sq->parent_sq = parent_sq;
 	setup_timer(&sq->pending_timer, throtl_pending_timer_fn,
@@ -293,6 +396,9 @@ static void throtl_pd_init(struct blkcg_
 		parent_sq = &blkg_to_tg(blkg->parent)->service_queue;
 
 	throtl_service_queue_init(&tg->service_queue, parent_sq);
+	throtl_qnode_init(&tg->qnode_on_self, tg);
+	throtl_qnode_init(&tg->qnode_on_parent, tg);
+
 	RB_CLEAR_NODE(&tg->rb_node);
 	tg->td = td;
 
@@ -752,7 +858,7 @@ static bool tg_may_dispatch(struct throt
 	 * queued.
 	 */
 	BUG_ON(tg->service_queue.nr_queued[rw] &&
-	       bio != bio_list_peek(&tg->service_queue.bio_lists[rw]));
+	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
 	if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
@@ -843,11 +949,24 @@ static void throtl_charge_bio(struct thr
 	}
 }
 
-static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg)
+/**
+ * throtl_add_bio_tg - add a bio to the specified throtl_grp
+ * @bio: bio to add
+ * @qn: qnode to use
+ * @tg: the target throtl_grp
+ *
+ * Add @bio to @tg's service_queue using @qn.  If @qn is not specified,
+ * tg->qnode_on_self is used.
+ */
+static void throtl_add_bio_tg(struct bio *bio, struct throtl_qnode *qn,
+			      struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	bool rw = bio_data_dir(bio);
 
+	if (!qn)
+		qn = &tg->qnode_on_self;
+
 	/*
 	 * If @tg doesn't currently have any bios queued in the same
 	 * direction, queueing @bio can change when @tg should be
@@ -857,9 +976,8 @@ static void throtl_add_bio_tg(struct bio
 	if (!sq->nr_queued[rw])
 		tg->flags |= THROTL_TG_WAS_EMPTY;
 
-	bio_list_add(&sq->bio_lists[rw], bio);
-	/* Take a bio reference on tg */
-	blkg_get(tg_to_blkg(tg));
+	throtl_qnode_add_bio(bio, qn, &sq->queued[rw]);
+
 	sq->nr_queued[rw]++;
 	throtl_enqueue_tg(tg);
 }
@@ -870,10 +988,10 @@ static void tg_update_disptime(struct th
 	unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime;
 	struct bio *bio;
 
-	if ((bio = bio_list_peek(&sq->bio_lists[READ])))
+	if ((bio = throtl_peek_queued(&sq->queued[READ])))
 		tg_may_dispatch(tg, bio, &read_wait);
 
-	if ((bio = bio_list_peek(&sq->bio_lists[WRITE])))
+	if ((bio = throtl_peek_queued(&sq->queued[WRITE])))
 		tg_may_dispatch(tg, bio, &write_wait);
 
 	min_wait = min(read_wait, write_wait);
@@ -895,7 +1013,7 @@ static void tg_dispatch_one_bio(struct t
 	struct throtl_grp *parent_tg = sq_to_tg(parent_sq);
 	struct bio *bio;
 
-	bio = bio_list_pop(&sq->bio_lists[rw]);
+	bio = throtl_pop_queued(&sq->queued[rw]);
 	sq->nr_queued[rw]--;
 
 	throtl_charge_bio(tg, bio);
@@ -908,17 +1026,15 @@ static void tg_dispatch_one_bio(struct t
 	 * responsible for issuing these bios.
 	 */
 	if (parent_tg) {
-		throtl_add_bio_tg(bio, parent_tg);
+		throtl_add_bio_tg(bio, &tg->qnode_on_parent, parent_tg);
 	} else {
-		bio_list_add(&parent_sq->bio_lists[rw], bio);
+		throtl_qnode_add_bio(bio, &tg->qnode_on_parent,
+				     &parent_sq->queued[rw]);
 		BUG_ON(tg->td->nr_queued[rw] <= 0);
 		tg->td->nr_queued[rw]--;
 	}
 
 	throtl_trim_slice(tg, rw);
-
-	/* @bio is transferred to parent, drop its blkg reference */
-	blkg_put(tg_to_blkg(tg));
 }
 
 static int throtl_dispatch_tg(struct throtl_grp *tg)
@@ -931,7 +1047,7 @@ static int throtl_dispatch_tg(struct thr
 
 	/* Try to dispatch 75% READS and 25% WRITES */
 
-	while ((bio = bio_list_peek(&sq->bio_lists[READ])) &&
+	while ((bio = throtl_peek_queued(&sq->queued[READ])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
 
 		tg_dispatch_one_bio(tg, bio_data_dir(bio));
@@ -941,7 +1057,7 @@ static int throtl_dispatch_tg(struct thr
 			break;
 	}
 
-	while ((bio = bio_list_peek(&sq->bio_lists[WRITE])) &&
+	while ((bio = throtl_peek_queued(&sq->queued[WRITE])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
 
 		tg_dispatch_one_bio(tg, bio_data_dir(bio));
@@ -1076,10 +1192,9 @@ void blk_throtl_dispatch_work_fn(struct
 	bio_list_init(&bio_list_on_stack);
 
 	spin_lock_irq(q->queue_lock);
-	for (rw = READ; rw <= WRITE; rw++) {
-		bio_list_merge(&bio_list_on_stack, &td_sq->bio_lists[rw]);
-		bio_list_init(&td_sq->bio_lists[rw]);
-	}
+	for (rw = READ; rw <= WRITE; rw++)
+		while ((bio = throtl_pop_queued(&td_sq->queued[rw])))
+			bio_list_add(&bio_list_on_stack, bio);
 	spin_unlock_irq(q->queue_lock);
 
 	if (!bio_list_empty(&bio_list_on_stack)) {
@@ -1295,6 +1410,7 @@ static struct blkcg_policy blkcg_policy_
 bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 {
 	struct throtl_data *td = q->td;
+	struct throtl_qnode *qn = NULL;
 	struct throtl_grp *tg;
 	struct throtl_service_queue *sq;
 	bool rw = bio_data_dir(bio);
@@ -1362,6 +1478,7 @@ bool blk_throtl_bio(struct request_queue
 		 * Climb up the ladder.  If we''re already at the top, it
 		 * can be executed directly.
 		 */
+		qn = &tg->qnode_on_parent;
 		sq = sq->parent_sq;
 		tg = sq_to_tg(sq);
 		if (!tg)
@@ -1377,7 +1494,7 @@ bool blk_throtl_bio(struct request_queue
 
 	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
-	throtl_add_bio_tg(bio, tg);
+	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
 
 	/*
@@ -1421,9 +1538,9 @@ static void tg_drain_bios(struct throtl_
 
 		throtl_dequeue_tg(tg);
 
-		while ((bio = bio_list_peek(&sq->bio_lists[READ])))
+		while ((bio = throtl_peek_queued(&sq->queued[READ])))
 			tg_dispatch_one_bio(tg, bio_data_dir(bio));
-		while ((bio = bio_list_peek(&sq->bio_lists[WRITE])))
+		while ((bio = throtl_peek_queued(&sq->queued[WRITE])))
 			tg_dispatch_one_bio(tg, bio_data_dir(bio));
 	}
 }
@@ -1465,7 +1582,7 @@ void blk_throtl_drain(struct request_que
 
 	/* all bios now should be in td->service_queue, issue them */
 	for (rw = READ; rw <= WRITE; rw++)
-		while ((bio = bio_list_pop(&td->service_queue.bio_lists[rw])))
+		while ((bio = throtl_pop_queued(&td->service_queue.queued[rw])))
 			generic_make_request(bio);
 
 	spin_lock_irq(q->queue_lock);

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 29.5/32] blk-throttle: add throtl_qnode for dispatch fairness
  2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
                   ` (31 preceding siblings ...)
  2013-05-02 17:34 ` [PATCHSET] " Vivek Goyal
@ 2013-05-04  0:50 ` Tejun Heo
  2013-05-04  0:53   ` Tejun Heo
  2013-05-06 16:00   ` Vivek Goyal
  32 siblings, 2 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-04  0:50 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal

With flat hierarchy, there's only single level of dispatching
happening and fairness beyond that point is the responsibility of the
rest of the block layer and driver, which usually works out okay;
however, with the planned hierarchy support,
service_queue->bio_lists[] can be filled up by bios from a single
source.  While the limits would still be honored, it'd be very easy to
starve IOs from siblings or children.

To avoid such starvation, this patch implements throtl_qnode and
converts service_queue->bio_lists[] to lists of per-source qnodes
which in turn contains the bio's.  For example, when a bio is
dispatched from a child group, the bio doesn't get queued on
->bio_lists[] directly but it first gets queued on the group's qnode
which in turn gets queued on service_queue->queued[].  When
dispatching for the upper level, the ->queued[] list is consumed in
round-robing order so that the dispatch windows is consumed fairly by
all IO sources.

There are two ways a bio can come to a throtl_grp - directly queued to
the group or dispatched from a child.  For the former
throtl_grp->qnode_on_self is used.  For the latter, the child's
->qnode_on_parent.

Note that this means that the child which is contributing a bio to its
parent should stay pinned until all its bios are dispatched to its
grand-parent.  This patch moves blkg refcnting from bio add/remove
spots to qnode activation/deactivation so that the blkg containing an
active qnode is always pinned.  As child pins the parent, this is
sufficient for keeping the relevant sub-tree pinned while bios are in
flight.

The starvation issue was spotted by Vivek Goyal.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
It took some modifications but it looks good to me now.  This patch
does cause minor conflicts for the following patches but nothing
serious.  Will update the whole series after including your patch.

Thanks.

 block/blk-throttle.c |  198 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 173 insertions(+), 25 deletions(-)

--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -26,6 +26,35 @@ static struct blkcg_policy blkcg_policy_
 /* A workqueue to queue throttle related work */
 static struct workqueue_struct *kthrotld_workqueue;
 
+/*
+ * To implement hierarchical throttling, throtl_grps form a tree and bios
+ * are dispatched upwards level by level until they reach the top and get
+ * issued.  When dispatching bios from the children and local group at each
+ * level, if the bios are dispatched into a single bio_list, there's a risk
+ * of a local or child group which can queue many bios at once filling up
+ * the list starving others.
+ *
+ * To avoid such starvation, dispatched bios are queued separately
+ * according to where they came from.  When they are again dispatched to
+ * the parent, they're popped in round-robin order so that no single source
+ * hogs the dispatch window.
+ *
+ * throtl_qnode is used to keep the queued bios separated by their sources.
+ * Bios are queued to throtl_qnode which in turn is queued to
+ * throtl_service_queue and then dispatched in round-robin order.
+ *
+ * It's also used to track the reference counts on blkg's.  A qnode always
+ * belongs to a throtl_grp and gets queued on itself or the parent, so
+ * incrementing the reference of the associated throtl_grp when a qnode is
+ * queued and decrementing when dequeued is enough to keep the whole blkg
+ * tree pinned while bios are in flight.
+ */
+struct throtl_qnode {
+	struct list_head	node;		/* service_queue->queued[] */
+	struct bio_list		bios;		/* queued bios */
+	struct throtl_grp	*tg;		/* tg this qnode belongs to */
+};
+
 struct throtl_service_queue {
 	struct throtl_service_queue *parent_sq;	/* the parent service_queue */
 
@@ -33,7 +62,7 @@ struct throtl_service_queue {
 	 * Bios queued directly to this service_queue or dispatched from
 	 * children throtl_grp's.
 	 */
-	struct bio_list		bio_lists[2];	/* queued bios [READ/WRITE] */
+	struct list_head	queued[2];	/* throtl_qnode [READ/WRITE] */
 	unsigned int		nr_queued[2];	/* number of queued bios */
 
 	/*
@@ -76,6 +105,17 @@ struct throtl_grp {
 	struct throtl_service_queue service_queue;
 
 	/*
+	 * qnode_on_self is used when bios are directly queued to this
+	 * throtl_grp so that local bios compete fairly with bios
+	 * dispatched from children.  qnode_on_parent is used when bios are
+	 * dispatched from this throtl_grp into its parent and will compete
+	 * with the sibling qnode_on_parents and the parent's
+	 * qnode_on_self.
+	 */
+	struct throtl_qnode qnode_on_self;
+	struct throtl_qnode qnode_on_parent;
+
+	/*
 	 * Dispatch time in jiffies. This is the estimated time when group
 	 * will unthrottle and is ready to dispatch more bio. It is used as
 	 * key to sort active groups in service tree.
@@ -247,12 +287,95 @@ alloc_stats:
 		goto alloc_stats;
 }
 
+static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg)
+{
+	INIT_LIST_HEAD(&qn->node);
+	bio_list_init(&qn->bios);
+	qn->tg = tg;
+}
+
+/**
+ * throtl_qnode_add_bio - add a bio to a throtl_qnode and activate it
+ * @bio: bio being added
+ * @qn: qnode to add bio to
+ * @queued: the service_queue->queued[] list @qn belongs to
+ *
+ * Add @bio to @qn and put @qn on @queued if it's not already on.
+ * @qn->tg's reference count is bumped when @qn is activated.  See the
+ * comment on top of throtl_qnode definition for details.
+ */
+static void throtl_qnode_add_bio(struct bio *bio, struct throtl_qnode *qn,
+				 struct list_head *queued)
+{
+	bio_list_add(&qn->bios, bio);
+	if (list_empty(&qn->node)) {
+		list_add_tail(&qn->node, queued);
+		blkg_get(tg_to_blkg(qn->tg));
+	}
+}
+
+/**
+ * throtl_peek_queued - peek the first bio on a qnode list
+ * @queued: the qnode list to peek
+ */
+static struct bio *throtl_peek_queued(struct list_head *queued)
+{
+	struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, node);
+	struct bio *bio;
+
+	if (list_empty(queued))
+		return NULL;
+
+	bio = bio_list_peek(&qn->bios);
+	WARN_ON_ONCE(!bio);
+	return bio;
+}
+
+/**
+ * throtl_pop_queued - pop the first bio form a qnode list
+ * @queued: the qnode list to pop a bio from
+ * @tg_to_put: optional out argument for throtl_grp to put
+ *
+ * Pop the first bio from the qnode list @queued.  After popping, the first
+ * qnode is removed from @queued if empty or moved to the end of @queued so
+ * that the popping order is round-robin.
+ *
+ * When the first qnode is removed, its associated throtl_grp should be put
+ * too.  If @tg_to_put is NULL, this function automatically puts it;
+ * otherwise, *@tg_to_put is set to the throtl_grp to put and the caller is
+ * responsible for putting it.
+ */
+static struct bio *throtl_pop_queued(struct list_head *queued,
+				     struct throtl_grp **tg_to_put)
+{
+	struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, node);
+	struct bio *bio;
+
+	if (list_empty(queued))
+		return NULL;
+
+	bio = bio_list_pop(&qn->bios);
+	WARN_ON_ONCE(!bio);
+
+	if (bio_list_empty(&qn->bios)) {
+		list_del_init(&qn->node);
+		if (tg_to_put)
+			*tg_to_put = qn->tg;
+		else
+			blkg_put(tg_to_blkg(tg_to_put));
+	} else {
+		list_move_tail(&qn->node, queued);
+	}
+
+	return bio;
+}
+
 /* init a service_queue, assumes the caller zeroed it */
 static void throtl_service_queue_init(struct throtl_service_queue *sq,
 				      struct throtl_service_queue *parent_sq)
 {
-	bio_list_init(&sq->bio_lists[0]);
-	bio_list_init(&sq->bio_lists[1]);
+	INIT_LIST_HEAD(&sq->queued[0]);
+	INIT_LIST_HEAD(&sq->queued[1]);
 	sq->pending_tree = RB_ROOT;
 	sq->parent_sq = parent_sq;
 	setup_timer(&sq->pending_timer, throtl_pending_timer_fn,
@@ -271,6 +394,9 @@ static void throtl_pd_init(struct blkcg_
 	unsigned long flags;
 
 	throtl_service_queue_init(&tg->service_queue, &td->service_queue);
+	throtl_qnode_init(&tg->qnode_on_self, tg);
+	throtl_qnode_init(&tg->qnode_on_parent, tg);
+
 	RB_CLEAR_NODE(&tg->rb_node);
 	tg->td = td;
 
@@ -712,7 +838,7 @@ static bool tg_may_dispatch(struct throt
 	 * queued.
 	 */
 	BUG_ON(tg->service_queue.nr_queued[rw] &&
-	       bio != bio_list_peek(&tg->service_queue.bio_lists[rw]));
+	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
 	if (tg->bps[rw] == -1 && tg->iops[rw] == -1) {
@@ -803,11 +929,24 @@ static void throtl_charge_bio(struct thr
 	}
 }
 
-static void throtl_add_bio_tg(struct bio *bio, struct throtl_grp *tg)
+/**
+ * throtl_add_bio_tg - add a bio to the specified throtl_grp
+ * @bio: bio to add
+ * @qn: qnode to use
+ * @tg: the target throtl_grp
+ *
+ * Add @bio to @tg's service_queue using @qn.  If @qn is not specified,
+ * tg->qnode_on_self is used.
+ */
+static void throtl_add_bio_tg(struct bio *bio, struct throtl_qnode *qn,
+			      struct throtl_grp *tg)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
 	bool rw = bio_data_dir(bio);
 
+	if (!qn)
+		qn = &tg->qnode_on_self;
+
 	/*
 	 * If @tg doesn't currently have any bios queued in the same
 	 * direction, queueing @bio can change when @tg should be
@@ -817,9 +956,8 @@ static void throtl_add_bio_tg(struct bio
 	if (!sq->nr_queued[rw])
 		tg->flags |= THROTL_TG_WAS_EMPTY;
 
-	bio_list_add(&sq->bio_lists[rw], bio);
-	/* Take a bio reference on tg */
-	blkg_get(tg_to_blkg(tg));
+	throtl_qnode_add_bio(bio, qn, &sq->queued[rw]);
+
 	sq->nr_queued[rw]++;
 	throtl_enqueue_tg(tg);
 }
@@ -830,10 +968,10 @@ static void tg_update_disptime(struct th
 	unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime;
 	struct bio *bio;
 
-	if ((bio = bio_list_peek(&sq->bio_lists[READ])))
+	if ((bio = throtl_peek_queued(&sq->queued[READ])))
 		tg_may_dispatch(tg, bio, &read_wait);
 
-	if ((bio = bio_list_peek(&sq->bio_lists[WRITE])))
+	if ((bio = throtl_peek_queued(&sq->queued[WRITE])))
 		tg_may_dispatch(tg, bio, &write_wait);
 
 	min_wait = min(read_wait, write_wait);
@@ -853,9 +991,16 @@ static void tg_dispatch_one_bio(struct t
 	struct throtl_service_queue *sq = &tg->service_queue;
 	struct throtl_service_queue *parent_sq = sq->parent_sq;
 	struct throtl_grp *parent_tg = sq_to_tg(parent_sq);
+	struct throtl_grp *tg_to_put = NULL;
 	struct bio *bio;
 
-	bio = bio_list_pop(&sq->bio_lists[rw]);
+	/*
+	 * @bio is being transferred from @tg to @parent_sq.  Popping a bio
+	 * from @tg may put its reference and @parent_sq might end up
+	 * getting released prematurely.  Remember the tg to put and put it
+	 * after @bio is transferred to @parent_sq.
+	 */
+	bio = throtl_pop_queued(&sq->queued[rw], &tg_to_put);
 	sq->nr_queued[rw]--;
 
 	throtl_charge_bio(tg, bio);
@@ -868,17 +1013,18 @@ static void tg_dispatch_one_bio(struct t
 	 * responsible for issuing these bios.
 	 */
 	if (parent_tg) {
-		throtl_add_bio_tg(bio, parent_tg);
+		throtl_add_bio_tg(bio, &tg->qnode_on_parent, parent_tg);
 	} else {
-		bio_list_add(&parent_sq->bio_lists[rw], bio);
+		throtl_qnode_add_bio(bio, &tg->qnode_on_parent,
+				     &parent_sq->queued[rw]);
 		BUG_ON(tg->td->nr_queued[rw] <= 0);
 		tg->td->nr_queued[rw]--;
 	}
 
 	throtl_trim_slice(tg, rw);
 
-	/* @bio is transferred to parent, drop its blkg reference */
-	blkg_put(tg_to_blkg(tg));
+	if (tg_to_put)
+		blkg_put(tg_to_blkg(tg_to_put));
 }
 
 static int throtl_dispatch_tg(struct throtl_grp *tg)
@@ -891,7 +1037,7 @@ static int throtl_dispatch_tg(struct thr
 
 	/* Try to dispatch 75% READS and 25% WRITES */
 
-	while ((bio = bio_list_peek(&sq->bio_lists[READ])) &&
+	while ((bio = throtl_peek_queued(&sq->queued[READ])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
 
 		tg_dispatch_one_bio(tg, bio_data_dir(bio));
@@ -901,7 +1047,7 @@ static int throtl_dispatch_tg(struct thr
 			break;
 	}
 
-	while ((bio = bio_list_peek(&sq->bio_lists[WRITE])) &&
+	while ((bio = throtl_peek_queued(&sq->queued[WRITE])) &&
 	       tg_may_dispatch(tg, bio, NULL)) {
 
 		tg_dispatch_one_bio(tg, bio_data_dir(bio));
@@ -1036,10 +1182,9 @@ void blk_throtl_dispatch_work_fn(struct
 	bio_list_init(&bio_list_on_stack);
 
 	spin_lock_irq(q->queue_lock);
-	for (rw = READ; rw <= WRITE; rw++) {
-		bio_list_merge(&bio_list_on_stack, &td_sq->bio_lists[rw]);
-		bio_list_init(&td_sq->bio_lists[rw]);
-	}
+	for (rw = READ; rw <= WRITE; rw++)
+		while ((bio = throtl_pop_queued(&td_sq->queued[rw], NULL)))
+			bio_list_add(&bio_list_on_stack, bio);
 	spin_unlock_irq(q->queue_lock);
 
 	if (!bio_list_empty(&bio_list_on_stack)) {
@@ -1241,6 +1386,7 @@ static struct blkcg_policy blkcg_policy_
 bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 {
 	struct throtl_data *td = q->td;
+	struct throtl_qnode *qn = NULL;
 	struct throtl_grp *tg;
 	struct throtl_service_queue *sq;
 	bool rw = bio_data_dir(bio);
@@ -1308,6 +1454,7 @@ bool blk_throtl_bio(struct request_queue
 		 * Climb up the ladder.  If we''re already at the top, it
 		 * can be executed directly.
 		 */
+		qn = &tg->qnode_on_parent;
 		sq = sq->parent_sq;
 		tg = sq_to_tg(sq);
 		if (!tg)
@@ -1323,7 +1470,7 @@ bool blk_throtl_bio(struct request_queue
 
 	bio_associate_current(bio);
 	tg->td->nr_queued[rw]++;
-	throtl_add_bio_tg(bio, tg);
+	throtl_add_bio_tg(bio, qn, tg);
 	throttled = true;
 
 	/*
@@ -1367,9 +1514,9 @@ static void tg_drain_bios(struct throtl_
 
 		throtl_dequeue_tg(tg);
 
-		while ((bio = bio_list_peek(&sq->bio_lists[READ])))
+		while ((bio = throtl_peek_queued(&sq->queued[READ])))
 			tg_dispatch_one_bio(tg, bio_data_dir(bio));
-		while ((bio = bio_list_peek(&sq->bio_lists[WRITE])))
+		while ((bio = throtl_peek_queued(&sq->queued[WRITE])))
 			tg_dispatch_one_bio(tg, bio_data_dir(bio));
 	}
 }
@@ -1411,7 +1558,8 @@ void blk_throtl_drain(struct request_que
 
 	/* all bios now should be in td->service_queue, issue them */
 	for (rw = READ; rw <= WRITE; rw++)
-		while ((bio = bio_list_pop(&td->service_queue.bio_lists[rw])))
+		while ((bio = throtl_pop_queued(&td->service_queue.queued[rw],
+						NULL)))
 			generic_make_request(bio);
 
 	spin_lock_irq(q->queue_lock);

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 29.5/32] blk-throttle: add throtl_qnode for dispatch fairness
  2013-05-04  0:50 ` [PATCH 29.5/32] blk-throttle: add throtl_qnode for dispatch fairness Tejun Heo
@ 2013-05-04  0:53   ` Tejun Heo
  2013-05-06 16:00   ` Vivek Goyal
  1 sibling, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-04  0:53 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, lizefan, containers, cgroups, vgoyal

On Fri, May 03, 2013 at 05:50:44PM -0700, Tejun Heo wrote:
....
> +static struct bio *throtl_pop_queued(struct list_head *queued,
> +				     struct throtl_grp **tg_to_put)
> +{
> +	struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, node);
> +	struct bio *bio;
> +
> +	if (list_empty(queued))
> +		return NULL;
> +
> +	bio = bio_list_pop(&qn->bios);
> +	WARN_ON_ONCE(!bio);
> +
> +	if (bio_list_empty(&qn->bios)) {
> +		list_del_init(&qn->node);
> +		if (tg_to_put)
> +			*tg_to_put = qn->tg;
> +		else
> +			blkg_put(tg_to_blkg(tg_to_put));

Oops, this should have been

			blkg_put(tg_to_blkg(qn->tg));

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 29.5/32] blk-throttle: add throtl_qnode for dispatch fairness
  2013-05-04  0:50 ` [PATCH 29.5/32] blk-throttle: add throtl_qnode for dispatch fairness Tejun Heo
  2013-05-04  0:53   ` Tejun Heo
@ 2013-05-06 16:00   ` Vivek Goyal
  2013-05-06 18:35     ` Tejun Heo
  1 sibling, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-06 16:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Fri, May 03, 2013 at 05:50:44PM -0700, Tejun Heo wrote:

[..]
> +	 * qnode_on_self is used when bios are directly queued to this
> +	 * throtl_grp so that local bios compete fairly with bios
> +	 * dispatched from children.  qnode_on_parent is used when bios are
> +	 * dispatched from this throtl_grp into its parent and will compete
> +	 * with the sibling qnode_on_parents and the parent's
> +	 * qnode_on_self.
> +	 */
> +	struct throtl_qnode qnode_on_self;
> +	struct throtl_qnode qnode_on_parent;

Do we need one throtl_qnode for each IO dir (read/write) so that we can
queue up the right direction throtl_node in right sq->queued[rw].

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCHSET] blk-throttle: implement proper hierarchy support
  2013-05-03 23:54                               ` Tejun Heo
@ 2013-05-06 17:33                                 ` Vivek Goyal
  0 siblings, 0 replies; 68+ messages in thread
From: Vivek Goyal @ 2013-05-06 17:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, lkml, Li Zefan, containers, Cgroups

On Fri, May 03, 2013 at 04:54:55PM -0700, Tejun Heo wrote:
> Hey,
> 
> On Fri, May 03, 2013 at 05:05:13PM -0400, Vivek Goyal wrote:
> > Following inline patch implements transferring child's start time to 
> > parent, if parent slice had expired at the time of bio migration.
> > 
> > I does seem to help a lot on my machine. Can you please give it a try.
> 
> Cool, will give it a try.  Can you please make it a proper patch with
> SOB?  Please feel free to base it on top of the series.  It can
> probably go right below the final patch but the rebase there should be
> trivial.

Hi tejun,

Here is the patch to fix the issue. It is on top of your old series
(no throtl_node stuff). Do let me know if want me to refresh it
on top of throtl_node patch.

Thanks
Vivek


blk-throtl: Account for child group's start time in parent while bio climbs up

Now with hierarchy support, a bio climbs up the tree before actually
being dispatched. This makes sure bio is also subjected to parent's
throttling limits, if any.

It might happen that parent is idle and when bio is transferred to
parent, a new slice starts fresh. But that is not good as parents
wait time should have started when bio was queued in child group.

Given the fact that we have not written hierarchical algorithm 
in a way where child's and parents time slices are synchronized,
we transfer the child's start time to parent if parent was idling.
If parent was busy doing dispatch of other bios all this while, this
is not an issue.

Child's slice start time is passed to parent. Parent looks at its
last expired slice start time. If child's start time is after parents
old start time, that means parent had been idle and after parent
went idle, child had an IO queued. So use child's start time as
parent start time.

If parent's start time is after child's start time, that means,
when IO got queued in child group, parent was not idle. But later
it dispatched some IO, its slice got trimmed and then it went idle.
After a while child's request got shifted in parent group. In this
case use parent's old start time as new start time as that's the
duration of slice we did not use.

This logic is far from perfect as if there are multiple childs
then first child transferring the bio decides the start time while
a bio might have queued up even earlier in other child, which is
yet to be transferred up to parent. In that case we will lose
time and bandwidth in parent. This patch is just an approximation
to make situation somewhat better.
 
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

Index: linux-2.6/block/blk-throttle.c
===================================================================
--- linux-2.6.orig/block/blk-throttle.c	2013-05-06 12:59:20.585535099 -0400
+++ linux-2.6/block/blk-throttle.c	2013-05-06 13:03:37.857552427 -0400
@@ -547,6 +547,28 @@ static bool throtl_schedule_next_dispatc
 	return false;
 }
 
+static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg,
+		bool rw, unsigned long start)
+{
+	tg->bytes_disp[rw] = 0;
+	tg->io_disp[rw] = 0;
+
+	/*
+	 * Previous slice has expired. We must have trimmed it after last
+	 * bio dispatch. That means since start of last slice, we never used
+	 * that bandwidth. Do try to make use of that bandwidth while giving
+	 * credit.
+	 */
+	if (time_after_eq(start, tg->slice_start[rw]))
+		tg->slice_start[rw] = start;
+
+	tg->slice_end[rw] = jiffies + throtl_slice;
+	throtl_log(&tg->service_queue,
+		   "[%c] new slice with credit start=%lu end=%lu jiffies=%lu",
+		   rw == READ ? 'R' : 'W', tg->slice_start[rw],
+		   tg->slice_end[rw], jiffies);
+}
+
 static inline void throtl_start_new_slice(struct throtl_grp *tg, bool rw)
 {
 	tg->bytes_disp[rw] = 0;
@@ -888,6 +910,16 @@ static void tg_update_disptime(struct th
 	tg->flags &= ~THROTL_TG_WAS_EMPTY;
 }
 
+static void start_parent_slice_with_credit(struct throtl_grp *child_tg,
+					struct throtl_grp *parent_tg, bool rw)
+{
+	if (throtl_slice_used(parent_tg, rw)) {
+		throtl_start_new_slice_with_credit(parent_tg, rw,
+				child_tg->slice_start[rw]);
+	}
+
+}
+
 static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
 {
 	struct throtl_service_queue *sq = &tg->service_queue;
@@ -909,6 +941,7 @@ static void tg_dispatch_one_bio(struct t
 	 */
 	if (parent_tg) {
 		throtl_add_bio_tg(bio, parent_tg);
+		start_parent_slice_with_credit(tg, parent_tg, rw);
 	} else {
 		bio_list_add(&parent_sq->bio_lists[rw], bio);
 		BUG_ON(tg->td->nr_queued[rw] <= 0);

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
  2013-05-02  0:39 ` [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() Tejun Heo
@ 2013-05-06 17:36   ` Vivek Goyal
  2013-05-06 18:38     ` Tejun Heo
  2013-05-06 20:38     ` Tejun Heo
  0 siblings, 2 replies; 68+ messages in thread
From: Vivek Goyal @ 2013-05-06 17:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Wed, May 01, 2013 at 05:39:39PM -0700, Tejun Heo wrote:

[..]
> +#define throtl_log(sq, fmt, args...)	do {				\
> +	struct throtl_grp *__tg = sq_to_tg((sq));			\
> +	struct throtl_data *__td = sq_to_td((sq));			\
>  	char __pbuf[128];						\
>  									\
> -	blkg_path(tg_to_blkg(tg), __pbuf, sizeof(__pbuf));		\
> -	blk_add_trace_msg((tg)->td->queue, "throtl %s " fmt, __pbuf, ##args); \
> +	__pbuf[0] = ' ';						\
> +	__pbuf[1] = '\0';						\
> +	if ((__tg))							\
> +		blkg_path(tg_to_blkg(__tg), __pbuf + 1, sizeof(__pbuf) - 1); \
> +	blk_add_trace_msg(__td->queue, "throtl%s" fmt, __pbuf, ##args); \
						^^

We need one extra space ("throtl%s "), otherwise all the messages
start right after cgroup path. We need a space between cgroup path
and actual message.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 29.5/32] blk-throttle: add throtl_qnode for dispatch fairness
  2013-05-06 16:00   ` Vivek Goyal
@ 2013-05-06 18:35     ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-06 18:35 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Mon, May 06, 2013 at 12:00:06PM -0400, Vivek Goyal wrote:
> On Fri, May 03, 2013 at 05:50:44PM -0700, Tejun Heo wrote:
> 
> [..]
> > +	 * qnode_on_self is used when bios are directly queued to this
> > +	 * throtl_grp so that local bios compete fairly with bios
> > +	 * dispatched from children.  qnode_on_parent is used when bios are
> > +	 * dispatched from this throtl_grp into its parent and will compete
> > +	 * with the sibling qnode_on_parents and the parent's
> > +	 * qnode_on_self.
> > +	 */
> > +	struct throtl_qnode qnode_on_self;
> > +	struct throtl_qnode qnode_on_parent;
> 
> Do we need one throtl_qnode for each IO dir (read/write) so that we can
> queue up the right direction throtl_node in right sq->queued[rw].

Yes, we do.  Thanks for spotting it.  Will post an updated version soon.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
  2013-05-06 17:36   ` Vivek Goyal
@ 2013-05-06 18:38     ` Tejun Heo
  2013-05-06 20:38     ` Tejun Heo
  1 sibling, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-06 18:38 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Mon, May 06, 2013 at 01:36:44PM -0400, Vivek Goyal wrote:
> On Wed, May 01, 2013 at 05:39:39PM -0700, Tejun Heo wrote:
> 
> [..]
> > +#define throtl_log(sq, fmt, args...)	do {				\
> > +	struct throtl_grp *__tg = sq_to_tg((sq));			\
> > +	struct throtl_data *__td = sq_to_td((sq));			\
> >  	char __pbuf[128];						\
> >  									\
> > -	blkg_path(tg_to_blkg(tg), __pbuf, sizeof(__pbuf));		\
> > -	blk_add_trace_msg((tg)->td->queue, "throtl %s " fmt, __pbuf, ##args); \
> > +	__pbuf[0] = ' ';						\
> > +	__pbuf[1] = '\0';						\
> > +	if ((__tg))							\
> > +		blkg_path(tg_to_blkg(__tg), __pbuf + 1, sizeof(__pbuf) - 1); \
> > +	blk_add_trace_msg(__td->queue, "throtl%s" fmt, __pbuf, ##args); \
> 						^^
> 
> We need one extra space ("throtl%s "), otherwise all the messages
> start right after cgroup path. We need a space between cgroup path
> and actual message.

Will update.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
  2013-05-06 17:36   ` Vivek Goyal
  2013-05-06 18:38     ` Tejun Heo
@ 2013-05-06 20:38     ` Tejun Heo
  2013-05-06 20:39       ` Tejun Heo
  2013-05-06 20:41       ` Vivek Goyal
  1 sibling, 2 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-06 20:38 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Mon, May 06, 2013 at 01:36:44PM -0400, Vivek Goyal wrote:
> On Wed, May 01, 2013 at 05:39:39PM -0700, Tejun Heo wrote:
> 
> [..]
> > +#define throtl_log(sq, fmt, args...)	do {				\
> > +	struct throtl_grp *__tg = sq_to_tg((sq));			\
> > +	struct throtl_data *__td = sq_to_td((sq));			\
> >  	char __pbuf[128];						\
> >  									\
> > -	blkg_path(tg_to_blkg(tg), __pbuf, sizeof(__pbuf));		\
> > -	blk_add_trace_msg((tg)->td->queue, "throtl %s " fmt, __pbuf, ##args); \
> > +	__pbuf[0] = ' ';						\
> > +	__pbuf[1] = '\0';						\
> > +	if ((__tg))							\
> > +		blkg_path(tg_to_blkg(__tg), __pbuf + 1, sizeof(__pbuf) - 1); \
> > +	blk_add_trace_msg(__td->queue, "throtl%s" fmt, __pbuf, ##args); \
> 
> We need one extra space ("throtl%s "), otherwise all the messages
> start right after cgroup path. We need a space between cgroup path
> and actual message.

Okay, we don't - __pbuf[0] = ' '.  The reason it's doing the above
instead of "throtl %s " is because when the path is nil, we don't
wanna print out two consecutive spaces, so AFAICS the code is correct.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
  2013-05-06 20:38     ` Tejun Heo
@ 2013-05-06 20:39       ` Tejun Heo
  2013-05-06 20:41       ` Vivek Goyal
  1 sibling, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-06 20:39 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Mon, May 06, 2013 at 01:38:27PM -0700, Tejun Heo wrote:
> On Mon, May 06, 2013 at 01:36:44PM -0400, Vivek Goyal wrote:
> > On Wed, May 01, 2013 at 05:39:39PM -0700, Tejun Heo wrote:
> > 
> > [..]
> > > +#define throtl_log(sq, fmt, args...)	do {				\
> > > +	struct throtl_grp *__tg = sq_to_tg((sq));			\
> > > +	struct throtl_data *__td = sq_to_td((sq));			\
> > >  	char __pbuf[128];						\
> > >  									\
> > > -	blkg_path(tg_to_blkg(tg), __pbuf, sizeof(__pbuf));		\
> > > -	blk_add_trace_msg((tg)->td->queue, "throtl %s " fmt, __pbuf, ##args); \
> > > +	__pbuf[0] = ' ';						\
> > > +	__pbuf[1] = '\0';						\
> > > +	if ((__tg))							\
> > > +		blkg_path(tg_to_blkg(__tg), __pbuf + 1, sizeof(__pbuf) - 1); \
> > > +	blk_add_trace_msg(__td->queue, "throtl%s" fmt, __pbuf, ##args); \
> > 
> > We need one extra space ("throtl%s "), otherwise all the messages
> > start right after cgroup path. We need a space between cgroup path
> > and actual message.
> 
> Okay, we don't - __pbuf[0] = ' '.  The reason it's doing the above
> instead of "throtl %s " is because when the path is nil, we don't
> wanna print out two consecutive spaces, so AFAICS the code is correct.

Oops, I was wrong again, you mean after '%s'.  Yeah, right.  I need to
change it so that it does __pbuf[0] = '\0' if no path and add space
after '%s'.  Updating...

Sorry about the confusion.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
  2013-05-06 20:38     ` Tejun Heo
  2013-05-06 20:39       ` Tejun Heo
@ 2013-05-06 20:41       ` Vivek Goyal
  2013-05-06 20:43         ` Tejun Heo
  1 sibling, 1 reply; 68+ messages in thread
From: Vivek Goyal @ 2013-05-06 20:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Mon, May 06, 2013 at 01:38:27PM -0700, Tejun Heo wrote:
> On Mon, May 06, 2013 at 01:36:44PM -0400, Vivek Goyal wrote:
> > On Wed, May 01, 2013 at 05:39:39PM -0700, Tejun Heo wrote:
> > 
> > [..]
> > > +#define throtl_log(sq, fmt, args...)	do {				\
> > > +	struct throtl_grp *__tg = sq_to_tg((sq));			\
> > > +	struct throtl_data *__td = sq_to_td((sq));			\
> > >  	char __pbuf[128];						\
> > >  									\
> > > -	blkg_path(tg_to_blkg(tg), __pbuf, sizeof(__pbuf));		\
> > > -	blk_add_trace_msg((tg)->td->queue, "throtl %s " fmt, __pbuf, ##args); \
> > > +	__pbuf[0] = ' ';						\
> > > +	__pbuf[1] = '\0';						\
> > > +	if ((__tg))							\
> > > +		blkg_path(tg_to_blkg(__tg), __pbuf + 1, sizeof(__pbuf) - 1); \
> > > +	blk_add_trace_msg(__td->queue, "throtl%s" fmt, __pbuf, ##args); \
> > 
> > We need one extra space ("throtl%s "), otherwise all the messages
> > start right after cgroup path. We need a space between cgroup path
> > and actual message.
> 
> Okay, we don't - __pbuf[0] = ' '.  The reason it's doing the above
> instead of "throtl %s " is because when the path is nil, we don't
> wanna print out two consecutive spaces, so AFAICS the code is correct.

What about space after the cgroup path info (when path is not nil).

Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
  2013-05-06 20:41       ` Vivek Goyal
@ 2013-05-06 20:43         ` Tejun Heo
  0 siblings, 0 replies; 68+ messages in thread
From: Tejun Heo @ 2013-05-06 20:43 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, lizefan, containers, cgroups

On Mon, May 06, 2013 at 04:41:41PM -0400, Vivek Goyal wrote:
> On Mon, May 06, 2013 at 01:38:27PM -0700, Tejun Heo wrote:
> > On Mon, May 06, 2013 at 01:36:44PM -0400, Vivek Goyal wrote:
> > > On Wed, May 01, 2013 at 05:39:39PM -0700, Tejun Heo wrote:
> > > 
> > > [..]
> > > > +#define throtl_log(sq, fmt, args...)	do {				\
> > > > +	struct throtl_grp *__tg = sq_to_tg((sq));			\
> > > > +	struct throtl_data *__td = sq_to_td((sq));			\
> > > >  	char __pbuf[128];						\
> > > >  									\
> > > > -	blkg_path(tg_to_blkg(tg), __pbuf, sizeof(__pbuf));		\
> > > > -	blk_add_trace_msg((tg)->td->queue, "throtl %s " fmt, __pbuf, ##args); \
> > > > +	__pbuf[0] = ' ';						\
> > > > +	__pbuf[1] = '\0';						\
> > > > +	if ((__tg))							\
> > > > +		blkg_path(tg_to_blkg(__tg), __pbuf + 1, sizeof(__pbuf) - 1); \
> > > > +	blk_add_trace_msg(__td->queue, "throtl%s" fmt, __pbuf, ##args); \
> > > 
> > > We need one extra space ("throtl%s "), otherwise all the messages
> > > start right after cgroup path. We need a space between cgroup path
> > > and actual message.
> > 
> > Okay, we don't - __pbuf[0] = ' '.  The reason it's doing the above
> > instead of "throtl %s " is because when the path is nil, we don't
> > wanna print out two consecutive spaces, so AFAICS the code is correct.
> 
> What about space after the cgroup path info (when path is not nil).

Yeah, I mis-read it.  Fixing it right now.

-- 
tejun

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2013-05-06 20:45 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-02  0:39 [PATCHSET] blk-throttle: implement proper hierarchy support Tejun Heo
2013-05-02  0:39 ` [PATCH 01/31] blkcg: fix error return path in blkg_create() Tejun Heo
2013-05-02  0:39 ` [PATCH 02/31] blkcg: move blkg_for_each_descendant_pre() to block/blk-cgroup.h Tejun Heo
2013-05-02  0:39 ` [PATCH 03/31] blkcg: implement blkg_for_each_descendant_post() Tejun Heo
2013-05-02  0:39 ` [PATCH 04/31] blkcg: invoke blkcg_policy->pd_init() after parent is linked Tejun Heo
2013-05-02  0:39 ` [PATCH 05/31] blkcg: move bulk of blkcg_gq release operations to the RCU callback Tejun Heo
2013-05-02  0:39 ` [PATCH 06/31] blk-throttle: remove spurious throtl_enqueue_tg() call from throtl_select_dispatch() Tejun Heo
2013-05-02  0:39 ` [PATCH 07/31] blk-throttle: removed deferred config application mechanism Tejun Heo
2013-05-02 14:49   ` Vivek Goyal
2013-05-02 17:27     ` Tejun Heo
2013-05-02  0:39 ` [PATCH 08/31] blk-throttle: collapse throtl_dispatch() into the work function Tejun Heo
2013-05-02  0:39 ` [PATCH 09/31] blk-throttle: relocate throtl_schedule_delayed_work() Tejun Heo
2013-05-02  0:39 ` [PATCH 10/31] blk-throttle: remove pointless throtl_nr_queued() optimizations Tejun Heo
2013-05-02  0:39 ` [PATCH 11/31] blk-throttle: rename throtl_rb_root to throtl_service_queue Tejun Heo
2013-05-02  0:39 ` [PATCH 12/31] blk-throttle: simplify throtl_grp flag handling Tejun Heo
2013-05-02  0:39 ` [PATCH 13/31] blk-throttle: add backlink pointer from throtl_grp to throtl_data Tejun Heo
2013-05-02  0:39 ` [PATCH 14/31] blk-throttle: pass around throtl_service_queue instead of throtl_data Tejun Heo
2013-05-02  0:39 ` [PATCH 15/31] blk-throttle: reorganize throtl_service_queue passed around as argument Tejun Heo
2013-05-02 15:21   ` Vivek Goyal
2013-05-02 17:29     ` Tejun Heo
2013-05-02  0:39 ` [PATCH 16/31] blk-throttle: add throtl_grp->service_queue Tejun Heo
2013-05-02  0:39 ` [PATCH 17/31] blk-throttle: move bio_lists[] and friends to throtl_service_queue Tejun Heo
2013-05-02  0:39 ` [PATCH 18/31] blk-throttle: dispatch to throtl_data->service_queue.bio_lists[] Tejun Heo
2013-05-02  0:39 ` [PATCH 19/31] blk-throttle: generalize update_disptime optimization in blk_throtl_bio() Tejun Heo
2013-05-02  0:39 ` [PATCH 20/31] blk-throttle: add throtl_service_queue->parent_sq Tejun Heo
2013-05-02  0:39 ` [PATCH 21/31] blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() Tejun Heo
2013-05-06 17:36   ` Vivek Goyal
2013-05-06 18:38     ` Tejun Heo
2013-05-06 20:38     ` Tejun Heo
2013-05-06 20:39       ` Tejun Heo
2013-05-06 20:41       ` Vivek Goyal
2013-05-06 20:43         ` Tejun Heo
2013-05-02  0:39 ` [PATCH 22/31] blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it Tejun Heo
2013-05-02  0:39 ` [PATCH 23/31] blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work Tejun Heo
2013-05-02  0:39 ` [PATCH 24/31] blk-throttle: implement dispatch looping Tejun Heo
2013-05-02  0:39 ` [PATCH 25/31] blk-throttle: dispatch from throtl_pending_timer_fn() Tejun Heo
2013-05-02  0:39 ` [PATCH 26/31] blk-throttle: make blk_throtl_drain() ready for hierarchy Tejun Heo
2013-05-02  0:39 ` [PATCH 27/31] blk-throttle: make blk_throtl_bio() " Tejun Heo
2013-05-02  0:39 ` [PATCH 28/31] blk-throttle: make tg_dispatch_one_bio() " Tejun Heo
2013-05-02  0:39 ` [PATCH 29/31] blk-throttle: make throtl_pending_timer_fn() " Tejun Heo
2013-05-02  0:39 ` [PATCH 30/31] blk-throttle: implement throtl_grp->has_rules[] Tejun Heo
2013-05-02  0:39 ` [PATCH 31/31] blk-throttle: implement proper hierarchy support Tejun Heo
2013-05-02 17:34 ` [PATCHSET] " Vivek Goyal
2013-05-02 17:57   ` Tejun Heo
2013-05-02 18:17     ` Vivek Goyal
2013-05-02 18:29       ` Tejun Heo
2013-05-02 18:45         ` Vivek Goyal
2013-05-02 18:49           ` Tejun Heo
2013-05-02 19:07             ` Vivek Goyal
2013-05-02 19:11               ` Tejun Heo
2013-05-02 19:31                 ` Vivek Goyal
2013-05-02 23:13                   ` Tejun Heo
2013-05-03 17:56                     ` Vivek Goyal
2013-05-03 18:57                       ` Tejun Heo
2013-05-03 18:58                         ` Tejun Heo
2013-05-03 19:08                         ` Vivek Goyal
2013-05-03 19:14                           ` Tejun Heo
2013-05-03 19:26                             ` Vivek Goyal
2013-05-03 21:05                             ` Vivek Goyal
2013-05-03 23:54                               ` Tejun Heo
2013-05-06 17:33                                 ` Vivek Goyal
2013-05-02 18:08   ` Vivek Goyal
2013-05-02 18:44     ` Tejun Heo
2013-05-02 18:59       ` Vivek Goyal
2013-05-04  0:50 ` [PATCH 29.5/32] blk-throttle: add throtl_qnode for dispatch fairness Tejun Heo
2013-05-04  0:53   ` Tejun Heo
2013-05-06 16:00   ` Vivek Goyal
2013-05-06 18:35     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).