linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET] block: implement blkcg hierarchy support in cfq
@ 2012-12-14 22:41 Tejun Heo
  2012-12-14 22:41 ` [PATCH 01/12] blkcg: fix minor bug in blkg_alloc() Tejun Heo
                   ` (12 more replies)
  0 siblings, 13 replies; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal; +Cc: containers, cgroups, linux-kernel, ctalbott, rni

Hello,

cfq-iosched is currently utterly broken in how it handles cgroup
hierarchy.  It ignores the hierarchy structure and just treats every
blkcgs equally.  This is simply broken.  This breakage makes blkcg
behave very differently from other properly-hierarchical controllers
and makes it impossible to give any uniform interpretation to the
hierarchy, which in turn makes it impossible to implement unified
hierarchy.

Given the relative simplicity of cfqg scheduling, implementing proper
hierarchy support isn't that difficult.  All that's necessary is
determining how much fraction each cfqg on the service tree has claim
to considering the hierarchy.  The calculation can be done by
maintaining the sum of active weights at each level and compounding
the ratios from the cfqg in question to root.  The overhead isn't
significant.  Tree traversals happen only when cfqgs are added or
removed from the service tree and they are from the cfqg being
modified to the root.

There are some design choices which are worth mentioning.

* Internal (non-leaf) cfqgs w/ tasks treat the tasks as a single unit
  competeting against the children cfqgs.  New config knobs -
  blkio.leaf_weight[_device] - are added to configure the weight of
  these tasks.  Another way to look at it is that each cfqg has a
  hidden leaf child node attached to it which hosts all tasks and
  leaf_weight controls the weight of that hidden node.

  Treating cfqqs and cfqgs as equals doesn't make much sense to me and
  is hairy - we need to establish ioprio to weight mapping and the
  weights fluctuate as processes fork and exit.  This becomes hairier
  when considering multiple controllers, Such mappings can't be
  established consistently across different controllers and the
  weights are given out differently - ie. blkcg give weights out to
  io_contexts while cpu to tasks, which may share io_contexts.  It's
  difficult to make sense of what's going on.

  The goal is to bring cpu, currently the only other controller which
  implements weight based resource allocation, to similar behavior.

* The existing stats aren't converted to hierarchical but new
  hierarchical ones are added.  There isn't a way to do that w/o
  introducing nasty silent surprises to the existing flat hierarchy
  users, so while being a bit clumsy, I can't see a better way.

* I based it on top of Vivek's cleanup patchset[1] but not the cfqq,
  cfqg scheduling unification patchset.  I don't think it's necessary
  or beneficial to mix the two and would really like to avoid messing
  with !blkcg scheduling logic.

The hierarchical scheduling itself is fairly simple.  The cfq part is
only ~260 lines with ~60 lines being comment, and the hierarchical
weight scaling is really straight-forward.

This patchset contains the following 12 patches.

 0001-blkcg-fix-minor-bug-in-blkg_alloc.patch
 0002-blkcg-reorganize-blkg_lookup_create-and-friends.patch
 0003-blkcg-cosmetic-updates-to-blkg_create.patch
 0004-blkcg-make-blkcg_gq-s-hierarchical.patch
 0005-cfq-iosched-add-leaf_weight.patch
 0006-cfq-iosched-implement-cfq_group-nr_active-and-level_.patch
 0007-cfq-iosched-implement-hierarchy-ready-cfq_group-char.patch
 0008-cfq-iosched-convert-cfq_group_slice-to-use-cfqg-vfra.patch
 0009-cfq-iosched-enable-full-blkcg-hierarchy-support.patch
 0010-blkcg-add-blkg_policy_data-plid.patch
 0011-blkcg-implement-blkg_prfill_-rw-stat_recursive.patch
 0012-cfq-iosched-add-hierarchical-cfq_group-statistics.patch

0001-0003 are prep patches.

0004 makes blkcg core always allocate non-leaf blkgs so that any given
blkg is guaranteed to have all its ancestor blkgs to the root.

0005-0006 prepare for hierarchical scheduling.

0007-0008 implement hierarchy-ready cfqg scheduling.

0009 enbles hierarchical scheduling.

0010-0012 implement hierarchical stats.

This patchset is on top of

  linus#master (d42b3a2906a10b732ea7d7f849d49be79d242ef0)
+ [1] "cfq-iosched: Some minor cleanups" patchset

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git blkcg-cfq-hierarchy

Thanks.

 block/blk-cgroup.c  |  263 ++++++++++++++++++++++++++++++++++++-----
 block/blk-cgroup.h  |   26 +++-
 block/cfq-iosched.c |  329 ++++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 560 insertions(+), 58 deletions(-)

--
tejun

[1] https://lkml.org/lkml/2012/10/3/502

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/12] blkcg: fix minor bug in blkg_alloc()
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-17 19:10   ` Vivek Goyal
  2012-12-14 22:41 ` [PATCH 02/12] blkcg: reorganize blkg_lookup_create() and friends Tejun Heo
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice.
The latter test should have been on whether pol->pd_init_fn() exists.
This doesn't cause actual problems because both blkcg policies
implement pol->pd_init_fn().  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3f6d39d..feda49f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -114,7 +114,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 		pd->blkg = blkg;
 
 		/* invoke per-policy init */
-		if (blkcg_policy_enabled(blkg->q, pol))
+		if (pol->pd_init_fn)
 			pol->pd_init_fn(blkg);
 	}
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 02/12] blkcg: reorganize blkg_lookup_create() and friends
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
  2012-12-14 22:41 ` [PATCH 01/12] blkcg: fix minor bug in blkg_alloc() Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-17 19:28   ` Vivek Goyal
  2012-12-14 22:41 ` [PATCH 03/12] blkcg: cosmetic updates to blkg_create() Tejun Heo
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Reorganize such that

* __blkg_lookup() takes bool param @update_hint to determine whether
  to update hint.

* __blkg_lookup_create() no longer performs lookup before trying to
  create.  Renamed to blkg_create().

* blkg_lookup_create() now performs lookup and then invokes
  blkg_create() if lookup fails.

* root_blkg creation in blkcg_activate_policy() updated accordingly.
  Note that blkcg_activate_policy() no longer updates lookup hint if
  root_blkg already exists.

Except for the last lookup hint bit which is immaterial, this is pure
reorganization and doesn't introduce any visible behavior change.
This is to prepare for proper hierarchy support.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 75 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index feda49f..ffbd237 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -126,7 +126,7 @@ err_free:
 }
 
 static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
-				      struct request_queue *q)
+				      struct request_queue *q, bool update_hint)
 {
 	struct blkcg_gq *blkg;
 
@@ -135,14 +135,19 @@ static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
 		return blkg;
 
 	/*
-	 * Hint didn't match.  Look up from the radix tree.  Note that we
-	 * may not be holding queue_lock and thus are not sure whether
-	 * @blkg from blkg_tree has already been removed or not, so we
-	 * can't update hint to the lookup result.  Leave it to the caller.
+	 * Hint didn't match.  Look up from the radix tree.  Note that the
+	 * hint can only be updated under queue_lock as otherwise @blkg
+	 * could have already been removed from blkg_tree.  The caller is
+	 * responsible for grabbing queue_lock if @update_hint.
 	 */
 	blkg = radix_tree_lookup(&blkcg->blkg_tree, q->id);
-	if (blkg && blkg->q == q)
+	if (blkg && blkg->q == q) {
+		if (update_hint) {
+			lockdep_assert_held(q->queue_lock);
+			rcu_assign_pointer(blkcg->blkg_hint, blkg);
+		}
 		return blkg;
+	}
 
 	return NULL;
 }
@@ -162,7 +167,7 @@ struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q)
 
 	if (unlikely(blk_queue_bypass(q)))
 		return NULL;
-	return __blkg_lookup(blkcg, q);
+	return __blkg_lookup(blkcg, q, false);
 }
 EXPORT_SYMBOL_GPL(blkg_lookup);
 
@@ -170,9 +175,9 @@ EXPORT_SYMBOL_GPL(blkg_lookup);
  * If @new_blkg is %NULL, this function tries to allocate a new one as
  * necessary using %GFP_ATOMIC.  @new_blkg is always consumed on return.
  */
-static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
-					     struct request_queue *q,
-					     struct blkcg_gq *new_blkg)
+static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
+				    struct request_queue *q,
+				    struct blkcg_gq *new_blkg)
 {
 	struct blkcg_gq *blkg;
 	int ret;
@@ -180,13 +185,6 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
 
-	/* lookup and update hint on success, see __blkg_lookup() for details */
-	blkg = __blkg_lookup(blkcg, q);
-	if (blkg) {
-		rcu_assign_pointer(blkcg->blkg_hint, blkg);
-		goto out_free;
-	}
-
 	/* blkg holds a reference to blkcg */
 	if (!css_tryget(&blkcg->css)) {
 		blkg = ERR_PTR(-EINVAL);
@@ -223,16 +221,39 @@ out_free:
 	return blkg;
 }
 
+/**
+ * blkg_lookup_create - lookup blkg, try to create one if not there
+ * @blkcg: blkcg of interest
+ * @q: request_queue of interest
+ *
+ * Lookup blkg for the @blkcg - @q pair.  If it doesn't exist, try to
+ * create one.  This function should be called under RCU read lock and
+ * @q->queue_lock.
+ *
+ * Returns pointer to the looked up or created blkg on success, ERR_PTR()
+ * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
+ * dead and bypassing, returns ERR_PTR(-EBUSY).
+ */
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 				    struct request_queue *q)
 {
+	struct blkcg_gq *blkg;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	lockdep_assert_held(q->queue_lock);
+
 	/*
 	 * This could be the first entry point of blkcg implementation and
 	 * we shouldn't allow anything to go through for a bypassing queue.
 	 */
 	if (unlikely(blk_queue_bypass(q)))
 		return ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
-	return __blkg_lookup_create(blkcg, q, NULL);
+
+	blkg = __blkg_lookup(blkcg, q, true);
+	if (blkg)
+		return blkg;
+
+	return blkg_create(blkcg, q, NULL);
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
@@ -777,7 +798,7 @@ int blkcg_activate_policy(struct request_queue *q,
 			  const struct blkcg_policy *pol)
 {
 	LIST_HEAD(pds);
-	struct blkcg_gq *blkg;
+	struct blkcg_gq *blkg, *new_blkg;
 	struct blkg_policy_data *pd, *n;
 	int cnt = 0, ret;
 	bool preloaded;
@@ -786,19 +807,27 @@ int blkcg_activate_policy(struct request_queue *q,
 		return 0;
 
 	/* preallocations for root blkg */
-	blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
-	if (!blkg)
+	new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
+	if (!new_blkg)
 		return -ENOMEM;
 
 	preloaded = !radix_tree_preload(GFP_KERNEL);
 
 	blk_queue_bypass_start(q);
 
-	/* make sure the root blkg exists and count the existing blkgs */
+	/*
+	 * Make sure the root blkg exists and count the existing blkgs.  As
+	 * @q is bypassing at this point, blkg_lookup_create() can't be
+	 * used.  Open code it.
+	 */
 	spin_lock_irq(q->queue_lock);
 
 	rcu_read_lock();
-	blkg = __blkg_lookup_create(&blkcg_root, q, blkg);
+	blkg = __blkg_lookup(&blkcg_root, q, false);
+	if (blkg)
+		blkg_free(new_blkg);
+	else
+		blkg = blkg_create(&blkcg_root, q, new_blkg);
 	rcu_read_unlock();
 
 	if (preloaded)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 03/12] blkcg: cosmetic updates to blkg_create()
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
  2012-12-14 22:41 ` [PATCH 01/12] blkcg: fix minor bug in blkg_alloc() Tejun Heo
  2012-12-14 22:41 ` [PATCH 02/12] blkcg: reorganize blkg_lookup_create() and friends Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-17 19:37   ` Vivek Goyal
  2012-12-14 22:41 ` [PATCH 04/12] blkcg: make blkcg_gq's hierarchical Tejun Heo
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

* Rename out_* labels to err_*.

* Do ERR_PTR() conversion once in the error return path.

This patch is cosmetic and to prepare for the hierarchy support.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ffbd237..fde2286 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -187,16 +187,16 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 
 	/* blkg holds a reference to blkcg */
 	if (!css_tryget(&blkcg->css)) {
-		blkg = ERR_PTR(-EINVAL);
-		goto out_free;
+		ret = -EINVAL;
+		goto err_free_blkg;
 	}
 
 	/* allocate */
 	if (!new_blkg) {
 		new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
 		if (unlikely(!new_blkg)) {
-			blkg = ERR_PTR(-ENOMEM);
-			goto out_put;
+			ret = -ENOMEM;
+			goto err_put_css;
 		}
 	}
 	blkg = new_blkg;
@@ -213,12 +213,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	if (!ret)
 		return blkg;
 
-	blkg = ERR_PTR(ret);
-out_put:
+err_put_css:
 	css_put(&blkcg->css);
-out_free:
+err_free_blkg:
 	blkg_free(new_blkg);
-	return blkg;
+	return ERR_PTR(ret);
 }
 
 /**
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 04/12] blkcg: make blkcg_gq's hierarchical
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
                   ` (2 preceding siblings ...)
  2012-12-14 22:41 ` [PATCH 03/12] blkcg: cosmetic updates to blkg_create() Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-17 20:04   ` Vivek Goyal
  2012-12-14 22:41 ` [PATCH 05/12] cfq-iosched: add leaf_weight Tejun Heo
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Currently a child blkg (blkcg_gq) can be created even if its parent
doesn't exist.  ie. Given a blkg, it's not guaranteed that its
ancestors will exist.  This makes it difficult to implement proper
hierarchy support for blkcg policies.

Always create blkgs recursively and make a child blkg hold a reference
to its parent.  blkg->parent is added so that finding the parent is
easy.  blkcg_parent() is also added in the process.

This change can be visible to userland.  e.g. while issuing IO in a
nested cgroup didn't affect the ancestors at all, now it will
initialize all ancestor blkgs and zero stats for the request_queue
will always appear on them.  While this is userland visible, this
shouldn't cause any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 42 +++++++++++++++++++++++++++++++++++++-----
 block/blk-cgroup.h | 18 ++++++++++++++++++
 2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fde2286..c858628 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -201,7 +201,16 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	}
 	blkg = new_blkg;
 
-	/* insert */
+	/* link parent and insert */
+	if (blkcg_parent(blkcg)) {
+		blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
+		if (WARN_ON_ONCE(!blkg->parent)) {
+			blkg = ERR_PTR(-EINVAL);
+			goto err_put_css;
+		}
+		blkg_get(blkg->parent);
+	}
+
 	spin_lock(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
 	if (likely(!ret)) {
@@ -213,6 +222,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	if (!ret)
 		return blkg;
 
+	/* @blkg failed fully initialized, use the usual release path */
+	blkg_put(blkg);
+	return ERR_PTR(ret);
+
 err_put_css:
 	css_put(&blkcg->css);
 err_free_blkg:
@@ -226,8 +239,9 @@ err_free_blkg:
  * @q: request_queue of interest
  *
  * Lookup blkg for the @blkcg - @q pair.  If it doesn't exist, try to
- * create one.  This function should be called under RCU read lock and
- * @q->queue_lock.
+ * create one.  blkg creation is performed recursively from blkcg_root such
+ * that all non-root blkg's have access to the parent blkg.  This function
+ * should be called under RCU read lock and @q->queue_lock.
  *
  * Returns pointer to the looked up or created blkg on success, ERR_PTR()
  * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
@@ -252,7 +266,23 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 	if (blkg)
 		return blkg;
 
-	return blkg_create(blkcg, q, NULL);
+	/*
+	 * Create blkgs walking down from blkcg_root to @blkcg, so that all
+	 * non-root blkgs have access to their parents.
+	 */
+	while (true) {
+		struct blkcg *pos = blkcg;
+		struct blkcg *parent = blkcg_parent(blkcg);
+
+		while (parent && !__blkg_lookup(parent, q, false)) {
+			pos = parent;
+			parent = blkcg_parent(parent);
+		}
+
+		blkg = blkg_create(pos, q, NULL);
+		if (pos == blkcg || IS_ERR(blkg))
+			return blkg;
+	}
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
@@ -321,8 +351,10 @@ static void blkg_rcu_free(struct rcu_head *rcu_head)
 
 void __blkg_release(struct blkcg_gq *blkg)
 {
-	/* release the extra blkcg reference this blkg has been holding */
+	/* release the blkcg and parent blkg refs this blkg has been holding */
 	css_put(&blkg->blkcg->css);
+	if (blkg->parent)
+		blkg_put(blkg->parent);
 
 	/*
 	 * A group is freed in rcu manner. But having an rcu lock does not
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 2459730..b26ed58 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -94,8 +94,13 @@ struct blkcg_gq {
 	struct list_head		q_node;
 	struct hlist_node		blkcg_node;
 	struct blkcg			*blkcg;
+
+	/* all non-root blkcg_gq's are guaranteed to have access to parent */
+	struct blkcg_gq			*parent;
+
 	/* request allocation list for this blkcg-q pair */
 	struct request_list		rl;
+
 	/* reference count */
 	int				refcnt;
 
@@ -181,6 +186,19 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
 }
 
 /**
+ * blkcg_parent - get the parent of a blkcg
+ * @blkcg: blkcg of interest
+ *
+ * Return the parent blkcg of @blkcg.  Can be called anytime.
+ */
+static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
+{
+	struct cgroup *pcg = blkcg->css.cgroup->parent;
+
+	return pcg ? cgroup_to_blkcg(pcg) : NULL;
+}
+
+/**
  * blkg_to_pdata - get policy private data
  * @blkg: blkg of interest
  * @pol: policy of interest
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 05/12] cfq-iosched: add leaf_weight
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
                   ` (3 preceding siblings ...)
  2012-12-14 22:41 ` [PATCH 04/12] blkcg: make blkcg_gq's hierarchical Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-14 22:41 ` [PATCH 06/12] cfq-iosched: implement cfq_group->nr_active and ->level_weight Tejun Heo
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

cfq blkcg is about to grow proper hierarchy handling, where a child
blkg's weight would nest inside the parent's.  This makes tasks in a
blkg to compete against both tasks in the sibling blkgs and the tasks
of child blkgs.

We're gonna use the existing weight as the group weight which decides
the blkg's weight against its siblings.  This patch introduces a new
weight - leaf_weight - which decides the weight of a blkg against the
child blkgs.

It's named leaf_weight because another way to look at it is that each
internal blkg nodes have a hidden child leaf node which contains all
its tasks and leaf_weight is the weight of the leaf node and handled
the same as the weight of the child blkgs.

This patch only adds leaf_weight fields and exposes it to userland.
The new weight isn't actually used anywhere yet.  Note that
cfq-iosched currently offcially supports only single level hierarchy
and root blkgs compete with the first level blkgs - ie. root weight is
basically being used as leaf_weight.  For root blkgs, the two weights
are kept in sync for backward compatibility.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c  |   4 +-
 block/blk-cgroup.h  |   1 +
 block/cfq-iosched.c | 133 ++++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 129 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c858628..16ea5a1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -26,7 +26,8 @@
 
 static DEFINE_MUTEX(blkcg_pol_mutex);
 
-struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT };
+struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT,
+			    .cfq_leaf_weight = 2 * CFQ_WEIGHT_DEFAULT, };
 EXPORT_SYMBOL_GPL(blkcg_root);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
@@ -710,6 +711,7 @@ static struct cgroup_subsys_state *blkcg_css_alloc(struct cgroup *cgroup)
 		return ERR_PTR(-ENOMEM);
 
 	blkcg->cfq_weight = CFQ_WEIGHT_DEFAULT;
+	blkcg->cfq_leaf_weight = CFQ_WEIGHT_DEFAULT;
 	blkcg->id = atomic64_inc_return(&id_seq); /* root is 0, start from 1 */
 done:
 	spin_lock_init(&blkcg->lock);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index b26ed58..2446225 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -54,6 +54,7 @@ struct blkcg {
 
 	/* TODO: per-policy storage in blkcg */
 	unsigned int			cfq_weight;	/* belongs to cfq */
+	unsigned int			cfq_leaf_weight;
 };
 
 struct blkg_stat {
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 2f2b215..5f23763 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -223,10 +223,21 @@ struct cfq_group {
 
 	/* group service_tree key */
 	u64 vdisktime;
+
+	/*
+	 * There are two weights - (internal) weight is the weight of this
+	 * cfqg against the sibling cfqgs.  leaf_weight is the wight of
+	 * this cfqg against the child cfqgs.  For the root cfqg, both
+	 * weights are kept in sync for backward compatibility.
+	 */
 	unsigned int weight;
 	unsigned int new_weight;
 	unsigned int dev_weight;
 
+	unsigned int leaf_weight;
+	unsigned int new_leaf_weight;
+	unsigned int dev_leaf_weight;
+
 	/* number of cfqq currently on this group */
 	int nr_cfqq;
 
@@ -1182,10 +1193,16 @@ static void
 cfq_update_group_weight(struct cfq_group *cfqg)
 {
 	BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
+
 	if (cfqg->new_weight) {
 		cfqg->weight = cfqg->new_weight;
 		cfqg->new_weight = 0;
 	}
+
+	if (cfqg->new_leaf_weight) {
+		cfqg->leaf_weight = cfqg->new_leaf_weight;
+		cfqg->new_leaf_weight = 0;
+	}
 }
 
 static void
@@ -1348,6 +1365,7 @@ static void cfq_pd_init(struct blkcg_gq *blkg)
 
 	cfq_init_cfqg_base(cfqg);
 	cfqg->weight = blkg->blkcg->cfq_weight;
+	cfqg->leaf_weight = blkg->blkcg->cfq_leaf_weight;
 }
 
 /*
@@ -1404,6 +1422,26 @@ static int cfqg_print_weight_device(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 cfqg_prfill_leaf_weight_device(struct seq_file *sf,
+					  struct blkg_policy_data *pd, int off)
+{
+	struct cfq_group *cfqg = pd_to_cfqg(pd);
+
+	if (!cfqg->dev_leaf_weight)
+		return 0;
+	return __blkg_prfill_u64(sf, pd, cfqg->dev_leaf_weight);
+}
+
+static int cfqg_print_leaf_weight_device(struct cgroup *cgrp,
+					 struct cftype *cft,
+					 struct seq_file *sf)
+{
+	blkcg_print_blkgs(sf, cgroup_to_blkcg(cgrp),
+			  cfqg_prfill_leaf_weight_device, &blkcg_policy_cfq, 0,
+			  false);
+	return 0;
+}
+
 static int cfq_print_weight(struct cgroup *cgrp, struct cftype *cft,
 			    struct seq_file *sf)
 {
@@ -1411,8 +1449,16 @@ static int cfq_print_weight(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
-static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
-				  const char *buf)
+static int cfq_print_leaf_weight(struct cgroup *cgrp, struct cftype *cft,
+				 struct seq_file *sf)
+{
+	seq_printf(sf, "%u\n",
+		   cgroup_to_blkcg(cgrp)->cfq_leaf_weight);
+	return 0;
+}
+
+static int __cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
+				    const char *buf, bool is_leaf_weight)
 {
 	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
 	struct blkg_conf_ctx ctx;
@@ -1426,8 +1472,13 @@ static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
 	ret = -EINVAL;
 	cfqg = blkg_to_cfqg(ctx.blkg);
 	if (!ctx.v || (ctx.v >= CFQ_WEIGHT_MIN && ctx.v <= CFQ_WEIGHT_MAX)) {
-		cfqg->dev_weight = ctx.v;
-		cfqg->new_weight = cfqg->dev_weight ?: blkcg->cfq_weight;
+		if (!is_leaf_weight) {
+			cfqg->dev_weight = ctx.v;
+			cfqg->new_weight = ctx.v ?: blkcg->cfq_weight;
+		} else {
+			cfqg->dev_leaf_weight = ctx.v;
+			cfqg->new_leaf_weight = ctx.v ?: blkcg->cfq_leaf_weight;
+		}
 		ret = 0;
 	}
 
@@ -1435,7 +1486,20 @@ static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
 	return ret;
 }
 
-static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
+static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
+				  const char *buf)
+{
+	return __cfqg_set_weight_device(cgrp, cft, buf, false);
+}
+
+static int cfqg_set_leaf_weight_device(struct cgroup *cgrp, struct cftype *cft,
+				       const char *buf)
+{
+	return __cfqg_set_weight_device(cgrp, cft, buf, true);
+}
+
+static int __cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val,
+			    bool is_leaf_weight)
 {
 	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
 	struct blkcg_gq *blkg;
@@ -1445,19 +1509,41 @@ static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
 		return -EINVAL;
 
 	spin_lock_irq(&blkcg->lock);
-	blkcg->cfq_weight = (unsigned int)val;
+
+	if (!is_leaf_weight)
+		blkcg->cfq_weight = val;
+	else
+		blkcg->cfq_leaf_weight = val;
 
 	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
 		struct cfq_group *cfqg = blkg_to_cfqg(blkg);
 
-		if (cfqg && !cfqg->dev_weight)
-			cfqg->new_weight = blkcg->cfq_weight;
+		if (!cfqg)
+			continue;
+
+		if (!is_leaf_weight) {
+			if (!cfqg->dev_weight)
+				cfqg->new_weight = blkcg->cfq_weight;
+		} else {
+			if (!cfqg->dev_leaf_weight)
+				cfqg->new_leaf_weight = blkcg->cfq_leaf_weight;
+		}
 	}
 
 	spin_unlock_irq(&blkcg->lock);
 	return 0;
 }
 
+static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	return __cfq_set_weight(cgrp, cft, val, false);
+}
+
+static int cfq_set_leaf_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	return __cfq_set_weight(cgrp, cft, val, true);
+}
+
 static int cfqg_print_stat(struct cgroup *cgrp, struct cftype *cft,
 			   struct seq_file *sf)
 {
@@ -1518,6 +1604,37 @@ static struct cftype cfq_blkcg_files[] = {
 		.read_seq_string = cfq_print_weight,
 		.write_u64 = cfq_set_weight,
 	},
+
+	/* on root, leaf_weight is mapped to weight */
+	{
+		.name = "leaf_weight_device",
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfqg_print_weight_device,
+		.write_string = cfqg_set_weight_device,
+		.max_write_len = 256,
+	},
+	{
+		.name = "leaf_weight",
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfq_print_weight,
+		.write_u64 = cfq_set_weight,
+	},
+
+	/* no such mapping necessary for !roots */
+	{
+		.name = "leaf_weight_device",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_seq_string = cfqg_print_leaf_weight_device,
+		.write_string = cfqg_set_leaf_weight_device,
+		.max_write_len = 256,
+	},
+	{
+		.name = "leaf_weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_seq_string = cfq_print_leaf_weight,
+		.write_u64 = cfq_set_leaf_weight,
+	},
+
 	{
 		.name = "time",
 		.private = offsetof(struct cfq_group, stats.time),
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 06/12] cfq-iosched: implement cfq_group->nr_active and ->level_weight
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
                   ` (4 preceding siblings ...)
  2012-12-14 22:41 ` [PATCH 05/12] cfq-iosched: add leaf_weight Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-17 20:46   ` Vivek Goyal
  2012-12-14 22:41 ` [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling Tejun Heo
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

To prepare for blkcg hierarchy support, add cfqg->nr_active and
->level_weight.  cfqg->nr_active counts the number of active cfqgs at
the cfqg's level and ->level_weight is sum of weights of those cfqgs.
The level covers itself (cfqg->leaf_weight) and immediate children.

The two values are updated when a cfqg enters and leaves the group
service tree.  Unless the hierarchy is very deep, the added overhead
should be negligible.

Currently, the parent is determined using cfqg_flat_parent() which
makes the root cfqg the parent of all other cfqgs.  This is to make
the transition to hierarchy-aware scheduling gradual.  Scheduling
logic will be converted to use cfqg->level_weight without actually
changing the behavior.  When everything is ready,
blkcg_weight_parent() will be replaced with proper parent function.

This patch doesn't introduce any behavior chagne.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5f23763..eb290a0 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -225,6 +225,18 @@ struct cfq_group {
 	u64 vdisktime;
 
 	/*
+	 * The number of active cfqgs and sum of their weights under this
+	 * cfqg.  This covers this cfqg's leaf_weight and all children's
+	 * weights, but does not cover weights of further descendants.
+	 *
+	 * If a cfqg is on the service tree, it's active.  An active cfqg
+	 * also activates its parent and contributes to the level_weight of
+	 * the parent.
+	 */
+	int nr_active;
+	unsigned int level_weight;
+
+	/*
 	 * There are two weights - (internal) weight is the weight of this
 	 * cfqg against the sibling cfqgs.  leaf_weight is the wight of
 	 * this cfqg against the child cfqgs.  For the root cfqg, both
@@ -583,6 +595,22 @@ static inline struct cfq_group *blkg_to_cfqg(struct blkcg_gq *blkg)
 	return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
 }
 
+/*
+ * Determine the parent cfqg for weight calculation.  Currently, cfqg
+ * scheduling is flat and the root is the parent of everyone else.
+ */
+static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
+{
+	struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
+	struct cfq_group *root;
+
+	while (blkg->parent)
+		blkg = blkg->parent;
+	root = blkg_to_cfqg(blkg);
+
+	return root != cfqg ? root : NULL;
+}
+
 static inline void cfqg_get(struct cfq_group *cfqg)
 {
 	return blkg_get(cfqg_to_blkg(cfqg));
@@ -683,6 +711,7 @@ static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
 
 #else	/* CONFIG_CFQ_GROUP_IOSCHED */
 
+static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
 static inline void cfqg_get(struct cfq_group *cfqg) { }
 static inline void cfqg_put(struct cfq_group *cfqg) { }
 
@@ -1208,11 +1237,33 @@ cfq_update_group_weight(struct cfq_group *cfqg)
 static void
 cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 {
+	struct cfq_group *pos = cfqg;
+	bool propagate;
+
+	/* add to the service tree */
 	BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
 
 	cfq_update_group_weight(cfqg);
 	__cfq_group_service_tree_add(st, cfqg);
 	st->total_weight += cfqg->weight;
+
+	/*
+	 * Activate @cfqg and propagate activation upwards until we meet an
+	 * already activated node or reach root.
+	 */
+	propagate = !pos->nr_active++;
+	pos->level_weight += pos->leaf_weight;
+
+	while (propagate) {
+		struct cfq_group *parent = cfqg_flat_parent(pos);
+
+		if (!parent)
+			break;
+
+		propagate = !parent->nr_active++;
+		parent->level_weight += pos->weight;
+		pos = parent;
+	}
 }
 
 static void
@@ -1243,6 +1294,31 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
 static void
 cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 {
+	struct cfq_group *pos = cfqg;
+	bool propagate;
+
+	/*
+	 * Undo activation from cfq_group_service_tree_add().  Deactivate
+	 * @cfqg and propagate deactivation upwards.
+	 */
+	propagate = !--pos->nr_active;
+	pos->level_weight -= pos->leaf_weight;
+
+	while (propagate) {
+		struct cfq_group *parent = cfqg_flat_parent(pos);
+
+		/* @pos has 0 nr_active at this point */
+		WARN_ON_ONCE(pos->level_weight);
+
+		if (!parent)
+			break;
+
+		propagate = !--parent->nr_active;
+		parent->level_weight -= pos->weight;
+		pos = parent;
+	}
+
+	/* remove from the service tree */
 	st->total_weight -= cfqg->weight;
 	if (!RB_EMPTY_NODE(&cfqg->rb_node))
 		cfq_rb_erase(&cfqg->rb_node, st);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
                   ` (5 preceding siblings ...)
  2012-12-14 22:41 ` [PATCH 06/12] cfq-iosched: implement cfq_group->nr_active and ->level_weight Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-17 20:53   ` Vivek Goyal
  2012-12-14 22:41 ` [PATCH 08/12] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction Tejun Heo
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Currently, cfqg charges are scaled directly according to cfqg->weight.
Regardless of the number of active cfqgs or the amount of active
weights, a given weight value always scales charge the same way.  This
works fine as long as all cfqgs are treated equally regardless of
their positions in the hierarchy, which is what cfq currently
implements.  It can't work in hierarchical settings because the
interpretation of a given weight value depends on where the weight is
located in the hierarchy.

This patch reimplements cfqg charge scaling so that it can be used to
support hierarchy properly.  The scheme is fairly simple and
light-weight.

* When a cfqg is added to the service tree, v(disktime)weight is
  calculated.  It walks up the tree to root calculating the fraction
  it has in the hierarchy.  At each level, the fraction can be
  calculated as

    cfqg->weight / parent->level_weight

  By compounding these, the global fraction of vdisktime the cfqg has
  claim to - vfraction - can be determined.

* When the cfqg needs to be charged, the charge is scaled inversely
  proportionally to the vfraction.

The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
representation as before; however, the smallest scaling factor is now
1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
scaling factor.

While this shifts the global scale of vdisktime a bit, it doesn't
change the relative relationships among cfqgs and the scheduling
result isn't different.

cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
didn't have any relevance to vdisktime before and is unlikely to cause
any visible behavior difference now especially as the scale shift
isn't that large.

As the new scheme now makes proper distinction between cfqg->weight
and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
root, both weights are now mapped to ->leaf_weight instead of the
other way around.

Because we're still using cfqg_flat_parent(), this patch shouldn't
change the scheduling behavior in any noticeable way.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 103 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 73 insertions(+), 30 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index eb290a0..663a0f0 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -237,6 +237,15 @@ struct cfq_group {
 	unsigned int level_weight;
 
 	/*
+	 * vfraction is the fraction of vdisktime that a cfqg is entitled
+	 * to.  It is in fixed point w/ CFQ_SERVICE_SHIFT and the sum of
+	 * all vfractions on a service tree is approximately 1.  The sum
+	 * may deviate a bit due to rounding errors and fluctuations caused
+	 * by cfqgs entering and leaving the service tree.
+	 */
+	unsigned int vfraction;
+
+	/*
 	 * There are two weights - (internal) weight is the weight of this
 	 * cfqg against the sibling cfqgs.  leaf_weight is the wight of
 	 * this cfqg against the child cfqgs.  For the root cfqg, both
@@ -891,13 +900,27 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
 }
 
-static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
+/**
+ * cfqg_scale_charge - scale disk time charge according to cfqg weight
+ * @charge: disk time being charged
+ * @vfraction: vfraction of the cfqg, fixed point w/ CFQ_SERVICE_SHIFT
+ *
+ * Scale @charge according to @vfraction, which is in range (0, 1].  The
+ * scaling is inversely proportional.
+ *
+ * scaled = charge / vfraction
+ *
+ * The result is also in fixed point w/ CFQ_SERVICE_SHIFT.
+ */
+static inline u64 cfqg_scale_charge(unsigned long charge,
+				    unsigned int vfraction)
 {
-	u64 d = delta << CFQ_SERVICE_SHIFT;
+	u64 c = charge << CFQ_SERVICE_SHIFT;	/* make it fixed point */
 
-	d = d * CFQ_WEIGHT_DEFAULT;
-	do_div(d, cfqg->weight);
-	return d;
+	/* charge / vfraction */
+	c <<= CFQ_SERVICE_SHIFT;
+	do_div(c, vfraction);
+	return c;
 }
 
 static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
@@ -1237,7 +1260,9 @@ cfq_update_group_weight(struct cfq_group *cfqg)
 static void
 cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 {
+	unsigned int vfr = 1 << CFQ_SERVICE_SHIFT;	/* start with 1 */
 	struct cfq_group *pos = cfqg;
+	struct cfq_group *parent;
 	bool propagate;
 
 	/* add to the service tree */
@@ -1248,22 +1273,33 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	st->total_weight += cfqg->weight;
 
 	/*
-	 * Activate @cfqg and propagate activation upwards until we meet an
-	 * already activated node or reach root.
+	 * Activate @cfqg and calculate the portion of vfraction @cfqg is
+	 * entitled to.  vfraction is calculated by walking the tree
+	 * towards the root calculating the fraction it has at each level.
+	 * The compounded ratio is how much vfraction @cfqg owns.
+	 *
+	 * Start with activating and calculating vfraction for @cfqg.
 	 */
 	propagate = !pos->nr_active++;
 	pos->level_weight += pos->leaf_weight;
+	vfr = vfr * pos->leaf_weight / pos->level_weight;
 
-	while (propagate) {
-		struct cfq_group *parent = cfqg_flat_parent(pos);
-
-		if (!parent)
-			break;
-
-		propagate = !parent->nr_active++;
-		parent->level_weight += pos->weight;
+	/*
+	 * Walk up the tree.  Both activation and vfraction calculation are
+	 * done in the same loop.  Propagation stops once an already
+	 * activated node is met.  vfraction calculation should always
+	 * continue to the root.
+	 */
+	while ((parent = cfqg_flat_parent(pos))) {
+		if (propagate) {
+			propagate = !parent->nr_active++;
+			parent->level_weight += pos->weight;
+		}
+		vfr = vfr * pos->weight / parent->level_weight;
 		pos = parent;
 	}
+
+	cfqg->vfraction = max_t(unsigned, vfr, 1);
 }
 
 static void
@@ -1309,6 +1345,7 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 
 		/* @pos has 0 nr_active at this point */
 		WARN_ON_ONCE(pos->level_weight);
+		pos->vfraction = 0;
 
 		if (!parent)
 			break;
@@ -1381,6 +1418,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	unsigned int used_sl, charge, unaccounted_sl = 0;
 	int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
 			- cfqg->service_tree_idle.count;
+	unsigned int vfr;
 
 	BUG_ON(nr_sync < 0);
 	used_sl = charge = cfq_cfqq_slice_usage(cfqq, &unaccounted_sl);
@@ -1390,10 +1428,15 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
 		charge = cfqq->allocated_slice;
 
-	/* Can't update vdisktime while group is on service tree */
+	/*
+	 * Can't update vdisktime while on service tree and cfqg->vfraction
+	 * is valid only while on it.  Cache vfr, leave the service tree,
+	 * update vdisktime and go back on.  The re-addition to the tree
+	 * will also update the weights as necessary.
+	 */
+	vfr = cfqg->vfraction;
 	cfq_group_service_tree_del(st, cfqg);
-	cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
-	/* If a new weight was requested, update now, off tree */
+	cfqg->vdisktime += cfqg_scale_charge(charge, vfr);
 	cfq_group_service_tree_add(st, cfqg);
 
 	/* This group is being expired. Save the context */
@@ -1669,44 +1712,44 @@ static int cfqg_print_avg_queue_size(struct cgroup *cgrp, struct cftype *cft,
 #endif	/* CONFIG_DEBUG_BLK_CGROUP */
 
 static struct cftype cfq_blkcg_files[] = {
+	/* on root, weight is mapped to leaf_weight */
 	{
 		.name = "weight_device",
-		.read_seq_string = cfqg_print_weight_device,
-		.write_string = cfqg_set_weight_device,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfqg_print_leaf_weight_device,
+		.write_string = cfqg_set_leaf_weight_device,
 		.max_write_len = 256,
 	},
 	{
 		.name = "weight",
-		.read_seq_string = cfq_print_weight,
-		.write_u64 = cfq_set_weight,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfq_print_leaf_weight,
+		.write_u64 = cfq_set_leaf_weight,
 	},
 
-	/* on root, leaf_weight is mapped to weight */
+	/* no such mapping necessary for !roots */
 	{
-		.name = "leaf_weight_device",
-		.flags = CFTYPE_ONLY_ON_ROOT,
+		.name = "weight_device",
+		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfqg_print_weight_device,
 		.write_string = cfqg_set_weight_device,
 		.max_write_len = 256,
 	},
 	{
-		.name = "leaf_weight",
-		.flags = CFTYPE_ONLY_ON_ROOT,
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfq_print_weight,
 		.write_u64 = cfq_set_weight,
 	},
 
-	/* no such mapping necessary for !roots */
 	{
 		.name = "leaf_weight_device",
-		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfqg_print_leaf_weight_device,
 		.write_string = cfqg_set_leaf_weight_device,
 		.max_write_len = 256,
 	},
 	{
 		.name = "leaf_weight",
-		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfq_print_leaf_weight,
 		.write_u64 = cfq_set_leaf_weight,
 	},
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 08/12] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
                   ` (6 preceding siblings ...)
  2012-12-14 22:41 ` [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-14 22:41 ` [PATCH 09/12] cfq-iosched: enable full blkcg hierarchy support Tejun Heo
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

cfq_group_slice() calculates slice by taking a fraction of
cfq_target_latency according to the ratio of cfqg->weight against
service_tree->total_weight.  This currently works only because all
cfqgs are treated to be at the same level.

To prepare for proper hierarchy support, convert cfq_group_slice() to
base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
a fraction of 1 and represents the fraction allocated to the cfqg with
hierarchy considered, the slice can be simply calculated by
multiplying cfqg->vfraction to cfq_target_latency (with fixed point
shift factored in).

As vfraction calculation currently treats all non-root cfqgs as
children of the root cfqg, this patch doesn't introduce noticeable
behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 663a0f0..fd2f4b4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -85,7 +85,6 @@ struct cfq_rb_root {
 	struct rb_root rb;
 	struct rb_node *left;
 	unsigned count;
-	unsigned total_weight;
 	u64 min_vdisktime;
 	struct cfq_ttime ttime;
 };
@@ -976,9 +975,7 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
 static inline unsigned
 cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
-	struct cfq_rb_root *st = &cfqd->grp_service_tree;
-
-	return cfqd->cfq_target_latency * cfqg->weight / st->total_weight;
+	return cfqd->cfq_target_latency * cfqg->vfraction >> CFQ_SERVICE_SHIFT;
 }
 
 static inline unsigned
@@ -1270,7 +1267,6 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 
 	cfq_update_group_weight(cfqg);
 	__cfq_group_service_tree_add(st, cfqg);
-	st->total_weight += cfqg->weight;
 
 	/*
 	 * Activate @cfqg and calculate the portion of vfraction @cfqg is
@@ -1356,7 +1352,6 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	}
 
 	/* remove from the service tree */
-	st->total_weight -= cfqg->weight;
 	if (!RB_EMPTY_NODE(&cfqg->rb_node))
 		cfq_rb_erase(&cfqg->rb_node, st);
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 09/12] cfq-iosched: enable full blkcg hierarchy support
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
                   ` (7 preceding siblings ...)
  2012-12-14 22:41 ` [PATCH 08/12] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-18 18:40   ` Vivek Goyal
  2012-12-14 22:41 ` [PATCH 10/12] blkcg: add blkg_policy_data->plid Tejun Heo
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support.  The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.

Replace it with cfqg_parent() which returns the real parent.  This
enables full blkcg hierarchy support for cfq-iosched.  For example,
consider the following hierarchy.

        root
      /      \
   A:500      B:250
  /     \
 AA:500  AB:1000

For simplicity, let's say all the leaf nodes have active tasks and are
on service tree.  For each leaf node, vfraction would be

 AA: (500  / 1500) * (500 / 750) =~ 0.2222
 AB: (1000 / 1500) * (500 / 750) =~ 0.4444
  B:                 (250 / 750) =~ 0.3333

and vdisktime will be distributed accordingly.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fd2f4b4..ceade6e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -603,20 +603,11 @@ static inline struct cfq_group *blkg_to_cfqg(struct blkcg_gq *blkg)
 	return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
 }
 
-/*
- * Determine the parent cfqg for weight calculation.  Currently, cfqg
- * scheduling is flat and the root is the parent of everyone else.
- */
-static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
+static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg)
 {
-	struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
-	struct cfq_group *root;
-
-	while (blkg->parent)
-		blkg = blkg->parent;
-	root = blkg_to_cfqg(blkg);
+	struct blkcg_gq *pblkg = cfqg_to_blkg(cfqg)->parent;
 
-	return root != cfqg ? root : NULL;
+	return pblkg ? blkg_to_cfqg(pblkg) : NULL;
 }
 
 static inline void cfqg_get(struct cfq_group *cfqg)
@@ -719,7 +710,7 @@ static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
 
 #else	/* CONFIG_CFQ_GROUP_IOSCHED */
 
-static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
+static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; }
 static inline void cfqg_get(struct cfq_group *cfqg) { }
 static inline void cfqg_put(struct cfq_group *cfqg) { }
 
@@ -1286,7 +1277,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	 * activated node is met.  vfraction calculation should always
 	 * continue to the root.
 	 */
-	while ((parent = cfqg_flat_parent(pos))) {
+	while ((parent = cfqg_parent(pos))) {
 		if (propagate) {
 			propagate = !parent->nr_active++;
 			parent->level_weight += pos->weight;
@@ -1337,7 +1328,7 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	pos->level_weight -= pos->leaf_weight;
 
 	while (propagate) {
-		struct cfq_group *parent = cfqg_flat_parent(pos);
+		struct cfq_group *parent = cfqg_parent(pos);
 
 		/* @pos has 0 nr_active at this point */
 		WARN_ON_ONCE(pos->level_weight);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 10/12] blkcg: add blkg_policy_data->plid
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
                   ` (8 preceding siblings ...)
  2012-12-14 22:41 ` [PATCH 09/12] cfq-iosched: enable full blkcg hierarchy support Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-14 22:41 ` [PATCH 11/12] blkcg: implement blkg_prfill_[rw]stat_recursive() Tejun Heo
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Add pd->plid so that the policy a pd belongs to can be identified
easily.  This will be used to implement hierarchical blkg_[rw]stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 2 ++
 block/blk-cgroup.h | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 16ea5a1..fbf96ce 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -113,6 +113,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
+		pd->plid = i;
 
 		/* invoke per-policy init */
 		if (pol->pd_init_fn)
@@ -908,6 +909,7 @@ int blkcg_activate_policy(struct request_queue *q,
 
 		blkg->pd[pol->plid] = pd;
 		pd->blkg = blkg;
+		pd->plid = pol->plid;
 		pol->pd_init_fn(blkg);
 
 		spin_unlock(&blkg->blkcg->lock);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 2446225..40f5b97 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -81,8 +81,9 @@ struct blkg_rwstat {
  * beginning and pd_size can't be smaller than pd.
  */
 struct blkg_policy_data {
-	/* the blkg this per-policy data belongs to */
+	/* the blkg and policy id this per-policy data belongs to */
 	struct blkcg_gq			*blkg;
+	int				plid;
 
 	/* used during policy activation */
 	struct list_head		alloc_node;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 11/12] blkcg: implement blkg_prfill_[rw]stat_recursive()
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
                   ` (9 preceding siblings ...)
  2012-12-14 22:41 ` [PATCH 10/12] blkcg: add blkg_policy_data->plid Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-14 22:41 ` [PATCH 12/12] cfq-iosched: add hierarchical cfq_group statistics Tejun Heo
  2012-12-17 16:52 ` [PATCHSET] block: implement blkcg hierarchy support in cfq Vivek Goyal
  12 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

blkg_prfill_[rw]stat_recursive() collect the specified [rw]stats from
the target policy_data and its descendants and print it.  They can be
used to make stat printing hierarchical.

blkg_for_each_descendant_pre() and blkg_[rw]stat_collect() are added
to implement the new prfill functions.  While at it, function comment
is added to __blkg_lookup() which is now also used by the for_each
macro.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-cgroup.h |   4 ++
 2 files changed, 133 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fbf96ce..8e9caf0 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -32,6 +32,26 @@ EXPORT_SYMBOL_GPL(blkcg_root);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 
+static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
+				      struct request_queue *q, bool update_hint);
+
+/**
+ * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants
+ * @d_blkg: loop cursor pointing to the current descendant
+ * @pos_cgrp: used for iteration
+ * @p_blkg: target blkg to walk descendants of
+ *
+ * Walk @c_blkg through the descendants of @p_blkg.  Must be used with RCU
+ * read locked.  If called under either blkcg or queue lock, the iteration
+ * is guaranteed to include all and only online blkgs.  The caller may
+ * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip
+ * subtree.
+ */
+#define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg)		\
+	cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \
+		if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \
+					      (p_blkg)->q, false)))
+
 static bool blkcg_policy_enabled(struct request_queue *q,
 				 const struct blkcg_policy *pol)
 {
@@ -127,6 +147,17 @@ err_free:
 	return NULL;
 }
 
+/**
+ * __blkg_lookup - internal version of blkg_lookup()
+ * @blkcg: blkcg of interest
+ * @q: request_queue of interest
+ * @update_hint: whether to update lookup hint with the result or not
+ *
+ * This is internal version and shouldn't be used by policy
+ * implementations.  Looks up blkgs for the @blkcg - @q pair regardless of
+ * @q's bypass state.  If @update_hint is %true, the caller should be
+ * holding @q->queue_lock and lookup hint is updated on success.
+ */
 static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
 				      struct request_queue *q, bool update_hint)
 {
@@ -568,6 +599,104 @@ u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
 EXPORT_SYMBOL_GPL(blkg_prfill_rwstat);
 
 /**
+ * blkg_stat_collect - collect hierarchical blkg_stat
+ * @pd: policy private data of interest
+ * @off: offset to the blkg_stat in @pd
+ *
+ * Collect the blkg_stat specified by @off from @pd and all its descendants
+ * and return the sum.
+ */
+static u64 blkg_stat_collect(struct blkg_policy_data *pd, int off)
+{
+	struct blkcg_policy *pol = blkcg_policy[pd->plid];
+	struct blkcg_gq *pos_blkg;
+	struct cgroup *pos_cgrp;
+	u64 sum;
+
+	sum = blkg_stat_read((void *)pd + off);
+
+	rcu_read_lock();
+	blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) {
+		struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol);
+		struct blkg_stat *stat = (void *)pos_pd + off;
+
+		sum += blkg_stat_read(stat);
+	}
+	rcu_read_unlock();
+
+	return sum;
+}
+
+/**
+ * blkg_rwstat_collect - collect hierarchical blkg_rwstat
+ * @pd: policy private data of interest
+ * @off: offset to the blkg_stat in @pd
+ *
+ * Collect the blkg_rwstat specified by @off from @pd and all its
+ * descendants and return the sum.
+ */
+static struct blkg_rwstat blkg_rwstat_collect(struct blkg_policy_data *pd,
+					      int off)
+{
+	struct blkcg_policy *pol = blkcg_policy[pd->plid];
+	struct blkcg_gq *pos_blkg;
+	struct cgroup *pos_cgrp;
+	struct blkg_rwstat sum;
+	int i;
+
+	sum = blkg_rwstat_read((void *)pd + off);
+
+	rcu_read_lock();
+	blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) {
+		struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol);
+		struct blkg_rwstat *rwstat = (void *)pos_pd + off;
+		struct blkg_rwstat tmp;
+
+		tmp = blkg_rwstat_read(rwstat);
+
+		for (i = 0; i < BLKG_RWSTAT_NR; i++)
+			sum.cnt[i] += tmp.cnt[i];
+	}
+	rcu_read_unlock();
+
+	return sum;
+}
+
+/**
+ * blkg_prfill_stat_recursive - recursive prfill callback for blkg_stat
+ * @sf: seq_file to print to
+ * @pd: policy private data of interest
+ * @off: offset to the blkg_stat in @pd
+ *
+ * Recursive prfill callback for printing a blkg_stat.  Sums the blkg_stat
+ * specified by @off of @pd and its descendants and prints it.
+ */
+u64 blkg_prfill_stat_recursive(struct seq_file *sf,
+			       struct blkg_policy_data *pd, int off)
+{
+	return __blkg_prfill_u64(sf, pd, blkg_stat_collect(pd, off));
+}
+EXPORT_SYMBOL_GPL(blkg_prfill_stat_recursive);
+
+/**
+ * blkg_prfill_rwstat_recursive - recursive prfill callback for blkg_rwstat
+ * @sf: seq_file to print to
+ * @pd: policy private data of interest
+ * @off: offset to the blkg_rwstat in @pd
+ *
+ * Recursive prfill callback for printing a blkg_rwstat.  Sums the
+ * blkg_rwstat specified by @off of @pd and its descendants and prints it.
+ */
+u64 blkg_prfill_rwstat_recursive(struct seq_file *sf,
+				 struct blkg_policy_data *pd, int off)
+{
+	struct blkg_rwstat sum = blkg_rwstat_collect(pd, off);
+
+	return __blkg_prfill_rwstat(sf, pd, &sum);
+}
+EXPORT_SYMBOL_GPL(blkg_prfill_rwstat_recursive);
+
+/**
  * blkg_conf_prep - parse and prepare for per-blkg config update
  * @blkcg: target block cgroup
  * @pol: target policy
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 40f5b97..747bcc5 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -156,6 +156,10 @@ u64 __blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
 u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off);
 u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
 		       int off);
+u64 blkg_prfill_stat_recursive(struct seq_file *sf,
+			       struct blkg_policy_data *pd, int off);
+u64 blkg_prfill_rwstat_recursive(struct seq_file *sf,
+				 struct blkg_policy_data *pd, int off);
 
 struct blkg_conf_ctx {
 	struct gendisk			*disk;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 12/12] cfq-iosched: add hierarchical cfq_group statistics
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
                   ` (10 preceding siblings ...)
  2012-12-14 22:41 ` [PATCH 11/12] blkcg: implement blkg_prfill_[rw]stat_recursive() Tejun Heo
@ 2012-12-14 22:41 ` Tejun Heo
  2012-12-18 19:11   ` Vivek Goyal
  2012-12-17 16:52 ` [PATCHSET] block: implement blkcg hierarchy support in cfq Vivek Goyal
  12 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-14 22:41 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Unfortunately, at this point, there's no way to make the existing
statistics hierarchical without creating nasty surprises for the
existing users.  Just create recursive counterpart of the existing
stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ceade6e..15cb97e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1669,6 +1669,26 @@ static int cfqg_print_rwstat(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static int cfqg_print_stat_recursive(struct cgroup *cgrp, struct cftype *cft,
+				     struct seq_file *sf)
+{
+	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
+
+	blkcg_print_blkgs(sf, blkcg, blkg_prfill_stat_recursive,
+			  &blkcg_policy_cfq, cft->private, false);
+	return 0;
+}
+
+static int cfqg_print_rwstat_recursive(struct cgroup *cgrp, struct cftype *cft,
+				       struct seq_file *sf)
+{
+	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
+
+	blkcg_print_blkgs(sf, blkcg, blkg_prfill_rwstat_recursive,
+			  &blkcg_policy_cfq, cft->private, true);
+	return 0;
+}
+
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 static u64 cfqg_prfill_avg_queue_size(struct seq_file *sf,
 				      struct blkg_policy_data *pd, int off)
@@ -1740,6 +1760,7 @@ static struct cftype cfq_blkcg_files[] = {
 		.write_u64 = cfq_set_leaf_weight,
 	},
 
+	/* statistics, covers only the tasks in the cfqg */
 	{
 		.name = "time",
 		.private = offsetof(struct cfq_group, stats.time),
@@ -1780,6 +1801,48 @@ static struct cftype cfq_blkcg_files[] = {
 		.private = offsetof(struct cfq_group, stats.queued),
 		.read_seq_string = cfqg_print_rwstat,
 	},
+
+	/* the same statictics which cover the cfqg and its descendants */
+	{
+		.name = "time_recursive",
+		.private = offsetof(struct cfq_group, stats.time),
+		.read_seq_string = cfqg_print_stat_recursive,
+	},
+	{
+		.name = "sectors_recursive",
+		.private = offsetof(struct cfq_group, stats.sectors),
+		.read_seq_string = cfqg_print_stat_recursive,
+	},
+	{
+		.name = "io_service_bytes_recursive",
+		.private = offsetof(struct cfq_group, stats.service_bytes),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_serviced_recursive",
+		.private = offsetof(struct cfq_group, stats.serviced),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_service_time_recursive",
+		.private = offsetof(struct cfq_group, stats.service_time),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_wait_time_recursive",
+		.private = offsetof(struct cfq_group, stats.wait_time),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_merged_recursive",
+		.private = offsetof(struct cfq_group, stats.merged),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_queued_recursive",
+		.private = offsetof(struct cfq_group, stats.queued),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	{
 		.name = "avg_queue_size",
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq
  2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
                   ` (11 preceding siblings ...)
  2012-12-14 22:41 ` [PATCH 12/12] cfq-iosched: add hierarchical cfq_group statistics Tejun Heo
@ 2012-12-17 16:52 ` Vivek Goyal
  2012-12-17 17:38   ` Tejun Heo
  12 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 16:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni,
	Peter Zijlstra, peterz

On Fri, Dec 14, 2012 at 02:41:13PM -0800, Tejun Heo wrote:
> Hello,
> 
> cfq-iosched is currently utterly broken in how it handles cgroup
> hierarchy.  It ignores the hierarchy structure and just treats every
> blkcgs equally.  This is simply broken.  This breakage makes blkcg
> behave very differently from other properly-hierarchical controllers
> and makes it impossible to give any uniform interpretation to the
> hierarchy, which in turn makes it impossible to implement unified
> hierarchy.
> 
> Given the relative simplicity of cfqg scheduling, implementing proper
> hierarchy support isn't that difficult.  All that's necessary is
> determining how much fraction each cfqg on the service tree has claim
> to considering the hierarchy.  The calculation can be done by
> maintaining the sum of active weights at each level and compounding
> the ratios from the cfqg in question to root.  The overhead isn't
> significant.  Tree traversals happen only when cfqgs are added or
> removed from the service tree and they are from the cfqg being
> modified to the root.
> 
> There are some design choices which are worth mentioning.
> 
> * Internal (non-leaf) cfqgs w/ tasks treat the tasks as a single unit
>   competeting against the children cfqgs.  New config knobs -
>   blkio.leaf_weight[_device] - are added to configure the weight of

[ CC peterz ]

Hi Tejun,

I am wondering if blkio.task_group_weight[_device] will more sense. It
is easier to think in terms of hidden task group of a cfqg instead of
whether it is a leaf node or not.

>   these tasks.  Another way to look at it is that each cfqg has a
>   hidden leaf child node attached to it which hosts all tasks and
>   leaf_weight controls the weight of that hidden node.
> 
>   Treating cfqqs and cfqgs as equals doesn't make much sense to me and
>   is hairy - we need to establish ioprio to weight mapping and the
>   weights fluctuate as processes fork and exit.

So weights of task (io_context) or blkcg weights don't fluctuate with
task fork/exit. It is just the weight on service tree, which fluctuates.

> This becomes hairier
>   when considering multiple controllers, Such mappings can't be
>   established consistently across different controllers and the
>   weights are given out differently - ie. blkcg give weights out to
>   io_contexts while cpu to tasks, which may share io_contexts.  It's
>   difficult to make sense of what's going on.

We already have that issue, isn't it. Cpu does task scheduling and
CFQ does io_context scheduling. (Nobody seems to be complaining though).

> 
>   The goal is to bring cpu, currently the only other controller which
>   implements weight based resource allocation, to similar behavior.

I think we first need to have some kind of buy-in from cpu controller
guys that yes in long term they will change it. Otherwise we don't want
to be stuck in a situation where cpu and blkio behave entirely
differently.

In fact we need to revisit this idea that what makes more sense. To
me treating task and group at same level is not necessarily bad as
it gives more flexibility. And leave it to user to create another
subgroup and launch all the tasks there if they want to emulate the
behavior of a hidden sub-group.

If you look at it systemd already puts services in separate group. They
always wanted to put user sessions also in a separate group. So
effectively hierarhcy looks as follows (for cpu controller).

			  root
		        / | \  \ 
		      T1 T2 usr system

So T1 and T2 here basically a kernel threds (All users sessions and
services have been moved out to respective cgroups).

I think I am fine with not limiting kernel threads into a subgroup of
their own. In fact I think there was a patch where we could not move
kernel threads out of root cgroup. If that makes sense, then it does
not make sense to limit kernel threads to a subgroup of their own
by default (it is equivalent to moving these threads to a cgroup of
their own).

So though I don't mind the notion of this hidden cgroups but given
the fact that we have implemented things other way and left it to
user space to manage it based on their needs, I am not sure what's
that fundamental reason that we should change that assumption now.

And even if we decide to do it, we need to have other controllers
on board (especially cpu).

I think we will have similar issues with others components too. In blkio
throttling support, we will have to put some kind of throttling limits
on internal group too. I guess one can raise similar concerns for memory
controller too where there are no internal limits on child task of a
cgroup but there are limits on child group.

			parent (mem.limit_in_bytes = 200)
			  /   |   \
			 T1  T2   child-grp (mem.limit_in_bytes = 100)
					|
					T3

Now there are no  guarantees that T3 will get its share of 100 bytes of
memory allocation as T1 and T2 might have already exhausted the quota
of 200 bytes of parent.

So should we create an internal group there too to limit the share of
T1 and T2. I thought if somebody wants it, then it is best to leave it
to user space then kernel enforcing that. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq
  2012-12-17 16:52 ` [PATCHSET] block: implement blkcg hierarchy support in cfq Vivek Goyal
@ 2012-12-17 17:38   ` Tejun Heo
  2012-12-17 18:50     ` Vivek Goyal
  0 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-17 17:38 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni,
	Peter Zijlstra, peterz

Hello, Vivek.

On Mon, Dec 17, 2012 at 11:52:28AM -0500, Vivek Goyal wrote:
> I am wondering if blkio.task_group_weight[_device] will more sense. It
> is easier to think in terms of hidden task group of a cfqg instead of
> whether it is a leaf node or not.

As long as we stick to something consistent and short (one word
please), I don't think it'll matter all that much and the
"leaf_weight" stands for weight for the hidden leaf node.

> >   these tasks.  Another way to look at it is that each cfqg has a
> >   hidden leaf child node attached to it which hosts all tasks and
> >   leaf_weight controls the weight of that hidden node.
> > 
> >   Treating cfqqs and cfqgs as equals doesn't make much sense to me and
> >   is hairy - we need to establish ioprio to weight mapping and the
> >   weights fluctuate as processes fork and exit.
> 
> So weights of task (io_context) or blkcg weights don't fluctuate with
> task fork/exit. It is just the weight on service tree, which fluctuates.

Why would the weight on service tree fluctuate?

> > This becomes hairier
> >   when considering multiple controllers, Such mappings can't be
> >   established consistently across different controllers and the
> >   weights are given out differently - ie. blkcg give weights out to
> >   io_contexts while cpu to tasks, which may share io_contexts.  It's
> >   difficult to make sense of what's going on.
> 
> We already have that issue, isn't it. Cpu does task scheduling and
> CFQ does io_context scheduling. (Nobody seems to be complaining though).

We don't have that issue yet because cfq doesn't do hierarchical
weighting yet and people are not co-mounting cpu and blkio.

> I think we first need to have some kind of buy-in from cpu controller
> guys that yes in long term they will change it. Otherwise we don't want
> to be stuck in a situation where cpu and blkio behave entirely
> differently.

Sure, I was planning to work on that once blkio is in place but it's
not like we can be consistent in any other way and I don't think
making cpu support this behavior would be too difficult.  It's just
dealing with an extra leaf node after all.  Peter?

> In fact we need to revisit this idea that what makes more sense. To
> me treating task and group at same level is not necessarily bad as
> it gives more flexibility. And leave it to user to create another
> subgroup and launch all the tasks there if they want to emulate the
> behavior of a hidden sub-group.

We'll have to disagree here.

> If you look at it systemd already puts services in separate group. They
> always wanted to put user sessions also in a separate group. So
> effectively hierarhcy looks as follows (for cpu controller).
> 
> 			  root
> 		        / | \  \ 
> 		      T1 T2 usr system
> 
> So T1 and T2 here basically a kernel threds (All users sessions and
> services have been moved out to respective cgroups).
> 
> I think I am fine with not limiting kernel threads into a subgroup of
> their own. In fact I think there was a patch where we could not move
> kernel threads out of root cgroup. If that makes sense, then it does
> not make sense to limit kernel threads to a subgroup of their own
> by default (it is equivalent to moving these threads to a cgroup of
> their own).

How is that relevant to the discussion at hand?  That's about having
no limit for root cgroup.  For anything weight-based, you can't have
"no" weight.

> So though I don't mind the notion of this hidden cgroups but given
> the fact that we have implemented things other way and left it to
> user space to manage it based on their needs, I am not sure what's
> that fundamental reason that we should change that assumption now.

Hmmm?  blkio doesn't work like that *at all*.  Currently, it basically
treats the root cgroup as a leaf group, so I'm kinda lost why you're
talking about "changing the assumption" because the proposed patchset
maintains the existing behavior (at least for 1-level hierarchy) while
what you're suggesting would change the behavior fundamentally.

So, in terms of compatibility, I don't think there's a clear better
way here.  cpu and blkio are already doing things differently and we
need to pick one to unify the behavior and I think having separate
weight for tasks in internal node is a better one because

* Configuration lives in cgroup proper.  No need to somehow map
  per-schedule-entity attribute to cgroup weight, which is hairy and
  non-obvious.

* Different controllers deal with different scheduling-entities and it
  becomes very difficult to tell how the weight is actually being
  distributed.  It's just nasty.

> And even if we decide to do it, we need to have other controllers
> on board (especially cpu).

Only cpu actually.  That's the only other controller which implements
weight based control and I'm fairly sure we can implement such
behavior in cpu in backward compatible way.

> I think we will have similar issues with others components too. In blkio
> throttling support, we will have to put some kind of throttling limits
> on internal group too. I guess one can raise similar concerns for memory
> controller too where there are no internal limits on child task of a
> cgroup but there are limits on child group.

I don't think so.  We need some way of assigning weights between tasks
of an internal cgroup and children.  No such issue exists for
non-weight based controllers.  I don't see any reason to change that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq
  2012-12-17 17:38   ` Tejun Heo
@ 2012-12-17 18:50     ` Vivek Goyal
  2012-12-17 18:59       ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 18:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni,
	Peter Zijlstra, peterz

On Mon, Dec 17, 2012 at 09:38:00AM -0800, Tejun Heo wrote:

[..]
> > >   Treating cfqqs and cfqgs as equals doesn't make much sense to me and
> > >   is hairy - we need to establish ioprio to weight mapping and the
> > >   weights fluctuate as processes fork and exit.
> > 
> > So weights of task (io_context) or blkcg weights don't fluctuate with
> > task fork/exit. It is just the weight on service tree, which fluctuates.
> 
> Why would the weight on service tree fluctuate?

Because tasks come and go and get queued on service tree. I am referring
to total_weight on service tree and not weight of individual entity. That
will change only if ioprio of task changes or blkio.weight is updated.

[..]
> > I think we first need to have some kind of buy-in from cpu controller
> > guys that yes in long term they will change it. Otherwise we don't want
> > to be stuck in a situation where cpu and blkio behave entirely
> > differently.
> 
> Sure, I was planning to work on that once blkio is in place but it's
> not like we can be consistent in any other way and I don't think
> making cpu support this behavior would be too difficult.  It's just
> dealing with an extra leaf node after all.  Peter?

I am not concerned about implementation. I am only worried about having
agreement that having a hidden group is a better thing to do as compared
to what we have now.

[..]
> > So though I don't mind the notion of this hidden cgroups but given
> > the fact that we have implemented things other way and left it to
> > user space to manage it based on their needs, I am not sure what's
> > that fundamental reason that we should change that assumption now.
> 
> Hmmm?  blkio doesn't work like that *at all*.  Currently, it basically
> treats the root cgroup as a leaf group, so I'm kinda lost why you're
> talking about "changing the assumption" because the proposed patchset
> maintains the existing behavior (at least for 1-level hierarchy) while
> what you're suggesting would change the behavior fundamentally.

I am comparing the change of behavior w.r.t cpu controller. Initially
we had implemented a full hierarchical controller (cpu like). It was
lot of code and never went any where so we ended up writing flat
controller. 

> 
> So, in terms of compatibility, I don't think there's a clear better
> way here.  cpu and blkio are already doing things differently and we
> need to pick one to unify the behavior and I think having separate
> weight for tasks in internal node is a better one because
> 
> * Configuration lives in cgroup proper.  No need to somehow map
>   per-schedule-entity attribute to cgroup weight, which is hairy and
>   non-obvious.
> 
> * Different controllers deal with different scheduling-entities and it
>   becomes very difficult to tell how the weight is actually being
>   distributed.  It's just nasty.
> 

Ok, so you want more preditability and don't want to rely on task
prio or ioprio so that when you co-mount cpu and blkio, you don't
have to worry about different behaviors and just by looking at cgroup
configuration you can tell what % of resoruce a group will get. Makes
sense.

[..]
> > I think we will have similar issues with others components too. In blkio
> > throttling support, we will have to put some kind of throttling limits
> > on internal group too. I guess one can raise similar concerns for memory
> > controller too where there are no internal limits on child task of a
> > cgroup but there are limits on child group.
> 
> I don't think so.  We need some way of assigning weights between tasks
> of an internal cgroup and children.  No such issue exists for
> non-weight based controllers.  I don't see any reason to change that.

I am not sure about that. So the general idea is that how resources of
a group are distributed among its children. I am not sure why are you
dismissing this notion in max limit controllers.

For example, if parent has 100MB/limit and it has 4 childs (T1, T2, T3 and G1),
then either all children can get 25MB/s or T1/T2/T3 colectively get
50MB/s and G1 gets 50MB/s. So to me question of hidden group and
its share w,r.t sibling entities is very much valid here too.

Having said that, what you are doing for CFQ, should make blk-throttle
hierachical easier. We just need to queue all IO from all tasks of
a group in a single entity and just round robin between this entity
and sibling groups. Otherwise making throttling hierarchical will become
tricky as we shall have to maintain per task queues in block throttling
layer too.

Well, I don't mind treating all tasks as a sub-group and let that sub-group
compete with sibling groups. Just want to make sure cpu controller guys
are on-board.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq
  2012-12-17 18:50     ` Vivek Goyal
@ 2012-12-17 18:59       ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2012-12-17 18:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni,
	Peter Zijlstra, peterz

Hey, Vivek.

On Mon, Dec 17, 2012 at 01:50:14PM -0500, Vivek Goyal wrote:
> > > So weights of task (io_context) or blkcg weights don't fluctuate with
> > > task fork/exit. It is just the weight on service tree, which fluctuates.
> > 
> > Why would the weight on service tree fluctuate?
> 
> Because tasks come and go and get queued on service tree. I am referring
> to total_weight on service tree and not weight of individual entity. That
> will change only if ioprio of task changes or blkio.weight is updated.

Ah, okay.  Yeah, that's how the weight based distribution works after
all.  BTW, if you view the vfraction as the effective weight, the
total always remains close to 1 after the patches.

> > Hmmm?  blkio doesn't work like that *at all*.  Currently, it basically
> > treats the root cgroup as a leaf group, so I'm kinda lost why you're
> > talking about "changing the assumption" because the proposed patchset
> > maintains the existing behavior (at least for 1-level hierarchy) while
> > what you're suggesting would change the behavior fundamentally.
> 
> I am comparing the change of behavior w.r.t cpu controller. Initially
> we had implemented a full hierarchical controller (cpu like). It was
> lot of code and never went any where so we ended up writing flat
> controller. 

I see.  Yeah, but we have to change the behavior of either to make
them consistent.  I think introducing leaf weight is both more
desriable and also easier to do.

> > So, in terms of compatibility, I don't think there's a clear better
> > way here.  cpu and blkio are already doing things differently and we
> > need to pick one to unify the behavior and I think having separate
> > weight for tasks in internal node is a better one because
> > 
> > * Configuration lives in cgroup proper.  No need to somehow map
> >   per-schedule-entity attribute to cgroup weight, which is hairy and
> >   non-obvious.
> > 
> > * Different controllers deal with different scheduling-entities and it
> >   becomes very difficult to tell how the weight is actually being
> >   distributed.  It's just nasty.
> > 
> 
> Ok, so you want more preditability and don't want to rely on task
> prio or ioprio so that when you co-mount cpu and blkio, you don't
> have to worry about different behaviors and just by looking at cgroup
> configuration you can tell what % of resoruce a group will get. Makes
> sense.

Yeap, pretty much.

> > I don't think so.  We need some way of assigning weights between tasks
> > of an internal cgroup and children.  No such issue exists for
> > non-weight based controllers.  I don't see any reason to change that.
> 
> I am not sure about that. So the general idea is that how resources of
> a group are distributed among its children. I am not sure why are you
> dismissing this notion in max limit controllers.

The thing is that for weight based ones, it's essential.  You have to
decide the ratio somehow.  There's no default no-config way to fall
back to.  For limit-based ones, it isn't essential and none of the
current controllers implements such internal node limit.  It *could*
be useful but there hasn't been any direct call for such limits, so I
just don't see a good reason to push that.

> For example, if parent has 100MB/limit and it has 4 childs (T1, T2, T3 and G1),
> then either all children can get 25MB/s or T1/T2/T3 colectively get
> 50MB/s and G1 gets 50MB/s. So to me question of hidden group and
> its share w,r.t sibling entities is very much valid here too.

I'm not saying that there aren't any situations where such limit would
be useful, but I'd like to have stronger rationale before go
implementing new features.

> Having said that, what you are doing for CFQ, should make blk-throttle
> hierachical easier. We just need to queue all IO from all tasks of
> a group in a single entity and just round robin between this entity
> and sibling groups. Otherwise making throttling hierarchical will become
> tricky as we shall have to maintain per task queues in block throttling
> layer too.
> 
> Well, I don't mind treating all tasks as a sub-group and let that sub-group
> compete with sibling groups. Just want to make sure cpu controller guys
> are on-board.

Sure.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 01/12] blkcg: fix minor bug in blkg_alloc()
  2012-12-14 22:41 ` [PATCH 01/12] blkcg: fix minor bug in blkg_alloc() Tejun Heo
@ 2012-12-17 19:10   ` Vivek Goyal
  0 siblings, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 19:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 14, 2012 at 02:41:14PM -0800, Tejun Heo wrote:
> blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice.
> The latter test should have been on whether pol->pd_init_fn() exists.
> This doesn't cause actual problems because both blkcg policies
> implement pol->pd_init_fn().  Fix it.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> ---
>  block/blk-cgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 3f6d39d..feda49f 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -114,7 +114,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  		pd->blkg = blkg;
>  
>  		/* invoke per-policy init */
> -		if (blkcg_policy_enabled(blkg->q, pol))
> +		if (pol->pd_init_fn)
>  			pol->pd_init_fn(blkg);
>  	}
>  
> -- 
> 1.7.11.7

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 02/12] blkcg: reorganize blkg_lookup_create() and friends
  2012-12-14 22:41 ` [PATCH 02/12] blkcg: reorganize blkg_lookup_create() and friends Tejun Heo
@ 2012-12-17 19:28   ` Vivek Goyal
  0 siblings, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 19:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 14, 2012 at 02:41:15PM -0800, Tejun Heo wrote:
> Reorganize such that
> 
> * __blkg_lookup() takes bool param @update_hint to determine whether
>   to update hint.
> 
> * __blkg_lookup_create() no longer performs lookup before trying to
>   create.  Renamed to blkg_create().
> 
> * blkg_lookup_create() now performs lookup and then invokes
>   blkg_create() if lookup fails.
> 
> * root_blkg creation in blkcg_activate_policy() updated accordingly.
>   Note that blkcg_activate_policy() no longer updates lookup hint if
>   root_blkg already exists.
> 
> Except for the last lookup hint bit which is immaterial, this is pure
> reorganization and doesn't introduce any visible behavior change.
> This is to prepare for proper hierarchy support.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

looks good to me. I particularly like cleanup of __blkg_lookup_create()
which freed new_blkg if a blkg was already found.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> ---
>  block/blk-cgroup.c | 75 +++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 52 insertions(+), 23 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index feda49f..ffbd237 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -126,7 +126,7 @@ err_free:
>  }
>  
>  static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
> -				      struct request_queue *q)
> +				      struct request_queue *q, bool update_hint)
>  {
>  	struct blkcg_gq *blkg;
>  
> @@ -135,14 +135,19 @@ static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
>  		return blkg;
>  
>  	/*
> -	 * Hint didn't match.  Look up from the radix tree.  Note that we
> -	 * may not be holding queue_lock and thus are not sure whether
> -	 * @blkg from blkg_tree has already been removed or not, so we
> -	 * can't update hint to the lookup result.  Leave it to the caller.
> +	 * Hint didn't match.  Look up from the radix tree.  Note that the
> +	 * hint can only be updated under queue_lock as otherwise @blkg
> +	 * could have already been removed from blkg_tree.  The caller is
> +	 * responsible for grabbing queue_lock if @update_hint.
>  	 */
>  	blkg = radix_tree_lookup(&blkcg->blkg_tree, q->id);
> -	if (blkg && blkg->q == q)
> +	if (blkg && blkg->q == q) {
> +		if (update_hint) {
> +			lockdep_assert_held(q->queue_lock);
> +			rcu_assign_pointer(blkcg->blkg_hint, blkg);
> +		}
>  		return blkg;
> +	}
>  
>  	return NULL;
>  }
> @@ -162,7 +167,7 @@ struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q)
>  
>  	if (unlikely(blk_queue_bypass(q)))
>  		return NULL;
> -	return __blkg_lookup(blkcg, q);
> +	return __blkg_lookup(blkcg, q, false);
>  }
>  EXPORT_SYMBOL_GPL(blkg_lookup);
>  
> @@ -170,9 +175,9 @@ EXPORT_SYMBOL_GPL(blkg_lookup);
>   * If @new_blkg is %NULL, this function tries to allocate a new one as
>   * necessary using %GFP_ATOMIC.  @new_blkg is always consumed on return.
>   */
> -static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
> -					     struct request_queue *q,
> -					     struct blkcg_gq *new_blkg)
> +static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> +				    struct request_queue *q,
> +				    struct blkcg_gq *new_blkg)
>  {
>  	struct blkcg_gq *blkg;
>  	int ret;
> @@ -180,13 +185,6 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
>  	WARN_ON_ONCE(!rcu_read_lock_held());
>  	lockdep_assert_held(q->queue_lock);
>  
> -	/* lookup and update hint on success, see __blkg_lookup() for details */
> -	blkg = __blkg_lookup(blkcg, q);
> -	if (blkg) {
> -		rcu_assign_pointer(blkcg->blkg_hint, blkg);
> -		goto out_free;
> -	}
> -
>  	/* blkg holds a reference to blkcg */
>  	if (!css_tryget(&blkcg->css)) {
>  		blkg = ERR_PTR(-EINVAL);
> @@ -223,16 +221,39 @@ out_free:
>  	return blkg;
>  }
>  
> +/**
> + * blkg_lookup_create - lookup blkg, try to create one if not there
> + * @blkcg: blkcg of interest
> + * @q: request_queue of interest
> + *
> + * Lookup blkg for the @blkcg - @q pair.  If it doesn't exist, try to
> + * create one.  This function should be called under RCU read lock and
> + * @q->queue_lock.
> + *
> + * Returns pointer to the looked up or created blkg on success, ERR_PTR()
> + * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
> + * dead and bypassing, returns ERR_PTR(-EBUSY).
> + */
>  struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>  				    struct request_queue *q)
>  {
> +	struct blkcg_gq *blkg;
> +
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +	lockdep_assert_held(q->queue_lock);
> +
>  	/*
>  	 * This could be the first entry point of blkcg implementation and
>  	 * we shouldn't allow anything to go through for a bypassing queue.
>  	 */
>  	if (unlikely(blk_queue_bypass(q)))
>  		return ERR_PTR(blk_queue_dead(q) ? -EINVAL : -EBUSY);
> -	return __blkg_lookup_create(blkcg, q, NULL);
> +
> +	blkg = __blkg_lookup(blkcg, q, true);
> +	if (blkg)
> +		return blkg;
> +
> +	return blkg_create(blkcg, q, NULL);
>  }
>  EXPORT_SYMBOL_GPL(blkg_lookup_create);
>  
> @@ -777,7 +798,7 @@ int blkcg_activate_policy(struct request_queue *q,
>  			  const struct blkcg_policy *pol)
>  {
>  	LIST_HEAD(pds);
> -	struct blkcg_gq *blkg;
> +	struct blkcg_gq *blkg, *new_blkg;
>  	struct blkg_policy_data *pd, *n;
>  	int cnt = 0, ret;
>  	bool preloaded;
> @@ -786,19 +807,27 @@ int blkcg_activate_policy(struct request_queue *q,
>  		return 0;
>  
>  	/* preallocations for root blkg */
> -	blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
> -	if (!blkg)
> +	new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
> +	if (!new_blkg)
>  		return -ENOMEM;
>  
>  	preloaded = !radix_tree_preload(GFP_KERNEL);
>  
>  	blk_queue_bypass_start(q);
>  
> -	/* make sure the root blkg exists and count the existing blkgs */
> +	/*
> +	 * Make sure the root blkg exists and count the existing blkgs.  As
> +	 * @q is bypassing at this point, blkg_lookup_create() can't be
> +	 * used.  Open code it.
> +	 */
>  	spin_lock_irq(q->queue_lock);
>  
>  	rcu_read_lock();
> -	blkg = __blkg_lookup_create(&blkcg_root, q, blkg);
> +	blkg = __blkg_lookup(&blkcg_root, q, false);
> +	if (blkg)
> +		blkg_free(new_blkg);
> +	else
> +		blkg = blkg_create(&blkcg_root, q, new_blkg);
>  	rcu_read_unlock();
>  
>  	if (preloaded)
> -- 
> 1.7.11.7

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 03/12] blkcg: cosmetic updates to blkg_create()
  2012-12-14 22:41 ` [PATCH 03/12] blkcg: cosmetic updates to blkg_create() Tejun Heo
@ 2012-12-17 19:37   ` Vivek Goyal
  0 siblings, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 19:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 14, 2012 at 02:41:16PM -0800, Tejun Heo wrote:
> * Rename out_* labels to err_*.
> 
> * Do ERR_PTR() conversion once in the error return path.
> 
> This patch is cosmetic and to prepare for the hierarchy support.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> ---
>  block/blk-cgroup.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index ffbd237..fde2286 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -187,16 +187,16 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  
>  	/* blkg holds a reference to blkcg */
>  	if (!css_tryget(&blkcg->css)) {
> -		blkg = ERR_PTR(-EINVAL);
> -		goto out_free;
> +		ret = -EINVAL;
> +		goto err_free_blkg;
>  	}
>  
>  	/* allocate */
>  	if (!new_blkg) {
>  		new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
>  		if (unlikely(!new_blkg)) {
> -			blkg = ERR_PTR(-ENOMEM);
> -			goto out_put;
> +			ret = -ENOMEM;
> +			goto err_put_css;
>  		}
>  	}
>  	blkg = new_blkg;
> @@ -213,12 +213,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  	if (!ret)
>  		return blkg;
>  
> -	blkg = ERR_PTR(ret);
> -out_put:
> +err_put_css:
>  	css_put(&blkcg->css);
> -out_free:
> +err_free_blkg:
>  	blkg_free(new_blkg);
> -	return blkg;
> +	return ERR_PTR(ret);
>  }
>  
>  /**
> -- 
> 1.7.11.7

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 04/12] blkcg: make blkcg_gq's hierarchical
  2012-12-14 22:41 ` [PATCH 04/12] blkcg: make blkcg_gq's hierarchical Tejun Heo
@ 2012-12-17 20:04   ` Vivek Goyal
  0 siblings, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 20:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 14, 2012 at 02:41:17PM -0800, Tejun Heo wrote:
> Currently a child blkg (blkcg_gq) can be created even if its parent
> doesn't exist.  ie. Given a blkg, it's not guaranteed that its
> ancestors will exist.  This makes it difficult to implement proper
> hierarchy support for blkcg policies.
> 
> Always create blkgs recursively and make a child blkg hold a reference
> to its parent.  blkg->parent is added so that finding the parent is
> easy.  blkcg_parent() is also added in the process.
> 
> This change can be visible to userland.  e.g. while issuing IO in a
> nested cgroup didn't affect the ancestors at all, now it will
> initialize all ancestor blkgs and zero stats for the request_queue
> will always appear on them.  While this is userland visible, this
> shouldn't cause any functional difference.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good to me.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> ---
>  block/blk-cgroup.c | 42 +++++++++++++++++++++++++++++++++++++-----
>  block/blk-cgroup.h | 18 ++++++++++++++++++
>  2 files changed, 55 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index fde2286..c858628 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -201,7 +201,16 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  	}
>  	blkg = new_blkg;
>  
> -	/* insert */
> +	/* link parent and insert */
> +	if (blkcg_parent(blkcg)) {
> +		blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
> +		if (WARN_ON_ONCE(!blkg->parent)) {
> +			blkg = ERR_PTR(-EINVAL);
> +			goto err_put_css;
> +		}
> +		blkg_get(blkg->parent);
> +	}
> +
>  	spin_lock(&blkcg->lock);
>  	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
>  	if (likely(!ret)) {
> @@ -213,6 +222,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  	if (!ret)
>  		return blkg;
>  
> +	/* @blkg failed fully initialized, use the usual release path */
> +	blkg_put(blkg);
> +	return ERR_PTR(ret);
> +
>  err_put_css:
>  	css_put(&blkcg->css);
>  err_free_blkg:
> @@ -226,8 +239,9 @@ err_free_blkg:
>   * @q: request_queue of interest
>   *
>   * Lookup blkg for the @blkcg - @q pair.  If it doesn't exist, try to
> - * create one.  This function should be called under RCU read lock and
> - * @q->queue_lock.
> + * create one.  blkg creation is performed recursively from blkcg_root such
> + * that all non-root blkg's have access to the parent blkg.  This function
> + * should be called under RCU read lock and @q->queue_lock.
>   *
>   * Returns pointer to the looked up or created blkg on success, ERR_PTR()
>   * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
> @@ -252,7 +266,23 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
>  	if (blkg)
>  		return blkg;
>  
> -	return blkg_create(blkcg, q, NULL);
> +	/*
> +	 * Create blkgs walking down from blkcg_root to @blkcg, so that all
> +	 * non-root blkgs have access to their parents.
> +	 */
> +	while (true) {
> +		struct blkcg *pos = blkcg;
> +		struct blkcg *parent = blkcg_parent(blkcg);
> +
> +		while (parent && !__blkg_lookup(parent, q, false)) {
> +			pos = parent;
> +			parent = blkcg_parent(parent);
> +		}
> +
> +		blkg = blkg_create(pos, q, NULL);
> +		if (pos == blkcg || IS_ERR(blkg))
> +			return blkg;
> +	}
>  }
>  EXPORT_SYMBOL_GPL(blkg_lookup_create);
>  
> @@ -321,8 +351,10 @@ static void blkg_rcu_free(struct rcu_head *rcu_head)
>  
>  void __blkg_release(struct blkcg_gq *blkg)
>  {
> -	/* release the extra blkcg reference this blkg has been holding */
> +	/* release the blkcg and parent blkg refs this blkg has been holding */
>  	css_put(&blkg->blkcg->css);
> +	if (blkg->parent)
> +		blkg_put(blkg->parent);
>  
>  	/*
>  	 * A group is freed in rcu manner. But having an rcu lock does not
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 2459730..b26ed58 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -94,8 +94,13 @@ struct blkcg_gq {
>  	struct list_head		q_node;
>  	struct hlist_node		blkcg_node;
>  	struct blkcg			*blkcg;
> +
> +	/* all non-root blkcg_gq's are guaranteed to have access to parent */
> +	struct blkcg_gq			*parent;
> +
>  	/* request allocation list for this blkcg-q pair */
>  	struct request_list		rl;
> +
>  	/* reference count */
>  	int				refcnt;
>  
> @@ -181,6 +186,19 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
>  }
>  
>  /**
> + * blkcg_parent - get the parent of a blkcg
> + * @blkcg: blkcg of interest
> + *
> + * Return the parent blkcg of @blkcg.  Can be called anytime.
> + */
> +static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
> +{
> +	struct cgroup *pcg = blkcg->css.cgroup->parent;
> +
> +	return pcg ? cgroup_to_blkcg(pcg) : NULL;
> +}
> +
> +/**
>   * blkg_to_pdata - get policy private data
>   * @blkg: blkg of interest
>   * @pol: policy of interest
> -- 
> 1.7.11.7

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/12] cfq-iosched: implement cfq_group->nr_active and ->level_weight
  2012-12-14 22:41 ` [PATCH 06/12] cfq-iosched: implement cfq_group->nr_active and ->level_weight Tejun Heo
@ 2012-12-17 20:46   ` Vivek Goyal
  2012-12-17 21:15     ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 20:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 14, 2012 at 02:41:19PM -0800, Tejun Heo wrote:
> To prepare for blkcg hierarchy support, add cfqg->nr_active and
> ->level_weight.  cfqg->nr_active counts the number of active cfqgs at
> the cfqg's level and ->level_weight is sum of weights of those cfqgs.
> The level covers itself (cfqg->leaf_weight) and immediate children.

This notion of level is really confusing. If one says "at cfqg's level"
I immediately associate with cfqg's siblings and not with cfqg's children.

I think confusion happens because we are overloading the definition of
cfqg. It is competing with its siblings at the same time it is competing
against its child groups (on behalf of its children tasks).

Thanks
Vivek

> 
> The two values are updated when a cfqg enters and leaves the group
> service tree.  Unless the hierarchy is very deep, the added overhead
> should be negligible.
> 
> Currently, the parent is determined using cfqg_flat_parent() which
> makes the root cfqg the parent of all other cfqgs.  This is to make
> the transition to hierarchy-aware scheduling gradual.  Scheduling
> logic will be converted to use cfqg->level_weight without actually
> changing the behavior.  When everything is ready,
> blkcg_weight_parent() will be replaced with proper parent function.
> 
> This patch doesn't introduce any behavior chagne.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/cfq-iosched.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 76 insertions(+)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 5f23763..eb290a0 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -225,6 +225,18 @@ struct cfq_group {
>  	u64 vdisktime;
>  
>  	/*
> +	 * The number of active cfqgs and sum of their weights under this
> +	 * cfqg.  This covers this cfqg's leaf_weight and all children's
> +	 * weights, but does not cover weights of further descendants.
> +	 *
> +	 * If a cfqg is on the service tree, it's active.  An active cfqg
> +	 * also activates its parent and contributes to the level_weight of
> +	 * the parent.
> +	 */
> +	int nr_active;
> +	unsigned int level_weight;
> +
> +	/*
>  	 * There are two weights - (internal) weight is the weight of this
>  	 * cfqg against the sibling cfqgs.  leaf_weight is the wight of
>  	 * this cfqg against the child cfqgs.  For the root cfqg, both
> @@ -583,6 +595,22 @@ static inline struct cfq_group *blkg_to_cfqg(struct blkcg_gq *blkg)
>  	return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
>  }
>  
> +/*
> + * Determine the parent cfqg for weight calculation.  Currently, cfqg
> + * scheduling is flat and the root is the parent of everyone else.
> + */
> +static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
> +{
> +	struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
> +	struct cfq_group *root;
> +
> +	while (blkg->parent)
> +		blkg = blkg->parent;
> +	root = blkg_to_cfqg(blkg);
> +
> +	return root != cfqg ? root : NULL;
> +}
> +
>  static inline void cfqg_get(struct cfq_group *cfqg)
>  {
>  	return blkg_get(cfqg_to_blkg(cfqg));
> @@ -683,6 +711,7 @@ static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
>  
>  #else	/* CONFIG_CFQ_GROUP_IOSCHED */
>  
> +static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
>  static inline void cfqg_get(struct cfq_group *cfqg) { }
>  static inline void cfqg_put(struct cfq_group *cfqg) { }
>  
> @@ -1208,11 +1237,33 @@ cfq_update_group_weight(struct cfq_group *cfqg)
>  static void
>  cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
>  {
> +	struct cfq_group *pos = cfqg;
> +	bool propagate;
> +
> +	/* add to the service tree */
>  	BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
>  
>  	cfq_update_group_weight(cfqg);
>  	__cfq_group_service_tree_add(st, cfqg);
>  	st->total_weight += cfqg->weight;
> +
> +	/*
> +	 * Activate @cfqg and propagate activation upwards until we meet an
> +	 * already activated node or reach root.
> +	 */
> +	propagate = !pos->nr_active++;
> +	pos->level_weight += pos->leaf_weight;
> +
> +	while (propagate) {
> +		struct cfq_group *parent = cfqg_flat_parent(pos);
> +
> +		if (!parent)
> +			break;
> +
> +		propagate = !parent->nr_active++;
> +		parent->level_weight += pos->weight;
> +		pos = parent;
> +	}
>  }
>  
>  static void
> @@ -1243,6 +1294,31 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>  static void
>  cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
>  {
> +	struct cfq_group *pos = cfqg;
> +	bool propagate;
> +
> +	/*
> +	 * Undo activation from cfq_group_service_tree_add().  Deactivate
> +	 * @cfqg and propagate deactivation upwards.
> +	 */
> +	propagate = !--pos->nr_active;
> +	pos->level_weight -= pos->leaf_weight;
> +
> +	while (propagate) {
> +		struct cfq_group *parent = cfqg_flat_parent(pos);
> +
> +		/* @pos has 0 nr_active at this point */
> +		WARN_ON_ONCE(pos->level_weight);
> +
> +		if (!parent)
> +			break;
> +
> +		propagate = !--parent->nr_active;
> +		parent->level_weight -= pos->weight;
> +		pos = parent;
> +	}
> +
> +	/* remove from the service tree */
>  	st->total_weight -= cfqg->weight;
>  	if (!RB_EMPTY_NODE(&cfqg->rb_node))
>  		cfq_rb_erase(&cfqg->rb_node, st);
> -- 
> 1.7.11.7

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
  2012-12-14 22:41 ` [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling Tejun Heo
@ 2012-12-17 20:53   ` Vivek Goyal
  2012-12-17 21:17     ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 20:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 14, 2012 at 02:41:20PM -0800, Tejun Heo wrote:
> Currently, cfqg charges are scaled directly according to cfqg->weight.
> Regardless of the number of active cfqgs or the amount of active
> weights, a given weight value always scales charge the same way.  This
> works fine as long as all cfqgs are treated equally regardless of
> their positions in the hierarchy, which is what cfq currently
> implements.  It can't work in hierarchical settings because the
> interpretation of a given weight value depends on where the weight is
> located in the hierarchy.

I did not understand this. Why the current scheme will not work with
hierarchy?

While we calculate the vdisktime, this is calculated with the help
of CFQ_DEFAULT_WEIGHT and cfqg->weight. So we scale used time slice
in proportion to CFQ_DEFAULT_WEIGTH/cfqg->weight. So higher the weight
lesser the charge and cfqg gets scheduled again faster and lower the
weight, higher the vdisktime and cfqg gets scheduled less  frequently.

As every cfqg does the same thing on service tree, they automatically
get fair share w.r.t their weight.

And this mechanism should not be impacted by the hierarchy because we
have a separate service tree at separate level. This will not work
only if you come up with one compressed tree and then weights will
have to be adjusted. If we have a separate service tree in each group
then it should work just fine.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/12] cfq-iosched: implement cfq_group->nr_active and ->level_weight
  2012-12-17 20:46   ` Vivek Goyal
@ 2012-12-17 21:15     ` Tejun Heo
  2012-12-17 21:18       ` Vivek Goyal
  0 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-17 21:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

Hello, Vivek.

On Mon, Dec 17, 2012 at 03:46:09PM -0500, Vivek Goyal wrote:
> On Fri, Dec 14, 2012 at 02:41:19PM -0800, Tejun Heo wrote:
> > To prepare for blkcg hierarchy support, add cfqg->nr_active and
> > ->level_weight.  cfqg->nr_active counts the number of active cfqgs at
> > the cfqg's level and ->level_weight is sum of weights of those cfqgs.
> > The level covers itself (cfqg->leaf_weight) and immediate children.
> 
> This notion of level is really confusing. If one says "at cfqg's level"
> I immediately associate with cfqg's siblings and not with cfqg's children.

We can explicitly say at children's level but I think it should be
enough to explain it clearly in the comment where the field is
defined.

> I think confusion happens because we are overloading the definition of
> cfqg. It is competing with its siblings at the same time it is competing
> against its child groups (on behalf of its children tasks).

While I agree that part is a bit tricky, I can't think of a much
better way to describe it.  Any better ideas?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
  2012-12-17 20:53   ` Vivek Goyal
@ 2012-12-17 21:17     ` Tejun Heo
  2012-12-17 21:27       ` Vivek Goyal
  0 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-17 21:17 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

Hello,

On Mon, Dec 17, 2012 at 03:53:18PM -0500, Vivek Goyal wrote:
> On Fri, Dec 14, 2012 at 02:41:20PM -0800, Tejun Heo wrote:
> > Currently, cfqg charges are scaled directly according to cfqg->weight.
> > Regardless of the number of active cfqgs or the amount of active
> > weights, a given weight value always scales charge the same way.  This
> > works fine as long as all cfqgs are treated equally regardless of
> > their positions in the hierarchy, which is what cfq currently
> > implements.  It can't work in hierarchical settings because the
> > interpretation of a given weight value depends on where the weight is
> > located in the hierarchy.
> 
> I did not understand this. Why the current scheme will not work with
> hierarchy?

Because the meaning of a weight changes depending on where the weight
exists in the hierarchy?

> While we calculate the vdisktime, this is calculated with the help
> of CFQ_DEFAULT_WEIGHT and cfqg->weight. So we scale used time slice
> in proportion to CFQ_DEFAULT_WEIGTH/cfqg->weight. So higher the weight
> lesser the charge and cfqg gets scheduled again faster and lower the
> weight, higher the vdisktime and cfqg gets scheduled less  frequently.
> 
> As every cfqg does the same thing on service tree, they automatically
> get fair share w.r.t their weight.
> 
> And this mechanism should not be impacted by the hierarchy because we
> have a separate service tree at separate level. This will not work
> only if you come up with one compressed tree and then weights will
> have to be adjusted. If we have a separate service tree in each group
> then it should work just fine.

Why would you create N service trees when you can almost trivially use
one by calcualting the effective weight?  You would have to be
adjusting all trees above whenever something changes in a child.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/12] cfq-iosched: implement cfq_group->nr_active and ->level_weight
  2012-12-17 21:15     ` Tejun Heo
@ 2012-12-17 21:18       ` Vivek Goyal
  2012-12-17 21:20         ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 21:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Mon, Dec 17, 2012 at 01:15:17PM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Mon, Dec 17, 2012 at 03:46:09PM -0500, Vivek Goyal wrote:
> > On Fri, Dec 14, 2012 at 02:41:19PM -0800, Tejun Heo wrote:
> > > To prepare for blkcg hierarchy support, add cfqg->nr_active and
> > > ->level_weight.  cfqg->nr_active counts the number of active cfqgs at
> > > the cfqg's level and ->level_weight is sum of weights of those cfqgs.
> > > The level covers itself (cfqg->leaf_weight) and immediate children.
> > 
> > This notion of level is really confusing. If one says "at cfqg's level"
> > I immediately associate with cfqg's siblings and not with cfqg's children.
> 
> We can explicitly say at children's level but I think it should be
> enough to explain it clearly in the comment where the field is
> defined.
> 
> > I think confusion happens because we are overloading the definition of
> > cfqg. It is competing with its siblings at the same time it is competing
> > against its child groups (on behalf of its children tasks).
> 
> While I agree that part is a bit tricky, I can't think of a much
> better way to describe it.  Any better ideas?

Can we call it cfqg->children_weight insted of cfqg->level_weight.


Thanks
Vivek

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 06/12] cfq-iosched: implement cfq_group->nr_active and ->level_weight
  2012-12-17 21:18       ` Vivek Goyal
@ 2012-12-17 21:20         ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2012-12-17 21:20 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Mon, Dec 17, 2012 at 04:18:43PM -0500, Vivek Goyal wrote:
> > > I think confusion happens because we are overloading the definition of
> > > cfqg. It is competing with its siblings at the same time it is competing
> > > against its child groups (on behalf of its children tasks).
> > 
> > While I agree that part is a bit tricky, I can't think of a much
> > better way to describe it.  Any better ideas?
> 
> Can we call it cfqg->children_weight insted of cfqg->level_weight.

Hmmm... yeah, why not?  I'll rename it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
  2012-12-17 21:17     ` Tejun Heo
@ 2012-12-17 21:27       ` Vivek Goyal
  2012-12-17 21:33         ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 21:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Mon, Dec 17, 2012 at 01:17:38PM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Dec 17, 2012 at 03:53:18PM -0500, Vivek Goyal wrote:
> > On Fri, Dec 14, 2012 at 02:41:20PM -0800, Tejun Heo wrote:
> > > Currently, cfqg charges are scaled directly according to cfqg->weight.
> > > Regardless of the number of active cfqgs or the amount of active
> > > weights, a given weight value always scales charge the same way.  This
> > > works fine as long as all cfqgs are treated equally regardless of
> > > their positions in the hierarchy, which is what cfq currently
> > > implements.  It can't work in hierarchical settings because the
> > > interpretation of a given weight value depends on where the weight is
> > > located in the hierarchy.
> > 
> > I did not understand this. Why the current scheme will not work with
> > hierarchy?
> 
> Because the meaning of a weight changes depending on where the weight
> exists in the hierarchy?
> 
> > While we calculate the vdisktime, this is calculated with the help
> > of CFQ_DEFAULT_WEIGHT and cfqg->weight. So we scale used time slice
> > in proportion to CFQ_DEFAULT_WEIGTH/cfqg->weight. So higher the weight
> > lesser the charge and cfqg gets scheduled again faster and lower the
> > weight, higher the vdisktime and cfqg gets scheduled less  frequently.
> > 
> > As every cfqg does the same thing on service tree, they automatically
> > get fair share w.r.t their weight.
> > 
> > And this mechanism should not be impacted by the hierarchy because we
> > have a separate service tree at separate level. This will not work
> > only if you come up with one compressed tree and then weights will
> > have to be adjusted. If we have a separate service tree in each group
> > then it should work just fine.
> 
> Why would you create N service trees when you can almost trivially use
> one by calcualting the effective weight?  You would have to be
> adjusting all trees above whenever something changes in a child.

One of the reasons I can think is accuracy. If a task/group is added to
a service tree, it mostly does not change the fraction of parent and
does not change the fraction of parent's sibling.

By making everything flat any addition/removal of an entity changes
fraction of everything on the tree.

Not that I am bothered about it because we do not focus that strictly
on fairness. So I would not care about it.

What I do care about is atleast being able to read and understand the
code easily. Right now, it is hard to understand. I am still struggling
to wrap my head around it.

For example, while adding a group to service tree we calculate
cfqg->vfaction as follows.

vfr = vfr * pos->leaf_weight / pos->level_weight;

and then 

vfr = vfr * pos->weight / parent->level_weight;

cfqg->vfraction = max_t(unsigned, vfr, 1)

If cfqg->vfraction is about cfqg then why should we take into account
leaf_weight and level_weight. We should be just worried about pos->weight
and parent->level_weight and that should determine vfaction of cfqg.
 
Thanks
Vivek

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
  2012-12-17 21:27       ` Vivek Goyal
@ 2012-12-17 21:33         ` Tejun Heo
  2012-12-17 21:49           ` Vivek Goyal
  0 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-17 21:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

Hello,

On Mon, Dec 17, 2012 at 04:27:36PM -0500, Vivek Goyal wrote:
> What I do care about is atleast being able to read and understand the
> code easily. Right now, it is hard to understand. I am still struggling
> to wrap my head around it.

Hmm... I thought it was really simple.  Maybe I'm just too familiar
with it.  You just walk up the tree multiplying the fraction you have
at each level.  It really doesn't get much simpler than that.

> For example, while adding a group to service tree we calculate
> cfqg->vfaction as follows.
> 
> vfr = vfr * pos->leaf_weight / pos->level_weight;
> 
> and then 
> 
> vfr = vfr * pos->weight / parent->level_weight;
> 
> cfqg->vfraction = max_t(unsigned, vfr, 1)
> 
> If cfqg->vfraction is about cfqg then why should we take into account
> leaf_weight and level_weight. We should be just worried about pos->weight
> and parent->level_weight and that should determine vfaction of cfqg.

Eh?  Then how would it compete with the children cfqgs?  You can
consider it as the hidden leaf node competing with the children cfqgs,
right?  So, you take the leaf weight and divide it by the total active
weight of all the children (including the hidden leaf node).  It isn't
different from the rest of the calculation.  It just looks different
because leaf_weight is stored elsewhere and we look down there and
then walke up.  The calculation being done is exactly the same one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
  2012-12-17 21:33         ` Tejun Heo
@ 2012-12-17 21:49           ` Vivek Goyal
  2012-12-17 22:12             ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-17 21:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Mon, Dec 17, 2012 at 01:33:14PM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Dec 17, 2012 at 04:27:36PM -0500, Vivek Goyal wrote:
> > What I do care about is atleast being able to read and understand the
> > code easily. Right now, it is hard to understand. I am still struggling
> > to wrap my head around it.
> 
> Hmm... I thought it was really simple.  Maybe I'm just too familiar
> with it.  You just walk up the tree multiplying the fraction you have
> at each level.  It really doesn't get much simpler than that.
> 
> > For example, while adding a group to service tree we calculate
> > cfqg->vfaction as follows.
> > 
> > vfr = vfr * pos->leaf_weight / pos->level_weight;
> > 
> > and then 
> > 
> > vfr = vfr * pos->weight / parent->level_weight;
> > 
> > cfqg->vfraction = max_t(unsigned, vfr, 1)
> > 
> > If cfqg->vfraction is about cfqg then why should we take into account
> > leaf_weight and level_weight. We should be just worried about pos->weight
> > and parent->level_weight and that should determine vfaction of cfqg.
> 
> Eh?  Then how would it compete with the children cfqgs?  You can
> consider it as the hidden leaf node competing with the children cfqgs,
> right?  So, you take the leaf weight and divide it by the total active
> weight of all the children (including the hidden leaf node).  It isn't
> different from the rest of the calculation.  It just looks different
> because leaf_weight is stored elsewhere and we look down there and
> then walke up.  The calculation being done is exactly the same one.

Again it is coming from multiplexing cfqg and cfqg->task_group. So
effectively we are calculating the vfraction of task_group inside cfqg.
Once you say cfqg->vfraction, I immediately think of total share of
group (including task_group and all children cgroups).

May be cfqg->task_group_vfraction (or cfqg->tg_fraction) is a better name.

Also descrition says.

/* vfraction  the fraction of vdisktime that a cfqg is entitled to */

May be we can clarify it that it is share of task_group of cfqg.

I guess i am too familiar with how cpu has done it. I think I am getting
confused with properties which belong to cfqg and properties which
belong to cfqg->task_group.

I guess if we prefix fields belong to task_group with somete suitable
string, it might become little easier to understand.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
  2012-12-17 21:49           ` Vivek Goyal
@ 2012-12-17 22:12             ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2012-12-17 22:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

Hey, Vivek.

On Mon, Dec 17, 2012 at 04:49:11PM -0500, Vivek Goyal wrote:
> > Eh?  Then how would it compete with the children cfqgs?  You can
> > consider it as the hidden leaf node competing with the children cfqgs,
> > right?  So, you take the leaf weight and divide it by the total active
> > weight of all the children (including the hidden leaf node).  It isn't
> > different from the rest of the calculation.  It just looks different
> > because leaf_weight is stored elsewhere and we look down there and
> > then walke up.  The calculation being done is exactly the same one.
> 
> Again it is coming from multiplexing cfqg and cfqg->task_group. So
> effectively we are calculating the vfraction of task_group inside cfqg.

Or calculating the fraction of the hidden leaf node.

> Once you say cfqg->vfraction, I immediately think of total share of
> group (including task_group and all children cgroups).

Any tasks should be treated as member of the hidden leaf node, so...

> May be cfqg->task_group_vfraction (or cfqg->tg_fraction) is a better name.

I don't know.  Wouldn't just explaining clearly in the comment e
enough?  I think the name "vfraction" is already weird enough to
indicate that it's something which you need to read explanation about.
If that explanation is readily available as comment, I think it should
be okay.

> Also descrition says.
> 
> /* vfraction  the fraction of vdisktime that a cfqg is entitled to */
> 
> May be we can clarify it that it is share of task_group of cfqg.

Yeah, no objection.  Let's beef up the explanation there.

> I guess i am too familiar with how cpu has done it. I think I am getting
> confused with properties which belong to cfqg and properties which
> belong to cfqg->task_group.
> 
> I guess if we prefix fields belong to task_group with somete suitable
> string, it might become little easier to understand.

I usually am not a big fan of adding wordy prefixes.  I think they
tend to obfuscate/frustrate more than help.  Let's stick a weird (that
is, non-generic) name to something weird and explain it well in the
comment.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/12] cfq-iosched: enable full blkcg hierarchy support
  2012-12-14 22:41 ` [PATCH 09/12] cfq-iosched: enable full blkcg hierarchy support Tejun Heo
@ 2012-12-18 18:40   ` Vivek Goyal
  2012-12-18 19:10     ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-18 18:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 14, 2012 at 02:41:22PM -0800, Tejun Heo wrote:
> With the previous two patches, all cfqg scheduling decisions are based
> on vfraction and ready for hierarchy support.  The only thing which
> keeps the behavior flat is cfqg_flat_parent() which makes vfraction
> calculation consider all non-root cfqgs children of the root cfqg.
> 
> Replace it with cfqg_parent() which returns the real parent.  This
> enables full blkcg hierarchy support for cfq-iosched.  For example,
> consider the following hierarchy.
> 
>         root
>       /      \
>    A:500      B:250
>   /     \
>  AA:500  AB:1000
> 
> For simplicity, let's say all the leaf nodes have active tasks and are
> on service tree.  For each leaf node, vfraction would be
> 
>  AA: (500  / 1500) * (500 / 750) =~ 0.2222
>  AB: (1000 / 1500) * (500 / 750) =~ 0.4444
>   B:                 (250 / 750) =~ 0.3333
> 
> and vdisktime will be distributed accordingly.

Can we update Documentation/blkio-controller.txt file to explain all
this. Also it would be nice to also explain the case of a group having
both tasks and groups as child and then how shares will be calculated.
There we can also explain the notion of group->weight and
group->leaf_weight.

So in above picture assume group A had some tasks, say T1 and T2, and
root group had task T3. Also assuming that A->weight = 500 and
A->leaf_weight = 750 and root->weigt = root->leaf_weight = 125.

           root
       /   |     \ 
    A:500  B:250 root-auto-group:125
   /   | \                    |
  AA  AB  A-auto-group:750     T3
(500)(1000)   / \ 
	     T1 T2

Now vfraction (% share) will should look as follows.

root-auto-group: 125/875		=~ .1428 =~ 14%
B: 250/875				=~ .2857 =~ 28%
A-auto-group: (750/2250) * (500/875)	=~ .1904 =~ 19%
AA: (500  / 2250) * (500 / 875)		=~ .1269 =~ 12%
AB: (1000/2250) * (500/875)		=~ .2539 =~ 25% 

Thanks
Vivek

> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/cfq-iosched.c | 21 ++++++---------------
>  1 file changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index fd2f4b4..ceade6e 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -603,20 +603,11 @@ static inline struct cfq_group *blkg_to_cfqg(struct blkcg_gq *blkg)
>  	return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
>  }
>  
> -/*
> - * Determine the parent cfqg for weight calculation.  Currently, cfqg
> - * scheduling is flat and the root is the parent of everyone else.
> - */
> -static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
> +static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg)
>  {
> -	struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
> -	struct cfq_group *root;
> -
> -	while (blkg->parent)
> -		blkg = blkg->parent;
> -	root = blkg_to_cfqg(blkg);
> +	struct blkcg_gq *pblkg = cfqg_to_blkg(cfqg)->parent;
>  
> -	return root != cfqg ? root : NULL;
> +	return pblkg ? blkg_to_cfqg(pblkg) : NULL;
>  }
>  
>  static inline void cfqg_get(struct cfq_group *cfqg)
> @@ -719,7 +710,7 @@ static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
>  
>  #else	/* CONFIG_CFQ_GROUP_IOSCHED */
>  
> -static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
> +static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; }
>  static inline void cfqg_get(struct cfq_group *cfqg) { }
>  static inline void cfqg_put(struct cfq_group *cfqg) { }
>  
> @@ -1286,7 +1277,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
>  	 * activated node is met.  vfraction calculation should always
>  	 * continue to the root.
>  	 */
> -	while ((parent = cfqg_flat_parent(pos))) {
> +	while ((parent = cfqg_parent(pos))) {
>  		if (propagate) {
>  			propagate = !parent->nr_active++;
>  			parent->level_weight += pos->weight;
> @@ -1337,7 +1328,7 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
>  	pos->level_weight -= pos->leaf_weight;
>  
>  	while (propagate) {
> -		struct cfq_group *parent = cfqg_flat_parent(pos);
> +		struct cfq_group *parent = cfqg_parent(pos);
>  
>  		/* @pos has 0 nr_active at this point */
>  		WARN_ON_ONCE(pos->level_weight);
> -- 
> 1.7.11.7

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/12] cfq-iosched: enable full blkcg hierarchy support
  2012-12-18 18:40   ` Vivek Goyal
@ 2012-12-18 19:10     ` Tejun Heo
  2012-12-18 19:16       ` Vivek Goyal
  0 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-18 19:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

Hey, Vivek.

On Tue, Dec 18, 2012 at 01:40:23PM -0500, Vivek Goyal wrote:
> Can we update Documentation/blkio-controller.txt file to explain all
> this. Also it would be nice to also explain the case of a group having
> both tasks and groups as child and then how shares will be calculated.
> There we can also explain the notion of group->weight and
> group->leaf_weight.

Yeah, sure thing.  Other than documentation / field names, do things
look okay to you?  If so, I'll post updated round once -rc1 happens.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 12/12] cfq-iosched: add hierarchical cfq_group statistics
  2012-12-14 22:41 ` [PATCH 12/12] cfq-iosched: add hierarchical cfq_group statistics Tejun Heo
@ 2012-12-18 19:11   ` Vivek Goyal
  2012-12-18 19:14     ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-18 19:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 14, 2012 at 02:41:25PM -0800, Tejun Heo wrote:
> Unfortunately, at this point, there's no way to make the existing
> statistics hierarchical without creating nasty surprises for the
> existing users.  Just create recursive counterpart of the existing
> stats.
> 

Hi Tejun,

All these stats needs to be mentioned in blkio-controller.txt file to 
keep that file uptodate.

I think it also needs another word about nature of hierarchical stats.
That is they represent current view of the system and don't store the
history. So if a cgroup was created, did some IO and it was removed, we
lost that history. Deleted cgroup's parent will have no history of
stats of deleted cgroup.

Hence these stats can't be used for things like billing purposes.

IIRC, this is different from the way we collect hierarhical stats for
memory controller.

But I kind of like this because stat update overhead does not increase
with depth of hierarchy. Primarily stat reader pays the price of
traversing through all the stats.

Thanks
Vivek

 

> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/cfq-iosched.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 63 insertions(+)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index ceade6e..15cb97e 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1669,6 +1669,26 @@ static int cfqg_print_rwstat(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +static int cfqg_print_stat_recursive(struct cgroup *cgrp, struct cftype *cft,
> +				     struct seq_file *sf)
> +{
> +	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
> +
> +	blkcg_print_blkgs(sf, blkcg, blkg_prfill_stat_recursive,
> +			  &blkcg_policy_cfq, cft->private, false);
> +	return 0;
> +}
> +
> +static int cfqg_print_rwstat_recursive(struct cgroup *cgrp, struct cftype *cft,
> +				       struct seq_file *sf)
> +{
> +	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
> +
> +	blkcg_print_blkgs(sf, blkcg, blkg_prfill_rwstat_recursive,
> +			  &blkcg_policy_cfq, cft->private, true);
> +	return 0;
> +}
> +
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  static u64 cfqg_prfill_avg_queue_size(struct seq_file *sf,
>  				      struct blkg_policy_data *pd, int off)
> @@ -1740,6 +1760,7 @@ static struct cftype cfq_blkcg_files[] = {
>  		.write_u64 = cfq_set_leaf_weight,
>  	},
>  
> +	/* statistics, covers only the tasks in the cfqg */
>  	{
>  		.name = "time",
>  		.private = offsetof(struct cfq_group, stats.time),
> @@ -1780,6 +1801,48 @@ static struct cftype cfq_blkcg_files[] = {
>  		.private = offsetof(struct cfq_group, stats.queued),
>  		.read_seq_string = cfqg_print_rwstat,
>  	},
> +
> +	/* the same statictics which cover the cfqg and its descendants */
> +	{
> +		.name = "time_recursive",
> +		.private = offsetof(struct cfq_group, stats.time),
> +		.read_seq_string = cfqg_print_stat_recursive,
> +	},
> +	{
> +		.name = "sectors_recursive",
> +		.private = offsetof(struct cfq_group, stats.sectors),
> +		.read_seq_string = cfqg_print_stat_recursive,
> +	},
> +	{
> +		.name = "io_service_bytes_recursive",
> +		.private = offsetof(struct cfq_group, stats.service_bytes),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_serviced_recursive",
> +		.private = offsetof(struct cfq_group, stats.serviced),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_service_time_recursive",
> +		.private = offsetof(struct cfq_group, stats.service_time),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_wait_time_recursive",
> +		.private = offsetof(struct cfq_group, stats.wait_time),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_merged_recursive",
> +		.private = offsetof(struct cfq_group, stats.merged),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_queued_recursive",
> +		.private = offsetof(struct cfq_group, stats.queued),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  	{
>  		.name = "avg_queue_size",
> -- 
> 1.7.11.7

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 12/12] cfq-iosched: add hierarchical cfq_group statistics
  2012-12-18 19:11   ` Vivek Goyal
@ 2012-12-18 19:14     ` Tejun Heo
  2012-12-18 19:18       ` Vivek Goyal
  0 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-18 19:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

Hello, Vivek.

On Tue, Dec 18, 2012 at 02:11:17PM -0500, Vivek Goyal wrote:
> All these stats needs to be mentioned in blkio-controller.txt file to 
> keep that file uptodate.
> 
> I think it also needs another word about nature of hierarchical stats.
> That is they represent current view of the system and don't store the
> history. So if a cgroup was created, did some IO and it was removed, we
> lost that history. Deleted cgroup's parent will have no history of
> stats of deleted cgroup.

Haven't thought about that.  That's nasty.

> Hence these stats can't be used for things like billing purposes.
> 
> IIRC, this is different from the way we collect hierarhical stats for
> memory controller.
> 
> But I kind of like this because stat update overhead does not increase
> with depth of hierarchy. Primarily stat reader pays the price of
> traversing through all the stats.

Yeah, hmmm, maybe we should add another set of counters to carry stats
from dead ones?  Avoiding hierarchical accounting overhead in hot path
while remembering by-gones shouldn't be that hard.  Will work on that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/12] cfq-iosched: enable full blkcg hierarchy support
  2012-12-18 19:10     ` Tejun Heo
@ 2012-12-18 19:16       ` Vivek Goyal
  2012-12-18 19:17         ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-18 19:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Tue, Dec 18, 2012 at 11:10:55AM -0800, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Tue, Dec 18, 2012 at 01:40:23PM -0500, Vivek Goyal wrote:
> > Can we update Documentation/blkio-controller.txt file to explain all
> > this. Also it would be nice to also explain the case of a group having
> > both tasks and groups as child and then how shares will be calculated.
> > There we can also explain the notion of group->weight and
> > group->leaf_weight.
> 
> Yeah, sure thing.  Other than documentation / field names, do things
> look okay to you?  If so, I'll post updated round once -rc1 happens.

Hi Tejun,

Yes, in general things look good to me with this patch.

/me is wondering how to get a response from cpu controller folks.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 09/12] cfq-iosched: enable full blkcg hierarchy support
  2012-12-18 19:16       ` Vivek Goyal
@ 2012-12-18 19:17         ` Tejun Heo
  0 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2012-12-18 19:17 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

Hey,

On Tue, Dec 18, 2012 at 02:16:45PM -0500, Vivek Goyal wrote:
> > Yeah, sure thing.  Other than documentation / field names, do things
> > look okay to you?  If so, I'll post updated round once -rc1 happens.
> 
> Yes, in general things look good to me with this patch.

Cool.

> /me is wondering how to get a response from cpu controller folks.

Well, I pinged Peter last time about it and didn't get any response.
Prolly best to ping w/ patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 12/12] cfq-iosched: add hierarchical cfq_group statistics
  2012-12-18 19:14     ` Tejun Heo
@ 2012-12-18 19:18       ` Vivek Goyal
  2012-12-18 19:21         ` Tejun Heo
  0 siblings, 1 reply; 40+ messages in thread
From: Vivek Goyal @ 2012-12-18 19:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Tue, Dec 18, 2012 at 11:14:25AM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Tue, Dec 18, 2012 at 02:11:17PM -0500, Vivek Goyal wrote:
> > All these stats needs to be mentioned in blkio-controller.txt file to 
> > keep that file uptodate.
> > 
> > I think it also needs another word about nature of hierarchical stats.
> > That is they represent current view of the system and don't store the
> > history. So if a cgroup was created, did some IO and it was removed, we
> > lost that history. Deleted cgroup's parent will have no history of
> > stats of deleted cgroup.
> 
> Haven't thought about that.  That's nasty.
> 
> > Hence these stats can't be used for things like billing purposes.
> > 
> > IIRC, this is different from the way we collect hierarhical stats for
> > memory controller.
> > 
> > But I kind of like this because stat update overhead does not increase
> > with depth of hierarchy. Primarily stat reader pays the price of
> > traversing through all the stats.
> 
> Yeah, hmmm, maybe we should add another set of counters to carry stats
> from dead ones?  Avoiding hierarchical accounting overhead in hot path
> while remembering by-gones shouldn't be that hard.  Will work on that.

So are you planning to migrate the stats to parent when a cgroup is being
deleted?  That should make sure we don't do hierarhical update.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 12/12] cfq-iosched: add hierarchical cfq_group statistics
  2012-12-18 19:18       ` Vivek Goyal
@ 2012-12-18 19:21         ` Tejun Heo
  2012-12-18 19:26           ` Vivek Goyal
  0 siblings, 1 reply; 40+ messages in thread
From: Tejun Heo @ 2012-12-18 19:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Tue, Dec 18, 2012 at 02:18:54PM -0500, Vivek Goyal wrote:
> > Yeah, hmmm, maybe we should add another set of counters to carry stats
> > from dead ones?  Avoiding hierarchical accounting overhead in hot path
> > while remembering by-gones shouldn't be that hard.  Will work on that.
> 
> So are you planning to migrate the stats to parent when a cgroup is being
> deleted?  That should make sure we don't do hierarhical update.

Yeah, something like that.  The current statistics being !hierarchical
means that we need to keep them separate from the actual stats, so
it'll end up with another set of stats.  Kinda nasty but I don't see
any better way out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 12/12] cfq-iosched: add hierarchical cfq_group statistics
  2012-12-18 19:21         ` Tejun Heo
@ 2012-12-18 19:26           ` Vivek Goyal
  0 siblings, 0 replies; 40+ messages in thread
From: Vivek Goyal @ 2012-12-18 19:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Tue, Dec 18, 2012 at 11:21:55AM -0800, Tejun Heo wrote:
> On Tue, Dec 18, 2012 at 02:18:54PM -0500, Vivek Goyal wrote:
> > > Yeah, hmmm, maybe we should add another set of counters to carry stats
> > > from dead ones?  Avoiding hierarchical accounting overhead in hot path
> > > while remembering by-gones shouldn't be that hard.  Will work on that.
> > 
> > So are you planning to migrate the stats to parent when a cgroup is being
> > deleted?  That should make sure we don't do hierarhical update.
> 
> Yeah, something like that.  The current statistics being !hierarchical
> means that we need to keep them separate from the actual stats, so
> it'll end up with another set of stats.  Kinda nasty but I don't see
> any better way out.

Separate stats for hierarchical migration make sense to me.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2012-12-18 19:26 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-14 22:41 [PATCHSET] block: implement blkcg hierarchy support in cfq Tejun Heo
2012-12-14 22:41 ` [PATCH 01/12] blkcg: fix minor bug in blkg_alloc() Tejun Heo
2012-12-17 19:10   ` Vivek Goyal
2012-12-14 22:41 ` [PATCH 02/12] blkcg: reorganize blkg_lookup_create() and friends Tejun Heo
2012-12-17 19:28   ` Vivek Goyal
2012-12-14 22:41 ` [PATCH 03/12] blkcg: cosmetic updates to blkg_create() Tejun Heo
2012-12-17 19:37   ` Vivek Goyal
2012-12-14 22:41 ` [PATCH 04/12] blkcg: make blkcg_gq's hierarchical Tejun Heo
2012-12-17 20:04   ` Vivek Goyal
2012-12-14 22:41 ` [PATCH 05/12] cfq-iosched: add leaf_weight Tejun Heo
2012-12-14 22:41 ` [PATCH 06/12] cfq-iosched: implement cfq_group->nr_active and ->level_weight Tejun Heo
2012-12-17 20:46   ` Vivek Goyal
2012-12-17 21:15     ` Tejun Heo
2012-12-17 21:18       ` Vivek Goyal
2012-12-17 21:20         ` Tejun Heo
2012-12-14 22:41 ` [PATCH 07/12] cfq-iosched: implement hierarchy-ready cfq_group charge scaling Tejun Heo
2012-12-17 20:53   ` Vivek Goyal
2012-12-17 21:17     ` Tejun Heo
2012-12-17 21:27       ` Vivek Goyal
2012-12-17 21:33         ` Tejun Heo
2012-12-17 21:49           ` Vivek Goyal
2012-12-17 22:12             ` Tejun Heo
2012-12-14 22:41 ` [PATCH 08/12] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction Tejun Heo
2012-12-14 22:41 ` [PATCH 09/12] cfq-iosched: enable full blkcg hierarchy support Tejun Heo
2012-12-18 18:40   ` Vivek Goyal
2012-12-18 19:10     ` Tejun Heo
2012-12-18 19:16       ` Vivek Goyal
2012-12-18 19:17         ` Tejun Heo
2012-12-14 22:41 ` [PATCH 10/12] blkcg: add blkg_policy_data->plid Tejun Heo
2012-12-14 22:41 ` [PATCH 11/12] blkcg: implement blkg_prfill_[rw]stat_recursive() Tejun Heo
2012-12-14 22:41 ` [PATCH 12/12] cfq-iosched: add hierarchical cfq_group statistics Tejun Heo
2012-12-18 19:11   ` Vivek Goyal
2012-12-18 19:14     ` Tejun Heo
2012-12-18 19:18       ` Vivek Goyal
2012-12-18 19:21         ` Tejun Heo
2012-12-18 19:26           ` Vivek Goyal
2012-12-17 16:52 ` [PATCHSET] block: implement blkcg hierarchy support in cfq Vivek Goyal
2012-12-17 17:38   ` Tejun Heo
2012-12-17 18:50     ` Vivek Goyal
2012-12-17 18:59       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).