All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
@ 2012-12-28 20:35 ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal; +Cc: containers, cgroups, linux-kernel, ctalbott, rni

Hello,

This is the second iteration to implement blkcg hierarchy support in
cfq-iosched.  Changes from the first task[L] are

* Vivek's cfq cleanup patches are included in the series for
  convenience.

* Divide by zero bug when !CONFIG_CFQ_GROUP_IOSCHED reported by
  Fengguang fixed.

* Updated to reflect Vivek's reviews - renames & documentation.

* Recursive stats no longer forget stats from dead descendants.  This
  turned out to be more complex than I wished involving implementing
  policy on/offline callbacks.

cfq-iosched is currently utterly broken in how it handles cgroup
hierarchy.  It ignores the hierarchy structure and just treats every
blkcgs equally.  This is simply broken.  This breakage makes blkcg
behave very differently from other properly-hierarchical controllers
and makes it impossible to give any uniform interpretation to the
hierarchy, which in turn makes it impossible to implement unified
hierarchy.

Given the relative simplicity of cfqg scheduling, implementing proper
hierarchy support isn't that difficult.  All that's necessary is
determining how much fraction each cfqg on the service tree has claim
to considering the hierarchy.  The calculation can be done by
maintaining the sum of active weights at each level and compounding
the ratios from the cfqg in question to root.  The overhead isn't
significant.  Tree traversals happen only when cfqgs are added or
removed from the service tree and they are from the cfqg being
modified to the root.

There are some design choices which are worth mentioning.

* Internal (non-leaf) cfqgs w/ tasks treat the tasks as a single unit
  competeting against the children cfqgs.  New config knobs -
  blkio.leaf_weight[_device] - are added to configure the weight of
  these tasks.  Another way to look at it is that each cfqg has a
  hidden leaf child node attached to it which hosts all tasks and
  leaf_weight controls the weight of that hidden node.

  Treating cfqqs and cfqgs as equals doesn't make much sense to me and
  is hairy - we need to establish ioprio to weight mapping and the
  weights fluctuate as processes fork and exit.  This becomes hairier
  when considering multiple controllers, Such mappings can't be
  established consistently across different controllers and the
  weights are given out differently - ie. blkcg give weights out to
  io_contexts while cpu to tasks, which may share io_contexts.  It's
  difficult to make sense of what's going on.

  The goal is to bring cpu, currently the only other controller which
  implements weight based resource allocation, to similar behavior.

* The existing stats aren't converted to hierarchical but new
  hierarchical ones are added.  There isn't a way to do that w/o
  introducing nasty silent surprises to the existing flat hierarchy
  users, so while being a bit clumsy, I can't see a better way.

* I based it on top of Vivek's cleanup patchset[1] but not the cfqq,
  cfqg scheduling unification patchset.  I don't think it's necessary
  or beneficial to mix the two and would really like to avoid messing
  with !blkcg scheduling logic.

The hierarchical scheduling itself is fairly simple.  The cfq part is
only ~260 lines with ~60 lines being comment, and the hierarchical
weight scaling is really straight-forward.

This patchset contains the following 24 patches.

 0001-cfq-iosched-Properly-name-all-references-to-IO-class.patch
 0002-cfq-iosched-More-renaming-to-better-represent-wl_cla.patch
 0003-cfq-iosched-Rename-service_tree-to-st-at-some-places.patch
 0004-cfq-iosched-Rename-few-functions-related-to-selectin.patch
 0005-cfq-iosched-Get-rid-of-unnecessary-local-variable.patch
 0006-cfq-iosched-Print-sync-noidle-information-in-blktrac.patch
 0007-blkcg-fix-minor-bug-in-blkg_alloc.patch
 0008-blkcg-reorganize-blkg_lookup_create-and-friends.patch
 0009-blkcg-cosmetic-updates-to-blkg_create.patch
 0010-blkcg-make-blkcg_gq-s-hierarchical.patch
 0011-cfq-iosched-add-leaf_weight.patch
 0012-cfq-iosched-implement-cfq_group-nr_active-and-childr.patch
 0013-cfq-iosched-implement-hierarchy-ready-cfq_group-char.patch
 0014-cfq-iosched-convert-cfq_group_slice-to-use-cfqg-vfra.patch
 0015-cfq-iosched-enable-full-blkcg-hierarchy-support.patch
 0016-blkcg-add-blkg_policy_data-plid.patch
 0017-blkcg-implement-blkcg_policy-on-offline_pd_fn-and-bl.patch
 0018-blkcg-s-blkg_rwstat_sum-blkg_rwstat_total.patch
 0019-blkcg-implement-blkg_-rw-stat_recursive_sum-and-blkg.patch
 0020-block-RCU-free-request_queue.patch
 0021-blkcg-make-blkcg_print_blkgs-grab-q-locks-instead-of.patch
 0022-cfq-iosched-separate-out-cfqg_stats_reset-from-cfq_p.patch
 0023-cfq-iosched-collect-stats-from-dead-cfqgs.patch
 0024-cfq-iosched-add-hierarchical-cfq_group-statistics.patch

0001-0006 are Vivek's cfq cleanup patches.

0007-0009 are prep patches.

0010 makes blkcg core always allocate non-leaf blkgs so that any given
blkg is guaranteed to have all its ancestor blkgs to the root.

0011-0012 prepare for hierarchical scheduling.

0013-0014 implement hierarchy-ready cfqg scheduling.

0015 enbles hierarchical scheduling.

0016-0022 prepare for hierarchical stats.

0023-0024 implement hierarchical stats.

This patchset is on top of linus#master (ecccd1248d ("mm: fix null
pointer dereference in wait_iff_congested()")).

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git blkcg-cfq-hierarchy

Thanks.

 Documentation/block/cfq-iosched.txt |   58 +++
 block/blk-cgroup.c                  |  276 +++++++++++++--
 block/blk-cgroup.h                  |   68 +++
 block/blk-sysfs.c                   |    9 
 block/cfq-iosched.c                 |  627 +++++++++++++++++++++++++++++-------
 include/linux/blkdev.h              |    2 
 6 files changed, 877 insertions(+), 163 deletions(-)

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel.cgroups/5440
[1] https://lkml.org/lkml/2012/10/3/502

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
@ 2012-12-28 20:35 ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

Hello,

This is the second iteration to implement blkcg hierarchy support in
cfq-iosched.  Changes from the first task[L] are

* Vivek's cfq cleanup patches are included in the series for
  convenience.

* Divide by zero bug when !CONFIG_CFQ_GROUP_IOSCHED reported by
  Fengguang fixed.

* Updated to reflect Vivek's reviews - renames & documentation.

* Recursive stats no longer forget stats from dead descendants.  This
  turned out to be more complex than I wished involving implementing
  policy on/offline callbacks.

cfq-iosched is currently utterly broken in how it handles cgroup
hierarchy.  It ignores the hierarchy structure and just treats every
blkcgs equally.  This is simply broken.  This breakage makes blkcg
behave very differently from other properly-hierarchical controllers
and makes it impossible to give any uniform interpretation to the
hierarchy, which in turn makes it impossible to implement unified
hierarchy.

Given the relative simplicity of cfqg scheduling, implementing proper
hierarchy support isn't that difficult.  All that's necessary is
determining how much fraction each cfqg on the service tree has claim
to considering the hierarchy.  The calculation can be done by
maintaining the sum of active weights at each level and compounding
the ratios from the cfqg in question to root.  The overhead isn't
significant.  Tree traversals happen only when cfqgs are added or
removed from the service tree and they are from the cfqg being
modified to the root.

There are some design choices which are worth mentioning.

* Internal (non-leaf) cfqgs w/ tasks treat the tasks as a single unit
  competeting against the children cfqgs.  New config knobs -
  blkio.leaf_weight[_device] - are added to configure the weight of
  these tasks.  Another way to look at it is that each cfqg has a
  hidden leaf child node attached to it which hosts all tasks and
  leaf_weight controls the weight of that hidden node.

  Treating cfqqs and cfqgs as equals doesn't make much sense to me and
  is hairy - we need to establish ioprio to weight mapping and the
  weights fluctuate as processes fork and exit.  This becomes hairier
  when considering multiple controllers, Such mappings can't be
  established consistently across different controllers and the
  weights are given out differently - ie. blkcg give weights out to
  io_contexts while cpu to tasks, which may share io_contexts.  It's
  difficult to make sense of what's going on.

  The goal is to bring cpu, currently the only other controller which
  implements weight based resource allocation, to similar behavior.

* The existing stats aren't converted to hierarchical but new
  hierarchical ones are added.  There isn't a way to do that w/o
  introducing nasty silent surprises to the existing flat hierarchy
  users, so while being a bit clumsy, I can't see a better way.

* I based it on top of Vivek's cleanup patchset[1] but not the cfqq,
  cfqg scheduling unification patchset.  I don't think it's necessary
  or beneficial to mix the two and would really like to avoid messing
  with !blkcg scheduling logic.

The hierarchical scheduling itself is fairly simple.  The cfq part is
only ~260 lines with ~60 lines being comment, and the hierarchical
weight scaling is really straight-forward.

This patchset contains the following 24 patches.

 0001-cfq-iosched-Properly-name-all-references-to-IO-class.patch
 0002-cfq-iosched-More-renaming-to-better-represent-wl_cla.patch
 0003-cfq-iosched-Rename-service_tree-to-st-at-some-places.patch
 0004-cfq-iosched-Rename-few-functions-related-to-selectin.patch
 0005-cfq-iosched-Get-rid-of-unnecessary-local-variable.patch
 0006-cfq-iosched-Print-sync-noidle-information-in-blktrac.patch
 0007-blkcg-fix-minor-bug-in-blkg_alloc.patch
 0008-blkcg-reorganize-blkg_lookup_create-and-friends.patch
 0009-blkcg-cosmetic-updates-to-blkg_create.patch
 0010-blkcg-make-blkcg_gq-s-hierarchical.patch
 0011-cfq-iosched-add-leaf_weight.patch
 0012-cfq-iosched-implement-cfq_group-nr_active-and-childr.patch
 0013-cfq-iosched-implement-hierarchy-ready-cfq_group-char.patch
 0014-cfq-iosched-convert-cfq_group_slice-to-use-cfqg-vfra.patch
 0015-cfq-iosched-enable-full-blkcg-hierarchy-support.patch
 0016-blkcg-add-blkg_policy_data-plid.patch
 0017-blkcg-implement-blkcg_policy-on-offline_pd_fn-and-bl.patch
 0018-blkcg-s-blkg_rwstat_sum-blkg_rwstat_total.patch
 0019-blkcg-implement-blkg_-rw-stat_recursive_sum-and-blkg.patch
 0020-block-RCU-free-request_queue.patch
 0021-blkcg-make-blkcg_print_blkgs-grab-q-locks-instead-of.patch
 0022-cfq-iosched-separate-out-cfqg_stats_reset-from-cfq_p.patch
 0023-cfq-iosched-collect-stats-from-dead-cfqgs.patch
 0024-cfq-iosched-add-hierarchical-cfq_group-statistics.patch

0001-0006 are Vivek's cfq cleanup patches.

0007-0009 are prep patches.

0010 makes blkcg core always allocate non-leaf blkgs so that any given
blkg is guaranteed to have all its ancestor blkgs to the root.

0011-0012 prepare for hierarchical scheduling.

0013-0014 implement hierarchy-ready cfqg scheduling.

0015 enbles hierarchical scheduling.

0016-0022 prepare for hierarchical stats.

0023-0024 implement hierarchical stats.

This patchset is on top of linus#master (ecccd1248d ("mm: fix null
pointer dereference in wait_iff_congested()")).

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git blkcg-cfq-hierarchy

Thanks.

 Documentation/block/cfq-iosched.txt |   58 +++
 block/blk-cgroup.c                  |  276 +++++++++++++--
 block/blk-cgroup.h                  |   68 +++
 block/blk-sysfs.c                   |    9 
 block/cfq-iosched.c                 |  627 +++++++++++++++++++++++++++++-------
 include/linux/blkdev.h              |    2 
 6 files changed, 877 insertions(+), 163 deletions(-)

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel.cgroups/5440
[1] https://lkml.org/lkml/2012/10/3/502

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH 01/24] cfq-iosched: Properly name all references to IO class
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we
are calling workloads belonging to these classes as "prio". This gets
very confusing as one starts to associate it with ioprio.

So this patch just does bunch of renaming so that reading code becomes
easier. All reference to RT, BE and IDLE workload are done using keyword
"class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made
using keyword "type".

This makes me feel much better while I am reading the code. There is no
functionality change due to this patch.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/cfq-iosched.c | 67 +++++++++++++++++++++++++++--------------------------
 1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e62e920..7646dfd 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -155,7 +155,7 @@ struct cfq_queue {
  * First index in the service_trees.
  * IDLE is handled separately, so it has negative index
  */
-enum wl_prio_t {
+enum wl_class_t {
 	BE_WORKLOAD = 0,
 	RT_WORKLOAD = 1,
 	IDLE_WORKLOAD = 2,
@@ -250,7 +250,7 @@ struct cfq_group {
 
 	unsigned long saved_workload_slice;
 	enum wl_type_t saved_workload;
-	enum wl_prio_t saved_serving_prio;
+	enum wl_class_t saved_serving_class;
 
 	/* number of requests that are on the dispatch list or inside driver */
 	int dispatched;
@@ -280,7 +280,7 @@ struct cfq_data {
 	/*
 	 * The priority currently being served
 	 */
-	enum wl_prio_t serving_prio;
+	enum wl_class_t serving_class;
 	enum wl_type_t serving_type;
 	unsigned long workload_expires;
 	struct cfq_group *serving_group;
@@ -354,16 +354,16 @@ struct cfq_data {
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
 
 static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
-					    enum wl_prio_t prio,
+					    enum wl_class_t class,
 					    enum wl_type_t type)
 {
 	if (!cfqg)
 		return NULL;
 
-	if (prio == IDLE_WORKLOAD)
+	if (class == IDLE_WORKLOAD)
 		return &cfqg->service_tree_idle;
 
-	return &cfqg->service_trees[prio][type];
+	return &cfqg->service_trees[class][type];
 }
 
 enum cfqq_state_flags {
@@ -732,7 +732,7 @@ static inline bool iops_mode(struct cfq_data *cfqd)
 		return false;
 }
 
-static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
+static inline enum wl_class_t cfqq_class(struct cfq_queue *cfqq)
 {
 	if (cfq_class_idle(cfqq))
 		return IDLE_WORKLOAD;
@@ -751,16 +751,16 @@ static enum wl_type_t cfqq_type(struct cfq_queue *cfqq)
 	return SYNC_WORKLOAD;
 }
 
-static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
+static inline int cfq_group_busy_queues_wl(enum wl_class_t wl_class,
 					struct cfq_data *cfqd,
 					struct cfq_group *cfqg)
 {
-	if (wl == IDLE_WORKLOAD)
+	if (wl_class == IDLE_WORKLOAD)
 		return cfqg->service_tree_idle.count;
 
-	return cfqg->service_trees[wl][ASYNC_WORKLOAD].count
-		+ cfqg->service_trees[wl][SYNC_NOIDLE_WORKLOAD].count
-		+ cfqg->service_trees[wl][SYNC_WORKLOAD].count;
+	return cfqg->service_trees[wl_class][ASYNC_WORKLOAD].count
+		+ cfqg->service_trees[wl_class][SYNC_NOIDLE_WORKLOAD].count
+		+ cfqg->service_trees[wl_class][SYNC_WORKLOAD].count;
 }
 
 static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
@@ -1304,7 +1304,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 		cfqg->saved_workload_slice = cfqd->workload_expires
 						- jiffies;
 		cfqg->saved_workload = cfqd->serving_type;
-		cfqg->saved_serving_prio = cfqd->serving_prio;
+		cfqg->saved_serving_class = cfqd->serving_class;
 	} else
 		cfqg->saved_workload_slice = 0;
 
@@ -1616,7 +1616,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	int left;
 	int new_cfqq = 1;
 
-	service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
+	service_tree = service_tree_for(cfqq->cfqg, cfqq_class(cfqq),
 						cfqq_type(cfqq));
 	if (cfq_class_idle(cfqq)) {
 		rb_key = CFQ_IDLE_DELAY;
@@ -2030,8 +2030,8 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 				   struct cfq_queue *cfqq)
 {
 	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active wl_prio:%d wl_type:%d",
-				cfqd->serving_prio, cfqd->serving_type);
+		cfq_log_cfqq(cfqd, cfqq, "set_active wl_class:%d wl_type:%d",
+				cfqd->serving_class, cfqd->serving_type);
 		cfqg_stats_update_avg_queue_size(cfqq->cfqg);
 		cfqq->slice_start = 0;
 		cfqq->dispatch_start = jiffies;
@@ -2118,7 +2118,7 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
 	struct cfq_rb_root *service_tree =
-		service_tree_for(cfqd->serving_group, cfqd->serving_prio,
+		service_tree_for(cfqd->serving_group, cfqd->serving_class,
 					cfqd->serving_type);
 
 	if (!cfqd->rq_queued)
@@ -2285,7 +2285,7 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 
 static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	enum wl_prio_t prio = cfqq_prio(cfqq);
+	enum wl_class_t wl_class = cfqq_class(cfqq);
 	struct cfq_rb_root *service_tree = cfqq->service_tree;
 
 	BUG_ON(!service_tree);
@@ -2295,7 +2295,7 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		return false;
 
 	/* We never do for idle class queues. */
-	if (prio == IDLE_WORKLOAD)
+	if (wl_class == IDLE_WORKLOAD)
 		return false;
 
 	/* We do for queues that were marked with idle window flag. */
@@ -2495,7 +2495,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
 }
 
 static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
-				struct cfq_group *cfqg, enum wl_prio_t prio)
+			struct cfq_group *cfqg, enum wl_class_t wl_class)
 {
 	struct cfq_queue *queue;
 	int i;
@@ -2505,7 +2505,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
 
 	for (i = 0; i <= SYNC_WORKLOAD; ++i) {
 		/* select the one with lowest rb_key */
-		queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
+		queue = cfq_rb_first(service_tree_for(cfqg, wl_class, i));
 		if (queue &&
 		    (!key_valid || time_before(queue->rb_key, lowest_key))) {
 			lowest_key = queue->rb_key;
@@ -2523,20 +2523,20 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	unsigned count;
 	struct cfq_rb_root *st;
 	unsigned group_slice;
-	enum wl_prio_t original_prio = cfqd->serving_prio;
+	enum wl_class_t original_class = cfqd->serving_class;
 
 	/* Choose next priority. RT > BE > IDLE */
 	if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
-		cfqd->serving_prio = RT_WORKLOAD;
+		cfqd->serving_class = RT_WORKLOAD;
 	else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
-		cfqd->serving_prio = BE_WORKLOAD;
+		cfqd->serving_class = BE_WORKLOAD;
 	else {
-		cfqd->serving_prio = IDLE_WORKLOAD;
+		cfqd->serving_class = IDLE_WORKLOAD;
 		cfqd->workload_expires = jiffies + 1;
 		return;
 	}
 
-	if (original_prio != cfqd->serving_prio)
+	if (original_class != cfqd->serving_class)
 		goto new_workload;
 
 	/*
@@ -2544,7 +2544,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	 * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
 	 * expiration time
 	 */
-	st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
+	st = service_tree_for(cfqg, cfqd->serving_class, cfqd->serving_type);
 	count = st->count;
 
 	/*
@@ -2556,8 +2556,8 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 new_workload:
 	/* otherwise select new workload type */
 	cfqd->serving_type =
-		cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
-	st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
+		cfq_choose_wl(cfqd, cfqg, cfqd->serving_class);
+	st = service_tree_for(cfqg, cfqd->serving_class, cfqd->serving_type);
 	count = st->count;
 
 	/*
@@ -2568,8 +2568,9 @@ new_workload:
 	group_slice = cfq_group_slice(cfqd, cfqg);
 
 	slice = group_slice * count /
-		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
-		      cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));
+		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_class],
+		      cfq_group_busy_queues_wl(cfqd->serving_class, cfqd,
+					cfqg));
 
 	if (cfqd->serving_type == ASYNC_WORKLOAD) {
 		unsigned int tmp;
@@ -2620,7 +2621,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
 	if (cfqg->saved_workload_slice) {
 		cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
 		cfqd->serving_type = cfqg->saved_workload;
-		cfqd->serving_prio = cfqg->saved_serving_prio;
+		cfqd->serving_class = cfqg->saved_serving_class;
 	} else
 		cfqd->workload_expires = jiffies - 1;
 
@@ -3645,7 +3646,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 			service_tree = cfqq->service_tree;
 		else
 			service_tree = service_tree_for(cfqq->cfqg,
-				cfqq_prio(cfqq), cfqq_type(cfqq));
+				cfqq_class(cfqq), cfqq_type(cfqq));
 		service_tree->ttime.last_end_request = now;
 		if (!time_after(rq->start_time + cfqd->cfq_fifo_expire[1], now))
 			cfqd->last_delayed_sync = now;
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 01/24] cfq-iosched: Properly name all references to IO class
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

From: Vivek Goyal <vgoyal@redhat.com>

Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we
are calling workloads belonging to these classes as "prio". This gets
very confusing as one starts to associate it with ioprio.

So this patch just does bunch of renaming so that reading code becomes
easier. All reference to RT, BE and IDLE workload are done using keyword
"class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made
using keyword "type".

This makes me feel much better while I am reading the code. There is no
functionality change due to this patch.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 67 +++++++++++++++++++++++++++--------------------------
 1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e62e920..7646dfd 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -155,7 +155,7 @@ struct cfq_queue {
  * First index in the service_trees.
  * IDLE is handled separately, so it has negative index
  */
-enum wl_prio_t {
+enum wl_class_t {
 	BE_WORKLOAD = 0,
 	RT_WORKLOAD = 1,
 	IDLE_WORKLOAD = 2,
@@ -250,7 +250,7 @@ struct cfq_group {
 
 	unsigned long saved_workload_slice;
 	enum wl_type_t saved_workload;
-	enum wl_prio_t saved_serving_prio;
+	enum wl_class_t saved_serving_class;
 
 	/* number of requests that are on the dispatch list or inside driver */
 	int dispatched;
@@ -280,7 +280,7 @@ struct cfq_data {
 	/*
 	 * The priority currently being served
 	 */
-	enum wl_prio_t serving_prio;
+	enum wl_class_t serving_class;
 	enum wl_type_t serving_type;
 	unsigned long workload_expires;
 	struct cfq_group *serving_group;
@@ -354,16 +354,16 @@ struct cfq_data {
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
 
 static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
-					    enum wl_prio_t prio,
+					    enum wl_class_t class,
 					    enum wl_type_t type)
 {
 	if (!cfqg)
 		return NULL;
 
-	if (prio == IDLE_WORKLOAD)
+	if (class == IDLE_WORKLOAD)
 		return &cfqg->service_tree_idle;
 
-	return &cfqg->service_trees[prio][type];
+	return &cfqg->service_trees[class][type];
 }
 
 enum cfqq_state_flags {
@@ -732,7 +732,7 @@ static inline bool iops_mode(struct cfq_data *cfqd)
 		return false;
 }
 
-static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
+static inline enum wl_class_t cfqq_class(struct cfq_queue *cfqq)
 {
 	if (cfq_class_idle(cfqq))
 		return IDLE_WORKLOAD;
@@ -751,16 +751,16 @@ static enum wl_type_t cfqq_type(struct cfq_queue *cfqq)
 	return SYNC_WORKLOAD;
 }
 
-static inline int cfq_group_busy_queues_wl(enum wl_prio_t wl,
+static inline int cfq_group_busy_queues_wl(enum wl_class_t wl_class,
 					struct cfq_data *cfqd,
 					struct cfq_group *cfqg)
 {
-	if (wl == IDLE_WORKLOAD)
+	if (wl_class == IDLE_WORKLOAD)
 		return cfqg->service_tree_idle.count;
 
-	return cfqg->service_trees[wl][ASYNC_WORKLOAD].count
-		+ cfqg->service_trees[wl][SYNC_NOIDLE_WORKLOAD].count
-		+ cfqg->service_trees[wl][SYNC_WORKLOAD].count;
+	return cfqg->service_trees[wl_class][ASYNC_WORKLOAD].count
+		+ cfqg->service_trees[wl_class][SYNC_NOIDLE_WORKLOAD].count
+		+ cfqg->service_trees[wl_class][SYNC_WORKLOAD].count;
 }
 
 static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
@@ -1304,7 +1304,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 		cfqg->saved_workload_slice = cfqd->workload_expires
 						- jiffies;
 		cfqg->saved_workload = cfqd->serving_type;
-		cfqg->saved_serving_prio = cfqd->serving_prio;
+		cfqg->saved_serving_class = cfqd->serving_class;
 	} else
 		cfqg->saved_workload_slice = 0;
 
@@ -1616,7 +1616,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	int left;
 	int new_cfqq = 1;
 
-	service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
+	service_tree = service_tree_for(cfqq->cfqg, cfqq_class(cfqq),
 						cfqq_type(cfqq));
 	if (cfq_class_idle(cfqq)) {
 		rb_key = CFQ_IDLE_DELAY;
@@ -2030,8 +2030,8 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 				   struct cfq_queue *cfqq)
 {
 	if (cfqq) {
-		cfq_log_cfqq(cfqd, cfqq, "set_active wl_prio:%d wl_type:%d",
-				cfqd->serving_prio, cfqd->serving_type);
+		cfq_log_cfqq(cfqd, cfqq, "set_active wl_class:%d wl_type:%d",
+				cfqd->serving_class, cfqd->serving_type);
 		cfqg_stats_update_avg_queue_size(cfqq->cfqg);
 		cfqq->slice_start = 0;
 		cfqq->dispatch_start = jiffies;
@@ -2118,7 +2118,7 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
 	struct cfq_rb_root *service_tree =
-		service_tree_for(cfqd->serving_group, cfqd->serving_prio,
+		service_tree_for(cfqd->serving_group, cfqd->serving_class,
 					cfqd->serving_type);
 
 	if (!cfqd->rq_queued)
@@ -2285,7 +2285,7 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 
 static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	enum wl_prio_t prio = cfqq_prio(cfqq);
+	enum wl_class_t wl_class = cfqq_class(cfqq);
 	struct cfq_rb_root *service_tree = cfqq->service_tree;
 
 	BUG_ON(!service_tree);
@@ -2295,7 +2295,7 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 		return false;
 
 	/* We never do for idle class queues. */
-	if (prio == IDLE_WORKLOAD)
+	if (wl_class == IDLE_WORKLOAD)
 		return false;
 
 	/* We do for queues that were marked with idle window flag. */
@@ -2495,7 +2495,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
 }
 
 static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
-				struct cfq_group *cfqg, enum wl_prio_t prio)
+			struct cfq_group *cfqg, enum wl_class_t wl_class)
 {
 	struct cfq_queue *queue;
 	int i;
@@ -2505,7 +2505,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
 
 	for (i = 0; i <= SYNC_WORKLOAD; ++i) {
 		/* select the one with lowest rb_key */
-		queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
+		queue = cfq_rb_first(service_tree_for(cfqg, wl_class, i));
 		if (queue &&
 		    (!key_valid || time_before(queue->rb_key, lowest_key))) {
 			lowest_key = queue->rb_key;
@@ -2523,20 +2523,20 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	unsigned count;
 	struct cfq_rb_root *st;
 	unsigned group_slice;
-	enum wl_prio_t original_prio = cfqd->serving_prio;
+	enum wl_class_t original_class = cfqd->serving_class;
 
 	/* Choose next priority. RT > BE > IDLE */
 	if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
-		cfqd->serving_prio = RT_WORKLOAD;
+		cfqd->serving_class = RT_WORKLOAD;
 	else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
-		cfqd->serving_prio = BE_WORKLOAD;
+		cfqd->serving_class = BE_WORKLOAD;
 	else {
-		cfqd->serving_prio = IDLE_WORKLOAD;
+		cfqd->serving_class = IDLE_WORKLOAD;
 		cfqd->workload_expires = jiffies + 1;
 		return;
 	}
 
-	if (original_prio != cfqd->serving_prio)
+	if (original_class != cfqd->serving_class)
 		goto new_workload;
 
 	/*
@@ -2544,7 +2544,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	 * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
 	 * expiration time
 	 */
-	st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
+	st = service_tree_for(cfqg, cfqd->serving_class, cfqd->serving_type);
 	count = st->count;
 
 	/*
@@ -2556,8 +2556,8 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 new_workload:
 	/* otherwise select new workload type */
 	cfqd->serving_type =
-		cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
-	st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
+		cfq_choose_wl(cfqd, cfqg, cfqd->serving_class);
+	st = service_tree_for(cfqg, cfqd->serving_class, cfqd->serving_type);
 	count = st->count;
 
 	/*
@@ -2568,8 +2568,9 @@ new_workload:
 	group_slice = cfq_group_slice(cfqd, cfqg);
 
 	slice = group_slice * count /
-		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_prio],
-		      cfq_group_busy_queues_wl(cfqd->serving_prio, cfqd, cfqg));
+		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_class],
+		      cfq_group_busy_queues_wl(cfqd->serving_class, cfqd,
+					cfqg));
 
 	if (cfqd->serving_type == ASYNC_WORKLOAD) {
 		unsigned int tmp;
@@ -2620,7 +2621,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
 	if (cfqg->saved_workload_slice) {
 		cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
 		cfqd->serving_type = cfqg->saved_workload;
-		cfqd->serving_prio = cfqg->saved_serving_prio;
+		cfqd->serving_class = cfqg->saved_serving_class;
 	} else
 		cfqd->workload_expires = jiffies - 1;
 
@@ -3645,7 +3646,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 			service_tree = cfqq->service_tree;
 		else
 			service_tree = service_tree_for(cfqq->cfqg,
-				cfqq_prio(cfqq), cfqq_type(cfqq));
+				cfqq_class(cfqq), cfqq_type(cfqq));
 		service_tree->ttime.last_end_request = now;
 		if (!time_after(rq->start_time + cfqd->cfq_fifo_expire[1], now))
 			cfqd->last_delayed_sync = now;
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 02/24] cfq-iosched: More renaming to better represent wl_class and wl_type
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Some more renaming. Again making the code uniform w.r.t use of
wl_class/class to represent IO class (RT, BE, IDLE) and using
wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC).

At places this patch shortens the string "workload" to "wl".
Renamed "saved_workload" to "saved_wl_type". Renamed
"saved_serving_class" to "saved_wl_class".

For uniformity with "saved_wl_*" variables, renamed "serving_class"
to "serving_wl_class" and renamed "serving_type" to "serving_wl_type".

Again, just trying to improve upon code uniformity and improve
readability. No functional change.

v2:
- Restored the usage of keyword "service" based on Jeff Moyer's feedback.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/cfq-iosched.c | 64 +++++++++++++++++++++++++++--------------------------
 1 file changed, 33 insertions(+), 31 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7646dfd..8f890bf 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -248,9 +248,9 @@ struct cfq_group {
 	struct cfq_rb_root service_trees[2][3];
 	struct cfq_rb_root service_tree_idle;
 
-	unsigned long saved_workload_slice;
-	enum wl_type_t saved_workload;
-	enum wl_class_t saved_serving_class;
+	unsigned long saved_wl_slice;
+	enum wl_type_t saved_wl_type;
+	enum wl_class_t saved_wl_class;
 
 	/* number of requests that are on the dispatch list or inside driver */
 	int dispatched;
@@ -280,8 +280,8 @@ struct cfq_data {
 	/*
 	 * The priority currently being served
 	 */
-	enum wl_class_t serving_class;
-	enum wl_type_t serving_type;
+	enum wl_class_t serving_wl_class;
+	enum wl_type_t serving_wl_type;
 	unsigned long workload_expires;
 	struct cfq_group *serving_group;
 
@@ -1241,7 +1241,7 @@ cfq_group_notify_queue_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
 
 	cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
 	cfq_group_service_tree_del(st, cfqg);
-	cfqg->saved_workload_slice = 0;
+	cfqg->saved_wl_slice = 0;
 	cfqg_stats_update_dequeue(cfqg);
 }
 
@@ -1301,12 +1301,12 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 
 	/* This group is being expired. Save the context */
 	if (time_after(cfqd->workload_expires, jiffies)) {
-		cfqg->saved_workload_slice = cfqd->workload_expires
+		cfqg->saved_wl_slice = cfqd->workload_expires
 						- jiffies;
-		cfqg->saved_workload = cfqd->serving_type;
-		cfqg->saved_serving_class = cfqd->serving_class;
+		cfqg->saved_wl_type = cfqd->serving_wl_type;
+		cfqg->saved_wl_class = cfqd->serving_wl_class;
 	} else
-		cfqg->saved_workload_slice = 0;
+		cfqg->saved_wl_slice = 0;
 
 	cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
 					st->min_vdisktime);
@@ -2031,7 +2031,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 {
 	if (cfqq) {
 		cfq_log_cfqq(cfqd, cfqq, "set_active wl_class:%d wl_type:%d",
-				cfqd->serving_class, cfqd->serving_type);
+				cfqd->serving_wl_class, cfqd->serving_wl_type);
 		cfqg_stats_update_avg_queue_size(cfqq->cfqg);
 		cfqq->slice_start = 0;
 		cfqq->dispatch_start = jiffies;
@@ -2118,8 +2118,8 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
 	struct cfq_rb_root *service_tree =
-		service_tree_for(cfqd->serving_group, cfqd->serving_class,
-					cfqd->serving_type);
+		service_tree_for(cfqd->serving_group, cfqd->serving_wl_class,
+						cfqd->serving_wl_type);
 
 	if (!cfqd->rq_queued)
 		return NULL;
@@ -2523,20 +2523,20 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	unsigned count;
 	struct cfq_rb_root *st;
 	unsigned group_slice;
-	enum wl_class_t original_class = cfqd->serving_class;
+	enum wl_class_t original_class = cfqd->serving_wl_class;
 
 	/* Choose next priority. RT > BE > IDLE */
 	if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
-		cfqd->serving_class = RT_WORKLOAD;
+		cfqd->serving_wl_class = RT_WORKLOAD;
 	else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
-		cfqd->serving_class = BE_WORKLOAD;
+		cfqd->serving_wl_class = BE_WORKLOAD;
 	else {
-		cfqd->serving_class = IDLE_WORKLOAD;
+		cfqd->serving_wl_class = IDLE_WORKLOAD;
 		cfqd->workload_expires = jiffies + 1;
 		return;
 	}
 
-	if (original_class != cfqd->serving_class)
+	if (original_class != cfqd->serving_wl_class)
 		goto new_workload;
 
 	/*
@@ -2544,7 +2544,8 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	 * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
 	 * expiration time
 	 */
-	st = service_tree_for(cfqg, cfqd->serving_class, cfqd->serving_type);
+	st = service_tree_for(cfqg, cfqd->serving_wl_class,
+					cfqd->serving_wl_type);
 	count = st->count;
 
 	/*
@@ -2555,9 +2556,10 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 
 new_workload:
 	/* otherwise select new workload type */
-	cfqd->serving_type =
-		cfq_choose_wl(cfqd, cfqg, cfqd->serving_class);
-	st = service_tree_for(cfqg, cfqd->serving_class, cfqd->serving_type);
+	cfqd->serving_wl_type = cfq_choose_wl(cfqd, cfqg,
+					cfqd->serving_wl_class);
+	st = service_tree_for(cfqg, cfqd->serving_wl_class,
+					cfqd->serving_wl_type);
 	count = st->count;
 
 	/*
@@ -2568,11 +2570,11 @@ new_workload:
 	group_slice = cfq_group_slice(cfqd, cfqg);
 
 	slice = group_slice * count /
-		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_class],
-		      cfq_group_busy_queues_wl(cfqd->serving_class, cfqd,
+		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_wl_class],
+		      cfq_group_busy_queues_wl(cfqd->serving_wl_class, cfqd,
 					cfqg));
 
-	if (cfqd->serving_type == ASYNC_WORKLOAD) {
+	if (cfqd->serving_wl_type == ASYNC_WORKLOAD) {
 		unsigned int tmp;
 
 		/*
@@ -2618,10 +2620,10 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
 	cfqd->serving_group = cfqg;
 
 	/* Restore the workload type data */
-	if (cfqg->saved_workload_slice) {
-		cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
-		cfqd->serving_type = cfqg->saved_workload;
-		cfqd->serving_class = cfqg->saved_serving_class;
+	if (cfqg->saved_wl_slice) {
+		cfqd->workload_expires = jiffies + cfqg->saved_wl_slice;
+		cfqd->serving_wl_type = cfqg->saved_wl_type;
+		cfqd->serving_wl_class = cfqg->saved_wl_class;
 	} else
 		cfqd->workload_expires = jiffies - 1;
 
@@ -3404,7 +3406,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 		return true;
 
 	/* Allow preemption only if we are idling on sync-noidle tree */
-	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
+	if (cfqd->serving_wl_type == SYNC_NOIDLE_WORKLOAD &&
 	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
 	    new_cfqq->service_tree->count == 2 &&
 	    RB_EMPTY_ROOT(&cfqq->sort_list))
@@ -3456,7 +3458,7 @@ static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	 * doesn't happen
 	 */
 	if (old_type != cfqq_type(cfqq))
-		cfqq->cfqg->saved_workload_slice = 0;
+		cfqq->cfqg->saved_wl_slice = 0;
 
 	/*
 	 * Put the new queue at the front of the of the current list,
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 02/24] cfq-iosched: More renaming to better represent wl_class and wl_type
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

From: Vivek Goyal <vgoyal@redhat.com>

Some more renaming. Again making the code uniform w.r.t use of
wl_class/class to represent IO class (RT, BE, IDLE) and using
wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC).

At places this patch shortens the string "workload" to "wl".
Renamed "saved_workload" to "saved_wl_type". Renamed
"saved_serving_class" to "saved_wl_class".

For uniformity with "saved_wl_*" variables, renamed "serving_class"
to "serving_wl_class" and renamed "serving_type" to "serving_wl_type".

Again, just trying to improve upon code uniformity and improve
readability. No functional change.

v2:
- Restored the usage of keyword "service" based on Jeff Moyer's feedback.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 64 +++++++++++++++++++++++++++--------------------------
 1 file changed, 33 insertions(+), 31 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7646dfd..8f890bf 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -248,9 +248,9 @@ struct cfq_group {
 	struct cfq_rb_root service_trees[2][3];
 	struct cfq_rb_root service_tree_idle;
 
-	unsigned long saved_workload_slice;
-	enum wl_type_t saved_workload;
-	enum wl_class_t saved_serving_class;
+	unsigned long saved_wl_slice;
+	enum wl_type_t saved_wl_type;
+	enum wl_class_t saved_wl_class;
 
 	/* number of requests that are on the dispatch list or inside driver */
 	int dispatched;
@@ -280,8 +280,8 @@ struct cfq_data {
 	/*
 	 * The priority currently being served
 	 */
-	enum wl_class_t serving_class;
-	enum wl_type_t serving_type;
+	enum wl_class_t serving_wl_class;
+	enum wl_type_t serving_wl_type;
 	unsigned long workload_expires;
 	struct cfq_group *serving_group;
 
@@ -1241,7 +1241,7 @@ cfq_group_notify_queue_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
 
 	cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
 	cfq_group_service_tree_del(st, cfqg);
-	cfqg->saved_workload_slice = 0;
+	cfqg->saved_wl_slice = 0;
 	cfqg_stats_update_dequeue(cfqg);
 }
 
@@ -1301,12 +1301,12 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 
 	/* This group is being expired. Save the context */
 	if (time_after(cfqd->workload_expires, jiffies)) {
-		cfqg->saved_workload_slice = cfqd->workload_expires
+		cfqg->saved_wl_slice = cfqd->workload_expires
 						- jiffies;
-		cfqg->saved_workload = cfqd->serving_type;
-		cfqg->saved_serving_class = cfqd->serving_class;
+		cfqg->saved_wl_type = cfqd->serving_wl_type;
+		cfqg->saved_wl_class = cfqd->serving_wl_class;
 	} else
-		cfqg->saved_workload_slice = 0;
+		cfqg->saved_wl_slice = 0;
 
 	cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
 					st->min_vdisktime);
@@ -2031,7 +2031,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 {
 	if (cfqq) {
 		cfq_log_cfqq(cfqd, cfqq, "set_active wl_class:%d wl_type:%d",
-				cfqd->serving_class, cfqd->serving_type);
+				cfqd->serving_wl_class, cfqd->serving_wl_type);
 		cfqg_stats_update_avg_queue_size(cfqq->cfqg);
 		cfqq->slice_start = 0;
 		cfqq->dispatch_start = jiffies;
@@ -2118,8 +2118,8 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
 	struct cfq_rb_root *service_tree =
-		service_tree_for(cfqd->serving_group, cfqd->serving_class,
-					cfqd->serving_type);
+		service_tree_for(cfqd->serving_group, cfqd->serving_wl_class,
+						cfqd->serving_wl_type);
 
 	if (!cfqd->rq_queued)
 		return NULL;
@@ -2523,20 +2523,20 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	unsigned count;
 	struct cfq_rb_root *st;
 	unsigned group_slice;
-	enum wl_class_t original_class = cfqd->serving_class;
+	enum wl_class_t original_class = cfqd->serving_wl_class;
 
 	/* Choose next priority. RT > BE > IDLE */
 	if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
-		cfqd->serving_class = RT_WORKLOAD;
+		cfqd->serving_wl_class = RT_WORKLOAD;
 	else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
-		cfqd->serving_class = BE_WORKLOAD;
+		cfqd->serving_wl_class = BE_WORKLOAD;
 	else {
-		cfqd->serving_class = IDLE_WORKLOAD;
+		cfqd->serving_wl_class = IDLE_WORKLOAD;
 		cfqd->workload_expires = jiffies + 1;
 		return;
 	}
 
-	if (original_class != cfqd->serving_class)
+	if (original_class != cfqd->serving_wl_class)
 		goto new_workload;
 
 	/*
@@ -2544,7 +2544,8 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	 * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
 	 * expiration time
 	 */
-	st = service_tree_for(cfqg, cfqd->serving_class, cfqd->serving_type);
+	st = service_tree_for(cfqg, cfqd->serving_wl_class,
+					cfqd->serving_wl_type);
 	count = st->count;
 
 	/*
@@ -2555,9 +2556,10 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 
 new_workload:
 	/* otherwise select new workload type */
-	cfqd->serving_type =
-		cfq_choose_wl(cfqd, cfqg, cfqd->serving_class);
-	st = service_tree_for(cfqg, cfqd->serving_class, cfqd->serving_type);
+	cfqd->serving_wl_type = cfq_choose_wl(cfqd, cfqg,
+					cfqd->serving_wl_class);
+	st = service_tree_for(cfqg, cfqd->serving_wl_class,
+					cfqd->serving_wl_type);
 	count = st->count;
 
 	/*
@@ -2568,11 +2570,11 @@ new_workload:
 	group_slice = cfq_group_slice(cfqd, cfqg);
 
 	slice = group_slice * count /
-		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_class],
-		      cfq_group_busy_queues_wl(cfqd->serving_class, cfqd,
+		max_t(unsigned, cfqg->busy_queues_avg[cfqd->serving_wl_class],
+		      cfq_group_busy_queues_wl(cfqd->serving_wl_class, cfqd,
 					cfqg));
 
-	if (cfqd->serving_type == ASYNC_WORKLOAD) {
+	if (cfqd->serving_wl_type == ASYNC_WORKLOAD) {
 		unsigned int tmp;
 
 		/*
@@ -2618,10 +2620,10 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
 	cfqd->serving_group = cfqg;
 
 	/* Restore the workload type data */
-	if (cfqg->saved_workload_slice) {
-		cfqd->workload_expires = jiffies + cfqg->saved_workload_slice;
-		cfqd->serving_type = cfqg->saved_workload;
-		cfqd->serving_class = cfqg->saved_serving_class;
+	if (cfqg->saved_wl_slice) {
+		cfqd->workload_expires = jiffies + cfqg->saved_wl_slice;
+		cfqd->serving_wl_type = cfqg->saved_wl_type;
+		cfqd->serving_wl_class = cfqg->saved_wl_class;
 	} else
 		cfqd->workload_expires = jiffies - 1;
 
@@ -3404,7 +3406,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 		return true;
 
 	/* Allow preemption only if we are idling on sync-noidle tree */
-	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
+	if (cfqd->serving_wl_type == SYNC_NOIDLE_WORKLOAD &&
 	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
 	    new_cfqq->service_tree->count == 2 &&
 	    RB_EMPTY_ROOT(&cfqq->sort_list))
@@ -3456,7 +3458,7 @@ static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	 * doesn't happen
 	 */
 	if (old_type != cfqq_type(cfqq))
-		cfqq->cfqg->saved_workload_slice = 0;
+		cfqq->cfqg->saved_wl_slice = 0;
 
 	/*
 	 * Put the new queue at the front of the of the current list,
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 03/24] cfq-iosched: Rename "service_tree" to "st" at some places
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

At quite a few places we use the keyword "service_tree". At some places,
especially local variables, I have abbreviated it to "st".

Also at couple of places moved binary operator "+" from beginning of line
to end of previous line, as per Tejun's feedback.

v2:
 Reverted most of the service tree name change based on Jeff Moyer's feedback.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/cfq-iosched.c | 77 +++++++++++++++++++++++++----------------------------
 1 file changed, 36 insertions(+), 41 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8f890bf..db4a1a5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -353,7 +353,7 @@ struct cfq_data {
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
 
-static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
+static struct cfq_rb_root *st_for(struct cfq_group *cfqg,
 					    enum wl_class_t class,
 					    enum wl_type_t type)
 {
@@ -758,16 +758,16 @@ static inline int cfq_group_busy_queues_wl(enum wl_class_t wl_class,
 	if (wl_class == IDLE_WORKLOAD)
 		return cfqg->service_tree_idle.count;
 
-	return cfqg->service_trees[wl_class][ASYNC_WORKLOAD].count
-		+ cfqg->service_trees[wl_class][SYNC_NOIDLE_WORKLOAD].count
-		+ cfqg->service_trees[wl_class][SYNC_WORKLOAD].count;
+	return cfqg->service_trees[wl_class][ASYNC_WORKLOAD].count +
+		cfqg->service_trees[wl_class][SYNC_NOIDLE_WORKLOAD].count +
+		cfqg->service_trees[wl_class][SYNC_WORKLOAD].count;
 }
 
 static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
 					struct cfq_group *cfqg)
 {
-	return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count
-		+ cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
+	return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count +
+		cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
 }
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
@@ -1612,15 +1612,14 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	struct rb_node **p, *parent;
 	struct cfq_queue *__cfqq;
 	unsigned long rb_key;
-	struct cfq_rb_root *service_tree;
+	struct cfq_rb_root *st;
 	int left;
 	int new_cfqq = 1;
 
-	service_tree = service_tree_for(cfqq->cfqg, cfqq_class(cfqq),
-						cfqq_type(cfqq));
+	st = st_for(cfqq->cfqg, cfqq_class(cfqq), cfqq_type(cfqq));
 	if (cfq_class_idle(cfqq)) {
 		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&service_tree->rb);
+		parent = rb_last(&st->rb);
 		if (parent && parent != &cfqq->rb_node) {
 			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 			rb_key += __cfqq->rb_key;
@@ -1638,7 +1637,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		cfqq->slice_resid = 0;
 	} else {
 		rb_key = -HZ;
-		__cfqq = cfq_rb_first(service_tree);
+		__cfqq = cfq_rb_first(st);
 		rb_key += __cfqq ? __cfqq->rb_key : jiffies;
 	}
 
@@ -1647,8 +1646,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		/*
 		 * same position, nothing more to do
 		 */
-		if (rb_key == cfqq->rb_key &&
-		    cfqq->service_tree == service_tree)
+		if (rb_key == cfqq->rb_key && cfqq->service_tree == st)
 			return;
 
 		cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
@@ -1657,8 +1655,8 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	left = 1;
 	parent = NULL;
-	cfqq->service_tree = service_tree;
-	p = &service_tree->rb.rb_node;
+	cfqq->service_tree = st;
+	p = &st->rb.rb_node;
 	while (*p) {
 		struct rb_node **n;
 
@@ -1679,12 +1677,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	}
 
 	if (left)
-		service_tree->left = &cfqq->rb_node;
+		st->left = &cfqq->rb_node;
 
 	cfqq->rb_key = rb_key;
 	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &service_tree->rb);
-	service_tree->count++;
+	rb_insert_color(&cfqq->rb_node, &st->rb);
+	st->count++;
 	if (add_front || !new_cfqq)
 		return;
 	cfq_group_notify_queue_add(cfqd, cfqq->cfqg);
@@ -2117,19 +2115,18 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
  */
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
-	struct cfq_rb_root *service_tree =
-		service_tree_for(cfqd->serving_group, cfqd->serving_wl_class,
-						cfqd->serving_wl_type);
+	struct cfq_rb_root *st = st_for(cfqd->serving_group,
+			cfqd->serving_wl_class, cfqd->serving_wl_type);
 
 	if (!cfqd->rq_queued)
 		return NULL;
 
 	/* There is nothing to dispatch */
-	if (!service_tree)
+	if (!st)
 		return NULL;
-	if (RB_EMPTY_ROOT(&service_tree->rb))
+	if (RB_EMPTY_ROOT(&st->rb))
 		return NULL;
-	return cfq_rb_first(service_tree);
+	return cfq_rb_first(st);
 }
 
 static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
@@ -2286,10 +2283,10 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	enum wl_class_t wl_class = cfqq_class(cfqq);
-	struct cfq_rb_root *service_tree = cfqq->service_tree;
+	struct cfq_rb_root *st = cfqq->service_tree;
 
-	BUG_ON(!service_tree);
-	BUG_ON(!service_tree->count);
+	BUG_ON(!st);
+	BUG_ON(!st->count);
 
 	if (!cfqd->cfq_slice_idle)
 		return false;
@@ -2307,11 +2304,10 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	 * Otherwise, we do only if they are the last ones
 	 * in their service tree.
 	 */
-	if (service_tree->count == 1 && cfq_cfqq_sync(cfqq) &&
-	   !cfq_io_thinktime_big(cfqd, &service_tree->ttime, false))
+	if (st->count == 1 && cfq_cfqq_sync(cfqq) &&
+	   !cfq_io_thinktime_big(cfqd, &st->ttime, false))
 		return true;
-	cfq_log_cfqq(cfqd, cfqq, "Not idling. st->count:%d",
-			service_tree->count);
+	cfq_log_cfqq(cfqd, cfqq, "Not idling. st->count:%d", st->count);
 	return false;
 }
 
@@ -2505,7 +2501,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
 
 	for (i = 0; i <= SYNC_WORKLOAD; ++i) {
 		/* select the one with lowest rb_key */
-		queue = cfq_rb_first(service_tree_for(cfqg, wl_class, i));
+		queue = cfq_rb_first(st_for(cfqg, wl_class, i));
 		if (queue &&
 		    (!key_valid || time_before(queue->rb_key, lowest_key))) {
 			lowest_key = queue->rb_key;
@@ -2544,8 +2540,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	 * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
 	 * expiration time
 	 */
-	st = service_tree_for(cfqg, cfqd->serving_wl_class,
-					cfqd->serving_wl_type);
+	st = st_for(cfqg, cfqd->serving_wl_class, cfqd->serving_wl_type);
 	count = st->count;
 
 	/*
@@ -2558,8 +2553,7 @@ new_workload:
 	/* otherwise select new workload type */
 	cfqd->serving_wl_type = cfq_choose_wl(cfqd, cfqg,
 					cfqd->serving_wl_class);
-	st = service_tree_for(cfqg, cfqd->serving_wl_class,
-					cfqd->serving_wl_type);
+	st = st_for(cfqg, cfqd->serving_wl_class, cfqd->serving_wl_type);
 	count = st->count;
 
 	/*
@@ -3640,16 +3634,17 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;
 
 	if (sync) {
-		struct cfq_rb_root *service_tree;
+		struct cfq_rb_root *st;
 
 		RQ_CIC(rq)->ttime.last_end_request = now;
 
 		if (cfq_cfqq_on_rr(cfqq))
-			service_tree = cfqq->service_tree;
+			st = cfqq->service_tree;
 		else
-			service_tree = service_tree_for(cfqq->cfqg,
-				cfqq_class(cfqq), cfqq_type(cfqq));
-		service_tree->ttime.last_end_request = now;
+			st = st_for(cfqq->cfqg, cfqq_class(cfqq),
+					cfqq_type(cfqq));
+
+		st->ttime.last_end_request = now;
 		if (!time_after(rq->start_time + cfqd->cfq_fifo_expire[1], now))
 			cfqd->last_delayed_sync = now;
 	}
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 03/24] cfq-iosched: Rename "service_tree" to "st" at some places
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

From: Vivek Goyal <vgoyal@redhat.com>

At quite a few places we use the keyword "service_tree". At some places,
especially local variables, I have abbreviated it to "st".

Also at couple of places moved binary operator "+" from beginning of line
to end of previous line, as per Tejun's feedback.

v2:
 Reverted most of the service tree name change based on Jeff Moyer's feedback.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 77 +++++++++++++++++++++++++----------------------------
 1 file changed, 36 insertions(+), 41 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8f890bf..db4a1a5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -353,7 +353,7 @@ struct cfq_data {
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
 
-static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
+static struct cfq_rb_root *st_for(struct cfq_group *cfqg,
 					    enum wl_class_t class,
 					    enum wl_type_t type)
 {
@@ -758,16 +758,16 @@ static inline int cfq_group_busy_queues_wl(enum wl_class_t wl_class,
 	if (wl_class == IDLE_WORKLOAD)
 		return cfqg->service_tree_idle.count;
 
-	return cfqg->service_trees[wl_class][ASYNC_WORKLOAD].count
-		+ cfqg->service_trees[wl_class][SYNC_NOIDLE_WORKLOAD].count
-		+ cfqg->service_trees[wl_class][SYNC_WORKLOAD].count;
+	return cfqg->service_trees[wl_class][ASYNC_WORKLOAD].count +
+		cfqg->service_trees[wl_class][SYNC_NOIDLE_WORKLOAD].count +
+		cfqg->service_trees[wl_class][SYNC_WORKLOAD].count;
 }
 
 static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
 					struct cfq_group *cfqg)
 {
-	return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count
-		+ cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
+	return cfqg->service_trees[RT_WORKLOAD][ASYNC_WORKLOAD].count +
+		cfqg->service_trees[BE_WORKLOAD][ASYNC_WORKLOAD].count;
 }
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
@@ -1612,15 +1612,14 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	struct rb_node **p, *parent;
 	struct cfq_queue *__cfqq;
 	unsigned long rb_key;
-	struct cfq_rb_root *service_tree;
+	struct cfq_rb_root *st;
 	int left;
 	int new_cfqq = 1;
 
-	service_tree = service_tree_for(cfqq->cfqg, cfqq_class(cfqq),
-						cfqq_type(cfqq));
+	st = st_for(cfqq->cfqg, cfqq_class(cfqq), cfqq_type(cfqq));
 	if (cfq_class_idle(cfqq)) {
 		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&service_tree->rb);
+		parent = rb_last(&st->rb);
 		if (parent && parent != &cfqq->rb_node) {
 			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 			rb_key += __cfqq->rb_key;
@@ -1638,7 +1637,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		cfqq->slice_resid = 0;
 	} else {
 		rb_key = -HZ;
-		__cfqq = cfq_rb_first(service_tree);
+		__cfqq = cfq_rb_first(st);
 		rb_key += __cfqq ? __cfqq->rb_key : jiffies;
 	}
 
@@ -1647,8 +1646,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		/*
 		 * same position, nothing more to do
 		 */
-		if (rb_key == cfqq->rb_key &&
-		    cfqq->service_tree == service_tree)
+		if (rb_key == cfqq->rb_key && cfqq->service_tree == st)
 			return;
 
 		cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
@@ -1657,8 +1655,8 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	left = 1;
 	parent = NULL;
-	cfqq->service_tree = service_tree;
-	p = &service_tree->rb.rb_node;
+	cfqq->service_tree = st;
+	p = &st->rb.rb_node;
 	while (*p) {
 		struct rb_node **n;
 
@@ -1679,12 +1677,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	}
 
 	if (left)
-		service_tree->left = &cfqq->rb_node;
+		st->left = &cfqq->rb_node;
 
 	cfqq->rb_key = rb_key;
 	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &service_tree->rb);
-	service_tree->count++;
+	rb_insert_color(&cfqq->rb_node, &st->rb);
+	st->count++;
 	if (add_front || !new_cfqq)
 		return;
 	cfq_group_notify_queue_add(cfqd, cfqq->cfqg);
@@ -2117,19 +2115,18 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, bool timed_out)
  */
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
-	struct cfq_rb_root *service_tree =
-		service_tree_for(cfqd->serving_group, cfqd->serving_wl_class,
-						cfqd->serving_wl_type);
+	struct cfq_rb_root *st = st_for(cfqd->serving_group,
+			cfqd->serving_wl_class, cfqd->serving_wl_type);
 
 	if (!cfqd->rq_queued)
 		return NULL;
 
 	/* There is nothing to dispatch */
-	if (!service_tree)
+	if (!st)
 		return NULL;
-	if (RB_EMPTY_ROOT(&service_tree->rb))
+	if (RB_EMPTY_ROOT(&st->rb))
 		return NULL;
-	return cfq_rb_first(service_tree);
+	return cfq_rb_first(st);
 }
 
 static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
@@ -2286,10 +2283,10 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	enum wl_class_t wl_class = cfqq_class(cfqq);
-	struct cfq_rb_root *service_tree = cfqq->service_tree;
+	struct cfq_rb_root *st = cfqq->service_tree;
 
-	BUG_ON(!service_tree);
-	BUG_ON(!service_tree->count);
+	BUG_ON(!st);
+	BUG_ON(!st->count);
 
 	if (!cfqd->cfq_slice_idle)
 		return false;
@@ -2307,11 +2304,10 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	 * Otherwise, we do only if they are the last ones
 	 * in their service tree.
 	 */
-	if (service_tree->count == 1 && cfq_cfqq_sync(cfqq) &&
-	   !cfq_io_thinktime_big(cfqd, &service_tree->ttime, false))
+	if (st->count == 1 && cfq_cfqq_sync(cfqq) &&
+	   !cfq_io_thinktime_big(cfqd, &st->ttime, false))
 		return true;
-	cfq_log_cfqq(cfqd, cfqq, "Not idling. st->count:%d",
-			service_tree->count);
+	cfq_log_cfqq(cfqd, cfqq, "Not idling. st->count:%d", st->count);
 	return false;
 }
 
@@ -2505,7 +2501,7 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
 
 	for (i = 0; i <= SYNC_WORKLOAD; ++i) {
 		/* select the one with lowest rb_key */
-		queue = cfq_rb_first(service_tree_for(cfqg, wl_class, i));
+		queue = cfq_rb_first(st_for(cfqg, wl_class, i));
 		if (queue &&
 		    (!key_valid || time_before(queue->rb_key, lowest_key))) {
 			lowest_key = queue->rb_key;
@@ -2544,8 +2540,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	 * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
 	 * expiration time
 	 */
-	st = service_tree_for(cfqg, cfqd->serving_wl_class,
-					cfqd->serving_wl_type);
+	st = st_for(cfqg, cfqd->serving_wl_class, cfqd->serving_wl_type);
 	count = st->count;
 
 	/*
@@ -2558,8 +2553,7 @@ new_workload:
 	/* otherwise select new workload type */
 	cfqd->serving_wl_type = cfq_choose_wl(cfqd, cfqg,
 					cfqd->serving_wl_class);
-	st = service_tree_for(cfqg, cfqd->serving_wl_class,
-					cfqd->serving_wl_type);
+	st = st_for(cfqg, cfqd->serving_wl_class, cfqd->serving_wl_type);
 	count = st->count;
 
 	/*
@@ -3640,16 +3634,17 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;
 
 	if (sync) {
-		struct cfq_rb_root *service_tree;
+		struct cfq_rb_root *st;
 
 		RQ_CIC(rq)->ttime.last_end_request = now;
 
 		if (cfq_cfqq_on_rr(cfqq))
-			service_tree = cfqq->service_tree;
+			st = cfqq->service_tree;
 		else
-			service_tree = service_tree_for(cfqq->cfqg,
-				cfqq_class(cfqq), cfqq_type(cfqq));
-		service_tree->ttime.last_end_request = now;
+			st = st_for(cfqq->cfqg, cfqq_class(cfqq),
+					cfqq_type(cfqq));
+
+		st->ttime.last_end_request = now;
 		if (!time_after(rq->start_time + cfqd->cfq_fifo_expire[1], now))
 			cfqd->last_delayed_sync = now;
 	}
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 04/24] cfq-iosched: Rename few functions related to selecting workload
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

choose_service_tree() selects/sets both wl_class and wl_type.  Rename it to
choose_wl_class_and_type() to make it very clear.

cfq_choose_wl() only selects and sets wl_type. It is easy to confuse
it with choose_st(). So rename it to cfq_choose_wl_type() to make
it clear what does it do.

Just renaming. No functionality change.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/cfq-iosched.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index db4a1a5..e34e142 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2490,7 +2490,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
 	}
 }
 
-static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
+static enum wl_type_t cfq_choose_wl_type(struct cfq_data *cfqd,
 			struct cfq_group *cfqg, enum wl_class_t wl_class)
 {
 	struct cfq_queue *queue;
@@ -2513,7 +2513,8 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
 	return cur_best;
 }
 
-static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
+static void
+choose_wl_class_and_type(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
 	unsigned slice;
 	unsigned count;
@@ -2551,7 +2552,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 
 new_workload:
 	/* otherwise select new workload type */
-	cfqd->serving_wl_type = cfq_choose_wl(cfqd, cfqg,
+	cfqd->serving_wl_type = cfq_choose_wl_type(cfqd, cfqg,
 					cfqd->serving_wl_class);
 	st = st_for(cfqg, cfqd->serving_wl_class, cfqd->serving_wl_type);
 	count = st->count;
@@ -2621,7 +2622,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
 	} else
 		cfqd->workload_expires = jiffies - 1;
 
-	choose_service_tree(cfqd, cfqg);
+	choose_wl_class_and_type(cfqd, cfqg);
 }
 
 /*
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 04/24] cfq-iosched: Rename few functions related to selecting workload
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

From: Vivek Goyal <vgoyal@redhat.com>

choose_service_tree() selects/sets both wl_class and wl_type.  Rename it to
choose_wl_class_and_type() to make it very clear.

cfq_choose_wl() only selects and sets wl_type. It is easy to confuse
it with choose_st(). So rename it to cfq_choose_wl_type() to make
it clear what does it do.

Just renaming. No functionality change.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index db4a1a5..e34e142 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2490,7 +2490,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
 	}
 }
 
-static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
+static enum wl_type_t cfq_choose_wl_type(struct cfq_data *cfqd,
 			struct cfq_group *cfqg, enum wl_class_t wl_class)
 {
 	struct cfq_queue *queue;
@@ -2513,7 +2513,8 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
 	return cur_best;
 }
 
-static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
+static void
+choose_wl_class_and_type(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
 	unsigned slice;
 	unsigned count;
@@ -2551,7 +2552,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 
 new_workload:
 	/* otherwise select new workload type */
-	cfqd->serving_wl_type = cfq_choose_wl(cfqd, cfqg,
+	cfqd->serving_wl_type = cfq_choose_wl_type(cfqd, cfqg,
 					cfqd->serving_wl_class);
 	st = st_for(cfqg, cfqd->serving_wl_class, cfqd->serving_wl_type);
 	count = st->count;
@@ -2621,7 +2622,7 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
 	} else
 		cfqd->workload_expires = jiffies - 1;
 
-	choose_service_tree(cfqd, cfqg);
+	choose_wl_class_and_type(cfqd, cfqg);
 }
 
 /*
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 05/24] cfq-iosched: Get rid of unnecessary local variable
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Use of local varibale "n" seems to be unnecessary. Remove it. This brings
it inline with function __cfq_group_st_add(), which is also doing the
similar operation of adding a group to a rb tree.

No functionality change here.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/cfq-iosched.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e34e142..5ad4cae 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1658,8 +1658,6 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	cfqq->service_tree = st;
 	p = &st->rb.rb_node;
 	while (*p) {
-		struct rb_node **n;
-
 		parent = *p;
 		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 
@@ -1667,13 +1665,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		 * sort by key, that represents service time.
 		 */
 		if (time_before(rb_key, __cfqq->rb_key))
-			n = &(*p)->rb_left;
+			p = &parent->rb_left;
 		else {
-			n = &(*p)->rb_right;
+			p = &parent->rb_right;
 			left = 0;
 		}
-
-		p = n;
 	}
 
 	if (left)
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 05/24] cfq-iosched: Get rid of unnecessary local variable
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

From: Vivek Goyal <vgoyal@redhat.com>

Use of local varibale "n" seems to be unnecessary. Remove it. This brings
it inline with function __cfq_group_st_add(), which is also doing the
similar operation of adding a group to a rb tree.

No functionality change here.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e34e142..5ad4cae 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1658,8 +1658,6 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	cfqq->service_tree = st;
 	p = &st->rb.rb_node;
 	while (*p) {
-		struct rb_node **n;
-
 		parent = *p;
 		__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 
@@ -1667,13 +1665,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		 * sort by key, that represents service time.
 		 */
 		if (time_before(rb_key, __cfqq->rb_key))
-			n = &(*p)->rb_left;
+			p = &parent->rb_left;
 		else {
-			n = &(*p)->rb_right;
+			p = &parent->rb_right;
 			left = 0;
 		}
-
-		p = n;
 	}
 
 	if (left)
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 06/24] cfq-iosched: Print sync-noidle information in blktrace messages
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

From: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Currently we attach a character "S" or "A" to the cfqq<pid>, to represent
whether queues is sync or async. Add one more character "N" to represent
whether it is sync-noidle queue or sync queue. So now three different
type of queues will look as follows.

cfq1234S   --> sync queus
cfq1234SN  --> sync noidle queue
cfq1234A   --> Async queue

Previously S/A classification was being printed only if group scheduling
was enabled. This patch also makes sure that this classification is
displayed even if group idling is disabled.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/cfq-iosched.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5ad4cae..bc076f4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -586,8 +586,9 @@ static inline void cfqg_put(struct cfq_group *cfqg)
 	char __pbuf[128];						\
 									\
 	blkg_path(cfqg_to_blkg((cfqq)->cfqg), __pbuf, sizeof(__pbuf));	\
-	blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
-			  cfq_cfqq_sync((cfqq)) ? 'S' : 'A',		\
+	blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c %s " fmt, (cfqq)->pid, \
+			cfq_cfqq_sync((cfqq)) ? 'S' : 'A',		\
+			cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\
 			  __pbuf, ##args);				\
 } while (0)
 
@@ -675,7 +676,10 @@ static inline void cfqg_get(struct cfq_group *cfqg) { }
 static inline void cfqg_put(struct cfq_group *cfqg) { }
 
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...)	\
-	blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
+	blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c " fmt, (cfqq)->pid,	\
+			cfq_cfqq_sync((cfqq)) ? 'S' : 'A',		\
+			cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\
+				##args)
 #define cfq_log_cfqg(cfqd, cfqg, fmt, args...)		do {} while (0)
 
 static inline void cfqg_stats_update_io_add(struct cfq_group *cfqg,
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 06/24] cfq-iosched: Print sync-noidle information in blktrace messages
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

From: Vivek Goyal <vgoyal@redhat.com>

Currently we attach a character "S" or "A" to the cfqq<pid>, to represent
whether queues is sync or async. Add one more character "N" to represent
whether it is sync-noidle queue or sync queue. So now three different
type of queues will look as follows.

cfq1234S   --> sync queus
cfq1234SN  --> sync noidle queue
cfq1234A   --> Async queue

Previously S/A classification was being printed only if group scheduling
was enabled. This patch also makes sure that this classification is
displayed even if group idling is disabled.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5ad4cae..bc076f4 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -586,8 +586,9 @@ static inline void cfqg_put(struct cfq_group *cfqg)
 	char __pbuf[128];						\
 									\
 	blkg_path(cfqg_to_blkg((cfqq)->cfqg), __pbuf, sizeof(__pbuf));	\
-	blk_add_trace_msg((cfqd)->queue, "cfq%d%c %s " fmt, (cfqq)->pid, \
-			  cfq_cfqq_sync((cfqq)) ? 'S' : 'A',		\
+	blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c %s " fmt, (cfqq)->pid, \
+			cfq_cfqq_sync((cfqq)) ? 'S' : 'A',		\
+			cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\
 			  __pbuf, ##args);				\
 } while (0)
 
@@ -675,7 +676,10 @@ static inline void cfqg_get(struct cfq_group *cfqg) { }
 static inline void cfqg_put(struct cfq_group *cfqg) { }
 
 #define cfq_log_cfqq(cfqd, cfqq, fmt, args...)	\
-	blk_add_trace_msg((cfqd)->queue, "cfq%d " fmt, (cfqq)->pid, ##args)
+	blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c " fmt, (cfqq)->pid,	\
+			cfq_cfqq_sync((cfqq)) ? 'S' : 'A',		\
+			cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\
+				##args)
 #define cfq_log_cfqg(cfqd, cfqg, fmt, args...)		do {} while (0)
 
 static inline void cfqg_stats_update_io_add(struct cfq_group *cfqg,
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 07/24] blkcg: fix minor bug in blkg_alloc()
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice.
The latter test should have been on whether pol->pd_init_fn() exists.
This doesn't cause actual problems because both blkcg policies
implement pol->pd_init_fn().  Fix it.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b8858fb..7ef747b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -114,7 +114,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 		pd->blkg = blkg;
 
 		/* invoke per-policy init */
-		if (blkcg_policy_enabled(blkg->q, pol))
+		if (pol->pd_init_fn)
 			pol->pd_init_fn(blkg);
 	}
 
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 07/24] blkcg: fix minor bug in blkg_alloc()
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice.
The latter test should have been on whether pol->pd_init_fn() exists.
This doesn't cause actual problems because both blkcg policies
implement pol->pd_init_fn().  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b8858fb..7ef747b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -114,7 +114,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 		pd->blkg = blkg;
 
 		/* invoke per-policy init */
-		if (blkcg_policy_enabled(blkg->q, pol))
+		if (pol->pd_init_fn)
 			pol->pd_init_fn(blkg);
 	}
 
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 08/24] blkcg: reorganize blkg_lookup_create() and friends
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Reorganize such that

* __blkg_lookup() takes bool param @update_hint to determine whether
  to update hint.

* __blkg_lookup_create() no longer performs lookup before trying to
  create.  Renamed to blkg_create().

* blkg_lookup_create() now performs lookup and then invokes
  blkg_create() if lookup fails.

* root_blkg creation in blkcg_activate_policy() updated accordingly.
  Note that blkcg_activate_policy() no longer updates lookup hint if
  root_blkg already exists.

Except for the last lookup hint bit which is immaterial, this is pure
reorganization and doesn't introduce any visible behavior change.
This is to prepare for proper hierarchy support.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-cgroup.c | 75 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 7ef747b..2012754 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -126,7 +126,7 @@ err_free:
 }
 
 static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
-				      struct request_queue *q)
+				      struct request_queue *q, bool update_hint)
 {
 	struct blkcg_gq *blkg;
 
@@ -135,14 +135,19 @@ static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
 		return blkg;
 
 	/*
-	 * Hint didn't match.  Look up from the radix tree.  Note that we
-	 * may not be holding queue_lock and thus are not sure whether
-	 * @blkg from blkg_tree has already been removed or not, so we
-	 * can't update hint to the lookup result.  Leave it to the caller.
+	 * Hint didn't match.  Look up from the radix tree.  Note that the
+	 * hint can only be updated under queue_lock as otherwise @blkg
+	 * could have already been removed from blkg_tree.  The caller is
+	 * responsible for grabbing queue_lock if @update_hint.
 	 */
 	blkg = radix_tree_lookup(&blkcg->blkg_tree, q->id);
-	if (blkg && blkg->q == q)
+	if (blkg && blkg->q == q) {
+		if (update_hint) {
+			lockdep_assert_held(q->queue_lock);
+			rcu_assign_pointer(blkcg->blkg_hint, blkg);
+		}
 		return blkg;
+	}
 
 	return NULL;
 }
@@ -162,7 +167,7 @@ struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q)
 
 	if (unlikely(blk_queue_bypass(q)))
 		return NULL;
-	return __blkg_lookup(blkcg, q);
+	return __blkg_lookup(blkcg, q, false);
 }
 EXPORT_SYMBOL_GPL(blkg_lookup);
 
@@ -170,9 +175,9 @@ EXPORT_SYMBOL_GPL(blkg_lookup);
  * If @new_blkg is %NULL, this function tries to allocate a new one as
  * necessary using %GFP_ATOMIC.  @new_blkg is always consumed on return.
  */
-static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
-					     struct request_queue *q,
-					     struct blkcg_gq *new_blkg)
+static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
+				    struct request_queue *q,
+				    struct blkcg_gq *new_blkg)
 {
 	struct blkcg_gq *blkg;
 	int ret;
@@ -180,13 +185,6 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
 
-	/* lookup and update hint on success, see __blkg_lookup() for details */
-	blkg = __blkg_lookup(blkcg, q);
-	if (blkg) {
-		rcu_assign_pointer(blkcg->blkg_hint, blkg);
-		goto out_free;
-	}
-
 	/* blkg holds a reference to blkcg */
 	if (!css_tryget(&blkcg->css)) {
 		blkg = ERR_PTR(-EINVAL);
@@ -223,16 +221,39 @@ out_free:
 	return blkg;
 }
 
+/**
+ * blkg_lookup_create - lookup blkg, try to create one if not there
+ * @blkcg: blkcg of interest
+ * @q: request_queue of interest
+ *
+ * Lookup blkg for the @blkcg - @q pair.  If it doesn't exist, try to
+ * create one.  This function should be called under RCU read lock and
+ * @q->queue_lock.
+ *
+ * Returns pointer to the looked up or created blkg on success, ERR_PTR()
+ * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
+ * dead and bypassing, returns ERR_PTR(-EBUSY).
+ */
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 				    struct request_queue *q)
 {
+	struct blkcg_gq *blkg;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	lockdep_assert_held(q->queue_lock);
+
 	/*
 	 * This could be the first entry point of blkcg implementation and
 	 * we shouldn't allow anything to go through for a bypassing queue.
 	 */
 	if (unlikely(blk_queue_bypass(q)))
 		return ERR_PTR(blk_queue_dying(q) ? -EINVAL : -EBUSY);
-	return __blkg_lookup_create(blkcg, q, NULL);
+
+	blkg = __blkg_lookup(blkcg, q, true);
+	if (blkg)
+		return blkg;
+
+	return blkg_create(blkcg, q, NULL);
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
@@ -777,7 +798,7 @@ int blkcg_activate_policy(struct request_queue *q,
 			  const struct blkcg_policy *pol)
 {
 	LIST_HEAD(pds);
-	struct blkcg_gq *blkg;
+	struct blkcg_gq *blkg, *new_blkg;
 	struct blkg_policy_data *pd, *n;
 	int cnt = 0, ret;
 	bool preloaded;
@@ -786,19 +807,27 @@ int blkcg_activate_policy(struct request_queue *q,
 		return 0;
 
 	/* preallocations for root blkg */
-	blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
-	if (!blkg)
+	new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
+	if (!new_blkg)
 		return -ENOMEM;
 
 	preloaded = !radix_tree_preload(GFP_KERNEL);
 
 	blk_queue_bypass_start(q);
 
-	/* make sure the root blkg exists and count the existing blkgs */
+	/*
+	 * Make sure the root blkg exists and count the existing blkgs.  As
+	 * @q is bypassing at this point, blkg_lookup_create() can't be
+	 * used.  Open code it.
+	 */
 	spin_lock_irq(q->queue_lock);
 
 	rcu_read_lock();
-	blkg = __blkg_lookup_create(&blkcg_root, q, blkg);
+	blkg = __blkg_lookup(&blkcg_root, q, false);
+	if (blkg)
+		blkg_free(new_blkg);
+	else
+		blkg = blkg_create(&blkcg_root, q, new_blkg);
 	rcu_read_unlock();
 
 	if (preloaded)
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 08/24] blkcg: reorganize blkg_lookup_create() and friends
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Reorganize such that

* __blkg_lookup() takes bool param @update_hint to determine whether
  to update hint.

* __blkg_lookup_create() no longer performs lookup before trying to
  create.  Renamed to blkg_create().

* blkg_lookup_create() now performs lookup and then invokes
  blkg_create() if lookup fails.

* root_blkg creation in blkcg_activate_policy() updated accordingly.
  Note that blkcg_activate_policy() no longer updates lookup hint if
  root_blkg already exists.

Except for the last lookup hint bit which is immaterial, this is pure
reorganization and doesn't introduce any visible behavior change.
This is to prepare for proper hierarchy support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c | 75 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 23 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 7ef747b..2012754 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -126,7 +126,7 @@ err_free:
 }
 
 static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
-				      struct request_queue *q)
+				      struct request_queue *q, bool update_hint)
 {
 	struct blkcg_gq *blkg;
 
@@ -135,14 +135,19 @@ static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
 		return blkg;
 
 	/*
-	 * Hint didn't match.  Look up from the radix tree.  Note that we
-	 * may not be holding queue_lock and thus are not sure whether
-	 * @blkg from blkg_tree has already been removed or not, so we
-	 * can't update hint to the lookup result.  Leave it to the caller.
+	 * Hint didn't match.  Look up from the radix tree.  Note that the
+	 * hint can only be updated under queue_lock as otherwise @blkg
+	 * could have already been removed from blkg_tree.  The caller is
+	 * responsible for grabbing queue_lock if @update_hint.
 	 */
 	blkg = radix_tree_lookup(&blkcg->blkg_tree, q->id);
-	if (blkg && blkg->q == q)
+	if (blkg && blkg->q == q) {
+		if (update_hint) {
+			lockdep_assert_held(q->queue_lock);
+			rcu_assign_pointer(blkcg->blkg_hint, blkg);
+		}
 		return blkg;
+	}
 
 	return NULL;
 }
@@ -162,7 +167,7 @@ struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q)
 
 	if (unlikely(blk_queue_bypass(q)))
 		return NULL;
-	return __blkg_lookup(blkcg, q);
+	return __blkg_lookup(blkcg, q, false);
 }
 EXPORT_SYMBOL_GPL(blkg_lookup);
 
@@ -170,9 +175,9 @@ EXPORT_SYMBOL_GPL(blkg_lookup);
  * If @new_blkg is %NULL, this function tries to allocate a new one as
  * necessary using %GFP_ATOMIC.  @new_blkg is always consumed on return.
  */
-static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
-					     struct request_queue *q,
-					     struct blkcg_gq *new_blkg)
+static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
+				    struct request_queue *q,
+				    struct blkcg_gq *new_blkg)
 {
 	struct blkcg_gq *blkg;
 	int ret;
@@ -180,13 +185,6 @@ static struct blkcg_gq *__blkg_lookup_create(struct blkcg *blkcg,
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
 
-	/* lookup and update hint on success, see __blkg_lookup() for details */
-	blkg = __blkg_lookup(blkcg, q);
-	if (blkg) {
-		rcu_assign_pointer(blkcg->blkg_hint, blkg);
-		goto out_free;
-	}
-
 	/* blkg holds a reference to blkcg */
 	if (!css_tryget(&blkcg->css)) {
 		blkg = ERR_PTR(-EINVAL);
@@ -223,16 +221,39 @@ out_free:
 	return blkg;
 }
 
+/**
+ * blkg_lookup_create - lookup blkg, try to create one if not there
+ * @blkcg: blkcg of interest
+ * @q: request_queue of interest
+ *
+ * Lookup blkg for the @blkcg - @q pair.  If it doesn't exist, try to
+ * create one.  This function should be called under RCU read lock and
+ * @q->queue_lock.
+ *
+ * Returns pointer to the looked up or created blkg on success, ERR_PTR()
+ * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
+ * dead and bypassing, returns ERR_PTR(-EBUSY).
+ */
 struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 				    struct request_queue *q)
 {
+	struct blkcg_gq *blkg;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	lockdep_assert_held(q->queue_lock);
+
 	/*
 	 * This could be the first entry point of blkcg implementation and
 	 * we shouldn't allow anything to go through for a bypassing queue.
 	 */
 	if (unlikely(blk_queue_bypass(q)))
 		return ERR_PTR(blk_queue_dying(q) ? -EINVAL : -EBUSY);
-	return __blkg_lookup_create(blkcg, q, NULL);
+
+	blkg = __blkg_lookup(blkcg, q, true);
+	if (blkg)
+		return blkg;
+
+	return blkg_create(blkcg, q, NULL);
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
@@ -777,7 +798,7 @@ int blkcg_activate_policy(struct request_queue *q,
 			  const struct blkcg_policy *pol)
 {
 	LIST_HEAD(pds);
-	struct blkcg_gq *blkg;
+	struct blkcg_gq *blkg, *new_blkg;
 	struct blkg_policy_data *pd, *n;
 	int cnt = 0, ret;
 	bool preloaded;
@@ -786,19 +807,27 @@ int blkcg_activate_policy(struct request_queue *q,
 		return 0;
 
 	/* preallocations for root blkg */
-	blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
-	if (!blkg)
+	new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
+	if (!new_blkg)
 		return -ENOMEM;
 
 	preloaded = !radix_tree_preload(GFP_KERNEL);
 
 	blk_queue_bypass_start(q);
 
-	/* make sure the root blkg exists and count the existing blkgs */
+	/*
+	 * Make sure the root blkg exists and count the existing blkgs.  As
+	 * @q is bypassing at this point, blkg_lookup_create() can't be
+	 * used.  Open code it.
+	 */
 	spin_lock_irq(q->queue_lock);
 
 	rcu_read_lock();
-	blkg = __blkg_lookup_create(&blkcg_root, q, blkg);
+	blkg = __blkg_lookup(&blkcg_root, q, false);
+	if (blkg)
+		blkg_free(new_blkg);
+	else
+		blkg = blkg_create(&blkcg_root, q, new_blkg);
 	rcu_read_unlock();
 
 	if (preloaded)
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 09/24] blkcg: cosmetic updates to blkg_create()
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

* Rename out_* labels to err_*.

* Do ERR_PTR() conversion once in the error return path.

This patch is cosmetic and to prepare for the hierarchy support.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-cgroup.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2012754..18ae480 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -187,16 +187,16 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 
 	/* blkg holds a reference to blkcg */
 	if (!css_tryget(&blkcg->css)) {
-		blkg = ERR_PTR(-EINVAL);
-		goto out_free;
+		ret = -EINVAL;
+		goto err_free_blkg;
 	}
 
 	/* allocate */
 	if (!new_blkg) {
 		new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
 		if (unlikely(!new_blkg)) {
-			blkg = ERR_PTR(-ENOMEM);
-			goto out_put;
+			ret = -ENOMEM;
+			goto err_put_css;
 		}
 	}
 	blkg = new_blkg;
@@ -213,12 +213,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	if (!ret)
 		return blkg;
 
-	blkg = ERR_PTR(ret);
-out_put:
+err_put_css:
 	css_put(&blkcg->css);
-out_free:
+err_free_blkg:
 	blkg_free(new_blkg);
-	return blkg;
+	return ERR_PTR(ret);
 }
 
 /**
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 09/24] blkcg: cosmetic updates to blkg_create()
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

* Rename out_* labels to err_*.

* Do ERR_PTR() conversion once in the error return path.

This patch is cosmetic and to prepare for the hierarchy support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2012754..18ae480 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -187,16 +187,16 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 
 	/* blkg holds a reference to blkcg */
 	if (!css_tryget(&blkcg->css)) {
-		blkg = ERR_PTR(-EINVAL);
-		goto out_free;
+		ret = -EINVAL;
+		goto err_free_blkg;
 	}
 
 	/* allocate */
 	if (!new_blkg) {
 		new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
 		if (unlikely(!new_blkg)) {
-			blkg = ERR_PTR(-ENOMEM);
-			goto out_put;
+			ret = -ENOMEM;
+			goto err_put_css;
 		}
 	}
 	blkg = new_blkg;
@@ -213,12 +213,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	if (!ret)
 		return blkg;
 
-	blkg = ERR_PTR(ret);
-out_put:
+err_put_css:
 	css_put(&blkcg->css);
-out_free:
+err_free_blkg:
 	blkg_free(new_blkg);
-	return blkg;
+	return ERR_PTR(ret);
 }
 
 /**
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 10/24] blkcg: make blkcg_gq's hierarchical
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Currently a child blkg (blkcg_gq) can be created even if its parent
doesn't exist.  ie. Given a blkg, it's not guaranteed that its
ancestors will exist.  This makes it difficult to implement proper
hierarchy support for blkcg policies.

Always create blkgs recursively and make a child blkg hold a reference
to its parent.  blkg->parent is added so that finding the parent is
easy.  blkcg_parent() is also added in the process.

This change can be visible to userland.  e.g. while issuing IO in a
nested cgroup didn't affect the ancestors at all, now it will
initialize all ancestor blkgs and zero stats for the request_queue
will always appear on them.  While this is userland visible, this
shouldn't cause any functional difference.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-cgroup.c | 42 +++++++++++++++++++++++++++++++++++++-----
 block/blk-cgroup.h | 18 ++++++++++++++++++
 2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 18ae480..942f344 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -201,7 +201,16 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	}
 	blkg = new_blkg;
 
-	/* insert */
+	/* link parent and insert */
+	if (blkcg_parent(blkcg)) {
+		blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
+		if (WARN_ON_ONCE(!blkg->parent)) {
+			blkg = ERR_PTR(-EINVAL);
+			goto err_put_css;
+		}
+		blkg_get(blkg->parent);
+	}
+
 	spin_lock(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
 	if (likely(!ret)) {
@@ -213,6 +222,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	if (!ret)
 		return blkg;
 
+	/* @blkg failed fully initialized, use the usual release path */
+	blkg_put(blkg);
+	return ERR_PTR(ret);
+
 err_put_css:
 	css_put(&blkcg->css);
 err_free_blkg:
@@ -226,8 +239,9 @@ err_free_blkg:
  * @q: request_queue of interest
  *
  * Lookup blkg for the @blkcg - @q pair.  If it doesn't exist, try to
- * create one.  This function should be called under RCU read lock and
- * @q->queue_lock.
+ * create one.  blkg creation is performed recursively from blkcg_root such
+ * that all non-root blkg's have access to the parent blkg.  This function
+ * should be called under RCU read lock and @q->queue_lock.
  *
  * Returns pointer to the looked up or created blkg on success, ERR_PTR()
  * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
@@ -252,7 +266,23 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 	if (blkg)
 		return blkg;
 
-	return blkg_create(blkcg, q, NULL);
+	/*
+	 * Create blkgs walking down from blkcg_root to @blkcg, so that all
+	 * non-root blkgs have access to their parents.
+	 */
+	while (true) {
+		struct blkcg *pos = blkcg;
+		struct blkcg *parent = blkcg_parent(blkcg);
+
+		while (parent && !__blkg_lookup(parent, q, false)) {
+			pos = parent;
+			parent = blkcg_parent(parent);
+		}
+
+		blkg = blkg_create(pos, q, NULL);
+		if (pos == blkcg || IS_ERR(blkg))
+			return blkg;
+	}
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
@@ -321,8 +351,10 @@ static void blkg_rcu_free(struct rcu_head *rcu_head)
 
 void __blkg_release(struct blkcg_gq *blkg)
 {
-	/* release the extra blkcg reference this blkg has been holding */
+	/* release the blkcg and parent blkg refs this blkg has been holding */
 	css_put(&blkg->blkcg->css);
+	if (blkg->parent)
+		blkg_put(blkg->parent);
 
 	/*
 	 * A group is freed in rcu manner. But having an rcu lock does not
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 2459730..b26ed58 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -94,8 +94,13 @@ struct blkcg_gq {
 	struct list_head		q_node;
 	struct hlist_node		blkcg_node;
 	struct blkcg			*blkcg;
+
+	/* all non-root blkcg_gq's are guaranteed to have access to parent */
+	struct blkcg_gq			*parent;
+
 	/* request allocation list for this blkcg-q pair */
 	struct request_list		rl;
+
 	/* reference count */
 	int				refcnt;
 
@@ -181,6 +186,19 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
 }
 
 /**
+ * blkcg_parent - get the parent of a blkcg
+ * @blkcg: blkcg of interest
+ *
+ * Return the parent blkcg of @blkcg.  Can be called anytime.
+ */
+static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
+{
+	struct cgroup *pcg = blkcg->css.cgroup->parent;
+
+	return pcg ? cgroup_to_blkcg(pcg) : NULL;
+}
+
+/**
  * blkg_to_pdata - get policy private data
  * @blkg: blkg of interest
  * @pol: policy of interest
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 10/24] blkcg: make blkcg_gq's hierarchical
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Currently a child blkg (blkcg_gq) can be created even if its parent
doesn't exist.  ie. Given a blkg, it's not guaranteed that its
ancestors will exist.  This makes it difficult to implement proper
hierarchy support for blkcg policies.

Always create blkgs recursively and make a child blkg hold a reference
to its parent.  blkg->parent is added so that finding the parent is
easy.  blkcg_parent() is also added in the process.

This change can be visible to userland.  e.g. while issuing IO in a
nested cgroup didn't affect the ancestors at all, now it will
initialize all ancestor blkgs and zero stats for the request_queue
will always appear on them.  While this is userland visible, this
shouldn't cause any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c | 42 +++++++++++++++++++++++++++++++++++++-----
 block/blk-cgroup.h | 18 ++++++++++++++++++
 2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 18ae480..942f344 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -201,7 +201,16 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	}
 	blkg = new_blkg;
 
-	/* insert */
+	/* link parent and insert */
+	if (blkcg_parent(blkcg)) {
+		blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
+		if (WARN_ON_ONCE(!blkg->parent)) {
+			blkg = ERR_PTR(-EINVAL);
+			goto err_put_css;
+		}
+		blkg_get(blkg->parent);
+	}
+
 	spin_lock(&blkcg->lock);
 	ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg);
 	if (likely(!ret)) {
@@ -213,6 +222,10 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	if (!ret)
 		return blkg;
 
+	/* @blkg failed fully initialized, use the usual release path */
+	blkg_put(blkg);
+	return ERR_PTR(ret);
+
 err_put_css:
 	css_put(&blkcg->css);
 err_free_blkg:
@@ -226,8 +239,9 @@ err_free_blkg:
  * @q: request_queue of interest
  *
  * Lookup blkg for the @blkcg - @q pair.  If it doesn't exist, try to
- * create one.  This function should be called under RCU read lock and
- * @q->queue_lock.
+ * create one.  blkg creation is performed recursively from blkcg_root such
+ * that all non-root blkg's have access to the parent blkg.  This function
+ * should be called under RCU read lock and @q->queue_lock.
  *
  * Returns pointer to the looked up or created blkg on success, ERR_PTR()
  * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
@@ -252,7 +266,23 @@ struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
 	if (blkg)
 		return blkg;
 
-	return blkg_create(blkcg, q, NULL);
+	/*
+	 * Create blkgs walking down from blkcg_root to @blkcg, so that all
+	 * non-root blkgs have access to their parents.
+	 */
+	while (true) {
+		struct blkcg *pos = blkcg;
+		struct blkcg *parent = blkcg_parent(blkcg);
+
+		while (parent && !__blkg_lookup(parent, q, false)) {
+			pos = parent;
+			parent = blkcg_parent(parent);
+		}
+
+		blkg = blkg_create(pos, q, NULL);
+		if (pos == blkcg || IS_ERR(blkg))
+			return blkg;
+	}
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
@@ -321,8 +351,10 @@ static void blkg_rcu_free(struct rcu_head *rcu_head)
 
 void __blkg_release(struct blkcg_gq *blkg)
 {
-	/* release the extra blkcg reference this blkg has been holding */
+	/* release the blkcg and parent blkg refs this blkg has been holding */
 	css_put(&blkg->blkcg->css);
+	if (blkg->parent)
+		blkg_put(blkg->parent);
 
 	/*
 	 * A group is freed in rcu manner. But having an rcu lock does not
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 2459730..b26ed58 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -94,8 +94,13 @@ struct blkcg_gq {
 	struct list_head		q_node;
 	struct hlist_node		blkcg_node;
 	struct blkcg			*blkcg;
+
+	/* all non-root blkcg_gq's are guaranteed to have access to parent */
+	struct blkcg_gq			*parent;
+
 	/* request allocation list for this blkcg-q pair */
 	struct request_list		rl;
+
 	/* reference count */
 	int				refcnt;
 
@@ -181,6 +186,19 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
 }
 
 /**
+ * blkcg_parent - get the parent of a blkcg
+ * @blkcg: blkcg of interest
+ *
+ * Return the parent blkcg of @blkcg.  Can be called anytime.
+ */
+static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
+{
+	struct cgroup *pcg = blkcg->css.cgroup->parent;
+
+	return pcg ? cgroup_to_blkcg(pcg) : NULL;
+}
+
+/**
  * blkg_to_pdata - get policy private data
  * @blkg: blkg of interest
  * @pol: policy of interest
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 11/24] cfq-iosched: add leaf_weight
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (9 preceding siblings ...)
  2012-12-28 20:35     ` Tejun Heo
@ 2012-12-28 20:35   ` Tejun Heo
  2012-12-28 20:35   ` [PATCH 12/24] cfq-iosched: implement cfq_group->nr_active and ->children_weight Tejun Heo
                     ` (14 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

cfq blkcg is about to grow proper hierarchy handling, where a child
blkg's weight would nest inside the parent's.  This makes tasks in a
blkg to compete against both tasks in the sibling blkgs and the tasks
of child blkgs.

We're gonna use the existing weight as the group weight which decides
the blkg's weight against its siblings.  This patch introduces a new
weight - leaf_weight - which decides the weight of a blkg against the
child blkgs.

It's named leaf_weight because another way to look at it is that each
internal blkg nodes have a hidden child leaf node which contains all
its tasks and leaf_weight is the weight of the leaf node and handled
the same as the weight of the child blkgs.

This patch only adds leaf_weight fields and exposes it to userland.
The new weight isn't actually used anywhere yet.  Note that
cfq-iosched currently offcially supports only single level hierarchy
and root blkgs compete with the first level blkgs - ie. root weight is
basically being used as leaf_weight.  For root blkgs, the two weights
are kept in sync for backward compatibility.

v2: cfqd->root_group->leaf_weight initialization was missing from
    cfq_init_queue() causing divide by zero when
    !CONFIG_CFQ_GROUP_SCHED.  Fix it.  Reported by Fengguang.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 block/blk-cgroup.c  |   4 +-
 block/blk-cgroup.h  |   1 +
 block/cfq-iosched.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 130 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 942f344..10e1df9 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -26,7 +26,8 @@
 
 static DEFINE_MUTEX(blkcg_pol_mutex);
 
-struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT };
+struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT,
+			    .cfq_leaf_weight = 2 * CFQ_WEIGHT_DEFAULT, };
 EXPORT_SYMBOL_GPL(blkcg_root);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
@@ -710,6 +711,7 @@ static struct cgroup_subsys_state *blkcg_css_alloc(struct cgroup *cgroup)
 		return ERR_PTR(-ENOMEM);
 
 	blkcg->cfq_weight = CFQ_WEIGHT_DEFAULT;
+	blkcg->cfq_leaf_weight = CFQ_WEIGHT_DEFAULT;
 	blkcg->id = atomic64_inc_return(&id_seq); /* root is 0, start from 1 */
 done:
 	spin_lock_init(&blkcg->lock);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index b26ed58..2446225 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -54,6 +54,7 @@ struct blkcg {
 
 	/* TODO: per-policy storage in blkcg */
 	unsigned int			cfq_weight;	/* belongs to cfq */
+	unsigned int			cfq_leaf_weight;
 };
 
 struct blkg_stat {
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index bc076f4..175218d6 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -223,10 +223,21 @@ struct cfq_group {
 
 	/* group service_tree key */
 	u64 vdisktime;
+
+	/*
+	 * There are two weights - (internal) weight is the weight of this
+	 * cfqg against the sibling cfqgs.  leaf_weight is the wight of
+	 * this cfqg against the child cfqgs.  For the root cfqg, both
+	 * weights are kept in sync for backward compatibility.
+	 */
 	unsigned int weight;
 	unsigned int new_weight;
 	unsigned int dev_weight;
 
+	unsigned int leaf_weight;
+	unsigned int new_leaf_weight;
+	unsigned int dev_leaf_weight;
+
 	/* number of cfqq currently on this group */
 	int nr_cfqq;
 
@@ -1182,10 +1193,16 @@ static void
 cfq_update_group_weight(struct cfq_group *cfqg)
 {
 	BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
+
 	if (cfqg->new_weight) {
 		cfqg->weight = cfqg->new_weight;
 		cfqg->new_weight = 0;
 	}
+
+	if (cfqg->new_leaf_weight) {
+		cfqg->leaf_weight = cfqg->new_leaf_weight;
+		cfqg->new_leaf_weight = 0;
+	}
 }
 
 static void
@@ -1348,6 +1365,7 @@ static void cfq_pd_init(struct blkcg_gq *blkg)
 
 	cfq_init_cfqg_base(cfqg);
 	cfqg->weight = blkg->blkcg->cfq_weight;
+	cfqg->leaf_weight = blkg->blkcg->cfq_leaf_weight;
 }
 
 /*
@@ -1404,6 +1422,26 @@ static int cfqg_print_weight_device(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 cfqg_prfill_leaf_weight_device(struct seq_file *sf,
+					  struct blkg_policy_data *pd, int off)
+{
+	struct cfq_group *cfqg = pd_to_cfqg(pd);
+
+	if (!cfqg->dev_leaf_weight)
+		return 0;
+	return __blkg_prfill_u64(sf, pd, cfqg->dev_leaf_weight);
+}
+
+static int cfqg_print_leaf_weight_device(struct cgroup *cgrp,
+					 struct cftype *cft,
+					 struct seq_file *sf)
+{
+	blkcg_print_blkgs(sf, cgroup_to_blkcg(cgrp),
+			  cfqg_prfill_leaf_weight_device, &blkcg_policy_cfq, 0,
+			  false);
+	return 0;
+}
+
 static int cfq_print_weight(struct cgroup *cgrp, struct cftype *cft,
 			    struct seq_file *sf)
 {
@@ -1411,8 +1449,16 @@ static int cfq_print_weight(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
-static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
-				  const char *buf)
+static int cfq_print_leaf_weight(struct cgroup *cgrp, struct cftype *cft,
+				 struct seq_file *sf)
+{
+	seq_printf(sf, "%u\n",
+		   cgroup_to_blkcg(cgrp)->cfq_leaf_weight);
+	return 0;
+}
+
+static int __cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
+				    const char *buf, bool is_leaf_weight)
 {
 	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
 	struct blkg_conf_ctx ctx;
@@ -1426,8 +1472,13 @@ static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
 	ret = -EINVAL;
 	cfqg = blkg_to_cfqg(ctx.blkg);
 	if (!ctx.v || (ctx.v >= CFQ_WEIGHT_MIN && ctx.v <= CFQ_WEIGHT_MAX)) {
-		cfqg->dev_weight = ctx.v;
-		cfqg->new_weight = cfqg->dev_weight ?: blkcg->cfq_weight;
+		if (!is_leaf_weight) {
+			cfqg->dev_weight = ctx.v;
+			cfqg->new_weight = ctx.v ?: blkcg->cfq_weight;
+		} else {
+			cfqg->dev_leaf_weight = ctx.v;
+			cfqg->new_leaf_weight = ctx.v ?: blkcg->cfq_leaf_weight;
+		}
 		ret = 0;
 	}
 
@@ -1435,7 +1486,20 @@ static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
 	return ret;
 }
 
-static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
+static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
+				  const char *buf)
+{
+	return __cfqg_set_weight_device(cgrp, cft, buf, false);
+}
+
+static int cfqg_set_leaf_weight_device(struct cgroup *cgrp, struct cftype *cft,
+				       const char *buf)
+{
+	return __cfqg_set_weight_device(cgrp, cft, buf, true);
+}
+
+static int __cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val,
+			    bool is_leaf_weight)
 {
 	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
 	struct blkcg_gq *blkg;
@@ -1445,19 +1509,41 @@ static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
 		return -EINVAL;
 
 	spin_lock_irq(&blkcg->lock);
-	blkcg->cfq_weight = (unsigned int)val;
+
+	if (!is_leaf_weight)
+		blkcg->cfq_weight = val;
+	else
+		blkcg->cfq_leaf_weight = val;
 
 	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
 		struct cfq_group *cfqg = blkg_to_cfqg(blkg);
 
-		if (cfqg && !cfqg->dev_weight)
-			cfqg->new_weight = blkcg->cfq_weight;
+		if (!cfqg)
+			continue;
+
+		if (!is_leaf_weight) {
+			if (!cfqg->dev_weight)
+				cfqg->new_weight = blkcg->cfq_weight;
+		} else {
+			if (!cfqg->dev_leaf_weight)
+				cfqg->new_leaf_weight = blkcg->cfq_leaf_weight;
+		}
 	}
 
 	spin_unlock_irq(&blkcg->lock);
 	return 0;
 }
 
+static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	return __cfq_set_weight(cgrp, cft, val, false);
+}
+
+static int cfq_set_leaf_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	return __cfq_set_weight(cgrp, cft, val, true);
+}
+
 static int cfqg_print_stat(struct cgroup *cgrp, struct cftype *cft,
 			   struct seq_file *sf)
 {
@@ -1518,6 +1604,37 @@ static struct cftype cfq_blkcg_files[] = {
 		.read_seq_string = cfq_print_weight,
 		.write_u64 = cfq_set_weight,
 	},
+
+	/* on root, leaf_weight is mapped to weight */
+	{
+		.name = "leaf_weight_device",
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfqg_print_weight_device,
+		.write_string = cfqg_set_weight_device,
+		.max_write_len = 256,
+	},
+	{
+		.name = "leaf_weight",
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfq_print_weight,
+		.write_u64 = cfq_set_weight,
+	},
+
+	/* no such mapping necessary for !roots */
+	{
+		.name = "leaf_weight_device",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_seq_string = cfqg_print_leaf_weight_device,
+		.write_string = cfqg_set_leaf_weight_device,
+		.max_write_len = 256,
+	},
+	{
+		.name = "leaf_weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_seq_string = cfq_print_leaf_weight,
+		.write_u64 = cfq_set_leaf_weight,
+	},
+
 	{
 		.name = "time",
 		.private = offsetof(struct cfq_group, stats.time),
@@ -3992,6 +4109,7 @@ static int cfq_init_queue(struct request_queue *q)
 	cfq_init_cfqg_base(cfqd->root_group);
 #endif
 	cfqd->root_group->weight = 2 * CFQ_WEIGHT_DEFAULT;
+	cfqd->root_group->leaf_weight = 2 * CFQ_WEIGHT_DEFAULT;
 
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 11/24] cfq-iosched: add leaf_weight
  2012-12-28 20:35 ` Tejun Heo
  (?)
@ 2012-12-28 20:35 ` Tejun Heo
       [not found]   ` <1356726946-26037-12-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  -1 siblings, 1 reply; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo,
	Fengguang Wu

cfq blkcg is about to grow proper hierarchy handling, where a child
blkg's weight would nest inside the parent's.  This makes tasks in a
blkg to compete against both tasks in the sibling blkgs and the tasks
of child blkgs.

We're gonna use the existing weight as the group weight which decides
the blkg's weight against its siblings.  This patch introduces a new
weight - leaf_weight - which decides the weight of a blkg against the
child blkgs.

It's named leaf_weight because another way to look at it is that each
internal blkg nodes have a hidden child leaf node which contains all
its tasks and leaf_weight is the weight of the leaf node and handled
the same as the weight of the child blkgs.

This patch only adds leaf_weight fields and exposes it to userland.
The new weight isn't actually used anywhere yet.  Note that
cfq-iosched currently offcially supports only single level hierarchy
and root blkgs compete with the first level blkgs - ie. root weight is
basically being used as leaf_weight.  For root blkgs, the two weights
are kept in sync for backward compatibility.

v2: cfqd->root_group->leaf_weight initialization was missing from
    cfq_init_queue() causing divide by zero when
    !CONFIG_CFQ_GROUP_SCHED.  Fix it.  Reported by Fengguang.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
---
 block/blk-cgroup.c  |   4 +-
 block/blk-cgroup.h  |   1 +
 block/cfq-iosched.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 130 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 942f344..10e1df9 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -26,7 +26,8 @@
 
 static DEFINE_MUTEX(blkcg_pol_mutex);
 
-struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT };
+struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT,
+			    .cfq_leaf_weight = 2 * CFQ_WEIGHT_DEFAULT, };
 EXPORT_SYMBOL_GPL(blkcg_root);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
@@ -710,6 +711,7 @@ static struct cgroup_subsys_state *blkcg_css_alloc(struct cgroup *cgroup)
 		return ERR_PTR(-ENOMEM);
 
 	blkcg->cfq_weight = CFQ_WEIGHT_DEFAULT;
+	blkcg->cfq_leaf_weight = CFQ_WEIGHT_DEFAULT;
 	blkcg->id = atomic64_inc_return(&id_seq); /* root is 0, start from 1 */
 done:
 	spin_lock_init(&blkcg->lock);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index b26ed58..2446225 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -54,6 +54,7 @@ struct blkcg {
 
 	/* TODO: per-policy storage in blkcg */
 	unsigned int			cfq_weight;	/* belongs to cfq */
+	unsigned int			cfq_leaf_weight;
 };
 
 struct blkg_stat {
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index bc076f4..175218d6 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -223,10 +223,21 @@ struct cfq_group {
 
 	/* group service_tree key */
 	u64 vdisktime;
+
+	/*
+	 * There are two weights - (internal) weight is the weight of this
+	 * cfqg against the sibling cfqgs.  leaf_weight is the wight of
+	 * this cfqg against the child cfqgs.  For the root cfqg, both
+	 * weights are kept in sync for backward compatibility.
+	 */
 	unsigned int weight;
 	unsigned int new_weight;
 	unsigned int dev_weight;
 
+	unsigned int leaf_weight;
+	unsigned int new_leaf_weight;
+	unsigned int dev_leaf_weight;
+
 	/* number of cfqq currently on this group */
 	int nr_cfqq;
 
@@ -1182,10 +1193,16 @@ static void
 cfq_update_group_weight(struct cfq_group *cfqg)
 {
 	BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
+
 	if (cfqg->new_weight) {
 		cfqg->weight = cfqg->new_weight;
 		cfqg->new_weight = 0;
 	}
+
+	if (cfqg->new_leaf_weight) {
+		cfqg->leaf_weight = cfqg->new_leaf_weight;
+		cfqg->new_leaf_weight = 0;
+	}
 }
 
 static void
@@ -1348,6 +1365,7 @@ static void cfq_pd_init(struct blkcg_gq *blkg)
 
 	cfq_init_cfqg_base(cfqg);
 	cfqg->weight = blkg->blkcg->cfq_weight;
+	cfqg->leaf_weight = blkg->blkcg->cfq_leaf_weight;
 }
 
 /*
@@ -1404,6 +1422,26 @@ static int cfqg_print_weight_device(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 cfqg_prfill_leaf_weight_device(struct seq_file *sf,
+					  struct blkg_policy_data *pd, int off)
+{
+	struct cfq_group *cfqg = pd_to_cfqg(pd);
+
+	if (!cfqg->dev_leaf_weight)
+		return 0;
+	return __blkg_prfill_u64(sf, pd, cfqg->dev_leaf_weight);
+}
+
+static int cfqg_print_leaf_weight_device(struct cgroup *cgrp,
+					 struct cftype *cft,
+					 struct seq_file *sf)
+{
+	blkcg_print_blkgs(sf, cgroup_to_blkcg(cgrp),
+			  cfqg_prfill_leaf_weight_device, &blkcg_policy_cfq, 0,
+			  false);
+	return 0;
+}
+
 static int cfq_print_weight(struct cgroup *cgrp, struct cftype *cft,
 			    struct seq_file *sf)
 {
@@ -1411,8 +1449,16 @@ static int cfq_print_weight(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
-static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
-				  const char *buf)
+static int cfq_print_leaf_weight(struct cgroup *cgrp, struct cftype *cft,
+				 struct seq_file *sf)
+{
+	seq_printf(sf, "%u\n",
+		   cgroup_to_blkcg(cgrp)->cfq_leaf_weight);
+	return 0;
+}
+
+static int __cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
+				    const char *buf, bool is_leaf_weight)
 {
 	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
 	struct blkg_conf_ctx ctx;
@@ -1426,8 +1472,13 @@ static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
 	ret = -EINVAL;
 	cfqg = blkg_to_cfqg(ctx.blkg);
 	if (!ctx.v || (ctx.v >= CFQ_WEIGHT_MIN && ctx.v <= CFQ_WEIGHT_MAX)) {
-		cfqg->dev_weight = ctx.v;
-		cfqg->new_weight = cfqg->dev_weight ?: blkcg->cfq_weight;
+		if (!is_leaf_weight) {
+			cfqg->dev_weight = ctx.v;
+			cfqg->new_weight = ctx.v ?: blkcg->cfq_weight;
+		} else {
+			cfqg->dev_leaf_weight = ctx.v;
+			cfqg->new_leaf_weight = ctx.v ?: blkcg->cfq_leaf_weight;
+		}
 		ret = 0;
 	}
 
@@ -1435,7 +1486,20 @@ static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
 	return ret;
 }
 
-static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
+static int cfqg_set_weight_device(struct cgroup *cgrp, struct cftype *cft,
+				  const char *buf)
+{
+	return __cfqg_set_weight_device(cgrp, cft, buf, false);
+}
+
+static int cfqg_set_leaf_weight_device(struct cgroup *cgrp, struct cftype *cft,
+				       const char *buf)
+{
+	return __cfqg_set_weight_device(cgrp, cft, buf, true);
+}
+
+static int __cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val,
+			    bool is_leaf_weight)
 {
 	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
 	struct blkcg_gq *blkg;
@@ -1445,19 +1509,41 @@ static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
 		return -EINVAL;
 
 	spin_lock_irq(&blkcg->lock);
-	blkcg->cfq_weight = (unsigned int)val;
+
+	if (!is_leaf_weight)
+		blkcg->cfq_weight = val;
+	else
+		blkcg->cfq_leaf_weight = val;
 
 	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
 		struct cfq_group *cfqg = blkg_to_cfqg(blkg);
 
-		if (cfqg && !cfqg->dev_weight)
-			cfqg->new_weight = blkcg->cfq_weight;
+		if (!cfqg)
+			continue;
+
+		if (!is_leaf_weight) {
+			if (!cfqg->dev_weight)
+				cfqg->new_weight = blkcg->cfq_weight;
+		} else {
+			if (!cfqg->dev_leaf_weight)
+				cfqg->new_leaf_weight = blkcg->cfq_leaf_weight;
+		}
 	}
 
 	spin_unlock_irq(&blkcg->lock);
 	return 0;
 }
 
+static int cfq_set_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	return __cfq_set_weight(cgrp, cft, val, false);
+}
+
+static int cfq_set_leaf_weight(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	return __cfq_set_weight(cgrp, cft, val, true);
+}
+
 static int cfqg_print_stat(struct cgroup *cgrp, struct cftype *cft,
 			   struct seq_file *sf)
 {
@@ -1518,6 +1604,37 @@ static struct cftype cfq_blkcg_files[] = {
 		.read_seq_string = cfq_print_weight,
 		.write_u64 = cfq_set_weight,
 	},
+
+	/* on root, leaf_weight is mapped to weight */
+	{
+		.name = "leaf_weight_device",
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfqg_print_weight_device,
+		.write_string = cfqg_set_weight_device,
+		.max_write_len = 256,
+	},
+	{
+		.name = "leaf_weight",
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfq_print_weight,
+		.write_u64 = cfq_set_weight,
+	},
+
+	/* no such mapping necessary for !roots */
+	{
+		.name = "leaf_weight_device",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_seq_string = cfqg_print_leaf_weight_device,
+		.write_string = cfqg_set_leaf_weight_device,
+		.max_write_len = 256,
+	},
+	{
+		.name = "leaf_weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_seq_string = cfq_print_leaf_weight,
+		.write_u64 = cfq_set_leaf_weight,
+	},
+
 	{
 		.name = "time",
 		.private = offsetof(struct cfq_group, stats.time),
@@ -3992,6 +4109,7 @@ static int cfq_init_queue(struct request_queue *q)
 	cfq_init_cfqg_base(cfqd->root_group);
 #endif
 	cfqd->root_group->weight = 2 * CFQ_WEIGHT_DEFAULT;
+	cfqd->root_group->leaf_weight = 2 * CFQ_WEIGHT_DEFAULT;
 
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 12/24] cfq-iosched: implement cfq_group->nr_active and ->children_weight
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (10 preceding siblings ...)
  2012-12-28 20:35   ` [PATCH 11/24] cfq-iosched: add leaf_weight Tejun Heo
@ 2012-12-28 20:35   ` Tejun Heo
  2012-12-28 20:35     ` Tejun Heo
                     ` (13 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

To prepare for blkcg hierarchy support, add cfqg->nr_active and
->children_weight.  cfqg->nr_active counts the number of active cfqgs
at the cfqg's level and ->children_weight is sum of weights of those
cfqgs.  The level covers itself (cfqg->leaf_weight) and immediate
children.

The two values are updated when a cfqg enters and leaves the group
service tree.  Unless the hierarchy is very deep, the added overhead
should be negligible.

Currently, the parent is determined using cfqg_flat_parent() which
makes the root cfqg the parent of all other cfqgs.  This is to make
the transition to hierarchy-aware scheduling gradual.  Scheduling
logic will be converted to use cfqg->children_weight without actually
changing the behavior.  When everything is ready,
blkcg_weight_parent() will be replaced with proper parent function.

This patch doesn't introduce any behavior chagne.

v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 175218d6..7701c3f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -225,6 +225,18 @@ struct cfq_group {
 	u64 vdisktime;
 
 	/*
+	 * The number of active cfqgs and sum of their weights under this
+	 * cfqg.  This covers this cfqg's leaf_weight and all children's
+	 * weights, but does not cover weights of further descendants.
+	 *
+	 * If a cfqg is on the service tree, it's active.  An active cfqg
+	 * also activates its parent and contributes to the children_weight
+	 * of the parent.
+	 */
+	int nr_active;
+	unsigned int children_weight;
+
+	/*
 	 * There are two weights - (internal) weight is the weight of this
 	 * cfqg against the sibling cfqgs.  leaf_weight is the wight of
 	 * this cfqg against the child cfqgs.  For the root cfqg, both
@@ -583,6 +595,22 @@ static inline struct cfq_group *blkg_to_cfqg(struct blkcg_gq *blkg)
 	return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
 }
 
+/*
+ * Determine the parent cfqg for weight calculation.  Currently, cfqg
+ * scheduling is flat and the root is the parent of everyone else.
+ */
+static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
+{
+	struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
+	struct cfq_group *root;
+
+	while (blkg->parent)
+		blkg = blkg->parent;
+	root = blkg_to_cfqg(blkg);
+
+	return root != cfqg ? root : NULL;
+}
+
 static inline void cfqg_get(struct cfq_group *cfqg)
 {
 	return blkg_get(cfqg_to_blkg(cfqg));
@@ -683,6 +711,7 @@ static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
 
 #else	/* CONFIG_CFQ_GROUP_IOSCHED */
 
+static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
 static inline void cfqg_get(struct cfq_group *cfqg) { }
 static inline void cfqg_put(struct cfq_group *cfqg) { }
 
@@ -1208,11 +1237,33 @@ cfq_update_group_weight(struct cfq_group *cfqg)
 static void
 cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 {
+	struct cfq_group *pos = cfqg;
+	bool propagate;
+
+	/* add to the service tree */
 	BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
 
 	cfq_update_group_weight(cfqg);
 	__cfq_group_service_tree_add(st, cfqg);
 	st->total_weight += cfqg->weight;
+
+	/*
+	 * Activate @cfqg and propagate activation upwards until we meet an
+	 * already activated node or reach root.
+	 */
+	propagate = !pos->nr_active++;
+	pos->children_weight += pos->leaf_weight;
+
+	while (propagate) {
+		struct cfq_group *parent = cfqg_flat_parent(pos);
+
+		if (!parent)
+			break;
+
+		propagate = !parent->nr_active++;
+		parent->children_weight += pos->weight;
+		pos = parent;
+	}
 }
 
 static void
@@ -1243,6 +1294,31 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
 static void
 cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 {
+	struct cfq_group *pos = cfqg;
+	bool propagate;
+
+	/*
+	 * Undo activation from cfq_group_service_tree_add().  Deactivate
+	 * @cfqg and propagate deactivation upwards.
+	 */
+	propagate = !--pos->nr_active;
+	pos->children_weight -= pos->leaf_weight;
+
+	while (propagate) {
+		struct cfq_group *parent = cfqg_flat_parent(pos);
+
+		/* @pos has 0 nr_active at this point */
+		WARN_ON_ONCE(pos->children_weight);
+
+		if (!parent)
+			break;
+
+		propagate = !--parent->nr_active;
+		parent->children_weight -= pos->weight;
+		pos = parent;
+	}
+
+	/* remove from the service tree */
 	st->total_weight -= cfqg->weight;
 	if (!RB_EMPTY_NODE(&cfqg->rb_node))
 		cfq_rb_erase(&cfqg->rb_node, st);
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 12/24] cfq-iosched: implement cfq_group->nr_active and ->children_weight
  2012-12-28 20:35 ` Tejun Heo
  (?)
  (?)
@ 2012-12-28 20:35 ` Tejun Heo
       [not found]   ` <1356726946-26037-13-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  -1 siblings, 1 reply; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

To prepare for blkcg hierarchy support, add cfqg->nr_active and
->children_weight.  cfqg->nr_active counts the number of active cfqgs
at the cfqg's level and ->children_weight is sum of weights of those
cfqgs.  The level covers itself (cfqg->leaf_weight) and immediate
children.

The two values are updated when a cfqg enters and leaves the group
service tree.  Unless the hierarchy is very deep, the added overhead
should be negligible.

Currently, the parent is determined using cfqg_flat_parent() which
makes the root cfqg the parent of all other cfqgs.  This is to make
the transition to hierarchy-aware scheduling gradual.  Scheduling
logic will be converted to use cfqg->children_weight without actually
changing the behavior.  When everything is ready,
blkcg_weight_parent() will be replaced with proper parent function.

This patch doesn't introduce any behavior chagne.

v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 175218d6..7701c3f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -225,6 +225,18 @@ struct cfq_group {
 	u64 vdisktime;
 
 	/*
+	 * The number of active cfqgs and sum of their weights under this
+	 * cfqg.  This covers this cfqg's leaf_weight and all children's
+	 * weights, but does not cover weights of further descendants.
+	 *
+	 * If a cfqg is on the service tree, it's active.  An active cfqg
+	 * also activates its parent and contributes to the children_weight
+	 * of the parent.
+	 */
+	int nr_active;
+	unsigned int children_weight;
+
+	/*
 	 * There are two weights - (internal) weight is the weight of this
 	 * cfqg against the sibling cfqgs.  leaf_weight is the wight of
 	 * this cfqg against the child cfqgs.  For the root cfqg, both
@@ -583,6 +595,22 @@ static inline struct cfq_group *blkg_to_cfqg(struct blkcg_gq *blkg)
 	return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
 }
 
+/*
+ * Determine the parent cfqg for weight calculation.  Currently, cfqg
+ * scheduling is flat and the root is the parent of everyone else.
+ */
+static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
+{
+	struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
+	struct cfq_group *root;
+
+	while (blkg->parent)
+		blkg = blkg->parent;
+	root = blkg_to_cfqg(blkg);
+
+	return root != cfqg ? root : NULL;
+}
+
 static inline void cfqg_get(struct cfq_group *cfqg)
 {
 	return blkg_get(cfqg_to_blkg(cfqg));
@@ -683,6 +711,7 @@ static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
 
 #else	/* CONFIG_CFQ_GROUP_IOSCHED */
 
+static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
 static inline void cfqg_get(struct cfq_group *cfqg) { }
 static inline void cfqg_put(struct cfq_group *cfqg) { }
 
@@ -1208,11 +1237,33 @@ cfq_update_group_weight(struct cfq_group *cfqg)
 static void
 cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 {
+	struct cfq_group *pos = cfqg;
+	bool propagate;
+
+	/* add to the service tree */
 	BUG_ON(!RB_EMPTY_NODE(&cfqg->rb_node));
 
 	cfq_update_group_weight(cfqg);
 	__cfq_group_service_tree_add(st, cfqg);
 	st->total_weight += cfqg->weight;
+
+	/*
+	 * Activate @cfqg and propagate activation upwards until we meet an
+	 * already activated node or reach root.
+	 */
+	propagate = !pos->nr_active++;
+	pos->children_weight += pos->leaf_weight;
+
+	while (propagate) {
+		struct cfq_group *parent = cfqg_flat_parent(pos);
+
+		if (!parent)
+			break;
+
+		propagate = !parent->nr_active++;
+		parent->children_weight += pos->weight;
+		pos = parent;
+	}
 }
 
 static void
@@ -1243,6 +1294,31 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
 static void
 cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 {
+	struct cfq_group *pos = cfqg;
+	bool propagate;
+
+	/*
+	 * Undo activation from cfq_group_service_tree_add().  Deactivate
+	 * @cfqg and propagate deactivation upwards.
+	 */
+	propagate = !--pos->nr_active;
+	pos->children_weight -= pos->leaf_weight;
+
+	while (propagate) {
+		struct cfq_group *parent = cfqg_flat_parent(pos);
+
+		/* @pos has 0 nr_active at this point */
+		WARN_ON_ONCE(pos->children_weight);
+
+		if (!parent)
+			break;
+
+		propagate = !--parent->nr_active;
+		parent->children_weight -= pos->weight;
+		pos = parent;
+	}
+
+	/* remove from the service tree */
 	st->total_weight -= cfqg->weight;
 	if (!RB_EMPTY_NODE(&cfqg->rb_node))
 		cfq_rb_erase(&cfqg->rb_node, st);
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 13/24] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Currently, cfqg charges are scaled directly according to cfqg->weight.
Regardless of the number of active cfqgs or the amount of active
weights, a given weight value always scales charge the same way.  This
works fine as long as all cfqgs are treated equally regardless of
their positions in the hierarchy, which is what cfq currently
implements.  It can't work in hierarchical settings because the
interpretation of a given weight value depends on where the weight is
located in the hierarchy.

This patch reimplements cfqg charge scaling so that it can be used to
support hierarchy properly.  The scheme is fairly simple and
light-weight.

* When a cfqg is added to the service tree, v(disktime)weight is
  calculated.  It walks up the tree to root calculating the fraction
  it has in the hierarchy.  At each level, the fraction can be
  calculated as

    cfqg->weight / parent->level_weight

  By compounding these, the global fraction of vdisktime the cfqg has
  claim to - vfraction - can be determined.

* When the cfqg needs to be charged, the charge is scaled inversely
  proportionally to the vfraction.

The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
representation as before; however, the smallest scaling factor is now
1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
scaling factor.

While this shifts the global scale of vdisktime a bit, it doesn't
change the relative relationships among cfqgs and the scheduling
result isn't different.

cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
didn't have any relevance to vdisktime before and is unlikely to cause
any visible behavior difference now especially as the scale shift
isn't that large.

As the new scheme now makes proper distinction between cfqg->weight
and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
root, both weights are now mapped to ->leaf_weight instead of the
other way around.

Because we're still using cfqg_flat_parent(), this patch shouldn't
change the scheduling behavior in any noticeable way.

v2: Beefed up comments on vfraction as requested by Vivek.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c | 107 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 77 insertions(+), 30 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7701c3f..b24acf6 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -237,6 +237,18 @@ struct cfq_group {
 	unsigned int children_weight;
 
 	/*
+	 * vfraction is the fraction of vdisktime that the tasks in this
+	 * cfqg are entitled to.  This is determined by compounding the
+	 * ratios walking up from this cfqg to the root.
+	 *
+	 * It is in fixed point w/ CFQ_SERVICE_SHIFT and the sum of all
+	 * vfractions on a service tree is approximately 1.  The sum may
+	 * deviate a bit due to rounding errors and fluctuations caused by
+	 * cfqgs entering and leaving the service tree.
+	 */
+	unsigned int vfraction;
+
+	/*
 	 * There are two weights - (internal) weight is the weight of this
 	 * cfqg against the sibling cfqgs.  leaf_weight is the wight of
 	 * this cfqg against the child cfqgs.  For the root cfqg, both
@@ -891,13 +903,27 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
 }
 
-static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
+/**
+ * cfqg_scale_charge - scale disk time charge according to cfqg weight
+ * @charge: disk time being charged
+ * @vfraction: vfraction of the cfqg, fixed point w/ CFQ_SERVICE_SHIFT
+ *
+ * Scale @charge according to @vfraction, which is in range (0, 1].  The
+ * scaling is inversely proportional.
+ *
+ * scaled = charge / vfraction
+ *
+ * The result is also in fixed point w/ CFQ_SERVICE_SHIFT.
+ */
+static inline u64 cfqg_scale_charge(unsigned long charge,
+				    unsigned int vfraction)
 {
-	u64 d = delta << CFQ_SERVICE_SHIFT;
+	u64 c = charge << CFQ_SERVICE_SHIFT;	/* make it fixed point */
 
-	d = d * CFQ_WEIGHT_DEFAULT;
-	do_div(d, cfqg->weight);
-	return d;
+	/* charge / vfraction */
+	c <<= CFQ_SERVICE_SHIFT;
+	do_div(c, vfraction);
+	return c;
 }
 
 static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
@@ -1237,7 +1263,9 @@ cfq_update_group_weight(struct cfq_group *cfqg)
 static void
 cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 {
+	unsigned int vfr = 1 << CFQ_SERVICE_SHIFT;	/* start with 1 */
 	struct cfq_group *pos = cfqg;
+	struct cfq_group *parent;
 	bool propagate;
 
 	/* add to the service tree */
@@ -1248,22 +1276,34 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	st->total_weight += cfqg->weight;
 
 	/*
-	 * Activate @cfqg and propagate activation upwards until we meet an
-	 * already activated node or reach root.
+	 * Activate @cfqg and calculate the portion of vfraction @cfqg is
+	 * entitled to.  vfraction is calculated by walking the tree
+	 * towards the root calculating the fraction it has at each level.
+	 * The compounded ratio is how much vfraction @cfqg owns.
+	 *
+	 * Start with the proportion tasks in this cfqg has against active
+	 * children cfqgs - its leaf_weight against children_weight.
 	 */
 	propagate = !pos->nr_active++;
 	pos->children_weight += pos->leaf_weight;
+	vfr = vfr * pos->leaf_weight / pos->children_weight;
 
-	while (propagate) {
-		struct cfq_group *parent = cfqg_flat_parent(pos);
-
-		if (!parent)
-			break;
-
-		propagate = !parent->nr_active++;
-		parent->children_weight += pos->weight;
+	/*
+	 * Compound ->weight walking up the tree.  Both activation and
+	 * vfraction calculation are done in the same loop.  Propagation
+	 * stops once an already activated node is met.  vfraction
+	 * calculation should always continue to the root.
+	 */
+	while ((parent = cfqg_flat_parent(pos))) {
+		if (propagate) {
+			propagate = !parent->nr_active++;
+			parent->children_weight += pos->weight;
+		}
+		vfr = vfr * pos->weight / parent->children_weight;
 		pos = parent;
 	}
+
+	cfqg->vfraction = max_t(unsigned, vfr, 1);
 }
 
 static void
@@ -1309,6 +1349,7 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 
 		/* @pos has 0 nr_active at this point */
 		WARN_ON_ONCE(pos->children_weight);
+		pos->vfraction = 0;
 
 		if (!parent)
 			break;
@@ -1381,6 +1422,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	unsigned int used_sl, charge, unaccounted_sl = 0;
 	int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
 			- cfqg->service_tree_idle.count;
+	unsigned int vfr;
 
 	BUG_ON(nr_sync < 0);
 	used_sl = charge = cfq_cfqq_slice_usage(cfqq, &unaccounted_sl);
@@ -1390,10 +1432,15 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
 		charge = cfqq->allocated_slice;
 
-	/* Can't update vdisktime while group is on service tree */
+	/*
+	 * Can't update vdisktime while on service tree and cfqg->vfraction
+	 * is valid only while on it.  Cache vfr, leave the service tree,
+	 * update vdisktime and go back on.  The re-addition to the tree
+	 * will also update the weights as necessary.
+	 */
+	vfr = cfqg->vfraction;
 	cfq_group_service_tree_del(st, cfqg);
-	cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
-	/* If a new weight was requested, update now, off tree */
+	cfqg->vdisktime += cfqg_scale_charge(charge, vfr);
 	cfq_group_service_tree_add(st, cfqg);
 
 	/* This group is being expired. Save the context */
@@ -1669,44 +1716,44 @@ static int cfqg_print_avg_queue_size(struct cgroup *cgrp, struct cftype *cft,
 #endif	/* CONFIG_DEBUG_BLK_CGROUP */
 
 static struct cftype cfq_blkcg_files[] = {
+	/* on root, weight is mapped to leaf_weight */
 	{
 		.name = "weight_device",
-		.read_seq_string = cfqg_print_weight_device,
-		.write_string = cfqg_set_weight_device,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfqg_print_leaf_weight_device,
+		.write_string = cfqg_set_leaf_weight_device,
 		.max_write_len = 256,
 	},
 	{
 		.name = "weight",
-		.read_seq_string = cfq_print_weight,
-		.write_u64 = cfq_set_weight,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfq_print_leaf_weight,
+		.write_u64 = cfq_set_leaf_weight,
 	},
 
-	/* on root, leaf_weight is mapped to weight */
+	/* no such mapping necessary for !roots */
 	{
-		.name = "leaf_weight_device",
-		.flags = CFTYPE_ONLY_ON_ROOT,
+		.name = "weight_device",
+		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfqg_print_weight_device,
 		.write_string = cfqg_set_weight_device,
 		.max_write_len = 256,
 	},
 	{
-		.name = "leaf_weight",
-		.flags = CFTYPE_ONLY_ON_ROOT,
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfq_print_weight,
 		.write_u64 = cfq_set_weight,
 	},
 
-	/* no such mapping necessary for !roots */
 	{
 		.name = "leaf_weight_device",
-		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfqg_print_leaf_weight_device,
 		.write_string = cfqg_set_leaf_weight_device,
 		.max_write_len = 256,
 	},
 	{
 		.name = "leaf_weight",
-		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfq_print_leaf_weight,
 		.write_u64 = cfq_set_leaf_weight,
 	},
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 13/24] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Currently, cfqg charges are scaled directly according to cfqg->weight.
Regardless of the number of active cfqgs or the amount of active
weights, a given weight value always scales charge the same way.  This
works fine as long as all cfqgs are treated equally regardless of
their positions in the hierarchy, which is what cfq currently
implements.  It can't work in hierarchical settings because the
interpretation of a given weight value depends on where the weight is
located in the hierarchy.

This patch reimplements cfqg charge scaling so that it can be used to
support hierarchy properly.  The scheme is fairly simple and
light-weight.

* When a cfqg is added to the service tree, v(disktime)weight is
  calculated.  It walks up the tree to root calculating the fraction
  it has in the hierarchy.  At each level, the fraction can be
  calculated as

    cfqg->weight / parent->level_weight

  By compounding these, the global fraction of vdisktime the cfqg has
  claim to - vfraction - can be determined.

* When the cfqg needs to be charged, the charge is scaled inversely
  proportionally to the vfraction.

The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
representation as before; however, the smallest scaling factor is now
1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
scaling factor.

While this shifts the global scale of vdisktime a bit, it doesn't
change the relative relationships among cfqgs and the scheduling
result isn't different.

cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
didn't have any relevance to vdisktime before and is unlikely to cause
any visible behavior difference now especially as the scale shift
isn't that large.

As the new scheme now makes proper distinction between cfqg->weight
and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
root, both weights are now mapped to ->leaf_weight instead of the
other way around.

Because we're still using cfqg_flat_parent(), this patch shouldn't
change the scheduling behavior in any noticeable way.

v2: Beefed up comments on vfraction as requested by Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c | 107 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 77 insertions(+), 30 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 7701c3f..b24acf6 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -237,6 +237,18 @@ struct cfq_group {
 	unsigned int children_weight;
 
 	/*
+	 * vfraction is the fraction of vdisktime that the tasks in this
+	 * cfqg are entitled to.  This is determined by compounding the
+	 * ratios walking up from this cfqg to the root.
+	 *
+	 * It is in fixed point w/ CFQ_SERVICE_SHIFT and the sum of all
+	 * vfractions on a service tree is approximately 1.  The sum may
+	 * deviate a bit due to rounding errors and fluctuations caused by
+	 * cfqgs entering and leaving the service tree.
+	 */
+	unsigned int vfraction;
+
+	/*
 	 * There are two weights - (internal) weight is the weight of this
 	 * cfqg against the sibling cfqgs.  leaf_weight is the wight of
 	 * this cfqg against the child cfqgs.  For the root cfqg, both
@@ -891,13 +903,27 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
 }
 
-static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
+/**
+ * cfqg_scale_charge - scale disk time charge according to cfqg weight
+ * @charge: disk time being charged
+ * @vfraction: vfraction of the cfqg, fixed point w/ CFQ_SERVICE_SHIFT
+ *
+ * Scale @charge according to @vfraction, which is in range (0, 1].  The
+ * scaling is inversely proportional.
+ *
+ * scaled = charge / vfraction
+ *
+ * The result is also in fixed point w/ CFQ_SERVICE_SHIFT.
+ */
+static inline u64 cfqg_scale_charge(unsigned long charge,
+				    unsigned int vfraction)
 {
-	u64 d = delta << CFQ_SERVICE_SHIFT;
+	u64 c = charge << CFQ_SERVICE_SHIFT;	/* make it fixed point */
 
-	d = d * CFQ_WEIGHT_DEFAULT;
-	do_div(d, cfqg->weight);
-	return d;
+	/* charge / vfraction */
+	c <<= CFQ_SERVICE_SHIFT;
+	do_div(c, vfraction);
+	return c;
 }
 
 static inline u64 max_vdisktime(u64 min_vdisktime, u64 vdisktime)
@@ -1237,7 +1263,9 @@ cfq_update_group_weight(struct cfq_group *cfqg)
 static void
 cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 {
+	unsigned int vfr = 1 << CFQ_SERVICE_SHIFT;	/* start with 1 */
 	struct cfq_group *pos = cfqg;
+	struct cfq_group *parent;
 	bool propagate;
 
 	/* add to the service tree */
@@ -1248,22 +1276,34 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	st->total_weight += cfqg->weight;
 
 	/*
-	 * Activate @cfqg and propagate activation upwards until we meet an
-	 * already activated node or reach root.
+	 * Activate @cfqg and calculate the portion of vfraction @cfqg is
+	 * entitled to.  vfraction is calculated by walking the tree
+	 * towards the root calculating the fraction it has at each level.
+	 * The compounded ratio is how much vfraction @cfqg owns.
+	 *
+	 * Start with the proportion tasks in this cfqg has against active
+	 * children cfqgs - its leaf_weight against children_weight.
 	 */
 	propagate = !pos->nr_active++;
 	pos->children_weight += pos->leaf_weight;
+	vfr = vfr * pos->leaf_weight / pos->children_weight;
 
-	while (propagate) {
-		struct cfq_group *parent = cfqg_flat_parent(pos);
-
-		if (!parent)
-			break;
-
-		propagate = !parent->nr_active++;
-		parent->children_weight += pos->weight;
+	/*
+	 * Compound ->weight walking up the tree.  Both activation and
+	 * vfraction calculation are done in the same loop.  Propagation
+	 * stops once an already activated node is met.  vfraction
+	 * calculation should always continue to the root.
+	 */
+	while ((parent = cfqg_flat_parent(pos))) {
+		if (propagate) {
+			propagate = !parent->nr_active++;
+			parent->children_weight += pos->weight;
+		}
+		vfr = vfr * pos->weight / parent->children_weight;
 		pos = parent;
 	}
+
+	cfqg->vfraction = max_t(unsigned, vfr, 1);
 }
 
 static void
@@ -1309,6 +1349,7 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 
 		/* @pos has 0 nr_active at this point */
 		WARN_ON_ONCE(pos->children_weight);
+		pos->vfraction = 0;
 
 		if (!parent)
 			break;
@@ -1381,6 +1422,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	unsigned int used_sl, charge, unaccounted_sl = 0;
 	int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
 			- cfqg->service_tree_idle.count;
+	unsigned int vfr;
 
 	BUG_ON(nr_sync < 0);
 	used_sl = charge = cfq_cfqq_slice_usage(cfqq, &unaccounted_sl);
@@ -1390,10 +1432,15 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
 		charge = cfqq->allocated_slice;
 
-	/* Can't update vdisktime while group is on service tree */
+	/*
+	 * Can't update vdisktime while on service tree and cfqg->vfraction
+	 * is valid only while on it.  Cache vfr, leave the service tree,
+	 * update vdisktime and go back on.  The re-addition to the tree
+	 * will also update the weights as necessary.
+	 */
+	vfr = cfqg->vfraction;
 	cfq_group_service_tree_del(st, cfqg);
-	cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
-	/* If a new weight was requested, update now, off tree */
+	cfqg->vdisktime += cfqg_scale_charge(charge, vfr);
 	cfq_group_service_tree_add(st, cfqg);
 
 	/* This group is being expired. Save the context */
@@ -1669,44 +1716,44 @@ static int cfqg_print_avg_queue_size(struct cgroup *cgrp, struct cftype *cft,
 #endif	/* CONFIG_DEBUG_BLK_CGROUP */
 
 static struct cftype cfq_blkcg_files[] = {
+	/* on root, weight is mapped to leaf_weight */
 	{
 		.name = "weight_device",
-		.read_seq_string = cfqg_print_weight_device,
-		.write_string = cfqg_set_weight_device,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfqg_print_leaf_weight_device,
+		.write_string = cfqg_set_leaf_weight_device,
 		.max_write_len = 256,
 	},
 	{
 		.name = "weight",
-		.read_seq_string = cfq_print_weight,
-		.write_u64 = cfq_set_weight,
+		.flags = CFTYPE_ONLY_ON_ROOT,
+		.read_seq_string = cfq_print_leaf_weight,
+		.write_u64 = cfq_set_leaf_weight,
 	},
 
-	/* on root, leaf_weight is mapped to weight */
+	/* no such mapping necessary for !roots */
 	{
-		.name = "leaf_weight_device",
-		.flags = CFTYPE_ONLY_ON_ROOT,
+		.name = "weight_device",
+		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfqg_print_weight_device,
 		.write_string = cfqg_set_weight_device,
 		.max_write_len = 256,
 	},
 	{
-		.name = "leaf_weight",
-		.flags = CFTYPE_ONLY_ON_ROOT,
+		.name = "weight",
+		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfq_print_weight,
 		.write_u64 = cfq_set_weight,
 	},
 
-	/* no such mapping necessary for !roots */
 	{
 		.name = "leaf_weight_device",
-		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfqg_print_leaf_weight_device,
 		.write_string = cfqg_set_leaf_weight_device,
 		.max_write_len = 256,
 	},
 	{
 		.name = "leaf_weight",
-		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_seq_string = cfq_print_leaf_weight,
 		.write_u64 = cfq_set_leaf_weight,
 	},
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 14/24] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (12 preceding siblings ...)
  2012-12-28 20:35     ` Tejun Heo
@ 2012-12-28 20:35   ` Tejun Heo
  2012-12-28 20:35   ` [PATCH 15/24] cfq-iosched: enable full blkcg hierarchy support Tejun Heo
                     ` (11 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

cfq_group_slice() calculates slice by taking a fraction of
cfq_target_latency according to the ratio of cfqg->weight against
service_tree->total_weight.  This currently works only because all
cfqgs are treated to be at the same level.

To prepare for proper hierarchy support, convert cfq_group_slice() to
base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
a fraction of 1 and represents the fraction allocated to the cfqg with
hierarchy considered, the slice can be simply calculated by
multiplying cfqg->vfraction to cfq_target_latency (with fixed point
shift factored in).

As vfraction calculation currently treats all non-root cfqgs as
children of the root cfqg, this patch doesn't introduce noticeable
behavior difference.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/cfq-iosched.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b24acf6..ee34282 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -85,7 +85,6 @@ struct cfq_rb_root {
 	struct rb_root rb;
 	struct rb_node *left;
 	unsigned count;
-	unsigned total_weight;
 	u64 min_vdisktime;
 	struct cfq_ttime ttime;
 };
@@ -979,9 +978,7 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
 static inline unsigned
 cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
-	struct cfq_rb_root *st = &cfqd->grp_service_tree;
-
-	return cfqd->cfq_target_latency * cfqg->weight / st->total_weight;
+	return cfqd->cfq_target_latency * cfqg->vfraction >> CFQ_SERVICE_SHIFT;
 }
 
 static inline unsigned
@@ -1273,7 +1270,6 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 
 	cfq_update_group_weight(cfqg);
 	__cfq_group_service_tree_add(st, cfqg);
-	st->total_weight += cfqg->weight;
 
 	/*
 	 * Activate @cfqg and calculate the portion of vfraction @cfqg is
@@ -1360,7 +1356,6 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	}
 
 	/* remove from the service tree */
-	st->total_weight -= cfqg->weight;
 	if (!RB_EMPTY_NODE(&cfqg->rb_node))
 		cfq_rb_erase(&cfqg->rb_node, st);
 }
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 14/24] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction
  2012-12-28 20:35 ` Tejun Heo
                   ` (2 preceding siblings ...)
  (?)
@ 2012-12-28 20:35 ` Tejun Heo
  2013-01-08 16:42     ` Vivek Goyal
       [not found]   ` <1356726946-26037-15-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  -1 siblings, 2 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

cfq_group_slice() calculates slice by taking a fraction of
cfq_target_latency according to the ratio of cfqg->weight against
service_tree->total_weight.  This currently works only because all
cfqgs are treated to be at the same level.

To prepare for proper hierarchy support, convert cfq_group_slice() to
base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
a fraction of 1 and represents the fraction allocated to the cfqg with
hierarchy considered, the slice can be simply calculated by
multiplying cfqg->vfraction to cfq_target_latency (with fixed point
shift factored in).

As vfraction calculation currently treats all non-root cfqgs as
children of the root cfqg, this patch doesn't introduce noticeable
behavior difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b24acf6..ee34282 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -85,7 +85,6 @@ struct cfq_rb_root {
 	struct rb_root rb;
 	struct rb_node *left;
 	unsigned count;
-	unsigned total_weight;
 	u64 min_vdisktime;
 	struct cfq_ttime ttime;
 };
@@ -979,9 +978,7 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
 static inline unsigned
 cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
-	struct cfq_rb_root *st = &cfqd->grp_service_tree;
-
-	return cfqd->cfq_target_latency * cfqg->weight / st->total_weight;
+	return cfqd->cfq_target_latency * cfqg->vfraction >> CFQ_SERVICE_SHIFT;
 }
 
 static inline unsigned
@@ -1273,7 +1270,6 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 
 	cfq_update_group_weight(cfqg);
 	__cfq_group_service_tree_add(st, cfqg);
-	st->total_weight += cfqg->weight;
 
 	/*
 	 * Activate @cfqg and calculate the portion of vfraction @cfqg is
@@ -1360,7 +1356,6 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	}
 
 	/* remove from the service tree */
-	st->total_weight -= cfqg->weight;
 	if (!RB_EMPTY_NODE(&cfqg->rb_node))
 		cfq_rb_erase(&cfqg->rb_node, st);
 }
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 15/24] cfq-iosched: enable full blkcg hierarchy support
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (13 preceding siblings ...)
  2012-12-28 20:35   ` [PATCH 14/24] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction Tejun Heo
@ 2012-12-28 20:35   ` Tejun Heo
  2012-12-28 20:35   ` [PATCH 16/24] blkcg: add blkg_policy_data->plid Tejun Heo
                     ` (10 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support.  The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.

Replace it with cfqg_parent() which returns the real parent.  This
enables full blkcg hierarchy support for cfq-iosched.  For example,
consider the following hierarchy.

        root
      /      \
   A:500      B:250
  /     \
 AA:500  AB:1000

For simplicity, let's say all the leaf nodes have active tasks and are
on service tree.  For each leaf node, vfraction would be

 AA: (500  / 1500) * (500 / 750) =~ 0.2222
 AB: (1000 / 1500) * (500 / 750) =~ 0.4444
  B:                 (250 / 750) =~ 0.3333

and vdisktime will be distributed accordingly.  For more detail,
please refer to Documentation/block/cfq-iosched.txt.

v2: cfq-iosched.txt updated to describe group scheduling as suggested
    by Vivek.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 Documentation/block/cfq-iosched.txt | 58 +++++++++++++++++++++++++++++++++++++
 block/cfq-iosched.c                 | 21 ++++----------
 2 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
index d89b4fe..a5eb7d1 100644
--- a/Documentation/block/cfq-iosched.txt
+++ b/Documentation/block/cfq-iosched.txt
@@ -102,6 +102,64 @@ processing of request. Therefore, increasing the value can imporve the
 performace although this can cause the latency of some I/O to increase due
 to more number of requests.
 
+CFQ Group scheduling
+====================
+
+CFQ supports blkio cgroup and has "blkio." prefixed files in each
+blkio cgroup directory. It is weight-based and there are four knobs
+for configuration - weight[_device] and leaf_weight[_device].
+Internal cgroup nodes (the ones with children) can also have tasks in
+them, so the former two configure how much proportion the cgroup as a
+whole is entitled to at its parent's level while the latter two
+configure how much proportion the tasks in the cgroup have compared to
+its direct children.
+
+Another way to think about it is assuming that each internal node has
+an implicit leaf child node which hosts all the tasks whose weight is
+configured by leaf_weight[_device]. Let's assume a blkio hierarchy
+composed of five cgroups - root, A, B, AA and AB - with the following
+weights where the names represent the hierarchy.
+
+        weight leaf_weight
+ root :  125    125
+ A    :  500    750
+ B    :  250    500
+ AA   :  500    500
+ AB   : 1000    500
+
+root never has a parent making its weight is meaningless. For backward
+compatibility, weight is always kept in sync with leaf_weight. B, AA
+and AB have no child and thus its tasks have no children cgroup to
+compete with. They always get 100% of what the cgroup won at the
+parent level. Considering only the weights which matter, the hierarchy
+looks like the following.
+
+          root
+       /    |   \
+      A     B    leaf
+     500   250   125
+   /  |  \
+  AA  AB  leaf
+ 500 1000 750
+
+If all cgroups have active IOs and competing with each other, disk
+time will be distributed like the following.
+
+Distribution below root. The total active weight at this level is
+A:500 + B:250 + C:125 = 875.
+
+ root-leaf :   125 /  875      =~ 14%
+ A         :   500 /  875      =~ 57%
+ B(-leaf)  :   250 /  875      =~ 28%
+
+A has children and further distributes its 57% among the children and
+the implicit leaf node. The total active weight at this level is
+AA:500 + AB:1000 + A-leaf:750 = 2250.
+
+ A-leaf    : ( 750 / 2250) * A =~ 19%
+ AA(-leaf) : ( 500 / 2250) * A =~ 12%
+ AB(-leaf) : (1000 / 2250) * A =~ 25%
+
 CFQ IOPS Mode for group scheduling
 ===================================
 Basic CFQ design is to provide priority based time slices. Higher priority
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ee34282..e8f3106 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -606,20 +606,11 @@ static inline struct cfq_group *blkg_to_cfqg(struct blkcg_gq *blkg)
 	return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
 }
 
-/*
- * Determine the parent cfqg for weight calculation.  Currently, cfqg
- * scheduling is flat and the root is the parent of everyone else.
- */
-static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
+static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg)
 {
-	struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
-	struct cfq_group *root;
-
-	while (blkg->parent)
-		blkg = blkg->parent;
-	root = blkg_to_cfqg(blkg);
+	struct blkcg_gq *pblkg = cfqg_to_blkg(cfqg)->parent;
 
-	return root != cfqg ? root : NULL;
+	return pblkg ? blkg_to_cfqg(pblkg) : NULL;
 }
 
 static inline void cfqg_get(struct cfq_group *cfqg)
@@ -722,7 +713,7 @@ static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
 
 #else	/* CONFIG_CFQ_GROUP_IOSCHED */
 
-static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
+static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; }
 static inline void cfqg_get(struct cfq_group *cfqg) { }
 static inline void cfqg_put(struct cfq_group *cfqg) { }
 
@@ -1290,7 +1281,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	 * stops once an already activated node is met.  vfraction
 	 * calculation should always continue to the root.
 	 */
-	while ((parent = cfqg_flat_parent(pos))) {
+	while ((parent = cfqg_parent(pos))) {
 		if (propagate) {
 			propagate = !parent->nr_active++;
 			parent->children_weight += pos->weight;
@@ -1341,7 +1332,7 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	pos->children_weight -= pos->leaf_weight;
 
 	while (propagate) {
-		struct cfq_group *parent = cfqg_flat_parent(pos);
+		struct cfq_group *parent = cfqg_parent(pos);
 
 		/* @pos has 0 nr_active at this point */
 		WARN_ON_ONCE(pos->children_weight);
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 15/24] cfq-iosched: enable full blkcg hierarchy support
  2012-12-28 20:35 ` Tejun Heo
                   ` (3 preceding siblings ...)
  (?)
@ 2012-12-28 20:35 ` Tejun Heo
       [not found]   ` <1356726946-26037-16-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  -1 siblings, 1 reply; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support.  The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.

Replace it with cfqg_parent() which returns the real parent.  This
enables full blkcg hierarchy support for cfq-iosched.  For example,
consider the following hierarchy.

        root
      /      \
   A:500      B:250
  /     \
 AA:500  AB:1000

For simplicity, let's say all the leaf nodes have active tasks and are
on service tree.  For each leaf node, vfraction would be

 AA: (500  / 1500) * (500 / 750) =~ 0.2222
 AB: (1000 / 1500) * (500 / 750) =~ 0.4444
  B:                 (250 / 750) =~ 0.3333

and vdisktime will be distributed accordingly.  For more detail,
please refer to Documentation/block/cfq-iosched.txt.

v2: cfq-iosched.txt updated to describe group scheduling as suggested
    by Vivek.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/block/cfq-iosched.txt | 58 +++++++++++++++++++++++++++++++++++++
 block/cfq-iosched.c                 | 21 ++++----------
 2 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
index d89b4fe..a5eb7d1 100644
--- a/Documentation/block/cfq-iosched.txt
+++ b/Documentation/block/cfq-iosched.txt
@@ -102,6 +102,64 @@ processing of request. Therefore, increasing the value can imporve the
 performace although this can cause the latency of some I/O to increase due
 to more number of requests.
 
+CFQ Group scheduling
+====================
+
+CFQ supports blkio cgroup and has "blkio." prefixed files in each
+blkio cgroup directory. It is weight-based and there are four knobs
+for configuration - weight[_device] and leaf_weight[_device].
+Internal cgroup nodes (the ones with children) can also have tasks in
+them, so the former two configure how much proportion the cgroup as a
+whole is entitled to at its parent's level while the latter two
+configure how much proportion the tasks in the cgroup have compared to
+its direct children.
+
+Another way to think about it is assuming that each internal node has
+an implicit leaf child node which hosts all the tasks whose weight is
+configured by leaf_weight[_device]. Let's assume a blkio hierarchy
+composed of five cgroups - root, A, B, AA and AB - with the following
+weights where the names represent the hierarchy.
+
+        weight leaf_weight
+ root :  125    125
+ A    :  500    750
+ B    :  250    500
+ AA   :  500    500
+ AB   : 1000    500
+
+root never has a parent making its weight is meaningless. For backward
+compatibility, weight is always kept in sync with leaf_weight. B, AA
+and AB have no child and thus its tasks have no children cgroup to
+compete with. They always get 100% of what the cgroup won at the
+parent level. Considering only the weights which matter, the hierarchy
+looks like the following.
+
+          root
+       /    |   \
+      A     B    leaf
+     500   250   125
+   /  |  \
+  AA  AB  leaf
+ 500 1000 750
+
+If all cgroups have active IOs and competing with each other, disk
+time will be distributed like the following.
+
+Distribution below root. The total active weight at this level is
+A:500 + B:250 + C:125 = 875.
+
+ root-leaf :   125 /  875      =~ 14%
+ A         :   500 /  875      =~ 57%
+ B(-leaf)  :   250 /  875      =~ 28%
+
+A has children and further distributes its 57% among the children and
+the implicit leaf node. The total active weight at this level is
+AA:500 + AB:1000 + A-leaf:750 = 2250.
+
+ A-leaf    : ( 750 / 2250) * A =~ 19%
+ AA(-leaf) : ( 500 / 2250) * A =~ 12%
+ AB(-leaf) : (1000 / 2250) * A =~ 25%
+
 CFQ IOPS Mode for group scheduling
 ===================================
 Basic CFQ design is to provide priority based time slices. Higher priority
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ee34282..e8f3106 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -606,20 +606,11 @@ static inline struct cfq_group *blkg_to_cfqg(struct blkcg_gq *blkg)
 	return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
 }
 
-/*
- * Determine the parent cfqg for weight calculation.  Currently, cfqg
- * scheduling is flat and the root is the parent of everyone else.
- */
-static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
+static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg)
 {
-	struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
-	struct cfq_group *root;
-
-	while (blkg->parent)
-		blkg = blkg->parent;
-	root = blkg_to_cfqg(blkg);
+	struct blkcg_gq *pblkg = cfqg_to_blkg(cfqg)->parent;
 
-	return root != cfqg ? root : NULL;
+	return pblkg ? blkg_to_cfqg(pblkg) : NULL;
 }
 
 static inline void cfqg_get(struct cfq_group *cfqg)
@@ -722,7 +713,7 @@ static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
 
 #else	/* CONFIG_CFQ_GROUP_IOSCHED */
 
-static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
+static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; }
 static inline void cfqg_get(struct cfq_group *cfqg) { }
 static inline void cfqg_put(struct cfq_group *cfqg) { }
 
@@ -1290,7 +1281,7 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	 * stops once an already activated node is met.  vfraction
 	 * calculation should always continue to the root.
 	 */
-	while ((parent = cfqg_flat_parent(pos))) {
+	while ((parent = cfqg_parent(pos))) {
 		if (propagate) {
 			propagate = !parent->nr_active++;
 			parent->children_weight += pos->weight;
@@ -1341,7 +1332,7 @@ cfq_group_service_tree_del(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	pos->children_weight -= pos->leaf_weight;
 
 	while (propagate) {
-		struct cfq_group *parent = cfqg_flat_parent(pos);
+		struct cfq_group *parent = cfqg_parent(pos);
 
 		/* @pos has 0 nr_active at this point */
 		WARN_ON_ONCE(pos->children_weight);
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 16/24] blkcg: add blkg_policy_data->plid
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (14 preceding siblings ...)
  2012-12-28 20:35   ` [PATCH 15/24] cfq-iosched: enable full blkcg hierarchy support Tejun Heo
@ 2012-12-28 20:35   ` Tejun Heo
  2012-12-28 20:35     ` Tejun Heo
                     ` (9 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Add pd->plid so that the policy a pd belongs to can be identified
easily.  This will be used to implement hierarchical blkg_[rw]stats.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-cgroup.c | 2 ++
 block/blk-cgroup.h | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 10e1df9..3a8de32 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -113,6 +113,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
+		pd->plid = i;
 
 		/* invoke per-policy init */
 		if (pol->pd_init_fn)
@@ -908,6 +909,7 @@ int blkcg_activate_policy(struct request_queue *q,
 
 		blkg->pd[pol->plid] = pd;
 		pd->blkg = blkg;
+		pd->plid = pol->plid;
 		pol->pd_init_fn(blkg);
 
 		spin_unlock(&blkg->blkcg->lock);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 2446225..40f5b97 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -81,8 +81,9 @@ struct blkg_rwstat {
  * beginning and pd_size can't be smaller than pd.
  */
 struct blkg_policy_data {
-	/* the blkg this per-policy data belongs to */
+	/* the blkg and policy id this per-policy data belongs to */
 	struct blkcg_gq			*blkg;
+	int				plid;
 
 	/* used during policy activation */
 	struct list_head		alloc_node;
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 16/24] blkcg: add blkg_policy_data->plid
  2012-12-28 20:35 ` Tejun Heo
                   ` (4 preceding siblings ...)
  (?)
@ 2012-12-28 20:35 ` Tejun Heo
       [not found]   ` <1356726946-26037-17-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-01-08 16:51     ` Vivek Goyal
  -1 siblings, 2 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Add pd->plid so that the policy a pd belongs to can be identified
easily.  This will be used to implement hierarchical blkg_[rw]stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 2 ++
 block/blk-cgroup.h | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 10e1df9..3a8de32 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -113,6 +113,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 
 		blkg->pd[i] = pd;
 		pd->blkg = blkg;
+		pd->plid = i;
 
 		/* invoke per-policy init */
 		if (pol->pd_init_fn)
@@ -908,6 +909,7 @@ int blkcg_activate_policy(struct request_queue *q,
 
 		blkg->pd[pol->plid] = pd;
 		pd->blkg = blkg;
+		pd->plid = pol->plid;
 		pol->pd_init_fn(blkg);
 
 		spin_unlock(&blkg->blkcg->lock);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 2446225..40f5b97 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -81,8 +81,9 @@ struct blkg_rwstat {
  * beginning and pd_size can't be smaller than pd.
  */
 struct blkg_policy_data {
-	/* the blkg this per-policy data belongs to */
+	/* the blkg and policy id this per-policy data belongs to */
 	struct blkcg_gq			*blkg;
+	int				plid;
 
 	/* used during policy activation */
 	struct list_head		alloc_node;
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
which are invoked as the policy_data gets activated and deactivated
while holding both blkcg and q locks.

Also, add blkcg_gq->online bool, which is set and cleared as the
blkcg_gq gets activated and deactivated.  This flag also is toggled
while holding both blkcg and q locks.

These will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-cgroup.c | 21 ++++++++++++++++++++-
 block/blk-cgroup.h |  7 +++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3a8de32..4d625d2 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -182,7 +182,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 				    struct blkcg_gq *new_blkg)
 {
 	struct blkcg_gq *blkg;
-	int ret;
+	int i, ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
@@ -218,7 +218,15 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	if (likely(!ret)) {
 		hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 		list_add(&blkg->q_node, &q->blkg_list);
+
+		for (i = 0; i < BLKCG_MAX_POLS; i++) {
+			struct blkcg_policy *pol = blkcg_policy[i];
+
+			if (blkg->pd[i] && pol->pd_online_fn)
+				pol->pd_online_fn(blkg);
+		}
 	}
+	blkg->online = true;
 	spin_unlock(&blkcg->lock);
 
 	if (!ret)
@@ -291,6 +299,7 @@ EXPORT_SYMBOL_GPL(blkg_lookup_create);
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
 	struct blkcg *blkcg = blkg->blkcg;
+	int i;
 
 	lockdep_assert_held(blkg->q->queue_lock);
 	lockdep_assert_held(&blkcg->lock);
@@ -299,6 +308,14 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 	WARN_ON_ONCE(list_empty(&blkg->q_node));
 	WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node));
 
+	for (i = 0; i < BLKCG_MAX_POLS; i++) {
+		struct blkcg_policy *pol = blkcg_policy[i];
+
+		if (blkg->pd[i] && pol->pd_offline_fn)
+			pol->pd_offline_fn(blkg);
+	}
+	blkg->online = false;
+
 	radix_tree_delete(&blkcg->blkg_tree, blkg->q->id);
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
@@ -956,6 +973,8 @@ void blkcg_deactivate_policy(struct request_queue *q,
 		/* grab blkcg lock too while removing @pd from @blkg */
 		spin_lock(&blkg->blkcg->lock);
 
+		if (pol->pd_offline_fn)
+			pol->pd_offline_fn(blkg);
 		if (pol->pd_exit_fn)
 			pol->pd_exit_fn(blkg);
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 40f5b97..678e89e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -106,12 +106,17 @@ struct blkcg_gq {
 	/* reference count */
 	int				refcnt;
 
+	/* is this blkg online? protected by both blkcg and q locks */
+	bool				online;
+
 	struct blkg_policy_data		*pd[BLKCG_MAX_POLS];
 
 	struct rcu_head			rcu_head;
 };
 
 typedef void (blkcg_pol_init_pd_fn)(struct blkcg_gq *blkg);
+typedef void (blkcg_pol_online_pd_fn)(struct blkcg_gq *blkg);
+typedef void (blkcg_pol_offline_pd_fn)(struct blkcg_gq *blkg);
 typedef void (blkcg_pol_exit_pd_fn)(struct blkcg_gq *blkg);
 typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkcg_gq *blkg);
 
@@ -124,6 +129,8 @@ struct blkcg_policy {
 
 	/* operations */
 	blkcg_pol_init_pd_fn		*pd_init_fn;
+	blkcg_pol_online_pd_fn		*pd_online_fn;
+	blkcg_pol_offline_pd_fn		*pd_offline_fn;
 	blkcg_pol_exit_pd_fn		*pd_exit_fn;
 	blkcg_pol_reset_pd_stats_fn	*pd_reset_stats_fn;
 };
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
which are invoked as the policy_data gets activated and deactivated
while holding both blkcg and q locks.

Also, add blkcg_gq->online bool, which is set and cleared as the
blkcg_gq gets activated and deactivated.  This flag also is toggled
while holding both blkcg and q locks.

These will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 21 ++++++++++++++++++++-
 block/blk-cgroup.h |  7 +++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3a8de32..4d625d2 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -182,7 +182,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 				    struct blkcg_gq *new_blkg)
 {
 	struct blkcg_gq *blkg;
-	int ret;
+	int i, ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
@@ -218,7 +218,15 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 	if (likely(!ret)) {
 		hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 		list_add(&blkg->q_node, &q->blkg_list);
+
+		for (i = 0; i < BLKCG_MAX_POLS; i++) {
+			struct blkcg_policy *pol = blkcg_policy[i];
+
+			if (blkg->pd[i] && pol->pd_online_fn)
+				pol->pd_online_fn(blkg);
+		}
 	}
+	blkg->online = true;
 	spin_unlock(&blkcg->lock);
 
 	if (!ret)
@@ -291,6 +299,7 @@ EXPORT_SYMBOL_GPL(blkg_lookup_create);
 static void blkg_destroy(struct blkcg_gq *blkg)
 {
 	struct blkcg *blkcg = blkg->blkcg;
+	int i;
 
 	lockdep_assert_held(blkg->q->queue_lock);
 	lockdep_assert_held(&blkcg->lock);
@@ -299,6 +308,14 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 	WARN_ON_ONCE(list_empty(&blkg->q_node));
 	WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node));
 
+	for (i = 0; i < BLKCG_MAX_POLS; i++) {
+		struct blkcg_policy *pol = blkcg_policy[i];
+
+		if (blkg->pd[i] && pol->pd_offline_fn)
+			pol->pd_offline_fn(blkg);
+	}
+	blkg->online = false;
+
 	radix_tree_delete(&blkcg->blkg_tree, blkg->q->id);
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
@@ -956,6 +973,8 @@ void blkcg_deactivate_policy(struct request_queue *q,
 		/* grab blkcg lock too while removing @pd from @blkg */
 		spin_lock(&blkg->blkcg->lock);
 
+		if (pol->pd_offline_fn)
+			pol->pd_offline_fn(blkg);
 		if (pol->pd_exit_fn)
 			pol->pd_exit_fn(blkg);
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 40f5b97..678e89e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -106,12 +106,17 @@ struct blkcg_gq {
 	/* reference count */
 	int				refcnt;
 
+	/* is this blkg online? protected by both blkcg and q locks */
+	bool				online;
+
 	struct blkg_policy_data		*pd[BLKCG_MAX_POLS];
 
 	struct rcu_head			rcu_head;
 };
 
 typedef void (blkcg_pol_init_pd_fn)(struct blkcg_gq *blkg);
+typedef void (blkcg_pol_online_pd_fn)(struct blkcg_gq *blkg);
+typedef void (blkcg_pol_offline_pd_fn)(struct blkcg_gq *blkg);
 typedef void (blkcg_pol_exit_pd_fn)(struct blkcg_gq *blkg);
 typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkcg_gq *blkg);
 
@@ -124,6 +129,8 @@ struct blkcg_policy {
 
 	/* operations */
 	blkcg_pol_init_pd_fn		*pd_init_fn;
+	blkcg_pol_online_pd_fn		*pd_online_fn;
+	blkcg_pol_offline_pd_fn		*pd_offline_fn;
 	blkcg_pol_exit_pd_fn		*pd_exit_fn;
 	blkcg_pol_reset_pd_stats_fn	*pd_reset_stats_fn;
 };
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 18/24] blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
summing up stats from multiple blkgs.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-cgroup.h  | 4 ++--
 block/cfq-iosched.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 678e89e..586c0ac 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -461,14 +461,14 @@ static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat)
 }
 
 /**
- * blkg_rwstat_sum - read the total count of a blkg_rwstat
+ * blkg_rwstat_total - read the total count of a blkg_rwstat
  * @rwstat: blkg_rwstat to read
  *
  * Return the total count of @rwstat regardless of the IO direction.  This
  * function can be called without synchronization and takes care of u64
  * atomicity.
  */
-static inline uint64_t blkg_rwstat_sum(struct blkg_rwstat *rwstat)
+static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat)
 {
 	struct blkg_rwstat tmp = blkg_rwstat_read(rwstat);
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e8f3106..d43145cc 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -536,7 +536,7 @@ static void cfqg_stats_set_start_empty_time(struct cfq_group *cfqg)
 {
 	struct cfqg_stats *stats = &cfqg->stats;
 
-	if (blkg_rwstat_sum(&stats->queued))
+	if (blkg_rwstat_total(&stats->queued))
 		return;
 
 	/*
@@ -580,7 +580,7 @@ static void cfqg_stats_update_avg_queue_size(struct cfq_group *cfqg)
 	struct cfqg_stats *stats = &cfqg->stats;
 
 	blkg_stat_add(&stats->avg_queue_size_sum,
-		      blkg_rwstat_sum(&stats->queued));
+		      blkg_rwstat_total(&stats->queued));
 	blkg_stat_add(&stats->avg_queue_size_samples, 1);
 	cfqg_stats_update_group_wait_time(stats);
 }
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 18/24] blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
summing up stats from multiple blkgs.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.h  | 4 ++--
 block/cfq-iosched.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 678e89e..586c0ac 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -461,14 +461,14 @@ static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat)
 }
 
 /**
- * blkg_rwstat_sum - read the total count of a blkg_rwstat
+ * blkg_rwstat_total - read the total count of a blkg_rwstat
  * @rwstat: blkg_rwstat to read
  *
  * Return the total count of @rwstat regardless of the IO direction.  This
  * function can be called without synchronization and takes care of u64
  * atomicity.
  */
-static inline uint64_t blkg_rwstat_sum(struct blkg_rwstat *rwstat)
+static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat)
 {
 	struct blkg_rwstat tmp = blkg_rwstat_read(rwstat);
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e8f3106..d43145cc 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -536,7 +536,7 @@ static void cfqg_stats_set_start_empty_time(struct cfq_group *cfqg)
 {
 	struct cfqg_stats *stats = &cfqg->stats;
 
-	if (blkg_rwstat_sum(&stats->queued))
+	if (blkg_rwstat_total(&stats->queued))
 		return;
 
 	/*
@@ -580,7 +580,7 @@ static void cfqg_stats_update_avg_queue_size(struct cfq_group *cfqg)
 	struct cfqg_stats *stats = &cfqg->stats;
 
 	blkg_stat_add(&stats->avg_queue_size_sum,
-		      blkg_rwstat_sum(&stats->queued));
+		      blkg_rwstat_total(&stats->queued));
 	blkg_stat_add(&stats->avg_queue_size_samples, 1);
 	cfqg_stats_update_group_wait_time(stats);
 }
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 19/24] blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (17 preceding siblings ...)
  2012-12-28 20:35     ` Tejun Heo
@ 2012-12-28 20:35   ` Tejun Heo
  2012-12-28 20:35   ` [PATCH 20/24] block: RCU free request_queue Tejun Heo
                     ` (6 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
The former two collect the [rw]stats designated by the target policy
data and offset from the pd's subtree.  The latter two add one
[rw]stat to another.

Note that the recursive sum functions require the queue lock to be
held on entry to make blkg online test reliable.  This is necessary to
properly handle stats of a dying blkg.

These will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-cgroup.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-cgroup.h |  35 ++++++++++++++++++
 2 files changed, 142 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4d625d2..a1a4b97 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -32,6 +32,26 @@ EXPORT_SYMBOL_GPL(blkcg_root);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 
+static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
+				      struct request_queue *q, bool update_hint);
+
+/**
+ * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants
+ * @d_blkg: loop cursor pointing to the current descendant
+ * @pos_cgrp: used for iteration
+ * @p_blkg: target blkg to walk descendants of
+ *
+ * Walk @c_blkg through the descendants of @p_blkg.  Must be used with RCU
+ * read locked.  If called under either blkcg or queue lock, the iteration
+ * is guaranteed to include all and only online blkgs.  The caller may
+ * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip
+ * subtree.
+ */
+#define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg)		\
+	cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \
+		if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \
+					      (p_blkg)->q, false)))
+
 static bool blkcg_policy_enabled(struct request_queue *q,
 				 const struct blkcg_policy *pol)
 {
@@ -127,6 +147,17 @@ err_free:
 	return NULL;
 }
 
+/**
+ * __blkg_lookup - internal version of blkg_lookup()
+ * @blkcg: blkcg of interest
+ * @q: request_queue of interest
+ * @update_hint: whether to update lookup hint with the result or not
+ *
+ * This is internal version and shouldn't be used by policy
+ * implementations.  Looks up blkgs for the @blkcg - @q pair regardless of
+ * @q's bypass state.  If @update_hint is %true, the caller should be
+ * holding @q->queue_lock and lookup hint is updated on success.
+ */
 static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
 				      struct request_queue *q, bool update_hint)
 {
@@ -585,6 +616,82 @@ u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
 EXPORT_SYMBOL_GPL(blkg_prfill_rwstat);
 
 /**
+ * blkg_stat_recursive_sum - collect hierarchical blkg_stat
+ * @pd: policy private data of interest
+ * @off: offset to the blkg_stat in @pd
+ *
+ * Collect the blkg_stat specified by @off from @pd and all its online
+ * descendants and return the sum.  The caller must be holding the queue
+ * lock for online tests.
+ */
+u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off)
+{
+	struct blkcg_policy *pol = blkcg_policy[pd->plid];
+	struct blkcg_gq *pos_blkg;
+	struct cgroup *pos_cgrp;
+	u64 sum;
+
+	lockdep_assert_held(pd->blkg->q->queue_lock);
+
+	sum = blkg_stat_read((void *)pd + off);
+
+	rcu_read_lock();
+	blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) {
+		struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol);
+		struct blkg_stat *stat = (void *)pos_pd + off;
+
+		if (pos_blkg->online)
+			sum += blkg_stat_read(stat);
+	}
+	rcu_read_unlock();
+
+	return sum;
+}
+EXPORT_SYMBOL_GPL(blkg_stat_recursive_sum);
+
+/**
+ * blkg_rwstat_recursive_sum - collect hierarchical blkg_rwstat
+ * @pd: policy private data of interest
+ * @off: offset to the blkg_stat in @pd
+ *
+ * Collect the blkg_rwstat specified by @off from @pd and all its online
+ * descendants and return the sum.  The caller must be holding the queue
+ * lock for online tests.
+ */
+struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
+					     int off)
+{
+	struct blkcg_policy *pol = blkcg_policy[pd->plid];
+	struct blkcg_gq *pos_blkg;
+	struct cgroup *pos_cgrp;
+	struct blkg_rwstat sum;
+	int i;
+
+	lockdep_assert_held(pd->blkg->q->queue_lock);
+
+	sum = blkg_rwstat_read((void *)pd + off);
+
+	rcu_read_lock();
+	blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) {
+		struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol);
+		struct blkg_rwstat *rwstat = (void *)pos_pd + off;
+		struct blkg_rwstat tmp;
+
+		if (!pos_blkg->online)
+			continue;
+
+		tmp = blkg_rwstat_read(rwstat);
+
+		for (i = 0; i < BLKG_RWSTAT_NR; i++)
+			sum.cnt[i] += tmp.cnt[i];
+	}
+	rcu_read_unlock();
+
+	return sum;
+}
+EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum);
+
+/**
  * blkg_conf_prep - parse and prepare for per-blkg config update
  * @blkcg: target block cgroup
  * @pol: target policy
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 586c0ac..f2b2929 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -164,6 +164,10 @@ u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off);
 u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
 		       int off);
 
+u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off);
+struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
+					     int off);
+
 struct blkg_conf_ctx {
 	struct gendisk			*disk;
 	struct blkcg_gq			*blkg;
@@ -414,6 +418,18 @@ static inline void blkg_stat_reset(struct blkg_stat *stat)
 }
 
 /**
+ * blkg_stat_merge - merge a blkg_stat into another
+ * @to: the destination blkg_stat
+ * @from: the source
+ *
+ * Add @from's count to @to.
+ */
+static inline void blkg_stat_merge(struct blkg_stat *to, struct blkg_stat *from)
+{
+	blkg_stat_add(to, blkg_stat_read(from));
+}
+
+/**
  * blkg_rwstat_add - add a value to a blkg_rwstat
  * @rwstat: target blkg_rwstat
  * @rw: mask of REQ_{WRITE|SYNC}
@@ -484,6 +500,25 @@ static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat)
 	memset(rwstat->cnt, 0, sizeof(rwstat->cnt));
 }
 
+/**
+ * blkg_rwstat_merge - merge a blkg_rwstat into another
+ * @to: the destination blkg_rwstat
+ * @from: the source
+ *
+ * Add @from's counts to @to.
+ */
+static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
+				     struct blkg_rwstat *from)
+{
+	struct blkg_rwstat v = blkg_rwstat_read(from);
+	int i;
+
+	u64_stats_update_begin(&to->syncp);
+	for (i = 0; i < BLKG_RWSTAT_NR; i++)
+		to->cnt[i] += v.cnt[i];
+	u64_stats_update_end(&to->syncp);
+}
+
 #else	/* CONFIG_BLK_CGROUP */
 
 struct cgroup;
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 19/24] blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
  2012-12-28 20:35 ` Tejun Heo
                   ` (5 preceding siblings ...)
  (?)
@ 2012-12-28 20:35 ` Tejun Heo
       [not found]   ` <1356726946-26037-20-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  -1 siblings, 1 reply; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
The former two collect the [rw]stats designated by the target policy
data and offset from the pd's subtree.  The latter two add one
[rw]stat to another.

Note that the recursive sum functions require the queue lock to be
held on entry to make blkg online test reliable.  This is necessary to
properly handle stats of a dying blkg.

These will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-cgroup.h |  35 ++++++++++++++++++
 2 files changed, 142 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4d625d2..a1a4b97 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -32,6 +32,26 @@ EXPORT_SYMBOL_GPL(blkcg_root);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 
+static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
+				      struct request_queue *q, bool update_hint);
+
+/**
+ * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants
+ * @d_blkg: loop cursor pointing to the current descendant
+ * @pos_cgrp: used for iteration
+ * @p_blkg: target blkg to walk descendants of
+ *
+ * Walk @c_blkg through the descendants of @p_blkg.  Must be used with RCU
+ * read locked.  If called under either blkcg or queue lock, the iteration
+ * is guaranteed to include all and only online blkgs.  The caller may
+ * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip
+ * subtree.
+ */
+#define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg)		\
+	cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \
+		if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \
+					      (p_blkg)->q, false)))
+
 static bool blkcg_policy_enabled(struct request_queue *q,
 				 const struct blkcg_policy *pol)
 {
@@ -127,6 +147,17 @@ err_free:
 	return NULL;
 }
 
+/**
+ * __blkg_lookup - internal version of blkg_lookup()
+ * @blkcg: blkcg of interest
+ * @q: request_queue of interest
+ * @update_hint: whether to update lookup hint with the result or not
+ *
+ * This is internal version and shouldn't be used by policy
+ * implementations.  Looks up blkgs for the @blkcg - @q pair regardless of
+ * @q's bypass state.  If @update_hint is %true, the caller should be
+ * holding @q->queue_lock and lookup hint is updated on success.
+ */
 static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
 				      struct request_queue *q, bool update_hint)
 {
@@ -585,6 +616,82 @@ u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
 EXPORT_SYMBOL_GPL(blkg_prfill_rwstat);
 
 /**
+ * blkg_stat_recursive_sum - collect hierarchical blkg_stat
+ * @pd: policy private data of interest
+ * @off: offset to the blkg_stat in @pd
+ *
+ * Collect the blkg_stat specified by @off from @pd and all its online
+ * descendants and return the sum.  The caller must be holding the queue
+ * lock for online tests.
+ */
+u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off)
+{
+	struct blkcg_policy *pol = blkcg_policy[pd->plid];
+	struct blkcg_gq *pos_blkg;
+	struct cgroup *pos_cgrp;
+	u64 sum;
+
+	lockdep_assert_held(pd->blkg->q->queue_lock);
+
+	sum = blkg_stat_read((void *)pd + off);
+
+	rcu_read_lock();
+	blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) {
+		struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol);
+		struct blkg_stat *stat = (void *)pos_pd + off;
+
+		if (pos_blkg->online)
+			sum += blkg_stat_read(stat);
+	}
+	rcu_read_unlock();
+
+	return sum;
+}
+EXPORT_SYMBOL_GPL(blkg_stat_recursive_sum);
+
+/**
+ * blkg_rwstat_recursive_sum - collect hierarchical blkg_rwstat
+ * @pd: policy private data of interest
+ * @off: offset to the blkg_stat in @pd
+ *
+ * Collect the blkg_rwstat specified by @off from @pd and all its online
+ * descendants and return the sum.  The caller must be holding the queue
+ * lock for online tests.
+ */
+struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
+					     int off)
+{
+	struct blkcg_policy *pol = blkcg_policy[pd->plid];
+	struct blkcg_gq *pos_blkg;
+	struct cgroup *pos_cgrp;
+	struct blkg_rwstat sum;
+	int i;
+
+	lockdep_assert_held(pd->blkg->q->queue_lock);
+
+	sum = blkg_rwstat_read((void *)pd + off);
+
+	rcu_read_lock();
+	blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) {
+		struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol);
+		struct blkg_rwstat *rwstat = (void *)pos_pd + off;
+		struct blkg_rwstat tmp;
+
+		if (!pos_blkg->online)
+			continue;
+
+		tmp = blkg_rwstat_read(rwstat);
+
+		for (i = 0; i < BLKG_RWSTAT_NR; i++)
+			sum.cnt[i] += tmp.cnt[i];
+	}
+	rcu_read_unlock();
+
+	return sum;
+}
+EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum);
+
+/**
  * blkg_conf_prep - parse and prepare for per-blkg config update
  * @blkcg: target block cgroup
  * @pol: target policy
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 586c0ac..f2b2929 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -164,6 +164,10 @@ u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off);
 u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
 		       int off);
 
+u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off);
+struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
+					     int off);
+
 struct blkg_conf_ctx {
 	struct gendisk			*disk;
 	struct blkcg_gq			*blkg;
@@ -414,6 +418,18 @@ static inline void blkg_stat_reset(struct blkg_stat *stat)
 }
 
 /**
+ * blkg_stat_merge - merge a blkg_stat into another
+ * @to: the destination blkg_stat
+ * @from: the source
+ *
+ * Add @from's count to @to.
+ */
+static inline void blkg_stat_merge(struct blkg_stat *to, struct blkg_stat *from)
+{
+	blkg_stat_add(to, blkg_stat_read(from));
+}
+
+/**
  * blkg_rwstat_add - add a value to a blkg_rwstat
  * @rwstat: target blkg_rwstat
  * @rw: mask of REQ_{WRITE|SYNC}
@@ -484,6 +500,25 @@ static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat)
 	memset(rwstat->cnt, 0, sizeof(rwstat->cnt));
 }
 
+/**
+ * blkg_rwstat_merge - merge a blkg_rwstat into another
+ * @to: the destination blkg_rwstat
+ * @from: the source
+ *
+ * Add @from's counts to @to.
+ */
+static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
+				     struct blkg_rwstat *from)
+{
+	struct blkg_rwstat v = blkg_rwstat_read(from);
+	int i;
+
+	u64_stats_update_begin(&to->syncp);
+	for (i = 0; i < BLKG_RWSTAT_NR; i++)
+		to->cnt[i] += v.cnt[i];
+	u64_stats_update_end(&to->syncp);
+}
+
 #else	/* CONFIG_BLK_CGROUP */
 
 struct cgroup;
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 20/24] block: RCU free request_queue
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (18 preceding siblings ...)
  2012-12-28 20:35   ` [PATCH 19/24] blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() Tejun Heo
@ 2012-12-28 20:35   ` Tejun Heo
  2012-12-28 20:35     ` Tejun Heo
                     ` (5 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

RCU free request_queue so that blkcg_gq->q can be dereferenced under
RCU lock.  This will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-sysfs.c      | 9 ++++++++-
 include/linux/blkdev.h | 2 ++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7881477..6206a93 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -497,6 +497,13 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
 	return res;
 }
 
+static void blk_free_queue_rcu(struct rcu_head *rcu_head)
+{
+	struct request_queue *q = container_of(rcu_head, struct request_queue,
+					       rcu_head);
+	kmem_cache_free(blk_requestq_cachep, q);
+}
+
 /**
  * blk_release_queue: - release a &struct request_queue when it is no longer needed
  * @kobj:    the kobj belonging to the request queue to be released
@@ -538,7 +545,7 @@ static void blk_release_queue(struct kobject *kobj)
 	bdi_destroy(&q->backing_dev_info);
 
 	ida_simple_remove(&blk_queue_ida, q->id);
-	kmem_cache_free(blk_requestq_cachep, q);
+	call_rcu(&q->rcu_head, blk_free_queue_rcu);
 }
 
 static const struct sysfs_ops queue_sysfs_ops = {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f94bc83..406343c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -19,6 +19,7 @@
 #include <linux/gfp.h>
 #include <linux/bsg.h>
 #include <linux/smp.h>
+#include <linux/rcupdate.h>
 
 #include <asm/scatterlist.h>
 
@@ -437,6 +438,7 @@ struct request_queue {
 	/* Throttle data */
 	struct throtl_data *td;
 #endif
+	struct rcu_head		rcu_head;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 20/24] block: RCU free request_queue
  2012-12-28 20:35 ` Tejun Heo
                   ` (6 preceding siblings ...)
  (?)
@ 2012-12-28 20:35 ` Tejun Heo
  2013-01-08 18:05     ` Vivek Goyal
       [not found]   ` <1356726946-26037-21-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  -1 siblings, 2 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

RCU free request_queue so that blkcg_gq->q can be dereferenced under
RCU lock.  This will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-sysfs.c      | 9 ++++++++-
 include/linux/blkdev.h | 2 ++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7881477..6206a93 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -497,6 +497,13 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
 	return res;
 }
 
+static void blk_free_queue_rcu(struct rcu_head *rcu_head)
+{
+	struct request_queue *q = container_of(rcu_head, struct request_queue,
+					       rcu_head);
+	kmem_cache_free(blk_requestq_cachep, q);
+}
+
 /**
  * blk_release_queue: - release a &struct request_queue when it is no longer needed
  * @kobj:    the kobj belonging to the request queue to be released
@@ -538,7 +545,7 @@ static void blk_release_queue(struct kobject *kobj)
 	bdi_destroy(&q->backing_dev_info);
 
 	ida_simple_remove(&blk_queue_ida, q->id);
-	kmem_cache_free(blk_requestq_cachep, q);
+	call_rcu(&q->rcu_head, blk_free_queue_rcu);
 }
 
 static const struct sysfs_ops queue_sysfs_ops = {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f94bc83..406343c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -19,6 +19,7 @@
 #include <linux/gfp.h>
 #include <linux/bsg.h>
 #include <linux/smp.h>
+#include <linux/rcupdate.h>
 
 #include <asm/scatterlist.h>
 
@@ -437,6 +438,7 @@ struct request_queue {
 	/* Throttle data */
 	struct throtl_data *td;
 #endif
+	struct rcu_head		rcu_head;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
  2012-12-28 20:35 ` Tejun Heo
@ 2012-12-28 20:35     ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Instead of holding blkcg->lock while walking ->blkg_list and executing
prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
executing prfill().  This makes prfill() implementations easier as
stats are mostly protected by queue lock.

This will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-cgroup.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a1a4b97..22f75d1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -504,8 +504,9 @@ static const char *blkg_dev_name(struct blkcg_gq *blkg)
  *
  * This function invokes @prfill on each blkg of @blkcg if pd for the
  * policy specified by @pol exists.  @prfill is invoked with @sf, the
- * policy data and @data.  If @show_total is %true, the sum of the return
- * values from @prfill is printed with "Total" label at the end.
+ * policy data and @data and the matching queue lock held.  If @show_total
+ * is %true, the sum of the return values from @prfill is printed with
+ * "Total" label at the end.
  *
  * This is to be used to construct print functions for
  * cftype->read_seq_string method.
@@ -520,11 +521,14 @@ void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
 	struct hlist_node *n;
 	u64 total = 0;
 
-	spin_lock_irq(&blkcg->lock);
-	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
+		spin_lock_irq(blkg->q->queue_lock);
 		if (blkcg_policy_enabled(blkg->q, pol))
 			total += prfill(sf, blkg->pd[pol->plid], data);
-	spin_unlock_irq(&blkcg->lock);
+		spin_unlock_irq(blkg->q->queue_lock);
+	}
+	rcu_read_unlock();
 
 	if (show_total)
 		seq_printf(sf, "Total %llu\n", (unsigned long long)total);
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
@ 2012-12-28 20:35     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Instead of holding blkcg->lock while walking ->blkg_list and executing
prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
executing prfill().  This makes prfill() implementations easier as
stats are mostly protected by queue lock.

This will be used to implement hierarchical stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index a1a4b97..22f75d1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -504,8 +504,9 @@ static const char *blkg_dev_name(struct blkcg_gq *blkg)
  *
  * This function invokes @prfill on each blkg of @blkcg if pd for the
  * policy specified by @pol exists.  @prfill is invoked with @sf, the
- * policy data and @data.  If @show_total is %true, the sum of the return
- * values from @prfill is printed with "Total" label at the end.
+ * policy data and @data and the matching queue lock held.  If @show_total
+ * is %true, the sum of the return values from @prfill is printed with
+ * "Total" label at the end.
  *
  * This is to be used to construct print functions for
  * cftype->read_seq_string method.
@@ -520,11 +521,14 @@ void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
 	struct hlist_node *n;
 	u64 total = 0;
 
-	spin_lock_irq(&blkcg->lock);
-	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
+		spin_lock_irq(blkg->q->queue_lock);
 		if (blkcg_policy_enabled(blkg->q, pol))
 			total += prfill(sf, blkg->pd[pol->plid], data);
-	spin_unlock_irq(&blkcg->lock);
+		spin_unlock_irq(blkg->q->queue_lock);
+	}
+	rcu_read_unlock();
 
 	if (show_total)
 		seq_printf(sf, "Total %llu\n", (unsigned long long)total);
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 22/24] cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (20 preceding siblings ...)
  2012-12-28 20:35     ` Tejun Heo
@ 2012-12-28 20:35   ` Tejun Heo
  2012-12-28 20:35   ` [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs Tejun Heo
                     ` (3 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
cfq_pd_reset_stats() and move the latter to where other pd methods are
defined.  cfqg_stats_reset() will be used to implement hierarchical
stats.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/cfq-iosched.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d43145cc..f8b34bb 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -688,11 +688,9 @@ static inline void cfqg_stats_update_completion(struct cfq_group *cfqg,
 				io_start_time - start_time);
 }
 
-static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
+/* @stats = 0 */
+static void cfqg_stats_reset(struct cfqg_stats *stats)
 {
-	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
-	struct cfqg_stats *stats = &cfqg->stats;
-
 	/* queued stats shouldn't be cleared */
 	blkg_rwstat_reset(&stats->service_bytes);
 	blkg_rwstat_reset(&stats->serviced);
@@ -1477,6 +1475,13 @@ static void cfq_pd_init(struct blkcg_gq *blkg)
 	cfqg->leaf_weight = blkg->blkcg->cfq_leaf_weight;
 }
 
+static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
+{
+	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
+
+	cfqg_stats_reset(&cfqg->stats);
+}
+
 /*
  * Search for the cfq group current task belongs to. request_queue lock must
  * be held.
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 22/24] cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
  2012-12-28 20:35 ` Tejun Heo
                   ` (7 preceding siblings ...)
  (?)
@ 2012-12-28 20:35 ` Tejun Heo
       [not found]   ` <1356726946-26037-23-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  -1 siblings, 1 reply; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
cfq_pd_reset_stats() and move the latter to where other pd methods are
defined.  cfqg_stats_reset() will be used to implement hierarchical
stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d43145cc..f8b34bb 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -688,11 +688,9 @@ static inline void cfqg_stats_update_completion(struct cfq_group *cfqg,
 				io_start_time - start_time);
 }
 
-static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
+/* @stats = 0 */
+static void cfqg_stats_reset(struct cfqg_stats *stats)
 {
-	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
-	struct cfqg_stats *stats = &cfqg->stats;
-
 	/* queued stats shouldn't be cleared */
 	blkg_rwstat_reset(&stats->service_bytes);
 	blkg_rwstat_reset(&stats->serviced);
@@ -1477,6 +1475,13 @@ static void cfq_pd_init(struct blkcg_gq *blkg)
 	cfqg->leaf_weight = blkg->blkcg->cfq_leaf_weight;
 }
 
+static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
+{
+	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
+
+	cfqg_stats_reset(&cfqg->stats);
+}
+
 /*
  * Search for the cfq group current task belongs to. request_queue lock must
  * be held.
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (21 preceding siblings ...)
  2012-12-28 20:35   ` [PATCH 22/24] cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() Tejun Heo
@ 2012-12-28 20:35   ` Tejun Heo
  2012-12-28 20:35   ` [PATCH 24/24] cfq-iosched: add hierarchical cfq_group statistics Tejun Heo
                     ` (2 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

To support hierarchical stats, it's necessary to remember stats from
dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
its stats to the parent's dead-stats.

The transfer happens form ->pd_offline_fn() and it is possible that
there are some residual IOs completing afterwards.  Currently, we lose
these stats.  Given that cgroup removal isn't a very high frequency
operation and the amount of residual IOs on offline are likely to be
nil or small, this shouldn't be a big deal and the complexity needed
to handle residual IOs - another callback and rather elaborate
synchronization to reach and lock the matching q - doesn't seem
justified.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/cfq-iosched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f8b34bb..4d75b79 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -289,7 +289,8 @@ struct cfq_group {
 	/* number of requests that are on the dispatch list or inside driver */
 	int dispatched;
 	struct cfq_ttime ttime;
-	struct cfqg_stats stats;
+	struct cfqg_stats stats;	/* stats for this cfqg */
+	struct cfqg_stats dead_stats;	/* stats pushed from dead children */
 };
 
 struct cfq_io_cq {
@@ -709,6 +710,47 @@ static void cfqg_stats_reset(struct cfqg_stats *stats)
 #endif
 }
 
+/* @to += @from */
+static void cfqg_stats_merge(struct cfqg_stats *to, struct cfqg_stats *from)
+{
+	/* queued stats shouldn't be cleared */
+	blkg_rwstat_merge(&to->service_bytes, &from->service_bytes);
+	blkg_rwstat_merge(&to->serviced, &from->serviced);
+	blkg_rwstat_merge(&to->merged, &from->merged);
+	blkg_rwstat_merge(&to->service_time, &from->service_time);
+	blkg_rwstat_merge(&to->wait_time, &from->wait_time);
+	blkg_stat_merge(&from->time, &from->time);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+	blkg_stat_merge(&to->unaccounted_time, &from->unaccounted_time);
+	blkg_stat_merge(&to->avg_queue_size_sum, &from->avg_queue_size_sum);
+	blkg_stat_merge(&to->avg_queue_size_samples, &from->avg_queue_size_samples);
+	blkg_stat_merge(&to->dequeue, &from->dequeue);
+	blkg_stat_merge(&to->group_wait_time, &from->group_wait_time);
+	blkg_stat_merge(&to->idle_time, &from->idle_time);
+	blkg_stat_merge(&to->empty_time, &from->empty_time);
+#endif
+}
+
+/*
+ * Transfer @cfqg's stats to its parent's dead_stats so that the ancestors'
+ * recursive stats can still account for the amount used by this cfqg after
+ * it's gone.
+ */
+static void cfqg_stats_xfer_dead(struct cfq_group *cfqg)
+{
+	struct cfq_group *parent = cfqg_parent(cfqg);
+
+	lockdep_assert_held(cfqg_to_blkg(cfqg)->q->queue_lock);
+
+	if (unlikely(!parent))
+		return;
+
+	cfqg_stats_merge(&parent->dead_stats, &cfqg->stats);
+	cfqg_stats_merge(&parent->dead_stats, &cfqg->dead_stats);
+	cfqg_stats_reset(&cfqg->stats);
+	cfqg_stats_reset(&cfqg->dead_stats);
+}
+
 #else	/* CONFIG_CFQ_GROUP_IOSCHED */
 
 static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; }
@@ -1475,11 +1517,23 @@ static void cfq_pd_init(struct blkcg_gq *blkg)
 	cfqg->leaf_weight = blkg->blkcg->cfq_leaf_weight;
 }
 
+static void cfq_pd_offline(struct blkcg_gq *blkg)
+{
+	/*
+	 * @blkg is going offline and will be ignored by
+	 * blkg_[rw]stat_recursive_sum().  Transfer stats to the parent so
+	 * that they don't get lost.  If IOs complete after this point, the
+	 * stats for them will be lost.  Oh well...
+	 */
+	cfqg_stats_xfer_dead(blkg_to_cfqg(blkg));
+}
+
 static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
 {
 	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
 
 	cfqg_stats_reset(&cfqg->stats);
+	cfqg_stats_reset(&cfqg->dead_stats);
 }
 
 /*
@@ -4408,6 +4462,7 @@ static struct blkcg_policy blkcg_policy_cfq = {
 	.cftypes		= cfq_blkcg_files,
 
 	.pd_init_fn		= cfq_pd_init,
+	.pd_offline_fn		= cfq_pd_offline,
 	.pd_reset_stats_fn	= cfq_pd_reset_stats,
 };
 #endif
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
  2012-12-28 20:35 ` Tejun Heo
                   ` (9 preceding siblings ...)
  (?)
@ 2012-12-28 20:35 ` Tejun Heo
       [not found]   ` <1356726946-26037-24-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (2 more replies)
  -1 siblings, 3 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

To support hierarchical stats, it's necessary to remember stats from
dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
its stats to the parent's dead-stats.

The transfer happens form ->pd_offline_fn() and it is possible that
there are some residual IOs completing afterwards.  Currently, we lose
these stats.  Given that cgroup removal isn't a very high frequency
operation and the amount of residual IOs on offline are likely to be
nil or small, this shouldn't be a big deal and the complexity needed
to handle residual IOs - another callback and rather elaborate
synchronization to reach and lock the matching q - doesn't seem
justified.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f8b34bb..4d75b79 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -289,7 +289,8 @@ struct cfq_group {
 	/* number of requests that are on the dispatch list or inside driver */
 	int dispatched;
 	struct cfq_ttime ttime;
-	struct cfqg_stats stats;
+	struct cfqg_stats stats;	/* stats for this cfqg */
+	struct cfqg_stats dead_stats;	/* stats pushed from dead children */
 };
 
 struct cfq_io_cq {
@@ -709,6 +710,47 @@ static void cfqg_stats_reset(struct cfqg_stats *stats)
 #endif
 }
 
+/* @to += @from */
+static void cfqg_stats_merge(struct cfqg_stats *to, struct cfqg_stats *from)
+{
+	/* queued stats shouldn't be cleared */
+	blkg_rwstat_merge(&to->service_bytes, &from->service_bytes);
+	blkg_rwstat_merge(&to->serviced, &from->serviced);
+	blkg_rwstat_merge(&to->merged, &from->merged);
+	blkg_rwstat_merge(&to->service_time, &from->service_time);
+	blkg_rwstat_merge(&to->wait_time, &from->wait_time);
+	blkg_stat_merge(&from->time, &from->time);
+#ifdef CONFIG_DEBUG_BLK_CGROUP
+	blkg_stat_merge(&to->unaccounted_time, &from->unaccounted_time);
+	blkg_stat_merge(&to->avg_queue_size_sum, &from->avg_queue_size_sum);
+	blkg_stat_merge(&to->avg_queue_size_samples, &from->avg_queue_size_samples);
+	blkg_stat_merge(&to->dequeue, &from->dequeue);
+	blkg_stat_merge(&to->group_wait_time, &from->group_wait_time);
+	blkg_stat_merge(&to->idle_time, &from->idle_time);
+	blkg_stat_merge(&to->empty_time, &from->empty_time);
+#endif
+}
+
+/*
+ * Transfer @cfqg's stats to its parent's dead_stats so that the ancestors'
+ * recursive stats can still account for the amount used by this cfqg after
+ * it's gone.
+ */
+static void cfqg_stats_xfer_dead(struct cfq_group *cfqg)
+{
+	struct cfq_group *parent = cfqg_parent(cfqg);
+
+	lockdep_assert_held(cfqg_to_blkg(cfqg)->q->queue_lock);
+
+	if (unlikely(!parent))
+		return;
+
+	cfqg_stats_merge(&parent->dead_stats, &cfqg->stats);
+	cfqg_stats_merge(&parent->dead_stats, &cfqg->dead_stats);
+	cfqg_stats_reset(&cfqg->stats);
+	cfqg_stats_reset(&cfqg->dead_stats);
+}
+
 #else	/* CONFIG_CFQ_GROUP_IOSCHED */
 
 static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; }
@@ -1475,11 +1517,23 @@ static void cfq_pd_init(struct blkcg_gq *blkg)
 	cfqg->leaf_weight = blkg->blkcg->cfq_leaf_weight;
 }
 
+static void cfq_pd_offline(struct blkcg_gq *blkg)
+{
+	/*
+	 * @blkg is going offline and will be ignored by
+	 * blkg_[rw]stat_recursive_sum().  Transfer stats to the parent so
+	 * that they don't get lost.  If IOs complete after this point, the
+	 * stats for them will be lost.  Oh well...
+	 */
+	cfqg_stats_xfer_dead(blkg_to_cfqg(blkg));
+}
+
 static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
 {
 	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
 
 	cfqg_stats_reset(&cfqg->stats);
+	cfqg_stats_reset(&cfqg->dead_stats);
 }
 
 /*
@@ -4408,6 +4462,7 @@ static struct blkcg_policy blkcg_policy_cfq = {
 	.cftypes		= cfq_blkcg_files,
 
 	.pd_init_fn		= cfq_pd_init,
+	.pd_offline_fn		= cfq_pd_offline,
 	.pd_reset_stats_fn	= cfq_pd_reset_stats,
 };
 #endif
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 24/24] cfq-iosched: add hierarchical cfq_group statistics
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (22 preceding siblings ...)
  2012-12-28 20:35   ` [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs Tejun Heo
@ 2012-12-28 20:35   ` Tejun Heo
  2012-12-28 23:18   ` [PATCH 18.5/24] blkcg: export __blkg_prfill_rwstat() take#2 Tejun Heo
  2013-01-02 18:20   ` [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2 Vivek Goyal
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Unfortunately, at this point, there's no way to make the existing
statistics hierarchical without creating nasty surprises for the
existing users.  Just create recursive counterpart of the existing
stats.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/cfq-iosched.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4d75b79..b66365b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1528,6 +1528,32 @@ static void cfq_pd_offline(struct blkcg_gq *blkg)
 	cfqg_stats_xfer_dead(blkg_to_cfqg(blkg));
 }
 
+/* offset delta from cfqg->stats to cfqg->dead_stats */
+static const int dead_stats_off_delta = offsetof(struct cfq_group, dead_stats) -
+					offsetof(struct cfq_group, stats);
+
+/* to be used by recursive prfill, sums live and dead stats recursively */
+static u64 cfqg_stat_pd_recursive_sum(struct blkg_policy_data *pd, int off)
+{
+	u64 sum = 0;
+
+	sum += blkg_stat_recursive_sum(pd, off);
+	sum += blkg_stat_recursive_sum(pd, off + dead_stats_off_delta);
+	return sum;
+}
+
+/* to be used by recursive prfill, sums live and dead rwstats recursively */
+static struct blkg_rwstat cfqg_rwstat_pd_recursive_sum(struct blkg_policy_data *pd,
+						       int off)
+{
+	struct blkg_rwstat a, b;
+
+	a = blkg_rwstat_recursive_sum(pd, off);
+	b = blkg_rwstat_recursive_sum(pd, off + dead_stats_off_delta);
+	blkg_rwstat_merge(&a, &b);
+	return a;
+}
+
 static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
 {
 	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
@@ -1732,6 +1758,42 @@ static int cfqg_print_rwstat(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 cfqg_prfill_stat_recursive(struct seq_file *sf,
+				      struct blkg_policy_data *pd, int off)
+{
+	u64 sum = cfqg_stat_pd_recursive_sum(pd, off);
+
+	return __blkg_prfill_u64(sf, pd, sum);
+}
+
+static u64 cfqg_prfill_rwstat_recursive(struct seq_file *sf,
+					struct blkg_policy_data *pd, int off)
+{
+	struct blkg_rwstat sum = cfqg_rwstat_pd_recursive_sum(pd, off);
+
+	return __blkg_prfill_rwstat(sf, pd, &sum);
+}
+
+static int cfqg_print_stat_recursive(struct cgroup *cgrp, struct cftype *cft,
+				     struct seq_file *sf)
+{
+	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
+
+	blkcg_print_blkgs(sf, blkcg, cfqg_prfill_stat_recursive,
+			  &blkcg_policy_cfq, cft->private, false);
+	return 0;
+}
+
+static int cfqg_print_rwstat_recursive(struct cgroup *cgrp, struct cftype *cft,
+				       struct seq_file *sf)
+{
+	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
+
+	blkcg_print_blkgs(sf, blkcg, cfqg_prfill_rwstat_recursive,
+			  &blkcg_policy_cfq, cft->private, true);
+	return 0;
+}
+
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 static u64 cfqg_prfill_avg_queue_size(struct seq_file *sf,
 				      struct blkg_policy_data *pd, int off)
@@ -1803,6 +1865,7 @@ static struct cftype cfq_blkcg_files[] = {
 		.write_u64 = cfq_set_leaf_weight,
 	},
 
+	/* statistics, covers only the tasks in the cfqg */
 	{
 		.name = "time",
 		.private = offsetof(struct cfq_group, stats.time),
@@ -1843,6 +1906,48 @@ static struct cftype cfq_blkcg_files[] = {
 		.private = offsetof(struct cfq_group, stats.queued),
 		.read_seq_string = cfqg_print_rwstat,
 	},
+
+	/* the same statictics which cover the cfqg and its descendants */
+	{
+		.name = "time_recursive",
+		.private = offsetof(struct cfq_group, stats.time),
+		.read_seq_string = cfqg_print_stat_recursive,
+	},
+	{
+		.name = "sectors_recursive",
+		.private = offsetof(struct cfq_group, stats.sectors),
+		.read_seq_string = cfqg_print_stat_recursive,
+	},
+	{
+		.name = "io_service_bytes_recursive",
+		.private = offsetof(struct cfq_group, stats.service_bytes),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_serviced_recursive",
+		.private = offsetof(struct cfq_group, stats.serviced),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_service_time_recursive",
+		.private = offsetof(struct cfq_group, stats.service_time),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_wait_time_recursive",
+		.private = offsetof(struct cfq_group, stats.wait_time),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_merged_recursive",
+		.private = offsetof(struct cfq_group, stats.merged),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_queued_recursive",
+		.private = offsetof(struct cfq_group, stats.queued),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	{
 		.name = "avg_queue_size",
-- 
1.8.0.2

^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 24/24] cfq-iosched: add hierarchical cfq_group statistics
  2012-12-28 20:35 ` Tejun Heo
                   ` (10 preceding siblings ...)
  (?)
@ 2012-12-28 20:35 ` Tejun Heo
  2013-01-08 18:27     ` Vivek Goyal
       [not found]   ` <1356726946-26037-25-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  -1 siblings, 2 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Tejun Heo

Unfortunately, at this point, there's no way to make the existing
statistics hierarchical without creating nasty surprises for the
existing users.  Just create recursive counterpart of the existing
stats.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/cfq-iosched.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 4d75b79..b66365b 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1528,6 +1528,32 @@ static void cfq_pd_offline(struct blkcg_gq *blkg)
 	cfqg_stats_xfer_dead(blkg_to_cfqg(blkg));
 }
 
+/* offset delta from cfqg->stats to cfqg->dead_stats */
+static const int dead_stats_off_delta = offsetof(struct cfq_group, dead_stats) -
+					offsetof(struct cfq_group, stats);
+
+/* to be used by recursive prfill, sums live and dead stats recursively */
+static u64 cfqg_stat_pd_recursive_sum(struct blkg_policy_data *pd, int off)
+{
+	u64 sum = 0;
+
+	sum += blkg_stat_recursive_sum(pd, off);
+	sum += blkg_stat_recursive_sum(pd, off + dead_stats_off_delta);
+	return sum;
+}
+
+/* to be used by recursive prfill, sums live and dead rwstats recursively */
+static struct blkg_rwstat cfqg_rwstat_pd_recursive_sum(struct blkg_policy_data *pd,
+						       int off)
+{
+	struct blkg_rwstat a, b;
+
+	a = blkg_rwstat_recursive_sum(pd, off);
+	b = blkg_rwstat_recursive_sum(pd, off + dead_stats_off_delta);
+	blkg_rwstat_merge(&a, &b);
+	return a;
+}
+
 static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
 {
 	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
@@ -1732,6 +1758,42 @@ static int cfqg_print_rwstat(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 cfqg_prfill_stat_recursive(struct seq_file *sf,
+				      struct blkg_policy_data *pd, int off)
+{
+	u64 sum = cfqg_stat_pd_recursive_sum(pd, off);
+
+	return __blkg_prfill_u64(sf, pd, sum);
+}
+
+static u64 cfqg_prfill_rwstat_recursive(struct seq_file *sf,
+					struct blkg_policy_data *pd, int off)
+{
+	struct blkg_rwstat sum = cfqg_rwstat_pd_recursive_sum(pd, off);
+
+	return __blkg_prfill_rwstat(sf, pd, &sum);
+}
+
+static int cfqg_print_stat_recursive(struct cgroup *cgrp, struct cftype *cft,
+				     struct seq_file *sf)
+{
+	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
+
+	blkcg_print_blkgs(sf, blkcg, cfqg_prfill_stat_recursive,
+			  &blkcg_policy_cfq, cft->private, false);
+	return 0;
+}
+
+static int cfqg_print_rwstat_recursive(struct cgroup *cgrp, struct cftype *cft,
+				       struct seq_file *sf)
+{
+	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
+
+	blkcg_print_blkgs(sf, blkcg, cfqg_prfill_rwstat_recursive,
+			  &blkcg_policy_cfq, cft->private, true);
+	return 0;
+}
+
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 static u64 cfqg_prfill_avg_queue_size(struct seq_file *sf,
 				      struct blkg_policy_data *pd, int off)
@@ -1803,6 +1865,7 @@ static struct cftype cfq_blkcg_files[] = {
 		.write_u64 = cfq_set_leaf_weight,
 	},
 
+	/* statistics, covers only the tasks in the cfqg */
 	{
 		.name = "time",
 		.private = offsetof(struct cfq_group, stats.time),
@@ -1843,6 +1906,48 @@ static struct cftype cfq_blkcg_files[] = {
 		.private = offsetof(struct cfq_group, stats.queued),
 		.read_seq_string = cfqg_print_rwstat,
 	},
+
+	/* the same statictics which cover the cfqg and its descendants */
+	{
+		.name = "time_recursive",
+		.private = offsetof(struct cfq_group, stats.time),
+		.read_seq_string = cfqg_print_stat_recursive,
+	},
+	{
+		.name = "sectors_recursive",
+		.private = offsetof(struct cfq_group, stats.sectors),
+		.read_seq_string = cfqg_print_stat_recursive,
+	},
+	{
+		.name = "io_service_bytes_recursive",
+		.private = offsetof(struct cfq_group, stats.service_bytes),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_serviced_recursive",
+		.private = offsetof(struct cfq_group, stats.serviced),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_service_time_recursive",
+		.private = offsetof(struct cfq_group, stats.service_time),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_wait_time_recursive",
+		.private = offsetof(struct cfq_group, stats.wait_time),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_merged_recursive",
+		.private = offsetof(struct cfq_group, stats.merged),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
+	{
+		.name = "io_queued_recursive",
+		.private = offsetof(struct cfq_group, stats.queued),
+		.read_seq_string = cfqg_print_rwstat_recursive,
+	},
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	{
 		.name = "avg_queue_size",
-- 
1.8.0.2


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH 18.5/24] blkcg: export __blkg_prfill_rwstat() take#2
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (23 preceding siblings ...)
  2012-12-28 20:35   ` [PATCH 24/24] cfq-iosched: add hierarchical cfq_group statistics Tejun Heo
@ 2012-12-28 23:18   ` Tejun Heo
  2013-01-02 18:20   ` [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2 Vivek Goyal
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 23:18 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
Export it.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reported-by: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
Fengguang's build test discovered that cfq now uses
__blkg_prfill_rwstat() which isn't exported leading to build failures
when cfq is built as a module.  Export it.  This doesn't affect
!module builds.  Git branch updated accordingly.

Thanks.

 block/blk-cgroup.c |    1 +
 1 file changed, 1 insertion(+)

--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -552,6 +552,7 @@ u64 __blkg_prfill_rwstat(struct seq_file
 	seq_printf(sf, "%s Total %llu\n", dname, (unsigned long long)v);
 	return v;
 }
+EXPORT_SYMBOL_GPL(__blkg_prfill_rwstat);
 
 /**
  * blkg_prfill_stat - prfill callback for blkg_stat

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH 18.5/24] blkcg: export __blkg_prfill_rwstat() take#2
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2012-12-28 23:18   ` Tejun Heo
  2012-12-28 20:35     ` Tejun Heo
                     ` (24 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 23:18 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal
  Cc: containers, cgroups, linux-kernel, ctalbott, rni, Fengguang Wu

Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
Export it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
---
Fengguang's build test discovered that cfq now uses
__blkg_prfill_rwstat() which isn't exported leading to build failures
when cfq is built as a module.  Export it.  This doesn't affect
!module builds.  Git branch updated accordingly.

Thanks.

 block/blk-cgroup.c |    1 +
 1 file changed, 1 insertion(+)

--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -552,6 +552,7 @@ u64 __blkg_prfill_rwstat(struct seq_file
 	seq_printf(sf, "%s Total %llu\n", dname, (unsigned long long)v);
 	return v;
 }
+EXPORT_SYMBOL_GPL(__blkg_prfill_rwstat);
 
 /**
  * blkg_prfill_stat - prfill callback for blkg_stat

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH 18.5/24] blkcg: export __blkg_prfill_rwstat() take#2
@ 2012-12-28 23:18   ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 23:18 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	Fengguang Wu

Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
Export it.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reported-by: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
Fengguang's build test discovered that cfq now uses
__blkg_prfill_rwstat() which isn't exported leading to build failures
when cfq is built as a module.  Export it.  This doesn't affect
!module builds.  Git branch updated accordingly.

Thanks.

 block/blk-cgroup.c |    1 +
 1 file changed, 1 insertion(+)

--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -552,6 +552,7 @@ u64 __blkg_prfill_rwstat(struct seq_file
 	seq_printf(sf, "%s Total %llu\n", dname, (unsigned long long)v);
 	return v;
 }
+EXPORT_SYMBOL_GPL(__blkg_prfill_rwstat);
 
 /**
  * blkg_prfill_stat - prfill callback for blkg_stat

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
       [not found]   ` <1356726946-26037-24-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-02 16:24     ` Vivek Goyal
  2013-01-08 18:12     ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 16:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> To support hierarchical stats, it's necessary to remember stats from
> dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> its stats to the parent's dead-stats.

Hi Tejun,

Why not directly transfer stats to cfqg->stats. IOW, what's the advantage
of maintaining dead_stats separately.

[..]
> + * Transfer @cfqg's stats to its parent's dead_stats so that the ancestors'
> + * recursive stats can still account for the amount used by this cfqg after
> + * it's gone.
> + */
> +static void cfqg_stats_xfer_dead(struct cfq_group *cfqg)
> +{
> +	struct cfq_group *parent = cfqg_parent(cfqg);
> +
> +	lockdep_assert_held(cfqg_to_blkg(cfqg)->q->queue_lock);
> +
> +	if (unlikely(!parent))
> +		return;
> +
> +	cfqg_stats_merge(&parent->dead_stats, &cfqg->stats);
> +	cfqg_stats_merge(&parent->dead_stats, &cfqg->dead_stats);
> +	cfqg_stats_reset(&cfqg->stats);
> +	cfqg_stats_reset(&cfqg->dead_stats);

Anyway group will be marked offline and later freed. So resetting stats
might not be required.

In fact if we have a realiable way of resetting status then online/offline
infrastructure might not be required? I think per cpu stats will be a
problem though and that's why we probably require logic to online/offline
the group?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
       [not found]   ` <1356726946-26037-24-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-02 16:24     ` Vivek Goyal
  2013-01-08 18:12     ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 16:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> To support hierarchical stats, it's necessary to remember stats from
> dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> its stats to the parent's dead-stats.

Hi Tejun,

Why not directly transfer stats to cfqg->stats. IOW, what's the advantage
of maintaining dead_stats separately.

[..]
> + * Transfer @cfqg's stats to its parent's dead_stats so that the ancestors'
> + * recursive stats can still account for the amount used by this cfqg after
> + * it's gone.
> + */
> +static void cfqg_stats_xfer_dead(struct cfq_group *cfqg)
> +{
> +	struct cfq_group *parent = cfqg_parent(cfqg);
> +
> +	lockdep_assert_held(cfqg_to_blkg(cfqg)->q->queue_lock);
> +
> +	if (unlikely(!parent))
> +		return;
> +
> +	cfqg_stats_merge(&parent->dead_stats, &cfqg->stats);
> +	cfqg_stats_merge(&parent->dead_stats, &cfqg->dead_stats);
> +	cfqg_stats_reset(&cfqg->stats);
> +	cfqg_stats_reset(&cfqg->dead_stats);

Anyway group will be marked offline and later freed. So resetting stats
might not be required.

In fact if we have a realiable way of resetting status then online/offline
infrastructure might not be required? I think per cpu stats will be a
problem though and that's why we probably require logic to online/offline
the group?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
@ 2013-01-02 16:24     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 16:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> To support hierarchical stats, it's necessary to remember stats from
> dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> its stats to the parent's dead-stats.

Hi Tejun,

Why not directly transfer stats to cfqg->stats. IOW, what's the advantage
of maintaining dead_stats separately.

[..]
> + * Transfer @cfqg's stats to its parent's dead_stats so that the ancestors'
> + * recursive stats can still account for the amount used by this cfqg after
> + * it's gone.
> + */
> +static void cfqg_stats_xfer_dead(struct cfq_group *cfqg)
> +{
> +	struct cfq_group *parent = cfqg_parent(cfqg);
> +
> +	lockdep_assert_held(cfqg_to_blkg(cfqg)->q->queue_lock);
> +
> +	if (unlikely(!parent))
> +		return;
> +
> +	cfqg_stats_merge(&parent->dead_stats, &cfqg->stats);
> +	cfqg_stats_merge(&parent->dead_stats, &cfqg->dead_stats);
> +	cfqg_stats_reset(&cfqg->stats);
> +	cfqg_stats_reset(&cfqg->dead_stats);

Anyway group will be marked offline and later freed. So resetting stats
might not be required.

In fact if we have a realiable way of resetting status then online/offline
infrastructure might not be required? I think per cpu stats will be a
problem though and that's why we probably require logic to online/offline
the group?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
       [not found]     ` <20130102162415.GA4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-02 16:30       ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 16:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hey, Vivek.

On Wed, Jan 02, 2013 at 11:24:15AM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> > To support hierarchical stats, it's necessary to remember stats from
> > dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> > its stats to the parent's dead-stats.
> 
> Hi Tejun,
> 
> Why not directly transfer stats to cfqg->stats. IOW, what's the advantage
> of maintaining dead_stats separately.

Backward compatibility?  The existing stat cgroupfs files expect to
see non-recursive stats.

> > +	cfqg_stats_merge(&parent->dead_stats, &cfqg->stats);
> > +	cfqg_stats_merge(&parent->dead_stats, &cfqg->dead_stats);
> > +	cfqg_stats_reset(&cfqg->stats);
> > +	cfqg_stats_reset(&cfqg->dead_stats);
> 
> Anyway group will be marked offline and later freed. So resetting stats
> might not be required.

Yeah, it isn't strictly necessary.  I tried to transfer the residual
IOs between offline and release first so the resetting.  Kinda like
them there tho.

> In fact if we have a realiable way of resetting status then online/offline
> infrastructure might not be required? I think per cpu stats will be a
> problem though and that's why we probably require logic to online/offline
> the group?

Hmmm?  What do you mean?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
       [not found]     ` <20130102162415.GA4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-02 16:30       ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 16:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

Hey, Vivek.

On Wed, Jan 02, 2013 at 11:24:15AM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> > To support hierarchical stats, it's necessary to remember stats from
> > dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> > its stats to the parent's dead-stats.
> 
> Hi Tejun,
> 
> Why not directly transfer stats to cfqg->stats. IOW, what's the advantage
> of maintaining dead_stats separately.

Backward compatibility?  The existing stat cgroupfs files expect to
see non-recursive stats.

> > +	cfqg_stats_merge(&parent->dead_stats, &cfqg->stats);
> > +	cfqg_stats_merge(&parent->dead_stats, &cfqg->dead_stats);
> > +	cfqg_stats_reset(&cfqg->stats);
> > +	cfqg_stats_reset(&cfqg->dead_stats);
> 
> Anyway group will be marked offline and later freed. So resetting stats
> might not be required.

Yeah, it isn't strictly necessary.  I tried to transfer the residual
IOs between offline and release first so the resetting.  Kinda like
them there tho.

> In fact if we have a realiable way of resetting status then online/offline
> infrastructure might not be required? I think per cpu stats will be a
> problem though and that's why we probably require logic to online/offline
> the group?

Hmmm?  What do you mean?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
@ 2013-01-02 16:30       ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 16:30 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

Hey, Vivek.

On Wed, Jan 02, 2013 at 11:24:15AM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> > To support hierarchical stats, it's necessary to remember stats from
> > dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> > its stats to the parent's dead-stats.
> 
> Hi Tejun,
> 
> Why not directly transfer stats to cfqg->stats. IOW, what's the advantage
> of maintaining dead_stats separately.

Backward compatibility?  The existing stat cgroupfs files expect to
see non-recursive stats.

> > +	cfqg_stats_merge(&parent->dead_stats, &cfqg->stats);
> > +	cfqg_stats_merge(&parent->dead_stats, &cfqg->dead_stats);
> > +	cfqg_stats_reset(&cfqg->stats);
> > +	cfqg_stats_reset(&cfqg->dead_stats);
> 
> Anyway group will be marked offline and later freed. So resetting stats
> might not be required.

Yeah, it isn't strictly necessary.  I tried to transfer the residual
IOs between offline and release first so the resetting.  Kinda like
them there tho.

> In fact if we have a realiable way of resetting status then online/offline
> infrastructure might not be required? I think per cpu stats will be a
> problem though and that's why we probably require logic to online/offline
> the group?

Hmmm?  What do you mean?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
  2013-01-02 16:30       ` Tejun Heo
@ 2013-01-02 16:44           ` Vivek Goyal
  -1 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 16:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 02, 2013 at 11:30:10AM -0500, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Jan 02, 2013 at 11:24:15AM -0500, Vivek Goyal wrote:
> > On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> > > To support hierarchical stats, it's necessary to remember stats from
> > > dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> > > its stats to the parent's dead-stats.
> > 
> > Hi Tejun,
> > 
> > Why not directly transfer stats to cfqg->stats. IOW, what's the advantage
> > of maintaining dead_stats separately.
> 
> Backward compatibility?  The existing stat cgroupfs files expect to
> see non-recursive stats.

Oh yes. Missed that. So dead_stats makes sense.

[..]
> > In fact if we have a realiable way of resetting status then online/offline
> > infrastructure might not be required? I think per cpu stats will be a
> > problem though and that's why we probably require logic to online/offline
> > the group?
> 
> Hmmm?  What do you mean?

I mean if we had a reliable way of resetting stats after transferring
then we would not need to keep a track of whether group is online/offline.
We could add everything and adding zero will not change anything. In fact
it will also take care of residual IO (IO which happened after transfer
of stats).

Or I missed the real reason of why do we have group online/offline
infrastructure.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
@ 2013-01-02 16:44           ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 16:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Wed, Jan 02, 2013 at 11:30:10AM -0500, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Wed, Jan 02, 2013 at 11:24:15AM -0500, Vivek Goyal wrote:
> > On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> > > To support hierarchical stats, it's necessary to remember stats from
> > > dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> > > its stats to the parent's dead-stats.
> > 
> > Hi Tejun,
> > 
> > Why not directly transfer stats to cfqg->stats. IOW, what's the advantage
> > of maintaining dead_stats separately.
> 
> Backward compatibility?  The existing stat cgroupfs files expect to
> see non-recursive stats.

Oh yes. Missed that. So dead_stats makes sense.

[..]
> > In fact if we have a realiable way of resetting status then online/offline
> > infrastructure might not be required? I think per cpu stats will be a
> > problem though and that's why we probably require logic to online/offline
> > the group?
> 
> Hmmm?  What do you mean?

I mean if we had a reliable way of resetting stats after transferring
then we would not need to keep a track of whether group is online/offline.
We could add everything and adding zero will not change anything. In fact
it will also take care of residual IO (IO which happened after transfer
of stats).

Or I missed the real reason of why do we have group online/offline
infrastructure.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
       [not found]           ` <20130102164415.GB4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-02 16:52             ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 16:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hey, Vivek.

On Wed, Jan 02, 2013 at 11:44:15AM -0500, Vivek Goyal wrote:
> > > In fact if we have a realiable way of resetting status then online/offline
> > > infrastructure might not be required? I think per cpu stats will be a
> > > problem though and that's why we probably require logic to online/offline
> > > the group?
> > 
> > Hmmm?  What do you mean?
> 
> I mean if we had a reliable way of resetting stats after transferring
> then we would not need to keep a track of whether group is online/offline.
> We could add everything and adding zero will not change anything. In fact
> it will also take care of residual IO (IO which happened after transfer
> of stats).

Ah... yeah, if we can do atomic transfer, we might be able to do away
with on/offlining.  Couldn't think of a way to do that without
incurring overhead to hot paths.

> Or I missed the real reason of why do we have group online/offline
> infrastructure.

But given that on/offline state is something common to cgroup, I don't
think adding the states to blkcg is a bad idea.  We need it for
reliable iterations anyways.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
       [not found]           ` <20130102164415.GB4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-02 16:52             ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 16:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

Hey, Vivek.

On Wed, Jan 02, 2013 at 11:44:15AM -0500, Vivek Goyal wrote:
> > > In fact if we have a realiable way of resetting status then online/offline
> > > infrastructure might not be required? I think per cpu stats will be a
> > > problem though and that's why we probably require logic to online/offline
> > > the group?
> > 
> > Hmmm?  What do you mean?
> 
> I mean if we had a reliable way of resetting stats after transferring
> then we would not need to keep a track of whether group is online/offline.
> We could add everything and adding zero will not change anything. In fact
> it will also take care of residual IO (IO which happened after transfer
> of stats).

Ah... yeah, if we can do atomic transfer, we might be able to do away
with on/offlining.  Couldn't think of a way to do that without
incurring overhead to hot paths.

> Or I missed the real reason of why do we have group online/offline
> infrastructure.

But given that on/offline state is something common to cgroup, I don't
think adding the states to blkcg is a bad idea.  We need it for
reliable iterations anyways.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
@ 2013-01-02 16:52             ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 16:52 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

Hey, Vivek.

On Wed, Jan 02, 2013 at 11:44:15AM -0500, Vivek Goyal wrote:
> > > In fact if we have a realiable way of resetting status then online/offline
> > > infrastructure might not be required? I think per cpu stats will be a
> > > problem though and that's why we probably require logic to online/offline
> > > the group?
> > 
> > Hmmm?  What do you mean?
> 
> I mean if we had a reliable way of resetting stats after transferring
> then we would not need to keep a track of whether group is online/offline.
> We could add everything and adding zero will not change anything. In fact
> it will also take care of residual IO (IO which happened after transfer
> of stats).

Ah... yeah, if we can do atomic transfer, we might be able to do away
with on/offlining.  Couldn't think of a way to do that without
incurring overhead to hot paths.

> Or I missed the real reason of why do we have group online/offline
> infrastructure.

But given that on/offline state is something common to cgroup, I don't
think adding the states to blkcg is a bad idea.  We need it for
reliable iterations anyways.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                     ` (24 preceding siblings ...)
  2012-12-28 23:18   ` [PATCH 18.5/24] blkcg: export __blkg_prfill_rwstat() take#2 Tejun Heo
@ 2013-01-02 18:20   ` Vivek Goyal
  25 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 18:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:22PM -0800, Tejun Heo wrote:

[..]
> 
> * Updated to reflect Vivek's reviews - renames & documentation.

Hi Tejun,

You forgot to update blkio-controller.txt.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
       [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-02 18:20   ` Vivek Goyal
  2012-12-28 20:35     ` Tejun Heo
                     ` (24 subsequent siblings)
  25 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 18:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:22PM -0800, Tejun Heo wrote:

[..]
> 
> * Updated to reflect Vivek's reviews - renames & documentation.

Hi Tejun,

You forgot to update blkio-controller.txt.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
@ 2013-01-02 18:20   ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 18:20 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:22PM -0800, Tejun Heo wrote:

[..]
> 
> * Updated to reflect Vivek's reviews - renames & documentation.

Hi Tejun,

You forgot to update blkio-controller.txt.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 20/24] block: RCU free request_queue
  2012-12-28 20:35 ` [PATCH 20/24] block: RCU free request_queue Tejun Heo
@ 2013-01-02 18:48       ` Vivek Goyal
       [not found]   ` <1356726946-26037-21-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 18:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:42PM -0800, Tejun Heo wrote:
> RCU free request_queue so that blkcg_gq->q can be dereferenced under
> RCU lock.  This will be used to implement hierarchical stats.

Can we just take a blkg reference on ->q in blkg_alloc() and drop that
reference in blkg_free(), instead of RCU freeing up queue.

Thanks
Vivek

> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/blk-sysfs.c      | 9 ++++++++-
>  include/linux/blkdev.h | 2 ++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7881477..6206a93 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -497,6 +497,13 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
>  	return res;
>  }
>  
> +static void blk_free_queue_rcu(struct rcu_head *rcu_head)
> +{
> +	struct request_queue *q = container_of(rcu_head, struct request_queue,
> +					       rcu_head);
> +	kmem_cache_free(blk_requestq_cachep, q);
> +}
> +
>  /**
>   * blk_release_queue: - release a &struct request_queue when it is no longer needed
>   * @kobj:    the kobj belonging to the request queue to be released
> @@ -538,7 +545,7 @@ static void blk_release_queue(struct kobject *kobj)
>  	bdi_destroy(&q->backing_dev_info);
>  
>  	ida_simple_remove(&blk_queue_ida, q->id);
> -	kmem_cache_free(blk_requestq_cachep, q);
> +	call_rcu(&q->rcu_head, blk_free_queue_rcu);
>  }
>  
>  static const struct sysfs_ops queue_sysfs_ops = {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f94bc83..406343c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -19,6 +19,7 @@
>  #include <linux/gfp.h>
>  #include <linux/bsg.h>
>  #include <linux/smp.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/scatterlist.h>
>  
> @@ -437,6 +438,7 @@ struct request_queue {
>  	/* Throttle data */
>  	struct throtl_data *td;
>  #endif
> +	struct rcu_head		rcu_head;
>  };
>  
>  #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 20/24] block: RCU free request_queue
@ 2013-01-02 18:48       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 18:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:42PM -0800, Tejun Heo wrote:
> RCU free request_queue so that blkcg_gq->q can be dereferenced under
> RCU lock.  This will be used to implement hierarchical stats.

Can we just take a blkg reference on ->q in blkg_alloc() and drop that
reference in blkg_free(), instead of RCU freeing up queue.

Thanks
Vivek

> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-sysfs.c      | 9 ++++++++-
>  include/linux/blkdev.h | 2 ++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7881477..6206a93 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -497,6 +497,13 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
>  	return res;
>  }
>  
> +static void blk_free_queue_rcu(struct rcu_head *rcu_head)
> +{
> +	struct request_queue *q = container_of(rcu_head, struct request_queue,
> +					       rcu_head);
> +	kmem_cache_free(blk_requestq_cachep, q);
> +}
> +
>  /**
>   * blk_release_queue: - release a &struct request_queue when it is no longer needed
>   * @kobj:    the kobj belonging to the request queue to be released
> @@ -538,7 +545,7 @@ static void blk_release_queue(struct kobject *kobj)
>  	bdi_destroy(&q->backing_dev_info);
>  
>  	ida_simple_remove(&blk_queue_ida, q->id);
> -	kmem_cache_free(blk_requestq_cachep, q);
> +	call_rcu(&q->rcu_head, blk_free_queue_rcu);
>  }
>  
>  static const struct sysfs_ops queue_sysfs_ops = {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f94bc83..406343c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -19,6 +19,7 @@
>  #include <linux/gfp.h>
>  #include <linux/bsg.h>
>  #include <linux/smp.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/scatterlist.h>
>  
> @@ -437,6 +438,7 @@ struct request_queue {
>  	/* Throttle data */
>  	struct throtl_data *td;
>  #endif
> +	struct rcu_head		rcu_head;
>  };
>  
>  #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
       [not found]     ` <1356726946-26037-22-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-02 19:27       ` Vivek Goyal
  2013-01-08 18:08       ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 19:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:43PM -0800, Tejun Heo wrote:
> Instead of holding blkcg->lock while walking ->blkg_list and executing
> prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
> executing prfill().  This makes prfill() implementations easier as
> stats are mostly protected by queue lock.
> 
> This will be used to implement hierarchical stats.
> 

Hi Tejun,

I think dropping blkcg->lock might be a problem. Using RCU we have made
sure that blkg and q are around. But what about blkg->q.backing_dev_info.dev.

We can follow bdi->dev pointer in blkg_dev_name(). I am not sure if we
ever clear it from q or not when device goes away.

Thanks
Vivek

> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/blk-cgroup.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index a1a4b97..22f75d1 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -504,8 +504,9 @@ static const char *blkg_dev_name(struct blkcg_gq *blkg)
>   *
>   * This function invokes @prfill on each blkg of @blkcg if pd for the
>   * policy specified by @pol exists.  @prfill is invoked with @sf, the
> - * policy data and @data.  If @show_total is %true, the sum of the return
> - * values from @prfill is printed with "Total" label at the end.
> + * policy data and @data and the matching queue lock held.  If @show_total
> + * is %true, the sum of the return values from @prfill is printed with
> + * "Total" label at the end.
>   *
>   * This is to be used to construct print functions for
>   * cftype->read_seq_string method.
> @@ -520,11 +521,14 @@ void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
>  	struct hlist_node *n;
>  	u64 total = 0;
>  
> -	spin_lock_irq(&blkcg->lock);
> -	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
> +		spin_lock_irq(blkg->q->queue_lock);
>  		if (blkcg_policy_enabled(blkg->q, pol))
>  			total += prfill(sf, blkg->pd[pol->plid], data);
> -	spin_unlock_irq(&blkcg->lock);
> +		spin_unlock_irq(blkg->q->queue_lock);
> +	}
> +	rcu_read_unlock();
>  
>  	if (show_total)
>  		seq_printf(sf, "Total %llu\n", (unsigned long long)total);
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
       [not found]     ` <1356726946-26037-22-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-02 19:27       ` Vivek Goyal
  2013-01-08 18:08       ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 19:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:43PM -0800, Tejun Heo wrote:
> Instead of holding blkcg->lock while walking ->blkg_list and executing
> prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
> executing prfill().  This makes prfill() implementations easier as
> stats are mostly protected by queue lock.
> 
> This will be used to implement hierarchical stats.
> 

Hi Tejun,

I think dropping blkcg->lock might be a problem. Using RCU we have made
sure that blkg and q are around. But what about blkg->q.backing_dev_info.dev.

We can follow bdi->dev pointer in blkg_dev_name(). I am not sure if we
ever clear it from q or not when device goes away.

Thanks
Vivek

> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-cgroup.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index a1a4b97..22f75d1 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -504,8 +504,9 @@ static const char *blkg_dev_name(struct blkcg_gq *blkg)
>   *
>   * This function invokes @prfill on each blkg of @blkcg if pd for the
>   * policy specified by @pol exists.  @prfill is invoked with @sf, the
> - * policy data and @data.  If @show_total is %true, the sum of the return
> - * values from @prfill is printed with "Total" label at the end.
> + * policy data and @data and the matching queue lock held.  If @show_total
> + * is %true, the sum of the return values from @prfill is printed with
> + * "Total" label at the end.
>   *
>   * This is to be used to construct print functions for
>   * cftype->read_seq_string method.
> @@ -520,11 +521,14 @@ void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
>  	struct hlist_node *n;
>  	u64 total = 0;
>  
> -	spin_lock_irq(&blkcg->lock);
> -	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
> +		spin_lock_irq(blkg->q->queue_lock);
>  		if (blkcg_policy_enabled(blkg->q, pol))
>  			total += prfill(sf, blkg->pd[pol->plid], data);
> -	spin_unlock_irq(&blkcg->lock);
> +		spin_unlock_irq(blkg->q->queue_lock);
> +	}
> +	rcu_read_unlock();
>  
>  	if (show_total)
>  		seq_printf(sf, "Total %llu\n", (unsigned long long)total);
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
@ 2013-01-02 19:27       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 19:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:43PM -0800, Tejun Heo wrote:
> Instead of holding blkcg->lock while walking ->blkg_list and executing
> prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
> executing prfill().  This makes prfill() implementations easier as
> stats are mostly protected by queue lock.
> 
> This will be used to implement hierarchical stats.
> 

Hi Tejun,

I think dropping blkcg->lock might be a problem. Using RCU we have made
sure that blkg and q are around. But what about blkg->q.backing_dev_info.dev.

We can follow bdi->dev pointer in blkg_dev_name(). I am not sure if we
ever clear it from q or not when device goes away.

Thanks
Vivek

> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/blk-cgroup.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index a1a4b97..22f75d1 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -504,8 +504,9 @@ static const char *blkg_dev_name(struct blkcg_gq *blkg)
>   *
>   * This function invokes @prfill on each blkg of @blkcg if pd for the
>   * policy specified by @pol exists.  @prfill is invoked with @sf, the
> - * policy data and @data.  If @show_total is %true, the sum of the return
> - * values from @prfill is printed with "Total" label at the end.
> + * policy data and @data and the matching queue lock held.  If @show_total
> + * is %true, the sum of the return values from @prfill is printed with
> + * "Total" label at the end.
>   *
>   * This is to be used to construct print functions for
>   * cftype->read_seq_string method.
> @@ -520,11 +521,14 @@ void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
>  	struct hlist_node *n;
>  	u64 total = 0;
>  
> -	spin_lock_irq(&blkcg->lock);
> -	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node)
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
> +		spin_lock_irq(blkg->q->queue_lock);
>  		if (blkcg_policy_enabled(blkg->q, pol))
>  			total += prfill(sf, blkg->pd[pol->plid], data);
> -	spin_unlock_irq(&blkcg->lock);
> +		spin_unlock_irq(blkg->q->queue_lock);
> +	}
> +	rcu_read_unlock();
>  
>  	if (show_total)
>  		seq_printf(sf, "Total %llu\n", (unsigned long long)total);
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online
       [not found]     ` <1356726946-26037-18-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-02 19:38       ` Vivek Goyal
  2013-01-08 16:58         ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 19:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:39PM -0800, Tejun Heo wrote:

[..]
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 40f5b97..678e89e 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -106,12 +106,17 @@ struct blkcg_gq {
>  	/* reference count */
>  	int				refcnt;
>  
> +	/* is this blkg online? protected by both blkcg and q locks */
> +	bool				online;
> +

Hi Tejun,

What does above mean? One needs to take one lock or both the locks to
check/modify blkg->online.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online
       [not found]     ` <1356726946-26037-18-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-02 19:38       ` Vivek Goyal
  2013-01-08 16:58         ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 19:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:39PM -0800, Tejun Heo wrote:

[..]
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 40f5b97..678e89e 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -106,12 +106,17 @@ struct blkcg_gq {
>  	/* reference count */
>  	int				refcnt;
>  
> +	/* is this blkg online? protected by both blkcg and q locks */
> +	bool				online;
> +

Hi Tejun,

What does above mean? One needs to take one lock or both the locks to
check/modify blkg->online.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online
@ 2013-01-02 19:38       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-02 19:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:39PM -0800, Tejun Heo wrote:

[..]
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 40f5b97..678e89e 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -106,12 +106,17 @@ struct blkcg_gq {
>  	/* reference count */
>  	int				refcnt;
>  
> +	/* is this blkg online? protected by both blkcg and q locks */
> +	bool				online;
> +

Hi Tejun,

What does above mean? One needs to take one lock or both the locks to
check/modify blkg->online.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online
       [not found]       ` <20130102193828.GE4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-02 20:37         ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 20:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 02, 2013 at 02:38:28PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:39PM -0800, Tejun Heo wrote:
> 
> [..]
> > diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> > index 40f5b97..678e89e 100644
> > --- a/block/blk-cgroup.h
> > +++ b/block/blk-cgroup.h
> > @@ -106,12 +106,17 @@ struct blkcg_gq {
> >  	/* reference count */
> >  	int				refcnt;
> >  
> > +	/* is this blkg online? protected by both blkcg and q locks */
> > +	bool				online;
> > +
> 
> Hi Tejun,
> 
> What does above mean? One needs to take one lock or both the locks to
> check/modify blkg->online.

Needs both locks to modify, so either lock is enough for checking.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online
       [not found]       ` <20130102193828.GE4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-02 20:37         ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 20:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Wed, Jan 02, 2013 at 02:38:28PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:39PM -0800, Tejun Heo wrote:
> 
> [..]
> > diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> > index 40f5b97..678e89e 100644
> > --- a/block/blk-cgroup.h
> > +++ b/block/blk-cgroup.h
> > @@ -106,12 +106,17 @@ struct blkcg_gq {
> >  	/* reference count */
> >  	int				refcnt;
> >  
> > +	/* is this blkg online? protected by both blkcg and q locks */
> > +	bool				online;
> > +
> 
> Hi Tejun,
> 
> What does above mean? One needs to take one lock or both the locks to
> check/modify blkg->online.

Needs both locks to modify, so either lock is enough for checking.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online
@ 2013-01-02 20:37         ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 20:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Wed, Jan 02, 2013 at 02:38:28PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:39PM -0800, Tejun Heo wrote:
> 
> [..]
> > diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> > index 40f5b97..678e89e 100644
> > --- a/block/blk-cgroup.h
> > +++ b/block/blk-cgroup.h
> > @@ -106,12 +106,17 @@ struct blkcg_gq {
> >  	/* reference count */
> >  	int				refcnt;
> >  
> > +	/* is this blkg online? protected by both blkcg and q locks */
> > +	bool				online;
> > +
> 
> Hi Tejun,
> 
> What does above mean? One needs to take one lock or both the locks to
> check/modify blkg->online.

Needs both locks to modify, so either lock is enough for checking.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 20/24] block: RCU free request_queue
  2013-01-02 18:48       ` Vivek Goyal
@ 2013-01-02 20:43           ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 20:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 02, 2013 at 01:48:15PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:42PM -0800, Tejun Heo wrote:
> > RCU free request_queue so that blkcg_gq->q can be dereferenced under
> > RCU lock.  This will be used to implement hierarchical stats.
> 
> Can we just take a blkg reference on ->q in blkg_alloc() and drop that
> reference in blkg_free(), instead of RCU freeing up queue.

I don't think we can invoke blk_release_queue() from RCU free contxt,
so we can't put request_queue from blkg_rcu_free().  Maybe we can go
through a work item but that's uglier.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 20/24] block: RCU free request_queue
@ 2013-01-02 20:43           ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 20:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Wed, Jan 02, 2013 at 01:48:15PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:42PM -0800, Tejun Heo wrote:
> > RCU free request_queue so that blkcg_gq->q can be dereferenced under
> > RCU lock.  This will be used to implement hierarchical stats.
> 
> Can we just take a blkg reference on ->q in blkg_alloc() and drop that
> reference in blkg_free(), instead of RCU freeing up queue.

I don't think we can invoke blk_release_queue() from RCU free contxt,
so we can't put request_queue from blkg_rcu_free().  Maybe we can go
through a work item but that's uglier.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
       [not found]       ` <20130102192700.GA9552-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-02 20:45         ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 20:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 02, 2013 at 02:27:00PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:43PM -0800, Tejun Heo wrote:
> > Instead of holding blkcg->lock while walking ->blkg_list and executing
> > prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
> > executing prfill().  This makes prfill() implementations easier as
> > stats are mostly protected by queue lock.
> > 
> > This will be used to implement hierarchical stats.
> > 
> 
> Hi Tejun,
> 
> I think dropping blkcg->lock might be a problem. Using RCU we have made
> sure that blkg and q are around. But what about blkg->q.backing_dev_info.dev.
> 
> We can follow bdi->dev pointer in blkg_dev_name(). I am not sure if we
> ever clear it from q or not when device goes away.

If the queue is dead, it wouldn't have policy enabled bit set which is
tested while holding queue lock, so I don't think it's gonna be a
problem.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
       [not found]       ` <20130102192700.GA9552-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-02 20:45         ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 20:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Wed, Jan 02, 2013 at 02:27:00PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:43PM -0800, Tejun Heo wrote:
> > Instead of holding blkcg->lock while walking ->blkg_list and executing
> > prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
> > executing prfill().  This makes prfill() implementations easier as
> > stats are mostly protected by queue lock.
> > 
> > This will be used to implement hierarchical stats.
> > 
> 
> Hi Tejun,
> 
> I think dropping blkcg->lock might be a problem. Using RCU we have made
> sure that blkg and q are around. But what about blkg->q.backing_dev_info.dev.
> 
> We can follow bdi->dev pointer in blkg_dev_name(). I am not sure if we
> ever clear it from q or not when device goes away.

If the queue is dead, it wouldn't have policy enabled bit set which is
tested while holding queue lock, so I don't think it's gonna be a
problem.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
@ 2013-01-02 20:45         ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-02 20:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Wed, Jan 02, 2013 at 02:27:00PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:43PM -0800, Tejun Heo wrote:
> > Instead of holding blkcg->lock while walking ->blkg_list and executing
> > prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
> > executing prfill().  This makes prfill() implementations easier as
> > stats are mostly protected by queue lock.
> > 
> > This will be used to implement hierarchical stats.
> > 
> 
> Hi Tejun,
> 
> I think dropping blkcg->lock might be a problem. Using RCU we have made
> sure that blkg and q are around. But what about blkg->q.backing_dev_info.dev.
> 
> We can follow bdi->dev pointer in blkg_dev_name(). I am not sure if we
> ever clear it from q or not when device goes away.

If the queue is dead, it wouldn't have policy enabled bit set which is
tested while holding queue lock, so I don't think it's gonna be a
problem.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH UPDATED 15/24] cfq-iosched: enable full blkcg hierarchy support
  2012-12-28 20:35 ` [PATCH 15/24] cfq-iosched: enable full blkcg hierarchy support Tejun Heo
@ 2013-01-07 16:34       ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-07 16:34 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA

With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support.  The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.

Replace it with cfqg_parent() which returns the real parent.  This
enables full blkcg hierarchy support for cfq-iosched.  For example,
consider the following hierarchy.

        root
      /      \
   A:500      B:250
  /     \
 AA:500  AB:1000

For simplicity, let's say all the leaf nodes have active tasks and are
on service tree.  For each leaf node, vfraction would be

 AA: (500  / 1500) * (500 / 750) =~ 0.2222
 AB: (1000 / 1500) * (500 / 750) =~ 0.4444
  B:                 (250 / 750) =~ 0.3333

and vdisktime will be distributed accordingly.  For more detail,
please refer to Documentation/block/cfq-iosched.txt.

v2: cfq-iosched.txt updated to describe group scheduling as suggested
    by Vivek.

v3: blkio-controller.txt updated.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
blkio-controller.txt updated as per Vivek.

Thanks.

 Documentation/block/cfq-iosched.txt        |   58 +++++++++++++++++++++++++++++
 Documentation/cgroups/blkio-controller.txt |   37 ++++++++++++------
 block/cfq-iosched.c                        |   21 +++-------
 3 files changed, 89 insertions(+), 27 deletions(-)

--- a/Documentation/block/cfq-iosched.txt
+++ b/Documentation/block/cfq-iosched.txt
@@ -102,6 +102,64 @@ processing of request. Therefore, increa
 performace although this can cause the latency of some I/O to increase due
 to more number of requests.
 
+CFQ Group scheduling
+====================
+
+CFQ supports blkio cgroup and has "blkio." prefixed files in each
+blkio cgroup directory. It is weight-based and there are four knobs
+for configuration - weight[_device] and leaf_weight[_device].
+Internal cgroup nodes (the ones with children) can also have tasks in
+them, so the former two configure how much proportion the cgroup as a
+whole is entitled to at its parent's level while the latter two
+configure how much proportion the tasks in the cgroup have compared to
+its direct children.
+
+Another way to think about it is assuming that each internal node has
+an implicit leaf child node which hosts all the tasks whose weight is
+configured by leaf_weight[_device]. Let's assume a blkio hierarchy
+composed of five cgroups - root, A, B, AA and AB - with the following
+weights where the names represent the hierarchy.
+
+        weight leaf_weight
+ root :  125    125
+ A    :  500    750
+ B    :  250    500
+ AA   :  500    500
+ AB   : 1000    500
+
+root never has a parent making its weight is meaningless. For backward
+compatibility, weight is always kept in sync with leaf_weight. B, AA
+and AB have no child and thus its tasks have no children cgroup to
+compete with. They always get 100% of what the cgroup won at the
+parent level. Considering only the weights which matter, the hierarchy
+looks like the following.
+
+          root
+       /    |   \
+      A     B    leaf
+     500   250   125
+   /  |  \
+  AA  AB  leaf
+ 500 1000 750
+
+If all cgroups have active IOs and competing with each other, disk
+time will be distributed like the following.
+
+Distribution below root. The total active weight at this level is
+A:500 + B:250 + C:125 = 875.
+
+ root-leaf :   125 /  875      =~ 14%
+ A         :   500 /  875      =~ 57%
+ B(-leaf)  :   250 /  875      =~ 28%
+
+A has children and further distributes its 57% among the children and
+the implicit leaf node. The total active weight at this level is
+AA:500 + AB:1000 + A-leaf:750 = 2250.
+
+ A-leaf    : ( 750 / 2250) * A =~ 19%
+ AA(-leaf) : ( 500 / 2250) * A =~ 12%
+ AB(-leaf) : (1000 / 2250) * A =~ 25%
+
 CFQ IOPS Mode for group scheduling
 ===================================
 Basic CFQ design is to provide priority based time slices. Higher priority
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -94,13 +94,11 @@ Throttling/Upper Limit policy
 
 Hierarchical Cgroups
 ====================
-- Currently none of the IO control policy supports hierarchical groups. But
-  cgroup interface does allow creation of hierarchical cgroups and internally
-  IO policies treat them as flat hierarchy.
-
-  So this patch will allow creation of cgroup hierarchcy but at the backend
-  everything will be treated as flat. So if somebody created a hierarchy like
-  as follows.
+- Currently only CFQ supports hierarchical groups. For throttling,
+  cgroup interface does allow creation of hierarchical cgroups and
+  internally it treats them as flat hierarchy.
+
+  If somebody created a hierarchy like as follows.
 
 			root
 			/  \
@@ -108,16 +106,20 @@ Hierarchical Cgroups
 			|
 		     test3
 
-  CFQ and throttling will practically treat all groups at same level.
+  CFQ will handle the hierarchy correctly but and throttling will
+  practically treat all groups at same level. For details on CFQ
+  hierarchy support, refer to Documentation/block/cfq-iosched.txt.
+  Throttling will treat the hierarchy as if it looks like the
+  following.
 
 				pivot
 			     /  /   \  \
 			root  test1 test2  test3
 
-  Down the line we can implement hierarchical accounting/control support
-  and also introduce a new cgroup file "use_hierarchy" which will control
-  whether cgroup hierarchy is viewed as flat or hierarchical by the policy..
-  This is how memory controller also has implemented the things.
+  Nesting cgroups, while allowed, isn't officially supported and blkio
+  genereates warning when cgroups nest. Once throttling implements
+  hierarchy support, hierarchy will be supported and the warning will
+  be removed.
 
 Various user visible config options
 ===================================
@@ -172,6 +174,12 @@ Proportional weight policy files
 	  dev     weight
 	  8:16    300
 
+- blkio.leaf_weight[_device]
+	- Equivalents of blkio.weight[_device] for the purpose of
+          deciding how much weight tasks in the given cgroup has while
+          competing with the cgroup's child cgroups. For details,
+          please refer to Documentation/block/cfq-iosched.txt.
+
 - blkio.time
 	- disk time allocated to cgroup per device in milliseconds. First
 	  two fields specify the major and minor number of the device and
@@ -279,6 +287,11 @@ Proportional weight policy files
 	  and minor number of the device and third field specifies the number
 	  of times a group was dequeued from a particular device.
 
+- blkio.*_recursive
+	- Recursive version of various stats. These files show the
+          same information as their non-recursive counterparts but
+          include stats from all the descendant cgroups.
+
 Throttling/Upper limit policy files
 -----------------------------------
 - blkio.throttle.read_bps_device
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -606,20 +606,11 @@ static inline struct cfq_group *blkg_to_
 	return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
 }
 
-/*
- * Determine the parent cfqg for weight calculation.  Currently, cfqg
- * scheduling is flat and the root is the parent of everyone else.
- */
-static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
+static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg)
 {
-	struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
-	struct cfq_group *root;
-
-	while (blkg->parent)
-		blkg = blkg->parent;
-	root = blkg_to_cfqg(blkg);
+	struct blkcg_gq *pblkg = cfqg_to_blkg(cfqg)->parent;
 
-	return root != cfqg ? root : NULL;
+	return pblkg ? blkg_to_cfqg(pblkg) : NULL;
 }
 
 static inline void cfqg_get(struct cfq_group *cfqg)
@@ -722,7 +713,7 @@ static void cfq_pd_reset_stats(struct bl
 
 #else	/* CONFIG_CFQ_GROUP_IOSCHED */
 
-static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
+static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; }
 static inline void cfqg_get(struct cfq_group *cfqg) { }
 static inline void cfqg_put(struct cfq_group *cfqg) { }
 
@@ -1290,7 +1281,7 @@ cfq_group_service_tree_add(struct cfq_rb
 	 * stops once an already activated node is met.  vfraction
 	 * calculation should always continue to the root.
 	 */
-	while ((parent = cfqg_flat_parent(pos))) {
+	while ((parent = cfqg_parent(pos))) {
 		if (propagate) {
 			propagate = !parent->nr_active++;
 			parent->children_weight += pos->weight;
@@ -1341,7 +1332,7 @@ cfq_group_service_tree_del(struct cfq_rb
 	pos->children_weight -= pos->leaf_weight;
 
 	while (propagate) {
-		struct cfq_group *parent = cfqg_flat_parent(pos);
+		struct cfq_group *parent = cfqg_parent(pos);
 
 		/* @pos has 0 nr_active at this point */
 		WARN_ON_ONCE(pos->children_weight);

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH UPDATED 15/24] cfq-iosched: enable full blkcg hierarchy support
@ 2013-01-07 16:34       ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-07 16:34 UTC (permalink / raw)
  To: lizefan, axboe, vgoyal; +Cc: containers, cgroups, linux-kernel, ctalbott, rni

With the previous two patches, all cfqg scheduling decisions are based
on vfraction and ready for hierarchy support.  The only thing which
keeps the behavior flat is cfqg_flat_parent() which makes vfraction
calculation consider all non-root cfqgs children of the root cfqg.

Replace it with cfqg_parent() which returns the real parent.  This
enables full blkcg hierarchy support for cfq-iosched.  For example,
consider the following hierarchy.

        root
      /      \
   A:500      B:250
  /     \
 AA:500  AB:1000

For simplicity, let's say all the leaf nodes have active tasks and are
on service tree.  For each leaf node, vfraction would be

 AA: (500  / 1500) * (500 / 750) =~ 0.2222
 AB: (1000 / 1500) * (500 / 750) =~ 0.4444
  B:                 (250 / 750) =~ 0.3333

and vdisktime will be distributed accordingly.  For more detail,
please refer to Documentation/block/cfq-iosched.txt.

v2: cfq-iosched.txt updated to describe group scheduling as suggested
    by Vivek.

v3: blkio-controller.txt updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
blkio-controller.txt updated as per Vivek.

Thanks.

 Documentation/block/cfq-iosched.txt        |   58 +++++++++++++++++++++++++++++
 Documentation/cgroups/blkio-controller.txt |   37 ++++++++++++------
 block/cfq-iosched.c                        |   21 +++-------
 3 files changed, 89 insertions(+), 27 deletions(-)

--- a/Documentation/block/cfq-iosched.txt
+++ b/Documentation/block/cfq-iosched.txt
@@ -102,6 +102,64 @@ processing of request. Therefore, increa
 performace although this can cause the latency of some I/O to increase due
 to more number of requests.
 
+CFQ Group scheduling
+====================
+
+CFQ supports blkio cgroup and has "blkio." prefixed files in each
+blkio cgroup directory. It is weight-based and there are four knobs
+for configuration - weight[_device] and leaf_weight[_device].
+Internal cgroup nodes (the ones with children) can also have tasks in
+them, so the former two configure how much proportion the cgroup as a
+whole is entitled to at its parent's level while the latter two
+configure how much proportion the tasks in the cgroup have compared to
+its direct children.
+
+Another way to think about it is assuming that each internal node has
+an implicit leaf child node which hosts all the tasks whose weight is
+configured by leaf_weight[_device]. Let's assume a blkio hierarchy
+composed of five cgroups - root, A, B, AA and AB - with the following
+weights where the names represent the hierarchy.
+
+        weight leaf_weight
+ root :  125    125
+ A    :  500    750
+ B    :  250    500
+ AA   :  500    500
+ AB   : 1000    500
+
+root never has a parent making its weight is meaningless. For backward
+compatibility, weight is always kept in sync with leaf_weight. B, AA
+and AB have no child and thus its tasks have no children cgroup to
+compete with. They always get 100% of what the cgroup won at the
+parent level. Considering only the weights which matter, the hierarchy
+looks like the following.
+
+          root
+       /    |   \
+      A     B    leaf
+     500   250   125
+   /  |  \
+  AA  AB  leaf
+ 500 1000 750
+
+If all cgroups have active IOs and competing with each other, disk
+time will be distributed like the following.
+
+Distribution below root. The total active weight at this level is
+A:500 + B:250 + C:125 = 875.
+
+ root-leaf :   125 /  875      =~ 14%
+ A         :   500 /  875      =~ 57%
+ B(-leaf)  :   250 /  875      =~ 28%
+
+A has children and further distributes its 57% among the children and
+the implicit leaf node. The total active weight at this level is
+AA:500 + AB:1000 + A-leaf:750 = 2250.
+
+ A-leaf    : ( 750 / 2250) * A =~ 19%
+ AA(-leaf) : ( 500 / 2250) * A =~ 12%
+ AB(-leaf) : (1000 / 2250) * A =~ 25%
+
 CFQ IOPS Mode for group scheduling
 ===================================
 Basic CFQ design is to provide priority based time slices. Higher priority
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -94,13 +94,11 @@ Throttling/Upper Limit policy
 
 Hierarchical Cgroups
 ====================
-- Currently none of the IO control policy supports hierarchical groups. But
-  cgroup interface does allow creation of hierarchical cgroups and internally
-  IO policies treat them as flat hierarchy.
-
-  So this patch will allow creation of cgroup hierarchcy but at the backend
-  everything will be treated as flat. So if somebody created a hierarchy like
-  as follows.
+- Currently only CFQ supports hierarchical groups. For throttling,
+  cgroup interface does allow creation of hierarchical cgroups and
+  internally it treats them as flat hierarchy.
+
+  If somebody created a hierarchy like as follows.
 
 			root
 			/  \
@@ -108,16 +106,20 @@ Hierarchical Cgroups
 			|
 		     test3
 
-  CFQ and throttling will practically treat all groups at same level.
+  CFQ will handle the hierarchy correctly but and throttling will
+  practically treat all groups at same level. For details on CFQ
+  hierarchy support, refer to Documentation/block/cfq-iosched.txt.
+  Throttling will treat the hierarchy as if it looks like the
+  following.
 
 				pivot
 			     /  /   \  \
 			root  test1 test2  test3
 
-  Down the line we can implement hierarchical accounting/control support
-  and also introduce a new cgroup file "use_hierarchy" which will control
-  whether cgroup hierarchy is viewed as flat or hierarchical by the policy..
-  This is how memory controller also has implemented the things.
+  Nesting cgroups, while allowed, isn't officially supported and blkio
+  genereates warning when cgroups nest. Once throttling implements
+  hierarchy support, hierarchy will be supported and the warning will
+  be removed.
 
 Various user visible config options
 ===================================
@@ -172,6 +174,12 @@ Proportional weight policy files
 	  dev     weight
 	  8:16    300
 
+- blkio.leaf_weight[_device]
+	- Equivalents of blkio.weight[_device] for the purpose of
+          deciding how much weight tasks in the given cgroup has while
+          competing with the cgroup's child cgroups. For details,
+          please refer to Documentation/block/cfq-iosched.txt.
+
 - blkio.time
 	- disk time allocated to cgroup per device in milliseconds. First
 	  two fields specify the major and minor number of the device and
@@ -279,6 +287,11 @@ Proportional weight policy files
 	  and minor number of the device and third field specifies the number
 	  of times a group was dequeued from a particular device.
 
+- blkio.*_recursive
+	- Recursive version of various stats. These files show the
+          same information as their non-recursive counterparts but
+          include stats from all the descendant cgroups.
+
 Throttling/Upper limit policy files
 -----------------------------------
 - blkio.throttle.read_bps_device
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -606,20 +606,11 @@ static inline struct cfq_group *blkg_to_
 	return pd_to_cfqg(blkg_to_pd(blkg, &blkcg_policy_cfq));
 }
 
-/*
- * Determine the parent cfqg for weight calculation.  Currently, cfqg
- * scheduling is flat and the root is the parent of everyone else.
- */
-static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg)
+static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg)
 {
-	struct blkcg_gq *blkg = cfqg_to_blkg(cfqg);
-	struct cfq_group *root;
-
-	while (blkg->parent)
-		blkg = blkg->parent;
-	root = blkg_to_cfqg(blkg);
+	struct blkcg_gq *pblkg = cfqg_to_blkg(cfqg)->parent;
 
-	return root != cfqg ? root : NULL;
+	return pblkg ? blkg_to_cfqg(pblkg) : NULL;
 }
 
 static inline void cfqg_get(struct cfq_group *cfqg)
@@ -722,7 +713,7 @@ static void cfq_pd_reset_stats(struct bl
 
 #else	/* CONFIG_CFQ_GROUP_IOSCHED */
 
-static inline struct cfq_group *cfqg_flat_parent(struct cfq_group *cfqg) { return NULL; }
+static inline struct cfq_group *cfqg_parent(struct cfq_group *cfqg) { return NULL; }
 static inline void cfqg_get(struct cfq_group *cfqg) { }
 static inline void cfqg_put(struct cfq_group *cfqg) { }
 
@@ -1290,7 +1281,7 @@ cfq_group_service_tree_add(struct cfq_rb
 	 * stops once an already activated node is met.  vfraction
 	 * calculation should always continue to the root.
 	 */
-	while ((parent = cfqg_flat_parent(pos))) {
+	while ((parent = cfqg_parent(pos))) {
 		if (propagate) {
 			propagate = !parent->nr_active++;
 			parent->children_weight += pos->weight;
@@ -1341,7 +1332,7 @@ cfq_group_service_tree_del(struct cfq_rb
 	pos->children_weight -= pos->leaf_weight;
 
 	while (propagate) {
-		struct cfq_group *parent = cfqg_flat_parent(pos);
+		struct cfq_group *parent = cfqg_parent(pos);
 
 		/* @pos has 0 nr_active at this point */
 		WARN_ON_ONCE(pos->children_weight);

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
       [not found]   ` <20130102182037.GC4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-07 16:34     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-07 16:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 02, 2013 at 01:20:38PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:22PM -0800, Tejun Heo wrote:
> 
> [..]
> > 
> > * Updated to reflect Vivek's reviews - renames & documentation.
> 
> Hi Tejun,
> 
> You forgot to update blkio-controller.txt.

Just updated.  Does everything else look okay to you?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
       [not found]   ` <20130102182037.GC4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-07 16:34     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-07 16:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Wed, Jan 02, 2013 at 01:20:38PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:22PM -0800, Tejun Heo wrote:
> 
> [..]
> > 
> > * Updated to reflect Vivek's reviews - renames & documentation.
> 
> Hi Tejun,
> 
> You forgot to update blkio-controller.txt.

Just updated.  Does everything else look okay to you?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
@ 2013-01-07 16:34     ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-07 16:34 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Wed, Jan 02, 2013 at 01:20:38PM -0500, Vivek Goyal wrote:
> On Fri, Dec 28, 2012 at 12:35:22PM -0800, Tejun Heo wrote:
> 
> [..]
> > 
> > * Updated to reflect Vivek's reviews - renames & documentation.
> 
> Hi Tejun,
> 
> You forgot to update blkio-controller.txt.

Just updated.  Does everything else look okay to you?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH UPDATED 15/24] cfq-iosched: enable full blkcg hierarchy support
  2013-01-07 16:34       ` Tejun Heo
@ 2013-01-08 14:42           ` Vivek Goyal
  -1 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 14:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Jan 07, 2013 at 08:34:05AM -0800, Tejun Heo wrote:

[..]
> +        weight leaf_weight
> + root :  125    125
> + A    :  500    750
> + B    :  250    500
> + AA   :  500    500
> + AB   : 1000    500
> +
> +root never has a parent making its weight is meaningless. For backward
> +compatibility, weight is always kept in sync with leaf_weight. B, AA
> +and AB have no child and thus its tasks have no children cgroup to
> +compete with. They always get 100% of what the cgroup won at the
> +parent level. Considering only the weights which matter, the hierarchy
> +looks like the following.
> +
> +          root
> +       /    |   \
> +      A     B    leaf
> +     500   250   125
> +   /  |  \
> +  AA  AB  leaf
> + 500 1000 750
> +
> +If all cgroups have active IOs and competing with each other, disk
> +time will be distributed like the following.
> +
> +Distribution below root. The total active weight at this level is
> +A:500 + B:250 + C:125 = 875.
> +
> + root-leaf :   125 /  875      =~ 14%
> + A         :   500 /  875      =~ 57%
> + B(-leaf)  :   250 /  875      =~ 28%
> +
> +A has children and further distributes its 57% among the children and
> +the implicit leaf node. The total active weight at this level is
> +AA:500 + AB:1000 + A-leaf:750 = 2250.
> +
> + A-leaf    : ( 750 / 2250) * A =~ 19%
> + AA(-leaf) : ( 500 / 2250) * A =~ 12%
> + AB(-leaf) : (1000 / 2250) * A =~ 25%

Hi Tejun,

What does (-leaf) is supposed to signify? I can understand that A-leaf
tells the share of A's tasks which are effectively in A-leaf group. 

Will just plain AA and AB be more clear?

Rest looks good to me. Thanks for updating the blkio-controoler.txt too.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH UPDATED 15/24] cfq-iosched: enable full blkcg hierarchy support
@ 2013-01-08 14:42           ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 14:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Mon, Jan 07, 2013 at 08:34:05AM -0800, Tejun Heo wrote:

[..]
> +        weight leaf_weight
> + root :  125    125
> + A    :  500    750
> + B    :  250    500
> + AA   :  500    500
> + AB   : 1000    500
> +
> +root never has a parent making its weight is meaningless. For backward
> +compatibility, weight is always kept in sync with leaf_weight. B, AA
> +and AB have no child and thus its tasks have no children cgroup to
> +compete with. They always get 100% of what the cgroup won at the
> +parent level. Considering only the weights which matter, the hierarchy
> +looks like the following.
> +
> +          root
> +       /    |   \
> +      A     B    leaf
> +     500   250   125
> +   /  |  \
> +  AA  AB  leaf
> + 500 1000 750
> +
> +If all cgroups have active IOs and competing with each other, disk
> +time will be distributed like the following.
> +
> +Distribution below root. The total active weight at this level is
> +A:500 + B:250 + C:125 = 875.
> +
> + root-leaf :   125 /  875      =~ 14%
> + A         :   500 /  875      =~ 57%
> + B(-leaf)  :   250 /  875      =~ 28%
> +
> +A has children and further distributes its 57% among the children and
> +the implicit leaf node. The total active weight at this level is
> +AA:500 + AB:1000 + A-leaf:750 = 2250.
> +
> + A-leaf    : ( 750 / 2250) * A =~ 19%
> + AA(-leaf) : ( 500 / 2250) * A =~ 12%
> + AB(-leaf) : (1000 / 2250) * A =~ 25%

Hi Tejun,

What does (-leaf) is supposed to signify? I can understand that A-leaf
tells the share of A's tasks which are effectively in A-leaf group. 

Will just plain AA and AB be more clear?

Rest looks good to me. Thanks for updating the blkio-controoler.txt too.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 11/24] cfq-iosched: add leaf_weight
  2012-12-28 20:35 ` [PATCH 11/24] cfq-iosched: add leaf_weight Tejun Heo
@ 2013-01-08 15:34       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 15:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

On Fri, Dec 28, 2012 at 12:35:33PM -0800, Tejun Heo wrote:

[..]
> +
> +	/* on root, leaf_weight is mapped to weight */
> +	{
> +		.name = "leaf_weight_device",
> +		.flags = CFTYPE_ONLY_ON_ROOT,
> +		.read_seq_string = cfqg_print_weight_device,
> +		.write_string = cfqg_set_weight_device,
> +		.max_write_len = 256,
> +	},
> +	{
> +		.name = "leaf_weight",
> +		.flags = CFTYPE_ONLY_ON_ROOT,
> +		.read_seq_string = cfq_print_weight,
> +		.write_u64 = cfq_set_weight,
> +	},

Hi Tejun,

How does it help to map leaf weight to weight in root group. Old programs
anyway don't know about leaf_weight. So nobody is going to update it. And
if they update it, they better know what does it do.

I think we just need to map "weight" to "leaf_weight" once you switch to
making use of leaf_weight and at that point of time a user will expect
that updating weight retains the old behavior. I think you have done
that in later patch.

But mapping leaf_weight to weight seems unnecessary atleast from backward
compatibility point of view.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 11/24] cfq-iosched: add leaf_weight
@ 2013-01-08 15:34       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 15:34 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni,
	Fengguang Wu

On Fri, Dec 28, 2012 at 12:35:33PM -0800, Tejun Heo wrote:

[..]
> +
> +	/* on root, leaf_weight is mapped to weight */
> +	{
> +		.name = "leaf_weight_device",
> +		.flags = CFTYPE_ONLY_ON_ROOT,
> +		.read_seq_string = cfqg_print_weight_device,
> +		.write_string = cfqg_set_weight_device,
> +		.max_write_len = 256,
> +	},
> +	{
> +		.name = "leaf_weight",
> +		.flags = CFTYPE_ONLY_ON_ROOT,
> +		.read_seq_string = cfq_print_weight,
> +		.write_u64 = cfq_set_weight,
> +	},

Hi Tejun,

How does it help to map leaf weight to weight in root group. Old programs
anyway don't know about leaf_weight. So nobody is going to update it. And
if they update it, they better know what does it do.

I think we just need to map "weight" to "leaf_weight" once you switch to
making use of leaf_weight and at that point of time a user will expect
that updating weight retains the old behavior. I think you have done
that in later patch.

But mapping leaf_weight to weight seems unnecessary atleast from backward
compatibility point of view.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 12/24] cfq-iosched: implement cfq_group->nr_active and ->children_weight
  2012-12-28 20:35 ` [PATCH 12/24] cfq-iosched: implement cfq_group->nr_active and ->children_weight Tejun Heo
@ 2013-01-08 15:51       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 15:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:34PM -0800, Tejun Heo wrote:
> To prepare for blkcg hierarchy support, add cfqg->nr_active and
> ->children_weight.  cfqg->nr_active counts the number of active cfqgs
> at the cfqg's level and ->children_weight is sum of weights of those
> cfqgs.  The level covers itself (cfqg->leaf_weight) and immediate
> children.
> 
> The two values are updated when a cfqg enters and leaves the group
> service tree.  Unless the hierarchy is very deep, the added overhead
> should be negligible.
> 
> Currently, the parent is determined using cfqg_flat_parent() which
> makes the root cfqg the parent of all other cfqgs.  This is to make
> the transition to hierarchy-aware scheduling gradual.  Scheduling
> logic will be converted to use cfqg->children_weight without actually
> changing the behavior.  When everything is ready,
> blkcg_weight_parent() will be replaced with proper parent function.
> 
> This patch doesn't introduce any behavior chagne.
> 
> v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Looks good to me.

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 12/24] cfq-iosched: implement cfq_group->nr_active and ->children_weight
@ 2013-01-08 15:51       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 15:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:34PM -0800, Tejun Heo wrote:
> To prepare for blkcg hierarchy support, add cfqg->nr_active and
> ->children_weight.  cfqg->nr_active counts the number of active cfqgs
> at the cfqg's level and ->children_weight is sum of weights of those
> cfqgs.  The level covers itself (cfqg->leaf_weight) and immediate
> children.
> 
> The two values are updated when a cfqg enters and leaves the group
> service tree.  Unless the hierarchy is very deep, the added overhead
> should be negligible.
> 
> Currently, the parent is determined using cfqg_flat_parent() which
> makes the root cfqg the parent of all other cfqgs.  This is to make
> the transition to hierarchy-aware scheduling gradual.  Scheduling
> logic will be converted to use cfqg->children_weight without actually
> changing the behavior.  When everything is ready,
> blkcg_weight_parent() will be replaced with proper parent function.
> 
> This patch doesn't introduce any behavior chagne.
> 
> v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vivek Goyal <vgoyal@redhat.com>

Looks good to me.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 13/24] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
  2012-12-28 20:35     ` Tejun Heo
@ 2013-01-08 16:16         ` Vivek Goyal
  -1 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:35PM -0800, Tejun Heo wrote:
> Currently, cfqg charges are scaled directly according to cfqg->weight.
> Regardless of the number of active cfqgs or the amount of active
> weights, a given weight value always scales charge the same way.  This
> works fine as long as all cfqgs are treated equally regardless of
> their positions in the hierarchy, which is what cfq currently
> implements.  It can't work in hierarchical settings because the
> interpretation of a given weight value depends on where the weight is
> located in the hierarchy.
> 
> This patch reimplements cfqg charge scaling so that it can be used to
> support hierarchy properly.  The scheme is fairly simple and
> light-weight.
> 
> * When a cfqg is added to the service tree, v(disktime)weight is
>   calculated.  It walks up the tree to root calculating the fraction
>   it has in the hierarchy.  At each level, the fraction can be
>   calculated as
> 
>     cfqg->weight / parent->level_weight
> 
>   By compounding these, the global fraction of vdisktime the cfqg has
>   claim to - vfraction - can be determined.
> 
> * When the cfqg needs to be charged, the charge is scaled inversely
>   proportionally to the vfraction.
> 
> The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
> representation as before; however, the smallest scaling factor is now
> 1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
> was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
> scaling factor.
> 
> While this shifts the global scale of vdisktime a bit, it doesn't
> change the relative relationships among cfqgs and the scheduling
> result isn't different.
> 
> cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
> new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
> didn't have any relevance to vdisktime before and is unlikely to cause
> any visible behavior difference now especially as the scale shift
> isn't that large.
> 
> As the new scheme now makes proper distinction between cfqg->weight
> and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
> root, both weights are now mapped to ->leaf_weight instead of the
> other way around.
> 
> Because we're still using cfqg_flat_parent(), this patch shouldn't
> change the scheduling behavior in any noticeable way.
> 
> v2: Beefed up comments on vfraction as requested by Vivek.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Looks good to me.

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 13/24] cfq-iosched: implement hierarchy-ready cfq_group charge scaling
@ 2013-01-08 16:16         ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:35PM -0800, Tejun Heo wrote:
> Currently, cfqg charges are scaled directly according to cfqg->weight.
> Regardless of the number of active cfqgs or the amount of active
> weights, a given weight value always scales charge the same way.  This
> works fine as long as all cfqgs are treated equally regardless of
> their positions in the hierarchy, which is what cfq currently
> implements.  It can't work in hierarchical settings because the
> interpretation of a given weight value depends on where the weight is
> located in the hierarchy.
> 
> This patch reimplements cfqg charge scaling so that it can be used to
> support hierarchy properly.  The scheme is fairly simple and
> light-weight.
> 
> * When a cfqg is added to the service tree, v(disktime)weight is
>   calculated.  It walks up the tree to root calculating the fraction
>   it has in the hierarchy.  At each level, the fraction can be
>   calculated as
> 
>     cfqg->weight / parent->level_weight
> 
>   By compounding these, the global fraction of vdisktime the cfqg has
>   claim to - vfraction - can be determined.
> 
> * When the cfqg needs to be charged, the charge is scaled inversely
>   proportionally to the vfraction.
> 
> The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
> representation as before; however, the smallest scaling factor is now
> 1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
> was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
> scaling factor.
> 
> While this shifts the global scale of vdisktime a bit, it doesn't
> change the relative relationships among cfqgs and the scheduling
> result isn't different.
> 
> cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
> new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
> didn't have any relevance to vdisktime before and is unlikely to cause
> any visible behavior difference now especially as the scale shift
> isn't that large.
> 
> As the new scheme now makes proper distinction between cfqg->weight
> and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
> root, both weights are now mapped to ->leaf_weight instead of the
> other way around.
> 
> Because we're still using cfqg_flat_parent(), this patch shouldn't
> change the scheduling behavior in any noticeable way.
> 
> v2: Beefed up comments on vfraction as requested by Vivek.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vivek Goyal <vgoyal@redhat.com>

Looks good to me.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 14/24] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction
       [not found]   ` <1356726946-26037-15-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 16:42     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:36PM -0800, Tejun Heo wrote:
> cfq_group_slice() calculates slice by taking a fraction of
> cfq_target_latency according to the ratio of cfqg->weight against
> service_tree->total_weight.  This currently works only because all
> cfqgs are treated to be at the same level.
> 
> To prepare for proper hierarchy support, convert cfq_group_slice() to
> base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
> a fraction of 1 and represents the fraction allocated to the cfqg with
> hierarchy considered, the slice can be simply calculated by
> multiplying cfqg->vfraction to cfq_target_latency (with fixed point
> shift factored in).
> 
> As vfraction calculation currently treats all non-root cfqgs as
> children of the root cfqg, this patch doesn't introduce noticeable
> behavior difference.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 14/24] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction
       [not found]   ` <1356726946-26037-15-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 16:42     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:36PM -0800, Tejun Heo wrote:
> cfq_group_slice() calculates slice by taking a fraction of
> cfq_target_latency according to the ratio of cfqg->weight against
> service_tree->total_weight.  This currently works only because all
> cfqgs are treated to be at the same level.
> 
> To prepare for proper hierarchy support, convert cfq_group_slice() to
> base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
> a fraction of 1 and represents the fraction allocated to the cfqg with
> hierarchy considered, the slice can be simply calculated by
> multiplying cfqg->vfraction to cfq_target_latency (with fixed point
> shift factored in).
> 
> As vfraction calculation currently treats all non-root cfqgs as
> children of the root cfqg, this patch doesn't introduce noticeable
> behavior difference.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 14/24] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction
@ 2013-01-08 16:42     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:36PM -0800, Tejun Heo wrote:
> cfq_group_slice() calculates slice by taking a fraction of
> cfq_target_latency according to the ratio of cfqg->weight against
> service_tree->total_weight.  This currently works only because all
> cfqgs are treated to be at the same level.
> 
> To prepare for proper hierarchy support, convert cfq_group_slice() to
> base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
> a fraction of 1 and represents the fraction allocated to the cfqg with
> hierarchy considered, the slice can be simply calculated by
> multiplying cfqg->vfraction to cfq_target_latency (with fixed point
> shift factored in).
> 
> As vfraction calculation currently treats all non-root cfqgs as
> children of the root cfqg, this patch doesn't introduce noticeable
> behavior difference.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 16/24] blkcg: add blkg_policy_data->plid
       [not found]   ` <1356726946-26037-17-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 16:51     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:38PM -0800, Tejun Heo wrote:
> Add pd->plid so that the policy a pd belongs to can be identified
> easily.  This will be used to implement hierarchical blkg_[rw]stats.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

> ---
>  block/blk-cgroup.c | 2 ++
>  block/blk-cgroup.h | 3 ++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 10e1df9..3a8de32 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -113,6 +113,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  
>  		blkg->pd[i] = pd;
>  		pd->blkg = blkg;
> +		pd->plid = i;
>  
>  		/* invoke per-policy init */
>  		if (pol->pd_init_fn)
> @@ -908,6 +909,7 @@ int blkcg_activate_policy(struct request_queue *q,
>  
>  		blkg->pd[pol->plid] = pd;
>  		pd->blkg = blkg;
> +		pd->plid = pol->plid;
>  		pol->pd_init_fn(blkg);
>  
>  		spin_unlock(&blkg->blkcg->lock);
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 2446225..40f5b97 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -81,8 +81,9 @@ struct blkg_rwstat {
>   * beginning and pd_size can't be smaller than pd.
>   */
>  struct blkg_policy_data {
> -	/* the blkg this per-policy data belongs to */
> +	/* the blkg and policy id this per-policy data belongs to */
>  	struct blkcg_gq			*blkg;
> +	int				plid;
>  
>  	/* used during policy activation */
>  	struct list_head		alloc_node;
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 16/24] blkcg: add blkg_policy_data->plid
       [not found]   ` <1356726946-26037-17-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 16:51     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:38PM -0800, Tejun Heo wrote:
> Add pd->plid so that the policy a pd belongs to can be identified
> easily.  This will be used to implement hierarchical blkg_[rw]stats.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> ---
>  block/blk-cgroup.c | 2 ++
>  block/blk-cgroup.h | 3 ++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 10e1df9..3a8de32 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -113,6 +113,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  
>  		blkg->pd[i] = pd;
>  		pd->blkg = blkg;
> +		pd->plid = i;
>  
>  		/* invoke per-policy init */
>  		if (pol->pd_init_fn)
> @@ -908,6 +909,7 @@ int blkcg_activate_policy(struct request_queue *q,
>  
>  		blkg->pd[pol->plid] = pd;
>  		pd->blkg = blkg;
> +		pd->plid = pol->plid;
>  		pol->pd_init_fn(blkg);
>  
>  		spin_unlock(&blkg->blkcg->lock);
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 2446225..40f5b97 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -81,8 +81,9 @@ struct blkg_rwstat {
>   * beginning and pd_size can't be smaller than pd.
>   */
>  struct blkg_policy_data {
> -	/* the blkg this per-policy data belongs to */
> +	/* the blkg and policy id this per-policy data belongs to */
>  	struct blkcg_gq			*blkg;
> +	int				plid;
>  
>  	/* used during policy activation */
>  	struct list_head		alloc_node;
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 16/24] blkcg: add blkg_policy_data->plid
@ 2013-01-08 16:51     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:38PM -0800, Tejun Heo wrote:
> Add pd->plid so that the policy a pd belongs to can be identified
> easily.  This will be used to implement hierarchical blkg_[rw]stats.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

> ---
>  block/blk-cgroup.c | 2 ++
>  block/blk-cgroup.h | 3 ++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 10e1df9..3a8de32 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -113,6 +113,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  
>  		blkg->pd[i] = pd;
>  		pd->blkg = blkg;
> +		pd->plid = i;
>  
>  		/* invoke per-policy init */
>  		if (pol->pd_init_fn)
> @@ -908,6 +909,7 @@ int blkcg_activate_policy(struct request_queue *q,
>  
>  		blkg->pd[pol->plid] = pd;
>  		pd->blkg = blkg;
> +		pd->plid = pol->plid;
>  		pol->pd_init_fn(blkg);
>  
>  		spin_unlock(&blkg->blkcg->lock);
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 2446225..40f5b97 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -81,8 +81,9 @@ struct blkg_rwstat {
>   * beginning and pd_size can't be smaller than pd.
>   */
>  struct blkg_policy_data {
> -	/* the blkg this per-policy data belongs to */
> +	/* the blkg and policy id this per-policy data belongs to */
>  	struct blkcg_gq			*blkg;
> +	int				plid;
>  
>  	/* used during policy activation */
>  	struct list_head		alloc_node;
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online
  2012-12-28 20:35     ` Tejun Heo
@ 2013-01-08 16:58         ` Vivek Goyal
  -1 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:39PM -0800, Tejun Heo wrote:
> Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
> which are invoked as the policy_data gets activated and deactivated
> while holding both blkcg and q locks.
> 
> Also, add blkcg_gq->online bool, which is set and cleared as the
> blkcg_gq gets activated and deactivated.  This flag also is toggled
> while holding both blkcg and q locks.
> 
> These will be used to implement hierarchical stats.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online
@ 2013-01-08 16:58         ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:39PM -0800, Tejun Heo wrote:
> Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
> which are invoked as the policy_data gets activated and deactivated
> while holding both blkcg and q locks.
> 
> Also, add blkcg_gq->online bool, which is set and cleared as the
> blkcg_gq gets activated and deactivated.  This flag also is toggled
> while holding both blkcg and q locks.
> 
> These will be used to implement hierarchical stats.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 18/24] blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/
       [not found]     ` <1356726946-26037-19-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 16:59       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:40PM -0800, Tejun Heo wrote:
> Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
> summing up stats from multiple blkgs.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

> ---
>  block/blk-cgroup.h  | 4 ++--
>  block/cfq-iosched.c | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 678e89e..586c0ac 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -461,14 +461,14 @@ static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat)
>  }
>  
>  /**
> - * blkg_rwstat_sum - read the total count of a blkg_rwstat
> + * blkg_rwstat_total - read the total count of a blkg_rwstat
>   * @rwstat: blkg_rwstat to read
>   *
>   * Return the total count of @rwstat regardless of the IO direction.  This
>   * function can be called without synchronization and takes care of u64
>   * atomicity.
>   */
> -static inline uint64_t blkg_rwstat_sum(struct blkg_rwstat *rwstat)
> +static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat)
>  {
>  	struct blkg_rwstat tmp = blkg_rwstat_read(rwstat);
>  
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index e8f3106..d43145cc 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -536,7 +536,7 @@ static void cfqg_stats_set_start_empty_time(struct cfq_group *cfqg)
>  {
>  	struct cfqg_stats *stats = &cfqg->stats;
>  
> -	if (blkg_rwstat_sum(&stats->queued))
> +	if (blkg_rwstat_total(&stats->queued))
>  		return;
>  
>  	/*
> @@ -580,7 +580,7 @@ static void cfqg_stats_update_avg_queue_size(struct cfq_group *cfqg)
>  	struct cfqg_stats *stats = &cfqg->stats;
>  
>  	blkg_stat_add(&stats->avg_queue_size_sum,
> -		      blkg_rwstat_sum(&stats->queued));
> +		      blkg_rwstat_total(&stats->queued));
>  	blkg_stat_add(&stats->avg_queue_size_samples, 1);
>  	cfqg_stats_update_group_wait_time(stats);
>  }
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 18/24] blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/
       [not found]     ` <1356726946-26037-19-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 16:59       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:40PM -0800, Tejun Heo wrote:
> Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
> summing up stats from multiple blkgs.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> ---
>  block/blk-cgroup.h  | 4 ++--
>  block/cfq-iosched.c | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 678e89e..586c0ac 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -461,14 +461,14 @@ static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat)
>  }
>  
>  /**
> - * blkg_rwstat_sum - read the total count of a blkg_rwstat
> + * blkg_rwstat_total - read the total count of a blkg_rwstat
>   * @rwstat: blkg_rwstat to read
>   *
>   * Return the total count of @rwstat regardless of the IO direction.  This
>   * function can be called without synchronization and takes care of u64
>   * atomicity.
>   */
> -static inline uint64_t blkg_rwstat_sum(struct blkg_rwstat *rwstat)
> +static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat)
>  {
>  	struct blkg_rwstat tmp = blkg_rwstat_read(rwstat);
>  
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index e8f3106..d43145cc 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -536,7 +536,7 @@ static void cfqg_stats_set_start_empty_time(struct cfq_group *cfqg)
>  {
>  	struct cfqg_stats *stats = &cfqg->stats;
>  
> -	if (blkg_rwstat_sum(&stats->queued))
> +	if (blkg_rwstat_total(&stats->queued))
>  		return;
>  
>  	/*
> @@ -580,7 +580,7 @@ static void cfqg_stats_update_avg_queue_size(struct cfq_group *cfqg)
>  	struct cfqg_stats *stats = &cfqg->stats;
>  
>  	blkg_stat_add(&stats->avg_queue_size_sum,
> -		      blkg_rwstat_sum(&stats->queued));
> +		      blkg_rwstat_total(&stats->queued));
>  	blkg_stat_add(&stats->avg_queue_size_samples, 1);
>  	cfqg_stats_update_group_wait_time(stats);
>  }
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 18/24] blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/
@ 2013-01-08 16:59       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 16:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:40PM -0800, Tejun Heo wrote:
> Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
> summing up stats from multiple blkgs.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

> ---
>  block/blk-cgroup.h  | 4 ++--
>  block/cfq-iosched.c | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 678e89e..586c0ac 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -461,14 +461,14 @@ static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat)
>  }
>  
>  /**
> - * blkg_rwstat_sum - read the total count of a blkg_rwstat
> + * blkg_rwstat_total - read the total count of a blkg_rwstat
>   * @rwstat: blkg_rwstat to read
>   *
>   * Return the total count of @rwstat regardless of the IO direction.  This
>   * function can be called without synchronization and takes care of u64
>   * atomicity.
>   */
> -static inline uint64_t blkg_rwstat_sum(struct blkg_rwstat *rwstat)
> +static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat)
>  {
>  	struct blkg_rwstat tmp = blkg_rwstat_read(rwstat);
>  
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index e8f3106..d43145cc 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -536,7 +536,7 @@ static void cfqg_stats_set_start_empty_time(struct cfq_group *cfqg)
>  {
>  	struct cfqg_stats *stats = &cfqg->stats;
>  
> -	if (blkg_rwstat_sum(&stats->queued))
> +	if (blkg_rwstat_total(&stats->queued))
>  		return;
>  
>  	/*
> @@ -580,7 +580,7 @@ static void cfqg_stats_update_avg_queue_size(struct cfq_group *cfqg)
>  	struct cfqg_stats *stats = &cfqg->stats;
>  
>  	blkg_stat_add(&stats->avg_queue_size_sum,
> -		      blkg_rwstat_sum(&stats->queued));
> +		      blkg_rwstat_total(&stats->queued));
>  	blkg_stat_add(&stats->avg_queue_size_samples, 1);
>  	cfqg_stats_update_group_wait_time(stats);
>  }
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH UPDATED 15/24] cfq-iosched: enable full blkcg hierarchy support
  2013-01-08 14:42           ` Vivek Goyal
@ 2013-01-08 17:19               ` Tejun Heo
  -1 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-08 17:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

Hey, Vivek.

On Tue, Jan 08, 2013 at 09:42:40AM -0500, Vivek Goyal wrote:
> > +A has children and further distributes its 57% among the children and
> > +the implicit leaf node. The total active weight at this level is
> > +AA:500 + AB:1000 + A-leaf:750 = 2250.
> > +
> > + A-leaf    : ( 750 / 2250) * A =~ 19%
> > + AA(-leaf) : ( 500 / 2250) * A =~ 12%
> > + AB(-leaf) : (1000 / 2250) * A =~ 25%
> 
> Hi Tejun,
> 
> What does (-leaf) is supposed to signify? I can understand that A-leaf
> tells the share of A's tasks which are effectively in A-leaf group. 

It's saying that the weight for AA and AA-leaf are the same.

> Will just plain AA and AB be more clear?

I don't know, maybe.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH UPDATED 15/24] cfq-iosched: enable full blkcg hierarchy support
@ 2013-01-08 17:19               ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-08 17:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

Hey, Vivek.

On Tue, Jan 08, 2013 at 09:42:40AM -0500, Vivek Goyal wrote:
> > +A has children and further distributes its 57% among the children and
> > +the implicit leaf node. The total active weight at this level is
> > +AA:500 + AB:1000 + A-leaf:750 = 2250.
> > +
> > + A-leaf    : ( 750 / 2250) * A =~ 19%
> > + AA(-leaf) : ( 500 / 2250) * A =~ 12%
> > + AB(-leaf) : (1000 / 2250) * A =~ 25%
> 
> Hi Tejun,
> 
> What does (-leaf) is supposed to signify? I can understand that A-leaf
> tells the share of A's tasks which are effectively in A-leaf group. 

It's saying that the weight for AA and AA-leaf are the same.

> Will just plain AA and AB be more clear?

I don't know, maybe.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 11/24] cfq-iosched: add leaf_weight
       [not found]       ` <20130108153448.GB29635-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-08 17:24         ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-08 17:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Fengguang Wu

Hello, Vivek.

On Tue, Jan 08, 2013 at 10:34:48AM -0500, Vivek Goyal wrote:
> How does it help to map leaf weight to weight in root group. Old programs
> anyway don't know about leaf_weight. So nobody is going to update it. And
> if they update it, they better know what does it do.

Because what cfq does before this patchset in flat hierarchy is
basically treating root weight as the leaf weight.

       R:500
      /     \
     A:250  B:750

Ratio of R is computed as 500 / (500 + 250 + 750).  The root blkcg's
tasks are competinig with the children blkcgs.  This is currently the
only supported config and we don't want to break existing users, so
keeping root's weight and leaf_weight synchronized and ignoring root's
weight, which BTW can't mean anything anyway as it doesn't have a
parent or siblings, keeps the behavior unchanged for flat configs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 11/24] cfq-iosched: add leaf_weight
       [not found]       ` <20130108153448.GB29635-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-01-08 17:24         ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-08 17:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni,
	Fengguang Wu

Hello, Vivek.

On Tue, Jan 08, 2013 at 10:34:48AM -0500, Vivek Goyal wrote:
> How does it help to map leaf weight to weight in root group. Old programs
> anyway don't know about leaf_weight. So nobody is going to update it. And
> if they update it, they better know what does it do.

Because what cfq does before this patchset in flat hierarchy is
basically treating root weight as the leaf weight.

       R:500
      /     \
     A:250  B:750

Ratio of R is computed as 500 / (500 + 250 + 750).  The root blkcg's
tasks are competinig with the children blkcgs.  This is currently the
only supported config and we don't want to break existing users, so
keeping root's weight and leaf_weight synchronized and ignoring root's
weight, which BTW can't mean anything anyway as it doesn't have a
parent or siblings, keeps the behavior unchanged for flat configs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 11/24] cfq-iosched: add leaf_weight
@ 2013-01-08 17:24         ` Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2013-01-08 17:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA,
	Fengguang Wu

Hello, Vivek.

On Tue, Jan 08, 2013 at 10:34:48AM -0500, Vivek Goyal wrote:
> How does it help to map leaf weight to weight in root group. Old programs
> anyway don't know about leaf_weight. So nobody is going to update it. And
> if they update it, they better know what does it do.

Because what cfq does before this patchset in flat hierarchy is
basically treating root weight as the leaf weight.

       R:500
      /     \
     A:250  B:750

Ratio of R is computed as 500 / (500 + 250 + 750).  The root blkcg's
tasks are competinig with the children blkcgs.  This is currently the
only supported config and we don't want to break existing users, so
keeping root's weight and leaf_weight synchronized and ignoring root's
weight, which BTW can't mean anything anyway as it doesn't have a
parent or siblings, keeps the behavior unchanged for flat configs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 19/24] blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
  2012-12-28 20:35 ` [PATCH 19/24] blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() Tejun Heo
@ 2013-01-08 18:03       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:41PM -0800, Tejun Heo wrote:
> Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
> The former two collect the [rw]stats designated by the target policy
> data and offset from the pd's subtree.  The latter two add one
> [rw]stat to another.
> 
> Note that the recursive sum functions require the queue lock to be
> held on entry to make blkg online test reliable.  This is necessary to
> properly handle stats of a dying blkg.
> 
> These will be used to implement hierarchical stats.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

> ---
>  block/blk-cgroup.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-cgroup.h |  35 ++++++++++++++++++
>  2 files changed, 142 insertions(+)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 4d625d2..a1a4b97 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -32,6 +32,26 @@ EXPORT_SYMBOL_GPL(blkcg_root);
>  
>  static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
>  
> +static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
> +				      struct request_queue *q, bool update_hint);
> +
> +/**
> + * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants
> + * @d_blkg: loop cursor pointing to the current descendant
> + * @pos_cgrp: used for iteration
> + * @p_blkg: target blkg to walk descendants of
> + *
> + * Walk @c_blkg through the descendants of @p_blkg.  Must be used with RCU
> + * read locked.  If called under either blkcg or queue lock, the iteration
> + * is guaranteed to include all and only online blkgs.  The caller may
> + * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip
> + * subtree.
> + */
> +#define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg)		\
> +	cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \
> +		if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \
> +					      (p_blkg)->q, false)))
> +
>  static bool blkcg_policy_enabled(struct request_queue *q,
>  				 const struct blkcg_policy *pol)
>  {
> @@ -127,6 +147,17 @@ err_free:
>  	return NULL;
>  }
>  
> +/**
> + * __blkg_lookup - internal version of blkg_lookup()
> + * @blkcg: blkcg of interest
> + * @q: request_queue of interest
> + * @update_hint: whether to update lookup hint with the result or not
> + *
> + * This is internal version and shouldn't be used by policy
> + * implementations.  Looks up blkgs for the @blkcg - @q pair regardless of
> + * @q's bypass state.  If @update_hint is %true, the caller should be
> + * holding @q->queue_lock and lookup hint is updated on success.
> + */
>  static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
>  				      struct request_queue *q, bool update_hint)
>  {
> @@ -585,6 +616,82 @@ u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
>  EXPORT_SYMBOL_GPL(blkg_prfill_rwstat);
>  
>  /**
> + * blkg_stat_recursive_sum - collect hierarchical blkg_stat
> + * @pd: policy private data of interest
> + * @off: offset to the blkg_stat in @pd
> + *
> + * Collect the blkg_stat specified by @off from @pd and all its online
> + * descendants and return the sum.  The caller must be holding the queue
> + * lock for online tests.
> + */
> +u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off)
> +{
> +	struct blkcg_policy *pol = blkcg_policy[pd->plid];
> +	struct blkcg_gq *pos_blkg;
> +	struct cgroup *pos_cgrp;
> +	u64 sum;
> +
> +	lockdep_assert_held(pd->blkg->q->queue_lock);
> +
> +	sum = blkg_stat_read((void *)pd + off);
> +
> +	rcu_read_lock();
> +	blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) {
> +		struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol);
> +		struct blkg_stat *stat = (void *)pos_pd + off;
> +
> +		if (pos_blkg->online)
> +			sum += blkg_stat_read(stat);
> +	}
> +	rcu_read_unlock();
> +
> +	return sum;
> +}
> +EXPORT_SYMBOL_GPL(blkg_stat_recursive_sum);
> +
> +/**
> + * blkg_rwstat_recursive_sum - collect hierarchical blkg_rwstat
> + * @pd: policy private data of interest
> + * @off: offset to the blkg_stat in @pd
> + *
> + * Collect the blkg_rwstat specified by @off from @pd and all its online
> + * descendants and return the sum.  The caller must be holding the queue
> + * lock for online tests.
> + */
> +struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
> +					     int off)
> +{
> +	struct blkcg_policy *pol = blkcg_policy[pd->plid];
> +	struct blkcg_gq *pos_blkg;
> +	struct cgroup *pos_cgrp;
> +	struct blkg_rwstat sum;
> +	int i;
> +
> +	lockdep_assert_held(pd->blkg->q->queue_lock);
> +
> +	sum = blkg_rwstat_read((void *)pd + off);
> +
> +	rcu_read_lock();
> +	blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) {
> +		struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol);
> +		struct blkg_rwstat *rwstat = (void *)pos_pd + off;
> +		struct blkg_rwstat tmp;
> +
> +		if (!pos_blkg->online)
> +			continue;
> +
> +		tmp = blkg_rwstat_read(rwstat);
> +
> +		for (i = 0; i < BLKG_RWSTAT_NR; i++)
> +			sum.cnt[i] += tmp.cnt[i];
> +	}
> +	rcu_read_unlock();
> +
> +	return sum;
> +}
> +EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum);
> +
> +/**
>   * blkg_conf_prep - parse and prepare for per-blkg config update
>   * @blkcg: target block cgroup
>   * @pol: target policy
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 586c0ac..f2b2929 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -164,6 +164,10 @@ u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off);
>  u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
>  		       int off);
>  
> +u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off);
> +struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
> +					     int off);
> +
>  struct blkg_conf_ctx {
>  	struct gendisk			*disk;
>  	struct blkcg_gq			*blkg;
> @@ -414,6 +418,18 @@ static inline void blkg_stat_reset(struct blkg_stat *stat)
>  }
>  
>  /**
> + * blkg_stat_merge - merge a blkg_stat into another
> + * @to: the destination blkg_stat
> + * @from: the source
> + *
> + * Add @from's count to @to.
> + */
> +static inline void blkg_stat_merge(struct blkg_stat *to, struct blkg_stat *from)
> +{
> +	blkg_stat_add(to, blkg_stat_read(from));
> +}
> +
> +/**
>   * blkg_rwstat_add - add a value to a blkg_rwstat
>   * @rwstat: target blkg_rwstat
>   * @rw: mask of REQ_{WRITE|SYNC}
> @@ -484,6 +500,25 @@ static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat)
>  	memset(rwstat->cnt, 0, sizeof(rwstat->cnt));
>  }
>  
> +/**
> + * blkg_rwstat_merge - merge a blkg_rwstat into another
> + * @to: the destination blkg_rwstat
> + * @from: the source
> + *
> + * Add @from's counts to @to.
> + */
> +static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
> +				     struct blkg_rwstat *from)
> +{
> +	struct blkg_rwstat v = blkg_rwstat_read(from);
> +	int i;
> +
> +	u64_stats_update_begin(&to->syncp);
> +	for (i = 0; i < BLKG_RWSTAT_NR; i++)
> +		to->cnt[i] += v.cnt[i];
> +	u64_stats_update_end(&to->syncp);
> +}
> +
>  #else	/* CONFIG_BLK_CGROUP */
>  
>  struct cgroup;
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 19/24] blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
@ 2013-01-08 18:03       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:41PM -0800, Tejun Heo wrote:
> Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
> The former two collect the [rw]stats designated by the target policy
> data and offset from the pd's subtree.  The latter two add one
> [rw]stat to another.
> 
> Note that the recursive sum functions require the queue lock to be
> held on entry to make blkg online test reliable.  This is necessary to
> properly handle stats of a dying blkg.
> 
> These will be used to implement hierarchical stats.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> ---
>  block/blk-cgroup.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-cgroup.h |  35 ++++++++++++++++++
>  2 files changed, 142 insertions(+)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 4d625d2..a1a4b97 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -32,6 +32,26 @@ EXPORT_SYMBOL_GPL(blkcg_root);
>  
>  static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
>  
> +static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
> +				      struct request_queue *q, bool update_hint);
> +
> +/**
> + * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants
> + * @d_blkg: loop cursor pointing to the current descendant
> + * @pos_cgrp: used for iteration
> + * @p_blkg: target blkg to walk descendants of
> + *
> + * Walk @c_blkg through the descendants of @p_blkg.  Must be used with RCU
> + * read locked.  If called under either blkcg or queue lock, the iteration
> + * is guaranteed to include all and only online blkgs.  The caller may
> + * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip
> + * subtree.
> + */
> +#define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg)		\
> +	cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \
> +		if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \
> +					      (p_blkg)->q, false)))
> +
>  static bool blkcg_policy_enabled(struct request_queue *q,
>  				 const struct blkcg_policy *pol)
>  {
> @@ -127,6 +147,17 @@ err_free:
>  	return NULL;
>  }
>  
> +/**
> + * __blkg_lookup - internal version of blkg_lookup()
> + * @blkcg: blkcg of interest
> + * @q: request_queue of interest
> + * @update_hint: whether to update lookup hint with the result or not
> + *
> + * This is internal version and shouldn't be used by policy
> + * implementations.  Looks up blkgs for the @blkcg - @q pair regardless of
> + * @q's bypass state.  If @update_hint is %true, the caller should be
> + * holding @q->queue_lock and lookup hint is updated on success.
> + */
>  static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg,
>  				      struct request_queue *q, bool update_hint)
>  {
> @@ -585,6 +616,82 @@ u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
>  EXPORT_SYMBOL_GPL(blkg_prfill_rwstat);
>  
>  /**
> + * blkg_stat_recursive_sum - collect hierarchical blkg_stat
> + * @pd: policy private data of interest
> + * @off: offset to the blkg_stat in @pd
> + *
> + * Collect the blkg_stat specified by @off from @pd and all its online
> + * descendants and return the sum.  The caller must be holding the queue
> + * lock for online tests.
> + */
> +u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off)
> +{
> +	struct blkcg_policy *pol = blkcg_policy[pd->plid];
> +	struct blkcg_gq *pos_blkg;
> +	struct cgroup *pos_cgrp;
> +	u64 sum;
> +
> +	lockdep_assert_held(pd->blkg->q->queue_lock);
> +
> +	sum = blkg_stat_read((void *)pd + off);
> +
> +	rcu_read_lock();
> +	blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) {
> +		struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol);
> +		struct blkg_stat *stat = (void *)pos_pd + off;
> +
> +		if (pos_blkg->online)
> +			sum += blkg_stat_read(stat);
> +	}
> +	rcu_read_unlock();
> +
> +	return sum;
> +}
> +EXPORT_SYMBOL_GPL(blkg_stat_recursive_sum);
> +
> +/**
> + * blkg_rwstat_recursive_sum - collect hierarchical blkg_rwstat
> + * @pd: policy private data of interest
> + * @off: offset to the blkg_stat in @pd
> + *
> + * Collect the blkg_rwstat specified by @off from @pd and all its online
> + * descendants and return the sum.  The caller must be holding the queue
> + * lock for online tests.
> + */
> +struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
> +					     int off)
> +{
> +	struct blkcg_policy *pol = blkcg_policy[pd->plid];
> +	struct blkcg_gq *pos_blkg;
> +	struct cgroup *pos_cgrp;
> +	struct blkg_rwstat sum;
> +	int i;
> +
> +	lockdep_assert_held(pd->blkg->q->queue_lock);
> +
> +	sum = blkg_rwstat_read((void *)pd + off);
> +
> +	rcu_read_lock();
> +	blkg_for_each_descendant_pre(pos_blkg, pos_cgrp, pd_to_blkg(pd)) {
> +		struct blkg_policy_data *pos_pd = blkg_to_pd(pos_blkg, pol);
> +		struct blkg_rwstat *rwstat = (void *)pos_pd + off;
> +		struct blkg_rwstat tmp;
> +
> +		if (!pos_blkg->online)
> +			continue;
> +
> +		tmp = blkg_rwstat_read(rwstat);
> +
> +		for (i = 0; i < BLKG_RWSTAT_NR; i++)
> +			sum.cnt[i] += tmp.cnt[i];
> +	}
> +	rcu_read_unlock();
> +
> +	return sum;
> +}
> +EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum);
> +
> +/**
>   * blkg_conf_prep - parse and prepare for per-blkg config update
>   * @blkcg: target block cgroup
>   * @pol: target policy
> diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> index 586c0ac..f2b2929 100644
> --- a/block/blk-cgroup.h
> +++ b/block/blk-cgroup.h
> @@ -164,6 +164,10 @@ u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off);
>  u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
>  		       int off);
>  
> +u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off);
> +struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
> +					     int off);
> +
>  struct blkg_conf_ctx {
>  	struct gendisk			*disk;
>  	struct blkcg_gq			*blkg;
> @@ -414,6 +418,18 @@ static inline void blkg_stat_reset(struct blkg_stat *stat)
>  }
>  
>  /**
> + * blkg_stat_merge - merge a blkg_stat into another
> + * @to: the destination blkg_stat
> + * @from: the source
> + *
> + * Add @from's count to @to.
> + */
> +static inline void blkg_stat_merge(struct blkg_stat *to, struct blkg_stat *from)
> +{
> +	blkg_stat_add(to, blkg_stat_read(from));
> +}
> +
> +/**
>   * blkg_rwstat_add - add a value to a blkg_rwstat
>   * @rwstat: target blkg_rwstat
>   * @rw: mask of REQ_{WRITE|SYNC}
> @@ -484,6 +500,25 @@ static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat)
>  	memset(rwstat->cnt, 0, sizeof(rwstat->cnt));
>  }
>  
> +/**
> + * blkg_rwstat_merge - merge a blkg_rwstat into another
> + * @to: the destination blkg_rwstat
> + * @from: the source
> + *
> + * Add @from's counts to @to.
> + */
> +static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
> +				     struct blkg_rwstat *from)
> +{
> +	struct blkg_rwstat v = blkg_rwstat_read(from);
> +	int i;
> +
> +	u64_stats_update_begin(&to->syncp);
> +	for (i = 0; i < BLKG_RWSTAT_NR; i++)
> +		to->cnt[i] += v.cnt[i];
> +	u64_stats_update_end(&to->syncp);
> +}
> +
>  #else	/* CONFIG_BLK_CGROUP */
>  
>  struct cgroup;
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 20/24] block: RCU free request_queue
       [not found]   ` <1356726946-26037-21-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-01-02 18:48       ` Vivek Goyal
@ 2013-01-08 18:05     ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:42PM -0800, Tejun Heo wrote:
> RCU free request_queue so that blkcg_gq->q can be dereferenced under
> RCU lock.  This will be used to implement hierarchical stats.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

> ---
>  block/blk-sysfs.c      | 9 ++++++++-
>  include/linux/blkdev.h | 2 ++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7881477..6206a93 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -497,6 +497,13 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
>  	return res;
>  }
>  
> +static void blk_free_queue_rcu(struct rcu_head *rcu_head)
> +{
> +	struct request_queue *q = container_of(rcu_head, struct request_queue,
> +					       rcu_head);
> +	kmem_cache_free(blk_requestq_cachep, q);
> +}
> +
>  /**
>   * blk_release_queue: - release a &struct request_queue when it is no longer needed
>   * @kobj:    the kobj belonging to the request queue to be released
> @@ -538,7 +545,7 @@ static void blk_release_queue(struct kobject *kobj)
>  	bdi_destroy(&q->backing_dev_info);
>  
>  	ida_simple_remove(&blk_queue_ida, q->id);
> -	kmem_cache_free(blk_requestq_cachep, q);
> +	call_rcu(&q->rcu_head, blk_free_queue_rcu);
>  }
>  
>  static const struct sysfs_ops queue_sysfs_ops = {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f94bc83..406343c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -19,6 +19,7 @@
>  #include <linux/gfp.h>
>  #include <linux/bsg.h>
>  #include <linux/smp.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/scatterlist.h>
>  
> @@ -437,6 +438,7 @@ struct request_queue {
>  	/* Throttle data */
>  	struct throtl_data *td;
>  #endif
> +	struct rcu_head		rcu_head;
>  };
>  
>  #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 20/24] block: RCU free request_queue
       [not found]   ` <1356726946-26037-21-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 18:05     ` Vivek Goyal
  2013-01-08 18:05     ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:42PM -0800, Tejun Heo wrote:
> RCU free request_queue so that blkcg_gq->q can be dereferenced under
> RCU lock.  This will be used to implement hierarchical stats.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> ---
>  block/blk-sysfs.c      | 9 ++++++++-
>  include/linux/blkdev.h | 2 ++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7881477..6206a93 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -497,6 +497,13 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
>  	return res;
>  }
>  
> +static void blk_free_queue_rcu(struct rcu_head *rcu_head)
> +{
> +	struct request_queue *q = container_of(rcu_head, struct request_queue,
> +					       rcu_head);
> +	kmem_cache_free(blk_requestq_cachep, q);
> +}
> +
>  /**
>   * blk_release_queue: - release a &struct request_queue when it is no longer needed
>   * @kobj:    the kobj belonging to the request queue to be released
> @@ -538,7 +545,7 @@ static void blk_release_queue(struct kobject *kobj)
>  	bdi_destroy(&q->backing_dev_info);
>  
>  	ida_simple_remove(&blk_queue_ida, q->id);
> -	kmem_cache_free(blk_requestq_cachep, q);
> +	call_rcu(&q->rcu_head, blk_free_queue_rcu);
>  }
>  
>  static const struct sysfs_ops queue_sysfs_ops = {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f94bc83..406343c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -19,6 +19,7 @@
>  #include <linux/gfp.h>
>  #include <linux/bsg.h>
>  #include <linux/smp.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/scatterlist.h>
>  
> @@ -437,6 +438,7 @@ struct request_queue {
>  	/* Throttle data */
>  	struct throtl_data *td;
>  #endif
> +	struct rcu_head		rcu_head;
>  };
>  
>  #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 20/24] block: RCU free request_queue
@ 2013-01-08 18:05     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:42PM -0800, Tejun Heo wrote:
> RCU free request_queue so that blkcg_gq->q can be dereferenced under
> RCU lock.  This will be used to implement hierarchical stats.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

> ---
>  block/blk-sysfs.c      | 9 ++++++++-
>  include/linux/blkdev.h | 2 ++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7881477..6206a93 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -497,6 +497,13 @@ queue_attr_store(struct kobject *kobj, struct attribute *attr,
>  	return res;
>  }
>  
> +static void blk_free_queue_rcu(struct rcu_head *rcu_head)
> +{
> +	struct request_queue *q = container_of(rcu_head, struct request_queue,
> +					       rcu_head);
> +	kmem_cache_free(blk_requestq_cachep, q);
> +}
> +
>  /**
>   * blk_release_queue: - release a &struct request_queue when it is no longer needed
>   * @kobj:    the kobj belonging to the request queue to be released
> @@ -538,7 +545,7 @@ static void blk_release_queue(struct kobject *kobj)
>  	bdi_destroy(&q->backing_dev_info);
>  
>  	ida_simple_remove(&blk_queue_ida, q->id);
> -	kmem_cache_free(blk_requestq_cachep, q);
> +	call_rcu(&q->rcu_head, blk_free_queue_rcu);
>  }
>  
>  static const struct sysfs_ops queue_sysfs_ops = {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f94bc83..406343c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -19,6 +19,7 @@
>  #include <linux/gfp.h>
>  #include <linux/bsg.h>
>  #include <linux/smp.h>
> +#include <linux/rcupdate.h>
>  
>  #include <asm/scatterlist.h>
>  
> @@ -437,6 +438,7 @@ struct request_queue {
>  	/* Throttle data */
>  	struct throtl_data *td;
>  #endif
> +	struct rcu_head		rcu_head;
>  };
>  
>  #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
       [not found]     ` <1356726946-26037-22-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-01-02 19:27       ` Vivek Goyal
@ 2013-01-08 18:08       ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:43PM -0800, Tejun Heo wrote:
> Instead of holding blkcg->lock while walking ->blkg_list and executing
> prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
> executing prfill().  This makes prfill() implementations easier as
> stats are mostly protected by queue lock.
> 
> This will be used to implement hierarchical stats.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
       [not found]     ` <1356726946-26037-22-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 18:08       ` Vivek Goyal
  2013-01-08 18:08       ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:43PM -0800, Tejun Heo wrote:
> Instead of holding blkcg->lock while walking ->blkg_list and executing
> prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
> executing prfill().  This makes prfill() implementations easier as
> stats are mostly protected by queue lock.
> 
> This will be used to implement hierarchical stats.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
@ 2013-01-08 18:08       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:43PM -0800, Tejun Heo wrote:
> Instead of holding blkcg->lock while walking ->blkg_list and executing
> prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
> executing prfill().  This makes prfill() implementations easier as
> stats are mostly protected by queue lock.
> 
> This will be used to implement hierarchical stats.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 22/24] cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
  2012-12-28 20:35 ` [PATCH 22/24] cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() Tejun Heo
@ 2013-01-08 18:09       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:44PM -0800, Tejun Heo wrote:
> Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
> cfq_pd_reset_stats() and move the latter to where other pd methods are
> defined.  cfqg_stats_reset() will be used to implement hierarchical
> stats.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 22/24] cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
@ 2013-01-08 18:09       ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:44PM -0800, Tejun Heo wrote:
> Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
> cfq_pd_reset_stats() and move the latter to where other pd methods are
> defined.  cfqg_stats_reset() will be used to implement hierarchical
> stats.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
       [not found]   ` <1356726946-26037-24-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2013-01-02 16:24     ` Vivek Goyal
@ 2013-01-08 18:12     ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> To support hierarchical stats, it's necessary to remember stats from
> dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> its stats to the parent's dead-stats.
> 
> The transfer happens form ->pd_offline_fn() and it is possible that
> there are some residual IOs completing afterwards.  Currently, we lose
> these stats.  Given that cgroup removal isn't a very high frequency
> operation and the amount of residual IOs on offline are likely to be
> nil or small, this shouldn't be a big deal and the complexity needed
> to handle residual IOs - another callback and rather elaborate
> synchronization to reach and lock the matching q - doesn't seem
> justified.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
       [not found]   ` <1356726946-26037-24-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 18:12     ` Vivek Goyal
  2013-01-08 18:12     ` Vivek Goyal
  1 sibling, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> To support hierarchical stats, it's necessary to remember stats from
> dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> its stats to the parent's dead-stats.
> 
> The transfer happens form ->pd_offline_fn() and it is possible that
> there are some residual IOs completing afterwards.  Currently, we lose
> these stats.  Given that cgroup removal isn't a very high frequency
> operation and the amount of residual IOs on offline are likely to be
> nil or small, this shouldn't be a big deal and the complexity needed
> to handle residual IOs - another callback and rather elaborate
> synchronization to reach and lock the matching q - doesn't seem
> justified.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs
@ 2013-01-08 18:12     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:45PM -0800, Tejun Heo wrote:
> To support hierarchical stats, it's necessary to remember stats from
> dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
> its stats to the parent's dead-stats.
> 
> The transfer happens form ->pd_offline_fn() and it is possible that
> there are some residual IOs completing afterwards.  Currently, we lose
> these stats.  Given that cgroup removal isn't a very high frequency
> operation and the amount of residual IOs on offline are likely to be
> nil or small, this shouldn't be a big deal and the complexity needed
> to handle residual IOs - another callback and rather elaborate
> synchronization to reach and lock the matching q - doesn't seem
> justified.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 24/24] cfq-iosched: add hierarchical cfq_group statistics
       [not found]   ` <1356726946-26037-25-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 18:27     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Dec 28, 2012 at 12:35:46PM -0800, Tejun Heo wrote:
> Unfortunately, at this point, there's no way to make the existing
> statistics hierarchical without creating nasty surprises for the
> existing users.  Just create recursive counterpart of the existing
> stats.
> 

No recursive counterparts for stats under DEBUG? Well, I would not 
complain. There are too many stats already and if somebody needs
recursive stats for debug stats, let them do it.

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/cfq-iosched.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 105 insertions(+)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 4d75b79..b66365b 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1528,6 +1528,32 @@ static void cfq_pd_offline(struct blkcg_gq *blkg)
>  	cfqg_stats_xfer_dead(blkg_to_cfqg(blkg));
>  }
>  
> +/* offset delta from cfqg->stats to cfqg->dead_stats */
> +static const int dead_stats_off_delta = offsetof(struct cfq_group, dead_stats) -
> +					offsetof(struct cfq_group, stats);
> +
> +/* to be used by recursive prfill, sums live and dead stats recursively */
> +static u64 cfqg_stat_pd_recursive_sum(struct blkg_policy_data *pd, int off)
> +{
> +	u64 sum = 0;
> +
> +	sum += blkg_stat_recursive_sum(pd, off);
> +	sum += blkg_stat_recursive_sum(pd, off + dead_stats_off_delta);
> +	return sum;
> +}
> +
> +/* to be used by recursive prfill, sums live and dead rwstats recursively */
> +static struct blkg_rwstat cfqg_rwstat_pd_recursive_sum(struct blkg_policy_data *pd,
> +						       int off)
> +{
> +	struct blkg_rwstat a, b;
> +
> +	a = blkg_rwstat_recursive_sum(pd, off);
> +	b = blkg_rwstat_recursive_sum(pd, off + dead_stats_off_delta);
> +	blkg_rwstat_merge(&a, &b);
> +	return a;
> +}
> +
>  static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
>  {
>  	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
> @@ -1732,6 +1758,42 @@ static int cfqg_print_rwstat(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +static u64 cfqg_prfill_stat_recursive(struct seq_file *sf,
> +				      struct blkg_policy_data *pd, int off)
> +{
> +	u64 sum = cfqg_stat_pd_recursive_sum(pd, off);
> +
> +	return __blkg_prfill_u64(sf, pd, sum);
> +}
> +
> +static u64 cfqg_prfill_rwstat_recursive(struct seq_file *sf,
> +					struct blkg_policy_data *pd, int off)
> +{
> +	struct blkg_rwstat sum = cfqg_rwstat_pd_recursive_sum(pd, off);
> +
> +	return __blkg_prfill_rwstat(sf, pd, &sum);
> +}
> +
> +static int cfqg_print_stat_recursive(struct cgroup *cgrp, struct cftype *cft,
> +				     struct seq_file *sf)
> +{
> +	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
> +
> +	blkcg_print_blkgs(sf, blkcg, cfqg_prfill_stat_recursive,
> +			  &blkcg_policy_cfq, cft->private, false);
> +	return 0;
> +}
> +
> +static int cfqg_print_rwstat_recursive(struct cgroup *cgrp, struct cftype *cft,
> +				       struct seq_file *sf)
> +{
> +	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
> +
> +	blkcg_print_blkgs(sf, blkcg, cfqg_prfill_rwstat_recursive,
> +			  &blkcg_policy_cfq, cft->private, true);
> +	return 0;
> +}
> +
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  static u64 cfqg_prfill_avg_queue_size(struct seq_file *sf,
>  				      struct blkg_policy_data *pd, int off)
> @@ -1803,6 +1865,7 @@ static struct cftype cfq_blkcg_files[] = {
>  		.write_u64 = cfq_set_leaf_weight,
>  	},
>  
> +	/* statistics, covers only the tasks in the cfqg */
>  	{
>  		.name = "time",
>  		.private = offsetof(struct cfq_group, stats.time),
> @@ -1843,6 +1906,48 @@ static struct cftype cfq_blkcg_files[] = {
>  		.private = offsetof(struct cfq_group, stats.queued),
>  		.read_seq_string = cfqg_print_rwstat,
>  	},
> +
> +	/* the same statictics which cover the cfqg and its descendants */
> +	{
> +		.name = "time_recursive",
> +		.private = offsetof(struct cfq_group, stats.time),
> +		.read_seq_string = cfqg_print_stat_recursive,
> +	},
> +	{
> +		.name = "sectors_recursive",
> +		.private = offsetof(struct cfq_group, stats.sectors),
> +		.read_seq_string = cfqg_print_stat_recursive,
> +	},
> +	{
> +		.name = "io_service_bytes_recursive",
> +		.private = offsetof(struct cfq_group, stats.service_bytes),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_serviced_recursive",
> +		.private = offsetof(struct cfq_group, stats.serviced),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_service_time_recursive",
> +		.private = offsetof(struct cfq_group, stats.service_time),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_wait_time_recursive",
> +		.private = offsetof(struct cfq_group, stats.wait_time),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_merged_recursive",
> +		.private = offsetof(struct cfq_group, stats.merged),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_queued_recursive",
> +		.private = offsetof(struct cfq_group, stats.queued),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  	{
>  		.name = "avg_queue_size",
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 24/24] cfq-iosched: add hierarchical cfq_group statistics
       [not found]   ` <1356726946-26037-25-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2013-01-08 18:27     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Fri, Dec 28, 2012 at 12:35:46PM -0800, Tejun Heo wrote:
> Unfortunately, at this point, there's no way to make the existing
> statistics hierarchical without creating nasty surprises for the
> existing users.  Just create recursive counterpart of the existing
> stats.
> 

No recursive counterparts for stats under DEBUG? Well, I would not 
complain. There are too many stats already and if somebody needs
recursive stats for debug stats, let them do it.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/cfq-iosched.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 105 insertions(+)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 4d75b79..b66365b 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1528,6 +1528,32 @@ static void cfq_pd_offline(struct blkcg_gq *blkg)
>  	cfqg_stats_xfer_dead(blkg_to_cfqg(blkg));
>  }
>  
> +/* offset delta from cfqg->stats to cfqg->dead_stats */
> +static const int dead_stats_off_delta = offsetof(struct cfq_group, dead_stats) -
> +					offsetof(struct cfq_group, stats);
> +
> +/* to be used by recursive prfill, sums live and dead stats recursively */
> +static u64 cfqg_stat_pd_recursive_sum(struct blkg_policy_data *pd, int off)
> +{
> +	u64 sum = 0;
> +
> +	sum += blkg_stat_recursive_sum(pd, off);
> +	sum += blkg_stat_recursive_sum(pd, off + dead_stats_off_delta);
> +	return sum;
> +}
> +
> +/* to be used by recursive prfill, sums live and dead rwstats recursively */
> +static struct blkg_rwstat cfqg_rwstat_pd_recursive_sum(struct blkg_policy_data *pd,
> +						       int off)
> +{
> +	struct blkg_rwstat a, b;
> +
> +	a = blkg_rwstat_recursive_sum(pd, off);
> +	b = blkg_rwstat_recursive_sum(pd, off + dead_stats_off_delta);
> +	blkg_rwstat_merge(&a, &b);
> +	return a;
> +}
> +
>  static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
>  {
>  	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
> @@ -1732,6 +1758,42 @@ static int cfqg_print_rwstat(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +static u64 cfqg_prfill_stat_recursive(struct seq_file *sf,
> +				      struct blkg_policy_data *pd, int off)
> +{
> +	u64 sum = cfqg_stat_pd_recursive_sum(pd, off);
> +
> +	return __blkg_prfill_u64(sf, pd, sum);
> +}
> +
> +static u64 cfqg_prfill_rwstat_recursive(struct seq_file *sf,
> +					struct blkg_policy_data *pd, int off)
> +{
> +	struct blkg_rwstat sum = cfqg_rwstat_pd_recursive_sum(pd, off);
> +
> +	return __blkg_prfill_rwstat(sf, pd, &sum);
> +}
> +
> +static int cfqg_print_stat_recursive(struct cgroup *cgrp, struct cftype *cft,
> +				     struct seq_file *sf)
> +{
> +	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
> +
> +	blkcg_print_blkgs(sf, blkcg, cfqg_prfill_stat_recursive,
> +			  &blkcg_policy_cfq, cft->private, false);
> +	return 0;
> +}
> +
> +static int cfqg_print_rwstat_recursive(struct cgroup *cgrp, struct cftype *cft,
> +				       struct seq_file *sf)
> +{
> +	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
> +
> +	blkcg_print_blkgs(sf, blkcg, cfqg_prfill_rwstat_recursive,
> +			  &blkcg_policy_cfq, cft->private, true);
> +	return 0;
> +}
> +
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  static u64 cfqg_prfill_avg_queue_size(struct seq_file *sf,
>  				      struct blkg_policy_data *pd, int off)
> @@ -1803,6 +1865,7 @@ static struct cftype cfq_blkcg_files[] = {
>  		.write_u64 = cfq_set_leaf_weight,
>  	},
>  
> +	/* statistics, covers only the tasks in the cfqg */
>  	{
>  		.name = "time",
>  		.private = offsetof(struct cfq_group, stats.time),
> @@ -1843,6 +1906,48 @@ static struct cftype cfq_blkcg_files[] = {
>  		.private = offsetof(struct cfq_group, stats.queued),
>  		.read_seq_string = cfqg_print_rwstat,
>  	},
> +
> +	/* the same statictics which cover the cfqg and its descendants */
> +	{
> +		.name = "time_recursive",
> +		.private = offsetof(struct cfq_group, stats.time),
> +		.read_seq_string = cfqg_print_stat_recursive,
> +	},
> +	{
> +		.name = "sectors_recursive",
> +		.private = offsetof(struct cfq_group, stats.sectors),
> +		.read_seq_string = cfqg_print_stat_recursive,
> +	},
> +	{
> +		.name = "io_service_bytes_recursive",
> +		.private = offsetof(struct cfq_group, stats.service_bytes),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_serviced_recursive",
> +		.private = offsetof(struct cfq_group, stats.serviced),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_service_time_recursive",
> +		.private = offsetof(struct cfq_group, stats.service_time),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_wait_time_recursive",
> +		.private = offsetof(struct cfq_group, stats.wait_time),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_merged_recursive",
> +		.private = offsetof(struct cfq_group, stats.merged),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_queued_recursive",
> +		.private = offsetof(struct cfq_group, stats.queued),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  	{
>  		.name = "avg_queue_size",
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH 24/24] cfq-iosched: add hierarchical cfq_group statistics
@ 2013-01-08 18:27     ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA, rni-hpIqsD4AKlfQT0dZR+AlfA

On Fri, Dec 28, 2012 at 12:35:46PM -0800, Tejun Heo wrote:
> Unfortunately, at this point, there's no way to make the existing
> statistics hierarchical without creating nasty surprises for the
> existing users.  Just create recursive counterpart of the existing
> stats.
> 

No recursive counterparts for stats under DEBUG? Well, I would not 
complain. There are too many stats already and if somebody needs
recursive stats for debug stats, let them do it.

Acked-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Vivek

> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/cfq-iosched.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 105 insertions(+)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 4d75b79..b66365b 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1528,6 +1528,32 @@ static void cfq_pd_offline(struct blkcg_gq *blkg)
>  	cfqg_stats_xfer_dead(blkg_to_cfqg(blkg));
>  }
>  
> +/* offset delta from cfqg->stats to cfqg->dead_stats */
> +static const int dead_stats_off_delta = offsetof(struct cfq_group, dead_stats) -
> +					offsetof(struct cfq_group, stats);
> +
> +/* to be used by recursive prfill, sums live and dead stats recursively */
> +static u64 cfqg_stat_pd_recursive_sum(struct blkg_policy_data *pd, int off)
> +{
> +	u64 sum = 0;
> +
> +	sum += blkg_stat_recursive_sum(pd, off);
> +	sum += blkg_stat_recursive_sum(pd, off + dead_stats_off_delta);
> +	return sum;
> +}
> +
> +/* to be used by recursive prfill, sums live and dead rwstats recursively */
> +static struct blkg_rwstat cfqg_rwstat_pd_recursive_sum(struct blkg_policy_data *pd,
> +						       int off)
> +{
> +	struct blkg_rwstat a, b;
> +
> +	a = blkg_rwstat_recursive_sum(pd, off);
> +	b = blkg_rwstat_recursive_sum(pd, off + dead_stats_off_delta);
> +	blkg_rwstat_merge(&a, &b);
> +	return a;
> +}
> +
>  static void cfq_pd_reset_stats(struct blkcg_gq *blkg)
>  {
>  	struct cfq_group *cfqg = blkg_to_cfqg(blkg);
> @@ -1732,6 +1758,42 @@ static int cfqg_print_rwstat(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +static u64 cfqg_prfill_stat_recursive(struct seq_file *sf,
> +				      struct blkg_policy_data *pd, int off)
> +{
> +	u64 sum = cfqg_stat_pd_recursive_sum(pd, off);
> +
> +	return __blkg_prfill_u64(sf, pd, sum);
> +}
> +
> +static u64 cfqg_prfill_rwstat_recursive(struct seq_file *sf,
> +					struct blkg_policy_data *pd, int off)
> +{
> +	struct blkg_rwstat sum = cfqg_rwstat_pd_recursive_sum(pd, off);
> +
> +	return __blkg_prfill_rwstat(sf, pd, &sum);
> +}
> +
> +static int cfqg_print_stat_recursive(struct cgroup *cgrp, struct cftype *cft,
> +				     struct seq_file *sf)
> +{
> +	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
> +
> +	blkcg_print_blkgs(sf, blkcg, cfqg_prfill_stat_recursive,
> +			  &blkcg_policy_cfq, cft->private, false);
> +	return 0;
> +}
> +
> +static int cfqg_print_rwstat_recursive(struct cgroup *cgrp, struct cftype *cft,
> +				       struct seq_file *sf)
> +{
> +	struct blkcg *blkcg = cgroup_to_blkcg(cgrp);
> +
> +	blkcg_print_blkgs(sf, blkcg, cfqg_prfill_rwstat_recursive,
> +			  &blkcg_policy_cfq, cft->private, true);
> +	return 0;
> +}
> +
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  static u64 cfqg_prfill_avg_queue_size(struct seq_file *sf,
>  				      struct blkg_policy_data *pd, int off)
> @@ -1803,6 +1865,7 @@ static struct cftype cfq_blkcg_files[] = {
>  		.write_u64 = cfq_set_leaf_weight,
>  	},
>  
> +	/* statistics, covers only the tasks in the cfqg */
>  	{
>  		.name = "time",
>  		.private = offsetof(struct cfq_group, stats.time),
> @@ -1843,6 +1906,48 @@ static struct cftype cfq_blkcg_files[] = {
>  		.private = offsetof(struct cfq_group, stats.queued),
>  		.read_seq_string = cfqg_print_rwstat,
>  	},
> +
> +	/* the same statictics which cover the cfqg and its descendants */
> +	{
> +		.name = "time_recursive",
> +		.private = offsetof(struct cfq_group, stats.time),
> +		.read_seq_string = cfqg_print_stat_recursive,
> +	},
> +	{
> +		.name = "sectors_recursive",
> +		.private = offsetof(struct cfq_group, stats.sectors),
> +		.read_seq_string = cfqg_print_stat_recursive,
> +	},
> +	{
> +		.name = "io_service_bytes_recursive",
> +		.private = offsetof(struct cfq_group, stats.service_bytes),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_serviced_recursive",
> +		.private = offsetof(struct cfq_group, stats.serviced),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_service_time_recursive",
> +		.private = offsetof(struct cfq_group, stats.service_time),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_wait_time_recursive",
> +		.private = offsetof(struct cfq_group, stats.wait_time),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_merged_recursive",
> +		.private = offsetof(struct cfq_group, stats.merged),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
> +	{
> +		.name = "io_queued_recursive",
> +		.private = offsetof(struct cfq_group, stats.queued),
> +		.read_seq_string = cfqg_print_rwstat_recursive,
> +	},
>  #ifdef CONFIG_DEBUG_BLK_CGROUP
>  	{
>  		.name = "avg_queue_size",
> -- 
> 1.8.0.2

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
  2013-01-07 16:34     ` Tejun Heo
@ 2013-01-08 18:28         ` Vivek Goyal
  -1 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, ctalbott-hpIqsD4AKlfQT0dZR+AlfA,
	rni-hpIqsD4AKlfQT0dZR+AlfA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Jan 07, 2013 at 08:34:37AM -0800, Tejun Heo wrote:
> On Wed, Jan 02, 2013 at 01:20:38PM -0500, Vivek Goyal wrote:
> > On Fri, Dec 28, 2012 at 12:35:22PM -0800, Tejun Heo wrote:
> > 
> > [..]
> > > 
> > > * Updated to reflect Vivek's reviews - renames & documentation.
> > 
> > Hi Tejun,
> > 
> > You forgot to update blkio-controller.txt.
> 
> Just updated.  Does everything else look okay to you?

Yep, everything looks good to me. Thanks.

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
@ 2013-01-08 18:28         ` Vivek Goyal
  0 siblings, 0 replies; 131+ messages in thread
From: Vivek Goyal @ 2013-01-08 18:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, axboe, containers, cgroups, linux-kernel, ctalbott, rni

On Mon, Jan 07, 2013 at 08:34:37AM -0800, Tejun Heo wrote:
> On Wed, Jan 02, 2013 at 01:20:38PM -0500, Vivek Goyal wrote:
> > On Fri, Dec 28, 2012 at 12:35:22PM -0800, Tejun Heo wrote:
> > 
> > [..]
> > > 
> > > * Updated to reflect Vivek's reviews - renames & documentation.
> > 
> > Hi Tejun,
> > 
> > You forgot to update blkio-controller.txt.
> 
> Just updated.  Does everything else look okay to you?

Yep, everything looks good to me. Thanks.

Vivek

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2
@ 2012-12-28 20:35 Tejun Heo
  0 siblings, 0 replies; 131+ messages in thread
From: Tejun Heo @ 2012-12-28 20:35 UTC (permalink / raw)
  To: lizefan-hv44wF8Li93QT0dZR+AlfA, axboe-tSWWG44O7X1aa/9Udqfwiw,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	rni-hpIqsD4AKlfQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ctalbott-hpIqsD4AKlfQT0dZR+AlfA

Hello,

This is the second iteration to implement blkcg hierarchy support in
cfq-iosched.  Changes from the first task[L] are

* Vivek's cfq cleanup patches are included in the series for
  convenience.

* Divide by zero bug when !CONFIG_CFQ_GROUP_IOSCHED reported by
  Fengguang fixed.

* Updated to reflect Vivek's reviews - renames & documentation.

* Recursive stats no longer forget stats from dead descendants.  This
  turned out to be more complex than I wished involving implementing
  policy on/offline callbacks.

cfq-iosched is currently utterly broken in how it handles cgroup
hierarchy.  It ignores the hierarchy structure and just treats every
blkcgs equally.  This is simply broken.  This breakage makes blkcg
behave very differently from other properly-hierarchical controllers
and makes it impossible to give any uniform interpretation to the
hierarchy, which in turn makes it impossible to implement unified
hierarchy.

Given the relative simplicity of cfqg scheduling, implementing proper
hierarchy support isn't that difficult.  All that's necessary is
determining how much fraction each cfqg on the service tree has claim
to considering the hierarchy.  The calculation can be done by
maintaining the sum of active weights at each level and compounding
the ratios from the cfqg in question to root.  The overhead isn't
significant.  Tree traversals happen only when cfqgs are added or
removed from the service tree and they are from the cfqg being
modified to the root.

There are some design choices which are worth mentioning.

* Internal (non-leaf) cfqgs w/ tasks treat the tasks as a single unit
  competeting against the children cfqgs.  New config knobs -
  blkio.leaf_weight[_device] - are added to configure the weight of
  these tasks.  Another way to look at it is that each cfqg has a
  hidden leaf child node attached to it which hosts all tasks and
  leaf_weight controls the weight of that hidden node.

  Treating cfqqs and cfqgs as equals doesn't make much sense to me and
  is hairy - we need to establish ioprio to weight mapping and the
  weights fluctuate as processes fork and exit.  This becomes hairier
  when considering multiple controllers, Such mappings can't be
  established consistently across different controllers and the
  weights are given out differently - ie. blkcg give weights out to
  io_contexts while cpu to tasks, which may share io_contexts.  It's
  difficult to make sense of what's going on.

  The goal is to bring cpu, currently the only other controller which
  implements weight based resource allocation, to similar behavior.

* The existing stats aren't converted to hierarchical but new
  hierarchical ones are added.  There isn't a way to do that w/o
  introducing nasty silent surprises to the existing flat hierarchy
  users, so while being a bit clumsy, I can't see a better way.

* I based it on top of Vivek's cleanup patchset[1] but not the cfqq,
  cfqg scheduling unification patchset.  I don't think it's necessary
  or beneficial to mix the two and would really like to avoid messing
  with !blkcg scheduling logic.

The hierarchical scheduling itself is fairly simple.  The cfq part is
only ~260 lines with ~60 lines being comment, and the hierarchical
weight scaling is really straight-forward.

This patchset contains the following 24 patches.

 0001-cfq-iosched-Properly-name-all-references-to-IO-class.patch
 0002-cfq-iosched-More-renaming-to-better-represent-wl_cla.patch
 0003-cfq-iosched-Rename-service_tree-to-st-at-some-places.patch
 0004-cfq-iosched-Rename-few-functions-related-to-selectin.patch
 0005-cfq-iosched-Get-rid-of-unnecessary-local-variable.patch
 0006-cfq-iosched-Print-sync-noidle-information-in-blktrac.patch
 0007-blkcg-fix-minor-bug-in-blkg_alloc.patch
 0008-blkcg-reorganize-blkg_lookup_create-and-friends.patch
 0009-blkcg-cosmetic-updates-to-blkg_create.patch
 0010-blkcg-make-blkcg_gq-s-hierarchical.patch
 0011-cfq-iosched-add-leaf_weight.patch
 0012-cfq-iosched-implement-cfq_group-nr_active-and-childr.patch
 0013-cfq-iosched-implement-hierarchy-ready-cfq_group-char.patch
 0014-cfq-iosched-convert-cfq_group_slice-to-use-cfqg-vfra.patch
 0015-cfq-iosched-enable-full-blkcg-hierarchy-support.patch
 0016-blkcg-add-blkg_policy_data-plid.patch
 0017-blkcg-implement-blkcg_policy-on-offline_pd_fn-and-bl.patch
 0018-blkcg-s-blkg_rwstat_sum-blkg_rwstat_total.patch
 0019-blkcg-implement-blkg_-rw-stat_recursive_sum-and-blkg.patch
 0020-block-RCU-free-request_queue.patch
 0021-blkcg-make-blkcg_print_blkgs-grab-q-locks-instead-of.patch
 0022-cfq-iosched-separate-out-cfqg_stats_reset-from-cfq_p.patch
 0023-cfq-iosched-collect-stats-from-dead-cfqgs.patch
 0024-cfq-iosched-add-hierarchical-cfq_group-statistics.patch

0001-0006 are Vivek's cfq cleanup patches.

0007-0009 are prep patches.

0010 makes blkcg core always allocate non-leaf blkgs so that any given
blkg is guaranteed to have all its ancestor blkgs to the root.

0011-0012 prepare for hierarchical scheduling.

0013-0014 implement hierarchy-ready cfqg scheduling.

0015 enbles hierarchical scheduling.

0016-0022 prepare for hierarchical stats.

0023-0024 implement hierarchical stats.

This patchset is on top of linus#master (ecccd1248d ("mm: fix null
pointer dereference in wait_iff_congested()")).

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git blkcg-cfq-hierarchy

Thanks.

 Documentation/block/cfq-iosched.txt |   58 +++
 block/blk-cgroup.c                  |  276 +++++++++++++--
 block/blk-cgroup.h                  |   68 +++
 block/blk-sysfs.c                   |    9 
 block/cfq-iosched.c                 |  627 +++++++++++++++++++++++++++++-------
 include/linux/blkdev.h              |    2 
 6 files changed, 877 insertions(+), 163 deletions(-)

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel.cgroups/5440
[1] https://lkml.org/lkml/2012/10/3/502

^ permalink raw reply	[flat|nested] 131+ messages in thread

end of thread, other threads:[~2013-01-08 18:28 UTC | newest]

Thread overview: 131+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-28 20:35 [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2 Tejun Heo
2012-12-28 20:35 ` Tejun Heo
2012-12-28 20:35 ` [PATCH 11/24] cfq-iosched: add leaf_weight Tejun Heo
     [not found]   ` <1356726946-26037-12-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-08 15:34     ` Vivek Goyal
2013-01-08 15:34       ` Vivek Goyal
     [not found]       ` <20130108153448.GB29635-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-01-08 17:24         ` Tejun Heo
2013-01-08 17:24       ` Tejun Heo
2013-01-08 17:24         ` Tejun Heo
2012-12-28 20:35 ` [PATCH 12/24] cfq-iosched: implement cfq_group->nr_active and ->children_weight Tejun Heo
     [not found]   ` <1356726946-26037-13-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-08 15:51     ` Vivek Goyal
2013-01-08 15:51       ` Vivek Goyal
2012-12-28 20:35 ` [PATCH 14/24] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction Tejun Heo
2013-01-08 16:42   ` Vivek Goyal
2013-01-08 16:42     ` Vivek Goyal
     [not found]   ` <1356726946-26037-15-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-08 16:42     ` Vivek Goyal
2012-12-28 20:35 ` [PATCH 15/24] cfq-iosched: enable full blkcg hierarchy support Tejun Heo
     [not found]   ` <1356726946-26037-16-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-07 16:34     ` [PATCH UPDATED " Tejun Heo
2013-01-07 16:34       ` Tejun Heo
     [not found]       ` <20130107163405.GE3926-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2013-01-08 14:42         ` Vivek Goyal
2013-01-08 14:42           ` Vivek Goyal
     [not found]           ` <20130108144240.GA29635-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-01-08 17:19             ` Tejun Heo
2013-01-08 17:19               ` Tejun Heo
2012-12-28 20:35 ` [PATCH 16/24] blkcg: add blkg_policy_data->plid Tejun Heo
     [not found]   ` <1356726946-26037-17-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-08 16:51     ` Vivek Goyal
2013-01-08 16:51   ` Vivek Goyal
2013-01-08 16:51     ` Vivek Goyal
2012-12-28 20:35 ` [PATCH 19/24] blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() Tejun Heo
     [not found]   ` <1356726946-26037-20-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-08 18:03     ` Vivek Goyal
2013-01-08 18:03       ` Vivek Goyal
2012-12-28 20:35 ` [PATCH 20/24] block: RCU free request_queue Tejun Heo
2013-01-08 18:05   ` Vivek Goyal
2013-01-08 18:05     ` Vivek Goyal
     [not found]   ` <1356726946-26037-21-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-02 18:48     ` Vivek Goyal
2013-01-02 18:48       ` Vivek Goyal
     [not found]       ` <20130102184814.GD4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-01-02 20:43         ` Tejun Heo
2013-01-02 20:43           ` Tejun Heo
2013-01-08 18:05     ` Vivek Goyal
2012-12-28 20:35 ` [PATCH 22/24] cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() Tejun Heo
     [not found]   ` <1356726946-26037-23-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-08 18:09     ` Vivek Goyal
2013-01-08 18:09       ` Vivek Goyal
     [not found] ` <1356726946-26037-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2012-12-28 20:35   ` [PATCH 01/24] cfq-iosched: Properly name all references to IO class Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2012-12-28 20:35   ` [PATCH 02/24] cfq-iosched: More renaming to better represent wl_class and wl_type Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2012-12-28 20:35   ` [PATCH 03/24] cfq-iosched: Rename "service_tree" to "st" at some places Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2012-12-28 20:35   ` [PATCH 04/24] cfq-iosched: Rename few functions related to selecting workload Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2012-12-28 20:35   ` [PATCH 05/24] cfq-iosched: Get rid of unnecessary local variable Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2012-12-28 20:35   ` [PATCH 06/24] cfq-iosched: Print sync-noidle information in blktrace messages Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2012-12-28 20:35   ` [PATCH 07/24] blkcg: fix minor bug in blkg_alloc() Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2012-12-28 20:35   ` [PATCH 08/24] blkcg: reorganize blkg_lookup_create() and friends Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2012-12-28 20:35   ` [PATCH 09/24] blkcg: cosmetic updates to blkg_create() Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2012-12-28 20:35   ` [PATCH 10/24] blkcg: make blkcg_gq's hierarchical Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2012-12-28 20:35   ` [PATCH 11/24] cfq-iosched: add leaf_weight Tejun Heo
2012-12-28 20:35   ` [PATCH 12/24] cfq-iosched: implement cfq_group->nr_active and ->children_weight Tejun Heo
2012-12-28 20:35   ` [PATCH 13/24] cfq-iosched: implement hierarchy-ready cfq_group charge scaling Tejun Heo
2012-12-28 20:35     ` Tejun Heo
     [not found]     ` <1356726946-26037-14-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-08 16:16       ` Vivek Goyal
2013-01-08 16:16         ` Vivek Goyal
2012-12-28 20:35   ` [PATCH 14/24] cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction Tejun Heo
2012-12-28 20:35   ` [PATCH 15/24] cfq-iosched: enable full blkcg hierarchy support Tejun Heo
2012-12-28 20:35   ` [PATCH 16/24] blkcg: add blkg_policy_data->plid Tejun Heo
2012-12-28 20:35   ` [PATCH 17/24] blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2013-01-02 19:38     ` Vivek Goyal
2013-01-02 19:38       ` Vivek Goyal
     [not found]       ` <20130102193828.GE4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-01-02 20:37         ` Tejun Heo
2013-01-02 20:37       ` Tejun Heo
2013-01-02 20:37         ` Tejun Heo
     [not found]     ` <1356726946-26037-18-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-02 19:38       ` Vivek Goyal
2013-01-08 16:58       ` Vivek Goyal
2013-01-08 16:58         ` Vivek Goyal
2012-12-28 20:35   ` [PATCH 18/24] blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/ Tejun Heo
2012-12-28 20:35     ` Tejun Heo
     [not found]     ` <1356726946-26037-19-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-08 16:59       ` Vivek Goyal
2013-01-08 16:59     ` Vivek Goyal
2013-01-08 16:59       ` Vivek Goyal
2012-12-28 20:35   ` [PATCH 19/24] blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() Tejun Heo
2012-12-28 20:35   ` [PATCH 20/24] block: RCU free request_queue Tejun Heo
2012-12-28 20:35   ` [PATCH 21/24] blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock Tejun Heo
2012-12-28 20:35     ` Tejun Heo
2013-01-02 19:27     ` Vivek Goyal
2013-01-02 19:27       ` Vivek Goyal
2013-01-02 20:45       ` Tejun Heo
2013-01-02 20:45         ` Tejun Heo
     [not found]       ` <20130102192700.GA9552-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-01-02 20:45         ` Tejun Heo
     [not found]     ` <1356726946-26037-22-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-02 19:27       ` Vivek Goyal
2013-01-08 18:08       ` Vivek Goyal
2013-01-08 18:08     ` Vivek Goyal
2013-01-08 18:08       ` Vivek Goyal
2012-12-28 20:35   ` [PATCH 22/24] cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() Tejun Heo
2012-12-28 20:35   ` [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs Tejun Heo
2012-12-28 20:35   ` [PATCH 24/24] cfq-iosched: add hierarchical cfq_group statistics Tejun Heo
2012-12-28 23:18   ` [PATCH 18.5/24] blkcg: export __blkg_prfill_rwstat() take#2 Tejun Heo
2013-01-02 18:20   ` [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2 Vivek Goyal
2012-12-28 20:35 ` [PATCH 23/24] cfq-iosched: collect stats from dead cfqgs Tejun Heo
     [not found]   ` <1356726946-26037-24-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-02 16:24     ` Vivek Goyal
2013-01-08 18:12     ` Vivek Goyal
2013-01-02 16:24   ` Vivek Goyal
2013-01-02 16:24     ` Vivek Goyal
2013-01-02 16:30     ` Tejun Heo
2013-01-02 16:30       ` Tejun Heo
     [not found]       ` <20130102163010.GC11220-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2013-01-02 16:44         ` Vivek Goyal
2013-01-02 16:44           ` Vivek Goyal
2013-01-02 16:52           ` Tejun Heo
2013-01-02 16:52             ` Tejun Heo
     [not found]           ` <20130102164415.GB4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-01-02 16:52             ` Tejun Heo
     [not found]     ` <20130102162415.GA4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-01-02 16:30       ` Tejun Heo
2013-01-08 18:12   ` Vivek Goyal
2013-01-08 18:12     ` Vivek Goyal
2012-12-28 20:35 ` [PATCH 24/24] cfq-iosched: add hierarchical cfq_group statistics Tejun Heo
2013-01-08 18:27   ` Vivek Goyal
2013-01-08 18:27     ` Vivek Goyal
     [not found]   ` <1356726946-26037-25-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2013-01-08 18:27     ` Vivek Goyal
2012-12-28 23:18 ` [PATCH 18.5/24] blkcg: export __blkg_prfill_rwstat() take#2 Tejun Heo
2012-12-28 23:18   ` Tejun Heo
2013-01-02 18:20 ` [PATCHSET] block: implement blkcg hierarchy support in cfq, take#2 Vivek Goyal
2013-01-02 18:20   ` Vivek Goyal
     [not found]   ` <20130102182037.GC4306-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-01-07 16:34     ` Tejun Heo
2013-01-07 16:34   ` Tejun Heo
2013-01-07 16:34     ` Tejun Heo
     [not found]     ` <20130107163437.GF3926-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2013-01-08 18:28       ` Vivek Goyal
2013-01-08 18:28         ` Vivek Goyal
  -- strict thread matches above, loose matches on Subject: below --
2012-12-28 20:35 Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.