linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [PATCH 0/4] cfq-iosched: Enable hierarchical cfq group scheduling and add use_hierarchy interface
@ 2010-10-21  2:32 Gui Jianfeng
  2010-10-21  2:34 ` [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support Gui Jianfeng
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Gui Jianfeng @ 2010-10-21  2:32 UTC (permalink / raw)
  To: Vivek Goyal, Jens Axboe
  Cc: Nauman Rafique, Chad Talbott, Divyesh Shah, linux kernel mailing list

Hi All

This patchset enable hierarchicals cfq group scheduling and adds new blkio cgroup
interface "use_hierarchy" to switch between hierarchical mode and flat mode.


[PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support
[PATCH 2/4] blkio-cgroup: Add a new interface use_hierarchy
[PATCH 3/4] cfq-iosched: Enable both hierarchical mode and flat mode for cfq group scheduling
[PATCH 4/4] blkio-cgroup: Documents for use_hierarchy interface

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support
  2010-10-21  2:32 [RFC] [PATCH 0/4] cfq-iosched: Enable hierarchical cfq group scheduling and add use_hierarchy interface Gui Jianfeng
@ 2010-10-21  2:34 ` Gui Jianfeng
  2010-10-22 20:54   ` Vivek Goyal
  2010-10-21  2:36 ` [PATCH 2/4] blkio-cgroup: Add a new interface use_hierarchy Gui Jianfeng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Gui Jianfeng @ 2010-10-21  2:34 UTC (permalink / raw)
  To: Vivek Goyal, Jens Axboe
  Cc: Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list, Gui Jianfeng

This patch enables cfq group hierarchical scheduling.

With this patch, you can create a cgroup directory deeper than level 1.
Now, I/O Bandwidth is distributed in a hierarchy way. For example:
We create cgroup directories as following(the number represents weight):

            Root grp
           /       \
       grp_1(100) grp_2(400)
       /    \ 
  grp_3(200) grp_4(300)

If grp_2 grp_3 and grp_4 are contending for I/O Bandwidth,
grp_2 will share 80% of total bandwidth.
For sub_groups, grp_3 shares 8%(20% * 40%), grp_4 shares 12%(20% * 60%)

Design:
  o Each cfq group has its own group service tree. 
  o Each cfq group contains a "group schedule entity" (gse) that 
    schedules on parent cfq group's service tree.
  o Each cfq group contains a "queue schedule entity"(qse), it
    represents all cfqqs located on this cfq group. It schedules
    on this group's service tree. For the time being, root group
    qse's weight is 1000, and subgroup qse's weight is 500.
  o All gses and qse which belones to a same cfq group schedules
    on the same group service tree.
  o cfq group allocates in a recursive manner, that means when a cfq 
    group needs to be allocated, the upper level cfq groups are also
    allocated.
  o When a cfq group served, not only charge this cfq group but also
    charge its ancestors.

Change v1->v2
o Rename some struct and variable according to Chad's comment.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/blk-cgroup.c  |    4 -
 block/cfq-iosched.c |  483 ++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 359 insertions(+), 128 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b1febd0..455768a 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1452,10 +1452,6 @@ blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 		goto done;
 	}
 
-	/* Currently we do not support hierarchy deeper than two level (0,1) */
-	if (parent != cgroup->top_cgroup)
-		return ERR_PTR(-EPERM);
-
 	blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
 	if (!blkcg)
 		return ERR_PTR(-ENOMEM);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ee9cd3a..5c3953d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -73,7 +73,8 @@ static DEFINE_IDA(cic_index_ida);
 #define cfq_class_rt(cfqq)	((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
 
 #define sample_valid(samples)	((samples) > 80)
-#define rb_entry_cfqg(node)	rb_entry((node), struct cfq_group, rb_node)
+#define rb_entry_se(node)	\
+	rb_entry((node), struct io_sched_entity, rb_node)
 
 /*
  * Most of our rbtree usage is for sorting with min extraction, so
@@ -171,19 +172,40 @@ enum wl_type_t {
 	SYNC_WORKLOAD = 2
 };
 
-/* This is per cgroup per device grouping structure */
-struct cfq_group {
+/*
+ * This's the schedule entity which is scheduled on group service tree.
+ * It represents shedule structure of a cfq group, or a bundle of cfq
+ * queues that locate in a cfq group.
+ */
+struct io_sched_entity {
+	struct cfq_rb_root *st;
+
 	/* group service_tree member */
 	struct rb_node rb_node;
-
-	/* group service_tree key */
 	u64 vdisktime;
 	unsigned int weight;
 	bool on_st;
+	bool is_group_se;
+	struct io_sched_entity *parent;
+};
+
+/* This is per cgroup per device grouping structure */
+struct cfq_group {
+	/* cfq group sched entity */
+	struct io_sched_entity group_se;
+
+	/* cfq queue sched entity */
+	struct io_sched_entity queue_se;
+
+	/* Service tree for cfq_groups and cfqqs set*/
+	struct cfq_rb_root grp_service_tree;
 
 	/* number of cfqq currently on this group */
 	int nr_cfqq;
 
+	/* number of sub cfq groups */
+	int nr_subgp;
+
 	/* Per group busy queus average. Useful for workload slice calc. */
 	unsigned int busy_queues_avg[2];
 	/*
@@ -210,8 +232,6 @@ struct cfq_group {
  */
 struct cfq_data {
 	struct request_queue *queue;
-	/* Root service tree for cfq_groups */
-	struct cfq_rb_root grp_service_tree;
 	struct cfq_group root_group;
 
 	/*
@@ -398,6 +418,24 @@ static inline bool iops_mode(struct cfq_data *cfqd)
 		return false;
 }
 
+static inline struct cfq_group *
+cfqg_of_group_entity(struct io_sched_entity *se)
+{
+	if (se->is_group_se)
+		return container_of(se, struct cfq_group, group_se);
+	else
+		return NULL;
+}
+
+static inline struct cfq_group *
+cfqg_of_queue_entity(struct io_sched_entity *se)
+{
+	if (!se->is_group_se)
+		return container_of(se, struct cfq_group, queue_se);
+	else
+		return NULL;
+}
+
 static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
 {
 	if (cfq_class_idle(cfqq))
@@ -521,12 +559,13 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
 }
 
-static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
+static inline u64 cfq_scale_slice(unsigned long delta,
+				  struct io_sched_entity *se)
 {
 	u64 d = delta << CFQ_SERVICE_SHIFT;
 
 	d = d * BLKIO_WEIGHT_DEFAULT;
-	do_div(d, cfqg->weight);
+	do_div(d, se->weight);
 	return d;
 }
 
@@ -551,16 +590,16 @@ static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
 static void update_min_vdisktime(struct cfq_rb_root *st)
 {
 	u64 vdisktime = st->min_vdisktime;
-	struct cfq_group *cfqg;
+	struct io_sched_entity *se;
 
 	if (st->active) {
-		cfqg = rb_entry_cfqg(st->active);
-		vdisktime = cfqg->vdisktime;
+		se = rb_entry_se(st->active);
+		vdisktime = se->vdisktime;
 	}
 
 	if (st->left) {
-		cfqg = rb_entry_cfqg(st->left);
-		vdisktime = min_vdisktime(vdisktime, cfqg->vdisktime);
+		se = rb_entry_se(st->left);
+		vdisktime = min_vdisktime(vdisktime, se->vdisktime);
 	}
 
 	st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
@@ -590,9 +629,10 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
 static inline unsigned
 cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
-	struct cfq_rb_root *st = &cfqd->grp_service_tree;
+	struct io_sched_entity *queue_entity = &cfqg->queue_se;
+	struct cfq_rb_root *st = queue_entity->st;
 
-	return cfq_target_latency * cfqg->weight / st->total_weight;
+	return cfq_target_latency * queue_entity->weight / st->total_weight;
 }
 
 static inline void
@@ -755,13 +795,13 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
 	return NULL;
 }
 
-static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
+static struct io_sched_entity *cfq_rb_first_se(struct cfq_rb_root *root)
 {
 	if (!root->left)
 		root->left = rb_first(&root->rb);
 
 	if (root->left)
-		return rb_entry_cfqg(root->left);
+		return rb_entry_se(root->left);
 
 	return NULL;
 }
@@ -818,25 +858,25 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
 }
 
 static inline s64
-cfqg_key(struct cfq_rb_root *st, struct cfq_group *cfqg)
+se_key(struct cfq_rb_root *st, struct io_sched_entity *se)
 {
-	return cfqg->vdisktime - st->min_vdisktime;
+	return se->vdisktime - st->min_vdisktime;
 }
 
 static void
-__cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+__io_sched_entity_add(struct cfq_rb_root *st, struct io_sched_entity *se)
 {
 	struct rb_node **node = &st->rb.rb_node;
 	struct rb_node *parent = NULL;
-	struct cfq_group *__cfqg;
-	s64 key = cfqg_key(st, cfqg);
+	struct io_sched_entity *__se;
+	s64 key = se_key(st, se);
 	int left = 1;
 
 	while (*node != NULL) {
 		parent = *node;
-		__cfqg = rb_entry_cfqg(parent);
+		__se = rb_entry_se(parent);
 
-		if (key < cfqg_key(st, __cfqg))
+		if (key < se_key(st, __se))
 			node = &parent->rb_left;
 		else {
 			node = &parent->rb_right;
@@ -845,47 +885,82 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	}
 
 	if (left)
-		st->left = &cfqg->rb_node;
+		st->left = &se->rb_node;
 
-	rb_link_node(&cfqg->rb_node, parent, node);
-	rb_insert_color(&cfqg->rb_node, &st->rb);
+	rb_link_node(&se->rb_node, parent, node);
+	rb_insert_color(&se->rb_node, &st->rb);
 }
 
-static void
-cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
+static void io_sched_entity_add(struct cfq_rb_root *st,
+				struct io_sched_entity *se)
 {
-	struct cfq_rb_root *st = &cfqd->grp_service_tree;
-	struct cfq_group *__cfqg;
 	struct rb_node *n;
+	struct io_sched_entity *__se;
 
-	cfqg->nr_cfqq++;
-	if (cfqg->on_st)
-		return;
 
+	if (se->on_st)
+		return;
 	/*
 	 * Currently put the group at the end. Later implement something
 	 * so that groups get lesser vtime based on their weights, so that
 	 * if group does not loose all if it was not continously backlogged.
 	 */
-	n = rb_last(&st->rb);
+	n = rb_last(&se->st->rb);
 	if (n) {
-		__cfqg = rb_entry_cfqg(n);
-		cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
+		__se = rb_entry_se(n);
+		se->vdisktime = __se->vdisktime + CFQ_IDLE_DELAY;
 	} else
-		cfqg->vdisktime = st->min_vdisktime;
+		se->vdisktime = st->min_vdisktime;
 
-	__cfq_group_service_tree_add(st, cfqg);
-	cfqg->on_st = true;
-	st->total_weight += cfqg->weight;
+	__io_sched_entity_add(se->st, se);
+	se->on_st = true;
+	st->total_weight += se->weight;
+}
+
+static void
+cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+	struct io_sched_entity *group_entity = &cfqg->group_se;
+	struct io_sched_entity *queue_entity = &cfqg->queue_se;
+	struct cfq_group *__cfqg;
+
+	cfqg->nr_cfqq++;
+
+	io_sched_entity_add(queue_entity->st, queue_entity);
+
+	while (group_entity && group_entity->parent) {
+		if (group_entity->on_st)
+			return;
+		io_sched_entity_add(group_entity->st, group_entity);
+		group_entity = group_entity->parent;
+		__cfqg = cfqg_of_group_entity(group_entity);
+		__cfqg->nr_subgp++;
+	}
+}
+
+static void io_sched_entity_del(struct io_sched_entity *se)
+{
+	if (!RB_EMPTY_NODE(&se->rb_node))
+		cfq_rb_erase(&se->rb_node, se->st);
+
+	se->on_st = false;
+	se->st->total_weight -= se->weight;
 }
 
 static void
 cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
-	struct cfq_rb_root *st = &cfqd->grp_service_tree;
+	struct io_sched_entity *group_entity = &cfqg->group_se;
+	struct io_sched_entity *queue_entity = &cfqg->queue_se;
+	struct cfq_group *__cfqg, *p_cfqg;
+
+	if (group_entity->st &&
+	    group_entity->st->active == &group_entity->rb_node)
+		group_entity->st->active = NULL;
 
-	if (st->active == &cfqg->rb_node)
-		st->active = NULL;
+	if (queue_entity->st &&
+	    queue_entity->st->active == &queue_entity->rb_node)
+		queue_entity->st->active = NULL;
 
 	BUG_ON(cfqg->nr_cfqq < 1);
 	cfqg->nr_cfqq--;
@@ -894,13 +969,25 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	if (cfqg->nr_cfqq)
 		return;
 
-	cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
-	cfqg->on_st = false;
-	st->total_weight -= cfqg->weight;
-	if (!RB_EMPTY_NODE(&cfqg->rb_node))
-		cfq_rb_erase(&cfqg->rb_node, st);
-	cfqg->saved_workload_slice = 0;
-	cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
+	/* dequeue queue se from group */
+	io_sched_entity_del(queue_entity);
+
+	if (cfqg->nr_subgp)
+		return;
+
+	/* prevent from dequeuing root group */
+	while (group_entity && group_entity->parent) {
+		__cfqg = cfqg_of_group_entity(group_entity);
+		p_cfqg = cfqg_of_group_entity(group_entity->parent);
+		io_sched_entity_del(group_entity);
+		cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
+		cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
+		__cfqg->saved_workload_slice = 0;
+		group_entity = group_entity->parent;
+		p_cfqg->nr_subgp--;
+		if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
+			return;
+	}
 }
 
 static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
@@ -932,7 +1019,8 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
 static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 				struct cfq_queue *cfqq)
 {
-	struct cfq_rb_root *st = &cfqd->grp_service_tree;
+	struct io_sched_entity *group_entity = &cfqg->group_se;
+	struct io_sched_entity *queue_entity = &cfqg->queue_se;
 	unsigned int used_sl, charge;
 	int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
 			- cfqg->service_tree_idle.count;
@@ -945,10 +1033,26 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
 		charge = cfqq->allocated_slice;
 
-	/* Can't update vdisktime while group is on service tree */
-	cfq_rb_erase(&cfqg->rb_node, st);
-	cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
-	__cfq_group_service_tree_add(st, cfqg);
+	/*
+	 *  update queue se's vdisktime.
+	 *  Can't update vdisktime while group is on service tree.
+	 */
+
+	cfq_rb_erase(&queue_entity->rb_node, queue_entity->st);
+	queue_entity->vdisktime += cfq_scale_slice(charge, queue_entity);
+	__io_sched_entity_add(queue_entity->st, queue_entity);
+	if (&queue_entity->rb_node == queue_entity->st->active)
+		queue_entity->st->active = NULL;
+
+	while (group_entity && group_entity->parent) {
+		cfq_rb_erase(&group_entity->rb_node, group_entity->st);
+		group_entity->vdisktime += cfq_scale_slice(charge,
+							   group_entity);
+		__io_sched_entity_add(group_entity->st, group_entity);
+		if (&group_entity->rb_node == group_entity->st->active)
+			group_entity->st->active = NULL;
+		group_entity = group_entity->parent;
+	}
 
 	/* This group is being expired. Save the context */
 	if (time_after(cfqd->workload_expires, jiffies)) {
@@ -959,8 +1063,10 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	} else
 		cfqg->saved_workload_slice = 0;
 
-	cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
-					st->min_vdisktime);
+	cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu",
+		     group_entity->vdisktime,
+		     group_entity->st->min_vdisktime);
+
 	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
 			" sect=%u", used_sl, cfqq->slice_dispatch, charge,
 			iops_mode(cfqd), cfqq->nr_sectors);
@@ -976,39 +1082,84 @@ static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
 	return NULL;
 }
 
+static void cfq_put_cfqg(struct cfq_group *cfqg)
+{
+	struct cfq_rb_root *st;
+	int i, j;
+	struct io_sched_entity *group_entity;
+	struct cfq_group *p_cfqg;
+
+	BUG_ON(atomic_read(&cfqg->ref) <= 0);
+	if (!atomic_dec_and_test(&cfqg->ref))
+		return;
+	for_each_cfqg_st(cfqg, i, j, st)
+		BUG_ON(!RB_EMPTY_ROOT(&st->rb) || st->active != NULL);
+
+	group_entity = &cfqg->group_se;
+	if (group_entity->parent) {
+		p_cfqg = cfqg_of_group_entity(group_entity->parent);
+		/* Drop the reference taken by children */
+		atomic_dec(&p_cfqg->ref);
+	}
+
+	kfree(cfqg);
+}
+
+static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+	/* Something wrong if we are trying to remove same group twice */
+	BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
+
+	hlist_del_init(&cfqg->cfqd_node);
+
+	/*
+	 * Put the reference taken at the time of creation so that when all
+	 * queues are gone, group can be destroyed.
+	 */
+	cfq_put_cfqg(cfqg);
+}
+
+static void init_group_queue_entity(struct blkio_cgroup *blkcg,
+				    struct cfq_group *cfqg)
+{
+	struct io_sched_entity *group_entity = &cfqg->group_se;
+	struct io_sched_entity *queue_entity = &cfqg->queue_se;
+
+	group_entity->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+	RB_CLEAR_NODE(&group_entity->rb_node);
+	group_entity->is_group_se = true;
+	group_entity->on_st = false;
+
+	/* Set to 500 for the time being */
+	queue_entity->weight = 500;
+	RB_CLEAR_NODE(&queue_entity->rb_node);
+	queue_entity->is_group_se = false;
+	queue_entity->on_st = false;
+}
+
 void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
 					unsigned int weight)
 {
-	cfqg_of_blkg(blkg)->weight = weight;
+	struct cfq_group *cfqg;
+	struct io_sched_entity *group_entity;
+
+	cfqg = cfqg_of_blkg(blkg);
+	group_entity = &cfqg->group_se;
+	group_entity->weight = weight;
 }
 
-static struct cfq_group *
-cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+static void init_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
+		      struct cfq_group *cfqg)
 {
-	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
-	struct cfq_group *cfqg = NULL;
-	void *key = cfqd;
 	int i, j;
 	struct cfq_rb_root *st;
-	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
 	unsigned int major, minor;
+	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
 
-	cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
-	if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
-		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
-		cfqg->blkg.dev = MKDEV(major, minor);
-		goto done;
-	}
-	if (cfqg || !create)
-		goto done;
-
-	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
-	if (!cfqg)
-		goto done;
+	cfqg->grp_service_tree = CFQ_RB_ROOT;
 
 	for_each_cfqg_st(cfqg, i, j, st)
 		*st = CFQ_RB_ROOT;
-	RB_CLEAR_NODE(&cfqg->rb_node);
 
 	/*
 	 * Take the initial reference that will be released on destroy
@@ -1022,11 +1173,102 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
 	sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 	cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
 					MKDEV(major, minor));
-	cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
-
+	init_group_queue_entity(blkcg, cfqg);
 	/* Add group on cfqd list */
 	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+}
+
+static void uninit_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+	if (!cfq_blkiocg_del_blkio_group(&cfqg->blkg))
+		cfq_destroy_cfqg(cfqd, cfqg);
+}
+
+static void cfqg_set_parent(struct cfq_group *cfqg, struct cfq_group *p_cfqg)
+{
+	struct io_sched_entity *group_entity = &cfqg->group_se;
+	struct io_sched_entity *queue_entity = &cfqg->queue_se;
+	struct io_sched_entity *p_group_entity = &p_cfqg->group_se;
+
+	group_entity->parent = p_group_entity;
+	group_entity->st = &p_cfqg->grp_service_tree;
+
+	queue_entity->parent = group_entity;
+	queue_entity->st = &cfqg->grp_service_tree;
+
+	/* child reference */
+	atomic_inc(&p_cfqg->ref);
+}
+
+int cfqg_chain_alloc(struct cfq_data *cfqd, struct cgroup *cgroup)
+{
+	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+	struct blkio_cgroup *p_blkcg;
+	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+	unsigned int major, minor;
+	struct cfq_group *cfqg, *p_cfqg;
+	void *key = cfqd;
+	int ret;
+
+	cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+	if (cfqg) {
+		if (!cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+			sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+			cfqg->blkg.dev = MKDEV(major, minor);
+		}
+		/* chain has already been built */
+		return 0;
+	}
+
+	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
+	if (!cfqg)
+		return -1;
+
+	init_cfqg(cfqd, blkcg, cfqg);
+
+	/* Already to the top group */
+	if (!cgroup->parent)
+		return 0;
+
+	ret = cfqg_chain_alloc(cfqd, cgroup->parent);
+	if (ret == -1) {
+		uninit_cfqg(cfqd, cfqg);
+		return -1;
+	}
+
+	p_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
+	p_cfqg = cfqg_of_blkg(blkiocg_lookup_group(p_blkcg, key));
+	BUG_ON(p_cfqg == NULL);
+
+	cfqg_set_parent(cfqg, p_cfqg);
+	return 0;
+}
+
+static struct cfq_group *
+cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+{
+	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+	struct cfq_group *cfqg = NULL;
+	void *key = cfqd;
+	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+	unsigned int major, minor;
+	int ret;
 
+	cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+	if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+		cfqg->blkg.dev = MKDEV(major, minor);
+		goto done;
+	}
+	if (cfqg || !create)
+		goto done;
+
+	ret = cfqg_chain_alloc(cfqd, cgroup);
+	if (!ret) {
+		cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+		BUG_ON(cfqg == NULL);
+		goto done;
+	}
 done:
 	return cfqg;
 }
@@ -1066,33 +1308,6 @@ static void cfq_link_cfqq_cfqg(struct cfq_queue *cfqq, struct cfq_group *cfqg)
 	atomic_inc(&cfqq->cfqg->ref);
 }
 
-static void cfq_put_cfqg(struct cfq_group *cfqg)
-{
-	struct cfq_rb_root *st;
-	int i, j;
-
-	BUG_ON(atomic_read(&cfqg->ref) <= 0);
-	if (!atomic_dec_and_test(&cfqg->ref))
-		return;
-	for_each_cfqg_st(cfqg, i, j, st)
-		BUG_ON(!RB_EMPTY_ROOT(&st->rb) || st->active != NULL);
-	kfree(cfqg);
-}
-
-static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
-{
-	/* Something wrong if we are trying to remove same group twice */
-	BUG_ON(hlist_unhashed(&cfqg->cfqd_node));
-
-	hlist_del_init(&cfqg->cfqd_node);
-
-	/*
-	 * Put the reference taken at the time of creation so that when all
-	 * queues are gone, group can be destroyed.
-	 */
-	cfq_put_cfqg(cfqg);
-}
-
 static void cfq_release_cfq_groups(struct cfq_data *cfqd)
 {
 	struct hlist_node *pos, *n;
@@ -1667,9 +1882,6 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	if (cfqq == cfqd->active_queue)
 		cfqd->active_queue = NULL;
 
-	if (&cfqq->cfqg->rb_node == cfqd->grp_service_tree.active)
-		cfqd->grp_service_tree.active = NULL;
-
 	if (cfqd->active_cic) {
 		put_io_context(cfqd->active_cic->ioc);
 		cfqd->active_cic = NULL;
@@ -2171,17 +2383,26 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	cfqd->workload_expires = jiffies + slice;
 }
 
+
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
 {
-	struct cfq_rb_root *st = &cfqd->grp_service_tree;
+	struct cfq_group *root_group = &cfqd->root_group;
+	struct cfq_rb_root *st = &root_group->grp_service_tree;
 	struct cfq_group *cfqg;
+	struct io_sched_entity *se;
 
-	if (RB_EMPTY_ROOT(&st->rb))
-		return NULL;
-	cfqg = cfq_rb_first_group(st);
-	st->active = &cfqg->rb_node;
-	update_min_vdisktime(st);
-	return cfqg;
+	do {
+		se = cfq_rb_first_se(st);
+		if (!se)
+			return NULL;
+		st->active = &se->rb_node;
+		update_min_vdisktime(st);
+		cfqg = cfqg_of_queue_entity(se);
+		if (cfqg)
+			return cfqg;
+		cfqg = cfqg_of_group_entity(se);
+		st = &cfqg->grp_service_tree;
+	} while (1);
 }
 
 static void cfq_choose_cfqg(struct cfq_data *cfqd)
@@ -2213,15 +2434,18 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 	if (!cfqq)
 		goto new_queue;
 
+
 	if (!cfqd->rq_queued)
 		return NULL;
 
+
 	/*
 	 * We were waiting for group to get backlogged. Expire the queue
 	 */
 	if (cfq_cfqq_wait_busy(cfqq) && !RB_EMPTY_ROOT(&cfqq->sort_list))
 		goto expire;
 
+
 	/*
 	 * The active queue has run out of time, expire it and select new.
 	 */
@@ -2243,6 +2467,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 			goto check_group_idle;
 	}
 
+
 	/*
 	 * The active queue has requests and isn't expired, allow it to
 	 * dispatch.
@@ -2250,6 +2475,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
 		goto keep_queue;
 
+
 	/*
 	 * If another queue has a request waiting within our mean seek
 	 * distance, let it run.  The expire code will check for close
@@ -2503,6 +2729,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	}
 
 	cfq_log_cfqq(cfqd, cfqq, "dispatched a request");
+
 	return 1;
 }
 
@@ -3844,17 +4071,25 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	cfqd->cic_index = i;
 
-	/* Init root service tree */
-	cfqd->grp_service_tree = CFQ_RB_ROOT;
-
 	/* Init root group */
 	cfqg = &cfqd->root_group;
+	cfqg->grp_service_tree = CFQ_RB_ROOT;
 	for_each_cfqg_st(cfqg, i, j, st)
 		*st = CFQ_RB_ROOT;
-	RB_CLEAR_NODE(&cfqg->rb_node);
-
+	cfqg->group_se.is_group_se = true;
+	RB_CLEAR_NODE(&cfqg->group_se.rb_node);
+	cfqg->group_se.on_st = false;
 	/* Give preference to root group over other groups */
-	cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
+	cfqg->group_se.weight = 2*BLKIO_WEIGHT_DEFAULT;
+	cfqg->group_se.parent = NULL;
+	cfqg->group_se.st = NULL;
+
+	cfqg->queue_se.is_group_se = false;
+	RB_CLEAR_NODE(&cfqg->queue_se.rb_node);
+	cfqg->queue_se.on_st = false;
+	cfqg->queue_se.weight = 2*BLKIO_WEIGHT_DEFAULT;
+	cfqg->queue_se.parent = &cfqg->group_se;
+	cfqg->queue_se.st = &cfqg->grp_service_tree;
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 	/*
-- 1.6.5.2 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] blkio-cgroup: Add a new interface use_hierarchy
  2010-10-21  2:32 [RFC] [PATCH 0/4] cfq-iosched: Enable hierarchical cfq group scheduling and add use_hierarchy interface Gui Jianfeng
  2010-10-21  2:34 ` [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support Gui Jianfeng
@ 2010-10-21  2:36 ` Gui Jianfeng
  2010-10-21  2:36 ` [PATCH 3/4] cfq-iosched: Enable both hierarchical mode and flat mode for cfq group scheduling Gui Jianfeng
  2010-10-21  2:37 ` [PATCH 4/4] blkio-cgroup: Documents for use_hierarchy interface Gui Jianfeng
  3 siblings, 0 replies; 12+ messages in thread
From: Gui Jianfeng @ 2010-10-21  2:36 UTC (permalink / raw)
  To: Vivek Goyal, Jens Axboe
  Cc: Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list, Gui Jianfeng

This patch just adds a new interface use_hierarchy without enabling any 
functinality. Currently, "use_hierarchy" only occurs in root cgroup. 
Latter patch will make use of this interface to switch between hierarchical
mode and flat mode for cfq group scheduling.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/blk-cgroup.c  |   74 ++++++++++++++++++++++++++++++++++++++++++++++++--
 block/blk-cgroup.h  |    5 +++-
 block/cfq-iosched.c |   26 +++++++++++++++++-
 3 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 455768a..5ff5b60 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -25,7 +25,10 @@
 static DEFINE_SPINLOCK(blkio_list_lock);
 static LIST_HEAD(blkio_list);
 
-struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
+struct blkio_cgroup blkio_root_cgroup = {
+		.weight = 2*BLKIO_WEIGHT_DEFAULT,
+		.use_hierarchy = 1,
+	};
 EXPORT_SYMBOL_GPL(blkio_root_cgroup);
 
 static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
@@ -433,6 +436,8 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 		struct blkio_group *blkg, void *key, dev_t dev,
 		enum blkio_policy_id plid)
 {
+	struct hlist_node *n;
+
 	unsigned long flags;
 
 	spin_lock_irqsave(&blkcg->lock, flags);
@@ -1385,10 +1390,73 @@ struct cftype blkio_files[] = {
 #endif
 };
 
+static u64 blkiocg_use_hierarchy_read(struct cgroup *cgroup,
+				      struct cftype *cftype)
+{
+	struct blkio_cgroup *blkcg;
+
+	blkcg = cgroup_to_blkio_cgroup(cgroup);
+	return (u64)blkcg->use_hierarchy;
+}
+
+static int
+blkiocg_use_hierarchy_write(struct cgroup *cgroup,
+			    struct cftype *cftype, u64 val)
+{
+	struct blkio_cgroup *blkcg;
+	struct blkio_group *blkg;
+	struct hlist_node *n;
+	struct blkio_policy_type *blkiop;
+
+	blkcg = cgroup_to_blkio_cgroup(cgroup);
+
+	if (val > 1 || !list_empty(&cgroup->children))
+		return -EINVAL;
+
+	if (blkcg->use_hierarchy == val)
+		return 0;
+
+	spin_lock(&blkio_list_lock);
+	blkcg->use_hierarchy = val;
+
+	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
+		/*
+		 * If this policy does not own the blkg, do not change
+		 * cfq group scheduling mode.
+		 */
+		if (blkiop->plid != blkg->plid)
+			continue;
+
+		list_for_each_entry(blkiop, &blkio_list, list) {
+			if (blkiop->ops.blkio_update_use_hierarchy_fn)
+				blkiop->ops.blkio_update_use_hierarchy_fn(blkg,
+									  val);
+		}
+	}
+	spin_unlock(&blkio_list_lock);
+	return 0;
+}
+
+static struct cftype blkio_use_hierarchy = {
+	.name = "use_hierarchy",
+	.read_u64 = blkiocg_use_hierarchy_read,
+	.write_u64 = blkiocg_use_hierarchy_write,
+};
+
 static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 {
-	return cgroup_add_files(cgroup, subsys, blkio_files,
-				ARRAY_SIZE(blkio_files));
+	int ret;
+
+	ret = cgroup_add_files(cgroup, subsys, blkio_files,
+			      ARRAY_SIZE(blkio_files));
+	if (ret)
+		return ret;
+
+	/* use_hierarchy is in root cgroup only. */
+	if (!cgroup->parent)
+		ret = cgroup_add_file(cgroup, subsys, &blkio_use_hierarchy);
+
+	return ret;
 }
 
 static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index ea4861b..c8caf4e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -105,6 +105,7 @@ enum blkcg_file_name_throtl {
 struct blkio_cgroup {
 	struct cgroup_subsys_state css;
 	unsigned int weight;
+	bool use_hierarchy;
 	spinlock_t lock;
 	struct hlist_head blkg_list;
 	struct list_head policy_list; /* list of blkio_policy_node */
@@ -200,7 +201,8 @@ typedef void (blkio_update_group_read_iops_fn) (void *key,
 			struct blkio_group *blkg, unsigned int read_iops);
 typedef void (blkio_update_group_write_iops_fn) (void *key,
 			struct blkio_group *blkg, unsigned int write_iops);
-
+typedef void (blkio_update_use_hierarchy_fn) (struct blkio_group *blkg,
+					      bool val);
 struct blkio_policy_ops {
 	blkio_unlink_group_fn *blkio_unlink_group_fn;
 	blkio_update_group_weight_fn *blkio_update_group_weight_fn;
@@ -208,6 +210,7 @@ struct blkio_policy_ops {
 	blkio_update_group_write_bps_fn *blkio_update_group_write_bps_fn;
 	blkio_update_group_read_iops_fn *blkio_update_group_read_iops_fn;
 	blkio_update_group_write_iops_fn *blkio_update_group_write_iops_fn;
+	blkio_update_use_hierarchy_fn *blkio_update_use_hierarchy_fn;
 };
 
 struct blkio_policy_type {
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5c3953d..f781e4d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -200,6 +200,9 @@ struct cfq_group {
 	/* Service tree for cfq_groups and cfqqs set*/
 	struct cfq_rb_root grp_service_tree;
 
+	/* parent cfq_data */
+	struct cfq_data *cfqd;
+
 	/* number of cfqq currently on this group */
 	int nr_cfqq;
 
@@ -234,6 +237,9 @@ struct cfq_data {
 	struct request_queue *queue;
 	struct cfq_group root_group;
 
+	/* cfq group schedule in flat or hierarchy manner. */
+	bool use_hierarchy;
+
 	/*
 	 * The priority currently being served
 	 */
@@ -854,7 +860,7 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
 	 * just an approximation, should be ok.
 	 */
 	return (cfqq->cfqg->nr_cfqq - 1) * (cfq_prio_slice(cfqd, 1, 0) -
-		       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
+		   cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
 }
 
 static inline s64
@@ -1119,6 +1125,15 @@ static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	cfq_put_cfqg(cfqg);
 }
 
+void
+cfq_update_blkio_use_hierarchy(struct blkio_group *blkg, bool val)
+{
+	struct cfq_group *cfqg;
+
+	cfqg = cfqg_of_blkg(blkg);
+	cfqg->cfqd->use_hierarchy = val;
+}
+
 static void init_group_queue_entity(struct blkio_cgroup *blkcg,
 				    struct cfq_group *cfqg)
 {
@@ -1169,6 +1184,9 @@ static void init_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
 	 */
 	atomic_set(&cfqg->ref, 1);
 
+	/* Setup cfq data for cfq group */
+	cfqg->cfqd = cfqd;
+
 	/* Add group onto cgroup list */
 	sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 	cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
@@ -4073,6 +4091,7 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	/* Init root group */
 	cfqg = &cfqd->root_group;
+	cfqg->cfqd = cfqd;
 	cfqg->grp_service_tree = CFQ_RB_ROOT;
 	for_each_cfqg_st(cfqg, i, j, st)
 		*st = CFQ_RB_ROOT;
@@ -4142,6 +4161,10 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_latency = 1;
 	cfqd->cfq_group_isolation = 0;
 	cfqd->hw_tag = -1;
+
+	/* hierarchical scheduling for cfq group by default */
+	cfqd->use_hierarchy = 1;
+
 	/*
 	 * we optimistically start assuming sync ops weren't delayed in last
 	 * second, in order to have larger depth for async operations.
@@ -4304,6 +4327,7 @@ static struct blkio_policy_type blkio_policy_cfq = {
 	.ops = {
 		.blkio_unlink_group_fn =	cfq_unlink_blkio_group,
 		.blkio_update_group_weight_fn =	cfq_update_blkio_group_weight,
+		.blkio_update_use_hierarchy_fn = cfq_update_blkio_use_hierarchy,
 	},
 	.plid = BLKIO_POLICY_PROP,
 };
-- 1.6.5.2 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] cfq-iosched: Enable both hierarchical mode and flat mode for cfq group scheduling
  2010-10-21  2:32 [RFC] [PATCH 0/4] cfq-iosched: Enable hierarchical cfq group scheduling and add use_hierarchy interface Gui Jianfeng
  2010-10-21  2:34 ` [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support Gui Jianfeng
  2010-10-21  2:36 ` [PATCH 2/4] blkio-cgroup: Add a new interface use_hierarchy Gui Jianfeng
@ 2010-10-21  2:36 ` Gui Jianfeng
  2010-10-21  2:37 ` [PATCH 4/4] blkio-cgroup: Documents for use_hierarchy interface Gui Jianfeng
  3 siblings, 0 replies; 12+ messages in thread
From: Gui Jianfeng @ 2010-10-21  2:36 UTC (permalink / raw)
  To: Vivek Goyal, Jens Axboe
  Cc: Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list, Gui Jianfeng

This patch enables both hierarchical mode and flat mode for cfq group scheduling.
Users can switch between two modes by using "use_hierarchy" interface in blkio
cgroup.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 block/cfq-iosched.c |  256 +++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 196 insertions(+), 60 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index f781e4d..98c9191 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -240,6 +240,9 @@ struct cfq_data {
 	/* cfq group schedule in flat or hierarchy manner. */
 	bool use_hierarchy;
 
+	/* Service tree for cfq group flat scheduling mode. */
+	struct cfq_rb_root grp_service_tree;
+
 	/*
 	 * The priority currently being served
 	 */
@@ -635,10 +638,20 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
 static inline unsigned
 cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
-	struct io_sched_entity *queue_entity = &cfqg->queue_se;
-	struct cfq_rb_root *st = queue_entity->st;
+	struct cfq_rb_root *st;
+	unsigned int weight;
+
+	if (cfqd->use_hierarchy) {
+		struct io_sched_entity *queue_entity = &cfqg->queue_se;
+		st = queue_entity->st;
+		weight = queue_entity->weight;
+	} else {
+		struct io_sched_entity *group_entity = &cfqg->group_se;
+		st = &cfqd->grp_service_tree;
+		weight = group_entity->weight;
+	}
 
-	return cfq_target_latency * queue_entity->weight / st->total_weight;
+	return cfq_target_latency * weight / st->total_weight;
 }
 
 static inline void
@@ -932,16 +945,30 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
 
 	cfqg->nr_cfqq++;
 
-	io_sched_entity_add(queue_entity->st, queue_entity);
+	if (cfqd->use_hierarchy) {
+		io_sched_entity_add(queue_entity->st, queue_entity);
 
-	while (group_entity && group_entity->parent) {
+		while (group_entity && group_entity->parent) {
+			if (group_entity->on_st)
+				return;
+			io_sched_entity_add(group_entity->st, group_entity);
+			group_entity = group_entity->parent;
+			__cfqg = cfqg_of_group_entity(group_entity);
+			__cfqg->nr_subgp++;
+		}
+	} else {
 		if (group_entity->on_st)
 			return;
+
+		/*
+		 * For flat mode, all cfq group schedule on the global service
+		 * tree(cfqd->grp_service_tree).
+		 */
 		io_sched_entity_add(group_entity->st, group_entity);
-		group_entity = group_entity->parent;
-		__cfqg = cfqg_of_group_entity(group_entity);
-		__cfqg->nr_subgp++;
+
 	}
+
+
 }
 
 static void io_sched_entity_del(struct io_sched_entity *se)
@@ -975,24 +1002,32 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	if (cfqg->nr_cfqq)
 		return;
 
-	/* dequeue queue se from group */
-	io_sched_entity_del(queue_entity);
+	/* For cfq group hierarchical schuduling case */
+	if (cfqd->use_hierarchy) {
+		/* dequeue queue se from group */
+		io_sched_entity_del(queue_entity);
 
-	if (cfqg->nr_subgp)
-		return;
+		if (cfqg->nr_subgp)
+			return;
 
-	/* prevent from dequeuing root group */
-	while (group_entity && group_entity->parent) {
-		__cfqg = cfqg_of_group_entity(group_entity);
-		p_cfqg = cfqg_of_group_entity(group_entity->parent);
+		/* prevent from dequeuing root group */
+		while (group_entity && group_entity->parent) {
+			__cfqg = cfqg_of_group_entity(group_entity);
+			p_cfqg = cfqg_of_group_entity(group_entity->parent);
+			io_sched_entity_del(group_entity);
+			cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
+			cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
+			__cfqg->saved_workload_slice = 0;
+			group_entity = group_entity->parent;
+			p_cfqg->nr_subgp--;
+			if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
+				return;
+		}
+	} else {
+		cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
 		io_sched_entity_del(group_entity);
-		cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
-		cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
-		__cfqg->saved_workload_slice = 0;
-		group_entity = group_entity->parent;
-		p_cfqg->nr_subgp--;
-		if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
-			return;
+		cfqg->saved_workload_slice = 0;
+		cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
 	}
 }
 
@@ -1026,7 +1061,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 				struct cfq_queue *cfqq)
 {
 	struct io_sched_entity *group_entity = &cfqg->group_se;
-	struct io_sched_entity *queue_entity = &cfqg->queue_se;
+	struct io_sched_entity *queue_entity;
 	unsigned int used_sl, charge;
 	int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
 			- cfqg->service_tree_idle.count;
@@ -1039,25 +1074,33 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 	else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
 		charge = cfqq->allocated_slice;
 
-	/*
-	 *  update queue se's vdisktime.
-	 *  Can't update vdisktime while group is on service tree.
-	 */
-
-	cfq_rb_erase(&queue_entity->rb_node, queue_entity->st);
-	queue_entity->vdisktime += cfq_scale_slice(charge, queue_entity);
-	__io_sched_entity_add(queue_entity->st, queue_entity);
-	if (&queue_entity->rb_node == queue_entity->st->active)
-		queue_entity->st->active = NULL;
-
-	while (group_entity && group_entity->parent) {
+	if (cfqd->use_hierarchy) {
+		/*
+		 *  update queue se's vdisktime.
+		 *  Can't update vdisktime while group is on service tree.
+		 */
+		queue_entity = &cfqg->queue_se;
+		cfq_rb_erase(&queue_entity->rb_node, queue_entity->st);
+		queue_entity->vdisktime += cfq_scale_slice(charge,
+							   queue_entity);
+		__io_sched_entity_add(queue_entity->st, queue_entity);
+		if (&queue_entity->rb_node == queue_entity->st->active)
+			queue_entity->st->active = NULL;
+
+		while (group_entity && group_entity->parent) {
+			cfq_rb_erase(&group_entity->rb_node, group_entity->st);
+			group_entity->vdisktime += cfq_scale_slice(charge,
+								 group_entity);
+			__io_sched_entity_add(group_entity->st, group_entity);
+			if (&group_entity->rb_node == group_entity->st->active)
+				group_entity->st->active = NULL;
+			group_entity = group_entity->parent;
+		}
+	} else {
 		cfq_rb_erase(&group_entity->rb_node, group_entity->st);
 		group_entity->vdisktime += cfq_scale_slice(charge,
 							   group_entity);
 		__io_sched_entity_add(group_entity->st, group_entity);
-		if (&group_entity->rb_node == group_entity->st->active)
-			group_entity->st->active = NULL;
-		group_entity = group_entity->parent;
 	}
 
 	/* This group is being expired. Save the context */
@@ -1125,13 +1168,35 @@ static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	cfq_put_cfqg(cfqg);
 }
 
-void
-cfq_update_blkio_use_hierarchy(struct blkio_group *blkg, bool val)
+static int cfq_forced_dispatch(struct cfq_data *cfqd);
+
+void cfq_update_blkio_use_hierarchy(struct blkio_group *blkg, bool val)
 {
+	unsigned long flags;
 	struct cfq_group *cfqg;
+	struct cfq_data *cfqd;
+	struct io_sched_entity *group_entity;
+	int nr;
 
+	/* Get root group here */
 	cfqg = cfqg_of_blkg(blkg);
-	cfqg->cfqd->use_hierarchy = val;
+	cfqd = cfqg->cfqd;
+
+	spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+
+	/* Drain all requests */
+	nr = cfq_forced_dispatch(cfqd);
+
+	group_entity = &cfqg->group_se;
+
+	if (!val)
+		group_entity->st = &cfqd->grp_service_tree;
+	else
+		group_entity->st = NULL;
+
+	cfqd->use_hierarchy = val;
+
+	spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
 }
 
 static void init_group_queue_entity(struct blkio_cgroup *blkcg,
@@ -1202,11 +1267,21 @@ static void uninit_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
 		cfq_destroy_cfqg(cfqd, cfqg);
 }
 
-static void cfqg_set_parent(struct cfq_group *cfqg, struct cfq_group *p_cfqg)
+static void cfqg_set_parent(struct cfq_data *cfqd, struct cfq_group *cfqg,
+			    struct cfq_group *p_cfqg)
 {
-	struct io_sched_entity *group_entity = &cfqg->group_se;
-	struct io_sched_entity *queue_entity = &cfqg->queue_se;
-	struct io_sched_entity *p_group_entity = &p_cfqg->group_se;
+	struct io_sched_entity *group_entity, *queue_entity, *p_group_entity;
+
+	group_entity = &cfqg->group_se;
+
+	if (!p_cfqg) {
+		group_entity->st = &cfqd->grp_service_tree;
+		group_entity->parent = NULL;
+		return;
+	}
+
+	queue_entity = &cfqg->queue_se;
+	p_group_entity = &p_cfqg->group_se;
 
 	group_entity->parent = p_group_entity;
 	group_entity->st = &p_cfqg->grp_service_tree;
@@ -1258,10 +1333,39 @@ int cfqg_chain_alloc(struct cfq_data *cfqd, struct cgroup *cgroup)
 	p_cfqg = cfqg_of_blkg(blkiocg_lookup_group(p_blkcg, key));
 	BUG_ON(p_cfqg == NULL);
 
-	cfqg_set_parent(cfqg, p_cfqg);
+	cfqg_set_parent(cfqd, cfqg, p_cfqg);
 	return 0;
 }
 
+static struct cfq_group *cfqg_alloc(struct cfq_data *cfqd,
+				    struct cgroup *cgroup)
+{
+	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+	unsigned int major, minor;
+	struct cfq_group *cfqg;
+	void *key = cfqd;
+
+	cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+	if (cfqg) {
+		if (!cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+			sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+			cfqg->blkg.dev = MKDEV(major, minor);
+		}
+		return cfqg;
+	}
+
+	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
+	if (!cfqg)
+		return NULL;
+
+	init_cfqg(cfqd, blkcg, cfqg);
+
+	cfqg_set_parent(cfqd, cfqg, NULL);
+
+	return cfqg;
+}
+
 static struct cfq_group *
 cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
 {
@@ -1281,11 +1385,26 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
 	if (cfqg || !create)
 		goto done;
 
-	ret = cfqg_chain_alloc(cfqd, cgroup);
-	if (!ret) {
-		cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
-		BUG_ON(cfqg == NULL);
-		goto done;
+	if (!cfqd->use_hierarchy) {
+		/*
+		 * For flat cfq group scheduling, we just need to allocate a
+		 * single cfq group.
+		 */
+		cfqg = cfqg_alloc(cfqd, cgroup);
+		if (!cfqg)
+			goto done;
+		return cfqg;
+	} else {
+		/*
+		 * For hierarchical cfq group scheduling, we need to allocate
+		 * the whole cfq group chain.
+		 */
+		ret = cfqg_chain_alloc(cfqd, cgroup);
+		if (!ret) {
+			cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+			BUG_ON(cfqg == NULL);
+			goto done;
+		}
 	}
 done:
 	return cfqg;
@@ -2404,23 +2523,37 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
 {
-	struct cfq_group *root_group = &cfqd->root_group;
-	struct cfq_rb_root *st = &root_group->grp_service_tree;
+	struct cfq_rb_root *st;
 	struct cfq_group *cfqg;
 	struct io_sched_entity *se;
 
-	do {
+	if (cfqd->use_hierarchy) {
+		struct cfq_group *root_group = &cfqd->root_group;
+		st = &root_group->grp_service_tree;
+
+		do {
+			se = cfq_rb_first_se(st);
+			if (!se)
+				return NULL;
+			st->active = &se->rb_node;
+			update_min_vdisktime(st);
+			cfqg = cfqg_of_queue_entity(se);
+			if (cfqg)
+				return cfqg;
+			cfqg = cfqg_of_group_entity(se);
+			st = &cfqg->grp_service_tree;
+		} while (1);
+	} else {
+		st = &cfqd->grp_service_tree;
 		se = cfq_rb_first_se(st);
 		if (!se)
 			return NULL;
 		st->active = &se->rb_node;
 		update_min_vdisktime(st);
-		cfqg = cfqg_of_queue_entity(se);
-		if (cfqg)
-			return cfqg;
 		cfqg = cfqg_of_group_entity(se);
-		st = &cfqg->grp_service_tree;
-	} while (1);
+		BUG_ON(!cfqg);
+		return cfqg;
+	}
 }
 
 static void cfq_choose_cfqg(struct cfq_data *cfqd)
@@ -4089,6 +4222,9 @@ static void *cfq_init_queue(struct request_queue *q)
 
 	cfqd->cic_index = i;
 
+	/* Init flat service tree */
+	cfqd->grp_service_tree = CFQ_RB_ROOT;
+
 	/* Init root group */
 	cfqg = &cfqd->root_group;
 	cfqg->cfqd = cfqd;
-- 1.6.5.2 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] blkio-cgroup: Documents for use_hierarchy interface
  2010-10-21  2:32 [RFC] [PATCH 0/4] cfq-iosched: Enable hierarchical cfq group scheduling and add use_hierarchy interface Gui Jianfeng
                   ` (2 preceding siblings ...)
  2010-10-21  2:36 ` [PATCH 3/4] cfq-iosched: Enable both hierarchical mode and flat mode for cfq group scheduling Gui Jianfeng
@ 2010-10-21  2:37 ` Gui Jianfeng
  3 siblings, 0 replies; 12+ messages in thread
From: Gui Jianfeng @ 2010-10-21  2:37 UTC (permalink / raw)
  To: Vivek Goyal, Jens Axboe
  Cc: Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list, Gui Jianfeng

Documents for use_hierarchy interface.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
 Documentation/cgroups/blkio-controller.txt |   21 +++++++++++++++++++++
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index d6da611..df6c938 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -15,6 +15,21 @@ one is throttling policy which can be used to specify upper IO rate limits
 on devices. This policy is implemented in generic block layer and can be
 used on leaf nodes as well as higher level logical devices like device mapper.
 
+Currently, both hierarchical bandwidth division and flat bandwidth division are
+supported. Consider the follow hierarchy:
+
+                    grp1
+                    /  \
+                  grp2 grp3
+                  /  \
+               grp4 grp5
+
+All groups have a same weight 500, and only grp3 grp4 and grp5 are contending
+for IO bandwidth. If flat bandwidth division is in use, grp3 grp4 and grp5 will
+share the same bandwidth, that is 33.3% for each. If hierarchical bandwidth 
+division is in use, grp4 and grp5 will get 25% of bandwidth for each, gpr3
+will get the reset 50%.
+
 HOWTO
 =====
 Proportional Weight division of bandwidth
@@ -142,6 +157,12 @@ Proportional weight policy files
 	  dev     weight
 	  8:16    300
 
+- blkio.use_hierarchy
+	- If this interface is set, hierarchical bandwidth division is enabled.
+	  Ohterwise, flat bandwidth division is enabled. Currently this
+	  interface only shows up in root cgroup, and works in the case that
+	  there're no child cgroups.
+
 - blkio.time
 	- disk time allocated to cgroup per device in milliseconds. First
 	  two fields specify the major and minor number of the device and
-- 1.6.5.2 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support
  2010-10-21  2:34 ` [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support Gui Jianfeng
@ 2010-10-22 20:54   ` Vivek Goyal
  2010-10-22 21:11     ` Vivek Goyal
  2010-10-25  2:48     ` Gui Jianfeng
  0 siblings, 2 replies; 12+ messages in thread
From: Vivek Goyal @ 2010-10-22 20:54 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Jens Axboe, Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list

On Thu, Oct 21, 2010 at 10:34:49AM +0800, Gui Jianfeng wrote:
> This patch enables cfq group hierarchical scheduling.
> 
> With this patch, you can create a cgroup directory deeper than level 1.
> Now, I/O Bandwidth is distributed in a hierarchy way. For example:
> We create cgroup directories as following(the number represents weight):
> 
>             Root grp
>            /       \
>        grp_1(100) grp_2(400)
>        /    \ 
>   grp_3(200) grp_4(300)
> 
> If grp_2 grp_3 and grp_4 are contending for I/O Bandwidth,
> grp_2 will share 80% of total bandwidth.
> For sub_groups, grp_3 shares 8%(20% * 40%), grp_4 shares 12%(20% * 60%)
> 
> Design:
>   o Each cfq group has its own group service tree. 
>   o Each cfq group contains a "group schedule entity" (gse) that 
>     schedules on parent cfq group's service tree.
>   o Each cfq group contains a "queue schedule entity"(qse), it
>     represents all cfqqs located on this cfq group. It schedules
>     on this group's service tree. For the time being, root group
>     qse's weight is 1000, and subgroup qse's weight is 500.
>   o All gses and qse which belones to a same cfq group schedules
>     on the same group service tree.
>   o cfq group allocates in a recursive manner, that means when a cfq 
>     group needs to be allocated, the upper level cfq groups are also
>     allocated.
>   o When a cfq group served, not only charge this cfq group but also
>     charge its ancestors.

Gui,

I have not been able to convince myself yet that not treating queue at
same level as group is a better idea than treating queue at the same
level as group. 

I am again trying to put my thoughts together that why I am not convinced.

- I really don't like the idea of hidden group and assumptions about the
  weight of this group which user does not know or user can't control.
 
- Secondly I think that both the following use cases are valid use cases.


  case 1:
  -------
			  root
			 / | \
			q1 q2 G1
			      / \
			     q3  q4	 

 In this case queues and group are treated at same level, and group G1's
 share changes dynamically based on number of competiting queues. Assume
 system admin has put one user's all tasks in G1, and default weight of G1
 is 500, then admin might really want to keep G1's share dyanmic, so that
 if root is not doing lots of IO (not many thread), then G1 gets more IO
 done but if IO activity in root threads increases then G1 gets less
 share. 

 case 2:
 -------  
 The second case is where one wants a more deterministic share of a
 group and does not want that share to change based on number of
 processes. In that case one can simply create a child group and move
 all root threads inside that group.

			  root
			   |  \
		  root-threads G1
			/ \    /\
		       q1 q2  q3 q4

 So if we design in such a way so that we treat queues at same level as
 group, then we are not bounding user to a specific case. case 1, will
 be default in hierarchical mode and user can easily achieve case 2. Instead
 of locking down user to case 2 by default from kernel implementation and
 assume nobody is going to use case 1.

 IOW, treating queues at group level provides more flexibility.

- Treating queues at same level as groups will also help us better handle
  the case of RT threads. Think of following.

			  root
			  |   \
			q1(RT) G1
			      / \
			     q3  q4	 

 In this case q1 is real time prio class. Now if we treat queue at same
 level group, then we can try to give 100% IO disk time to q1. But with
 hardcoding of hidden group, covering such cases will be hard.

- Other examples in kernel (CFS scheduler) already treat queue at same
  level at group. So until and unless we have a good reason, we should
  remain consistent. 

- If we try to draw analogy from other subsystems like virtual machine,
  where weight of a KVM machine on cpu is decided by native threads
  created on host (logical cpus) and not by how many threads are running
  inside the guest. And share of these logical cpu threads varies
  dynamically based on how many other threads are running on system.

  In a simple case of 1 logical cpu, we will create 1 thread and say there
  are 10 processes running inside guest, then effectively shares of these
  10 processes changes dynamically based on how many threads are running. 

So I am not yet convinced that we should take the hidden group approach.

Now coming to the question of how to resolve conflict with the cfqq queue
scheduling algorithm. Can we do following.

- Give some kind of boost to queue entities based on their weight. So when
  queue and group entities are hanging on a service tree, they are
  scheduled according to their vdisktime, and vdisktime is calculated
  based on entitie's weight and how much time entity spent on disk just
  now.

  Group entities can continue to follow existing method and we can try
  to reduce the vdisktime of queue entities a bit based on their priority.

  That way, one might see some service differentiation between ioprio
  of queues and also the relative share between groups does not change.
  The only problematic part is that when queue and groups are at same
  level then it is not very predictable that group gets how much share
  and queues get how much share. But I guess this is lesser of a problem
  as compared to hidden group approach.

Thoughts?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support
  2010-10-22 20:54   ` Vivek Goyal
@ 2010-10-22 21:11     ` Vivek Goyal
  2010-10-25  2:48     ` Gui Jianfeng
  1 sibling, 0 replies; 12+ messages in thread
From: Vivek Goyal @ 2010-10-22 21:11 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Jens Axboe, Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list

On Fri, Oct 22, 2010 at 04:54:37PM -0400, Vivek Goyal wrote:
> On Thu, Oct 21, 2010 at 10:34:49AM +0800, Gui Jianfeng wrote:
> > This patch enables cfq group hierarchical scheduling.
> > 
> > With this patch, you can create a cgroup directory deeper than level 1.
> > Now, I/O Bandwidth is distributed in a hierarchy way. For example:
> > We create cgroup directories as following(the number represents weight):
> > 
> >             Root grp
> >            /       \
> >        grp_1(100) grp_2(400)
> >        /    \ 
> >   grp_3(200) grp_4(300)
> > 
> > If grp_2 grp_3 and grp_4 are contending for I/O Bandwidth,
> > grp_2 will share 80% of total bandwidth.
> > For sub_groups, grp_3 shares 8%(20% * 40%), grp_4 shares 12%(20% * 60%)
> > 
> > Design:
> >   o Each cfq group has its own group service tree. 
> >   o Each cfq group contains a "group schedule entity" (gse) that 
> >     schedules on parent cfq group's service tree.
> >   o Each cfq group contains a "queue schedule entity"(qse), it
> >     represents all cfqqs located on this cfq group. It schedules
> >     on this group's service tree. For the time being, root group
> >     qse's weight is 1000, and subgroup qse's weight is 500.
> >   o All gses and qse which belones to a same cfq group schedules
> >     on the same group service tree.
> >   o cfq group allocates in a recursive manner, that means when a cfq 
> >     group needs to be allocated, the upper level cfq groups are also
> >     allocated.
> >   o When a cfq group served, not only charge this cfq group but also
> >     charge its ancestors.
> 
> Gui,
> 
> I have not been able to convince myself yet that not treating queue at
> same level as group is a better idea than treating queue at the same
> level as group. 
> 
> I am again trying to put my thoughts together that why I am not convinced.
> 
> - I really don't like the idea of hidden group and assumptions about the
>   weight of this group which user does not know or user can't control.
>  
> - Secondly I think that both the following use cases are valid use cases.
> 
> 
>   case 1:
>   -------
> 			  root
> 			 / | \
> 			q1 q2 G1
> 			      / \
> 			     q3  q4	 
> 
>  In this case queues and group are treated at same level, and group G1's
>  share changes dynamically based on number of competiting queues. Assume
>  system admin has put one user's all tasks in G1, and default weight of G1
>  is 500, then admin might really want to keep G1's share dyanmic, so that
>  if root is not doing lots of IO (not many thread), then G1 gets more IO
>  done but if IO activity in root threads increases then G1 gets less
>  share. 
> 
>  case 2:
>  -------  
>  The second case is where one wants a more deterministic share of a
>  group and does not want that share to change based on number of
>  processes. In that case one can simply create a child group and move
>  all root threads inside that group.
> 
> 			  root
> 			   |  \
> 		  root-threads G1
> 			/ \    /\
> 		       q1 q2  q3 q4
> 
>  So if we design in such a way so that we treat queues at same level as
>  group, then we are not bounding user to a specific case. case 1, will
>  be default in hierarchical mode and user can easily achieve case 2. Instead
>  of locking down user to case 2 by default from kernel implementation and
>  assume nobody is going to use case 1.
> 
>  IOW, treating queues at group level provides more flexibility.
> 
> - Treating queues at same level as groups will also help us better handle
>   the case of RT threads. Think of following.
> 
> 			  root
> 			  |   \
> 			q1(RT) G1
> 			      / \
> 			     q3  q4	 
> 
>  In this case q1 is real time prio class. Now if we treat queue at same
>  level group, then we can try to give 100% IO disk time to q1. But with
>  hardcoding of hidden group, covering such cases will be hard.
> 
> - Other examples in kernel (CFS scheduler) already treat queue at same
>   level at group. So until and unless we have a good reason, we should
>   remain consistent. 
> 
> - If we try to draw analogy from other subsystems like virtual machine,
>   where weight of a KVM machine on cpu is decided by native threads
>   created on host (logical cpus) and not by how many threads are running
>   inside the guest. And share of these logical cpu threads varies
>   dynamically based on how many other threads are running on system.
> 
>   In a simple case of 1 logical cpu, we will create 1 thread and say there
>   are 10 processes running inside guest, then effectively shares of these
>   10 processes changes dynamically based on how many threads are running. 
> 
> So I am not yet convinced that we should take the hidden group approach.
> 
> Now coming to the question of how to resolve conflict with the cfqq queue
> scheduling algorithm. Can we do following.
> 
> - Give some kind of boost to queue entities based on their weight. So when
>   queue and group entities are hanging on a service tree, they are
>   scheduled according to their vdisktime, and vdisktime is calculated
>   based on entitie's weight and how much time entity spent on disk just
>   now.
> 
>   Group entities can continue to follow existing method and we can try
>   to reduce the vdisktime of queue entities a bit based on their priority.
> 
>   That way, one might see some service differentiation between ioprio
>   of queues and also the relative share between groups does not change.
>   The only problematic part is that when queue and groups are at same
>   level then it is not very predictable that group gets how much share
>   and queues get how much share. But I guess this is lesser of a problem
>   as compared to hidden group approach.
> 

Thinking more about it, I guess we can give a boost to vdisktime for newly
queued entities and not for entities which just have consumed their slice
and are being put again on service tree. That way I think we should be
able to come close to logic of CFQ where service differentation between
ioprio of cfqq can be created in select cases even if idling is disabled.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support
  2010-10-22 20:54   ` Vivek Goyal
  2010-10-22 21:11     ` Vivek Goyal
@ 2010-10-25  2:48     ` Gui Jianfeng
  2010-10-25 20:20       ` Vivek Goyal
  1 sibling, 1 reply; 12+ messages in thread
From: Gui Jianfeng @ 2010-10-25  2:48 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list

Vivek Goyal wrote:
> On Thu, Oct 21, 2010 at 10:34:49AM +0800, Gui Jianfeng wrote:
>> This patch enables cfq group hierarchical scheduling.
>>
>> With this patch, you can create a cgroup directory deeper than level 1.
>> Now, I/O Bandwidth is distributed in a hierarchy way. For example:
>> We create cgroup directories as following(the number represents weight):
>>
>>             Root grp
>>            /       \
>>        grp_1(100) grp_2(400)
>>        /    \ 
>>   grp_3(200) grp_4(300)
>>
>> If grp_2 grp_3 and grp_4 are contending for I/O Bandwidth,
>> grp_2 will share 80% of total bandwidth.
>> For sub_groups, grp_3 shares 8%(20% * 40%), grp_4 shares 12%(20% * 60%)
>>
>> Design:
>>   o Each cfq group has its own group service tree. 
>>   o Each cfq group contains a "group schedule entity" (gse) that 
>>     schedules on parent cfq group's service tree.
>>   o Each cfq group contains a "queue schedule entity"(qse), it
>>     represents all cfqqs located on this cfq group. It schedules
>>     on this group's service tree. For the time being, root group
>>     qse's weight is 1000, and subgroup qse's weight is 500.
>>   o All gses and qse which belones to a same cfq group schedules
>>     on the same group service tree.
>>   o cfq group allocates in a recursive manner, that means when a cfq 
>>     group needs to be allocated, the upper level cfq groups are also
>>     allocated.
>>   o When a cfq group served, not only charge this cfq group but also
>>     charge its ancestors.
> 
> Gui,
> 
> I have not been able to convince myself yet that not treating queue at
> same level as group is a better idea than treating queue at the same
> level as group. 
> 
> I am again trying to put my thoughts together that why I am not convinced.
> 
> - I really don't like the idea of hidden group and assumptions about the
>   weight of this group which user does not know or user can't control.
>  
> - Secondly I think that both the following use cases are valid use cases.
> 
> 
>   case 1:
>   -------
> 			  root
> 			 / | \
> 			q1 q2 G1
> 			      / \
> 			     q3  q4	 
> 
>  In this case queues and group are treated at same level, and group G1's
>  share changes dynamically based on number of competiting queues. Assume
>  system admin has put one user's all tasks in G1, and default weight of G1
>  is 500, then admin might really want to keep G1's share dyanmic, so that
>  if root is not doing lots of IO (not many thread), then G1 gets more IO
>  done but if IO activity in root threads increases then G1 gets less
>  share. 
> 
>  case 2:
>  -------  
>  The second case is where one wants a more deterministic share of a
>  group and does not want that share to change based on number of
>  processes. In that case one can simply create a child group and move
>  all root threads inside that group.
> 
> 			  root
> 			   |  \
> 		  root-threads G1
> 			/ \    /\
> 		       q1 q2  q3 q4
> 
>  So if we design in such a way so that we treat queues at same level as
>  group, then we are not bounding user to a specific case. case 1, will
>  be default in hierarchical mode and user can easily achieve case 2. Instead
>  of locking down user to case 2 by default from kernel implementation and
>  assume nobody is going to use case 1.
> 
>  IOW, treating queues at group level provides more flexibility.
> 
> - Treating queues at same level as groups will also help us better handle
>   the case of RT threads. Think of following.
> 
> 			  root
> 			  |   \
> 			q1(RT) G1
> 			      / \
> 			     q3  q4	 
> 
>  In this case q1 is real time prio class. Now if we treat queue at same
>  level group, then we can try to give 100% IO disk time to q1. But with
>  hardcoding of hidden group, covering such cases will be hard.
> 
> - Other examples in kernel (CFS scheduler) already treat queue at same
>   level at group. So until and unless we have a good reason, we should
>   remain consistent. 
> 
> - If we try to draw analogy from other subsystems like virtual machine,
>   where weight of a KVM machine on cpu is decided by native threads
>   created on host (logical cpus) and not by how many threads are running
>   inside the guest. And share of these logical cpu threads varies
>   dynamically based on how many other threads are running on system.
> 
>   In a simple case of 1 logical cpu, we will create 1 thread and say there
>   are 10 processes running inside guest, then effectively shares of these
>   10 processes changes dynamically based on how many threads are running. 
> 
> So I am not yet convinced that we should take the hidden group approach.

Hi Vivek,

In short, All of the problems are bacause of the fixed weight "Hidden group".
So how about make the "hidden group" weight becoming dynamic according to
the cfqq number and priority. Or whether we can export an new user interface
to make "Hidden group" configurable. Thus, user can configure the "Hidden group".

> 
> Now coming to the question of how to resolve conflict with the cfqq queue
> scheduling algorithm. Can we do following.
> 
> - Give some kind of boost to queue entities based on their weight. So when
>   queue and group entities are hanging on a service tree, they are
>   scheduled according to their vdisktime, and vdisktime is calculated
>   based on entitie's weight and how much time entity spent on disk just
>   now.
> 
>   Group entities can continue to follow existing method and we can try
>   to reduce the vdisktime of queue entities a bit based on their priority.
> 
>   That way, one might see some service differentiation between ioprio
>   of queues and also the relative share between groups does not change.
>   The only problematic part is that when queue and groups are at same
>   level then it is not very predictable that group gets how much share
>   and queues get how much share. But I guess this is lesser of a problem
>   as compared to hidden group approach.
> 
> Thoughts?

Do you mean that let cfqq and cfq group schedule at the same service tree. If
we choose a cfq queue, ok let it run. If we choose the cfq group, we should
continue to choose a cfq queue in that group.
If that's the case, I think the original CFQ logic has been broken.
Am I missing something?

Thanks
Gui

> 
> Thanks
> Vivek
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support
  2010-10-25  2:48     ` Gui Jianfeng
@ 2010-10-25 20:20       ` Vivek Goyal
  2010-10-26  2:15         ` Gui Jianfeng
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2010-10-25 20:20 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Jens Axboe, Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list

On Mon, Oct 25, 2010 at 10:48:30AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Thu, Oct 21, 2010 at 10:34:49AM +0800, Gui Jianfeng wrote:
> >> This patch enables cfq group hierarchical scheduling.
> >>
> >> With this patch, you can create a cgroup directory deeper than level 1.
> >> Now, I/O Bandwidth is distributed in a hierarchy way. For example:
> >> We create cgroup directories as following(the number represents weight):
> >>
> >>             Root grp
> >>            /       \
> >>        grp_1(100) grp_2(400)
> >>        /    \ 
> >>   grp_3(200) grp_4(300)
> >>
> >> If grp_2 grp_3 and grp_4 are contending for I/O Bandwidth,
> >> grp_2 will share 80% of total bandwidth.
> >> For sub_groups, grp_3 shares 8%(20% * 40%), grp_4 shares 12%(20% * 60%)
> >>
> >> Design:
> >>   o Each cfq group has its own group service tree. 
> >>   o Each cfq group contains a "group schedule entity" (gse) that 
> >>     schedules on parent cfq group's service tree.
> >>   o Each cfq group contains a "queue schedule entity"(qse), it
> >>     represents all cfqqs located on this cfq group. It schedules
> >>     on this group's service tree. For the time being, root group
> >>     qse's weight is 1000, and subgroup qse's weight is 500.
> >>   o All gses and qse which belones to a same cfq group schedules
> >>     on the same group service tree.
> >>   o cfq group allocates in a recursive manner, that means when a cfq 
> >>     group needs to be allocated, the upper level cfq groups are also
> >>     allocated.
> >>   o When a cfq group served, not only charge this cfq group but also
> >>     charge its ancestors.
> > 
> > Gui,
> > 
> > I have not been able to convince myself yet that not treating queue at
> > same level as group is a better idea than treating queue at the same
> > level as group. 
> > 
> > I am again trying to put my thoughts together that why I am not convinced.
> > 
> > - I really don't like the idea of hidden group and assumptions about the
> >   weight of this group which user does not know or user can't control.
> >  
> > - Secondly I think that both the following use cases are valid use cases.
> > 
> > 
> >   case 1:
> >   -------
> > 			  root
> > 			 / | \
> > 			q1 q2 G1
> > 			      / \
> > 			     q3  q4	 
> > 
> >  In this case queues and group are treated at same level, and group G1's
> >  share changes dynamically based on number of competiting queues. Assume
> >  system admin has put one user's all tasks in G1, and default weight of G1
> >  is 500, then admin might really want to keep G1's share dyanmic, so that
> >  if root is not doing lots of IO (not many thread), then G1 gets more IO
> >  done but if IO activity in root threads increases then G1 gets less
> >  share. 
> > 
> >  case 2:
> >  -------  
> >  The second case is where one wants a more deterministic share of a
> >  group and does not want that share to change based on number of
> >  processes. In that case one can simply create a child group and move
> >  all root threads inside that group.
> > 
> > 			  root
> > 			   |  \
> > 		  root-threads G1
> > 			/ \    /\
> > 		       q1 q2  q3 q4
> > 
> >  So if we design in such a way so that we treat queues at same level as
> >  group, then we are not bounding user to a specific case. case 1, will
> >  be default in hierarchical mode and user can easily achieve case 2. Instead
> >  of locking down user to case 2 by default from kernel implementation and
> >  assume nobody is going to use case 1.
> > 
> >  IOW, treating queues at group level provides more flexibility.
> > 
> > - Treating queues at same level as groups will also help us better handle
> >   the case of RT threads. Think of following.
> > 
> > 			  root
> > 			  |   \
> > 			q1(RT) G1
> > 			      / \
> > 			     q3  q4	 
> > 
> >  In this case q1 is real time prio class. Now if we treat queue at same
> >  level group, then we can try to give 100% IO disk time to q1. But with
> >  hardcoding of hidden group, covering such cases will be hard.
> > 
> > - Other examples in kernel (CFS scheduler) already treat queue at same
> >   level at group. So until and unless we have a good reason, we should
> >   remain consistent. 
> > 
> > - If we try to draw analogy from other subsystems like virtual machine,
> >   where weight of a KVM machine on cpu is decided by native threads
> >   created on host (logical cpus) and not by how many threads are running
> >   inside the guest. And share of these logical cpu threads varies
> >   dynamically based on how many other threads are running on system.
> > 
> >   In a simple case of 1 logical cpu, we will create 1 thread and say there
> >   are 10 processes running inside guest, then effectively shares of these
> >   10 processes changes dynamically based on how many threads are running. 
> > 
> > So I am not yet convinced that we should take the hidden group approach.
> 
> Hi Vivek,
> 
> In short, All of the problems are bacause of the fixed weight "Hidden group".
> So how about make the "hidden group" weight becoming dynamic according to
> the cfqq number and priority. Or whether we can export an new user interface
> to make "Hidden group" configurable. Thus, user can configure the "Hidden group".

Gui,

Even if you do that it will still not solve the problem of RT tread in
root group getting all the disk.

Secondly, somehow the idea of hidden group is just not appealing to me
and trying to even expose it to user will make it even uglier.

I guess without going into implementation details, we need to first
figure out what's the right thing to do from a design perspective and
then later dive into what are the complexities involved in doing the
right thing.

> 
> > 
> > Now coming to the question of how to resolve conflict with the cfqq queue
> > scheduling algorithm. Can we do following.
> > 
> > - Give some kind of boost to queue entities based on their weight. So when
> >   queue and group entities are hanging on a service tree, they are
> >   scheduled according to their vdisktime, and vdisktime is calculated
> >   based on entitie's weight and how much time entity spent on disk just
> >   now.
> > 
> >   Group entities can continue to follow existing method and we can try
> >   to reduce the vdisktime of queue entities a bit based on their priority.
> > 
> >   That way, one might see some service differentiation between ioprio
> >   of queues and also the relative share between groups does not change.
> >   The only problematic part is that when queue and groups are at same
> >   level then it is not very predictable that group gets how much share
> >   and queues get how much share. But I guess this is lesser of a problem
> >   as compared to hidden group approach.
> > 
> > Thoughts?
> 
> Do you mean that let cfqq and cfq group schedule at the same service tree. If
> we choose a cfq queue, ok let it run. If we choose the cfq group, we should
> continue to choose a cfq queue in that group.
> If that's the case, I think the original CFQ logic has been broken.
> Am I missing something?
> 

Can you give more details about what's broken in running CFQ queue and
CFQ group on same service tree?

To me only thing which was broken is that how to take care of giving
higher disk share to higher prio queue when idling is disabled. In that
case we don't idle on queue and after request dispatch queue is deleted
from service tree and when new request comes in, queue is put at the end
of service tree (like other entities). And this happens with queues of
all prio and hence the prio difference between queues is lost.

Currently we put all new queues at the end of service tree. If we put
some logic to give vdisktime boost based on priority for new queues, 
then we should be able to achieve the similar affect as current CFQ. Isn't
it?

Thanks
Vivek 

> Thanks
> Gui
> 
> > 
> > Thanks
> > Vivek
> > 
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support
  2010-10-25 20:20       ` Vivek Goyal
@ 2010-10-26  2:15         ` Gui Jianfeng
  2010-10-26 15:57           ` Vivek Goyal
  0 siblings, 1 reply; 12+ messages in thread
From: Gui Jianfeng @ 2010-10-26  2:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list

Vivek Goyal wrote:
> On Mon, Oct 25, 2010 at 10:48:30AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Thu, Oct 21, 2010 at 10:34:49AM +0800, Gui Jianfeng wrote:
>>>> This patch enables cfq group hierarchical scheduling.
>>>>
>>>> With this patch, you can create a cgroup directory deeper than level 1.
>>>> Now, I/O Bandwidth is distributed in a hierarchy way. For example:
>>>> We create cgroup directories as following(the number represents weight):
>>>>
>>>>             Root grp
>>>>            /       \
>>>>        grp_1(100) grp_2(400)
>>>>        /    \ 
>>>>   grp_3(200) grp_4(300)
>>>>
>>>> If grp_2 grp_3 and grp_4 are contending for I/O Bandwidth,
>>>> grp_2 will share 80% of total bandwidth.
>>>> For sub_groups, grp_3 shares 8%(20% * 40%), grp_4 shares 12%(20% * 60%)
>>>>
>>>> Design:
>>>>   o Each cfq group has its own group service tree. 
>>>>   o Each cfq group contains a "group schedule entity" (gse) that 
>>>>     schedules on parent cfq group's service tree.
>>>>   o Each cfq group contains a "queue schedule entity"(qse), it
>>>>     represents all cfqqs located on this cfq group. It schedules
>>>>     on this group's service tree. For the time being, root group
>>>>     qse's weight is 1000, and subgroup qse's weight is 500.
>>>>   o All gses and qse which belones to a same cfq group schedules
>>>>     on the same group service tree.
>>>>   o cfq group allocates in a recursive manner, that means when a cfq 
>>>>     group needs to be allocated, the upper level cfq groups are also
>>>>     allocated.
>>>>   o When a cfq group served, not only charge this cfq group but also
>>>>     charge its ancestors.
>>> Gui,
>>>
>>> I have not been able to convince myself yet that not treating queue at
>>> same level as group is a better idea than treating queue at the same
>>> level as group. 
>>>
>>> I am again trying to put my thoughts together that why I am not convinced.
>>>
>>> - I really don't like the idea of hidden group and assumptions about the
>>>   weight of this group which user does not know or user can't control.
>>>  
>>> - Secondly I think that both the following use cases are valid use cases.
>>>
>>>
>>>   case 1:
>>>   -------
>>> 			  root
>>> 			 / | \
>>> 			q1 q2 G1
>>> 			      / \
>>> 			     q3  q4	 
>>>
>>>  In this case queues and group are treated at same level, and group G1's
>>>  share changes dynamically based on number of competiting queues. Assume
>>>  system admin has put one user's all tasks in G1, and default weight of G1
>>>  is 500, then admin might really want to keep G1's share dyanmic, so that
>>>  if root is not doing lots of IO (not many thread), then G1 gets more IO
>>>  done but if IO activity in root threads increases then G1 gets less
>>>  share. 
>>>
>>>  case 2:
>>>  -------  
>>>  The second case is where one wants a more deterministic share of a
>>>  group and does not want that share to change based on number of
>>>  processes. In that case one can simply create a child group and move
>>>  all root threads inside that group.
>>>
>>> 			  root
>>> 			   |  \
>>> 		  root-threads G1
>>> 			/ \    /\
>>> 		       q1 q2  q3 q4
>>>
>>>  So if we design in such a way so that we treat queues at same level as
>>>  group, then we are not bounding user to a specific case. case 1, will
>>>  be default in hierarchical mode and user can easily achieve case 2. Instead
>>>  of locking down user to case 2 by default from kernel implementation and
>>>  assume nobody is going to use case 1.
>>>
>>>  IOW, treating queues at group level provides more flexibility.
>>>
>>> - Treating queues at same level as groups will also help us better handle
>>>   the case of RT threads. Think of following.
>>>
>>> 			  root
>>> 			  |   \
>>> 			q1(RT) G1
>>> 			      / \
>>> 			     q3  q4	 
>>>
>>>  In this case q1 is real time prio class. Now if we treat queue at same
>>>  level group, then we can try to give 100% IO disk time to q1. But with
>>>  hardcoding of hidden group, covering such cases will be hard.
>>>
>>> - Other examples in kernel (CFS scheduler) already treat queue at same
>>>   level at group. So until and unless we have a good reason, we should
>>>   remain consistent. 
>>>
>>> - If we try to draw analogy from other subsystems like virtual machine,
>>>   where weight of a KVM machine on cpu is decided by native threads
>>>   created on host (logical cpus) and not by how many threads are running
>>>   inside the guest. And share of these logical cpu threads varies
>>>   dynamically based on how many other threads are running on system.
>>>
>>>   In a simple case of 1 logical cpu, we will create 1 thread and say there
>>>   are 10 processes running inside guest, then effectively shares of these
>>>   10 processes changes dynamically based on how many threads are running. 
>>>
>>> So I am not yet convinced that we should take the hidden group approach.
>> Hi Vivek,
>>
>> In short, All of the problems are bacause of the fixed weight "Hidden group".
>> So how about make the "hidden group" weight becoming dynamic according to
>> the cfqq number and priority. Or whether we can export an new user interface
>> to make "Hidden group" configurable. Thus, user can configure the "Hidden group".
> 
> Gui,
> 
> Even if you do that it will still not solve the problem of RT tread in
> root group getting all the disk.

Hi Vivek

Next step, whether we can enable cfq group IO Class and export it in cgroup.
So that, hidden group might boost to RT if there're RT cfqqs in it.

> 
> Secondly, somehow the idea of hidden group is just not appealing to me
> and trying to even expose it to user will make it even uglier.
> 
> I guess without going into implementation details, we need to first
> figure out what's the right thing to do from a design perspective and
> then later dive into what are the complexities involved in doing the
> right thing.
> 
>>> Now coming to the question of how to resolve conflict with the cfqq queue
>>> scheduling algorithm. Can we do following.
>>>
>>> - Give some kind of boost to queue entities based on their weight. So when
>>>   queue and group entities are hanging on a service tree, they are
>>>   scheduled according to their vdisktime, and vdisktime is calculated
>>>   based on entitie's weight and how much time entity spent on disk just
>>>   now.
>>>
>>>   Group entities can continue to follow existing method and we can try
>>>   to reduce the vdisktime of queue entities a bit based on their priority.
>>>
>>>   That way, one might see some service differentiation between ioprio
>>>   of queues and also the relative share between groups does not change.
>>>   The only problematic part is that when queue and groups are at same
>>>   level then it is not very predictable that group gets how much share
>>>   and queues get how much share. But I guess this is lesser of a problem
>>>   as compared to hidden group approach.
>>>
>>> Thoughts?
>> Do you mean that let cfqq and cfq group schedule at the same service tree. If
>> we choose a cfq queue, ok let it run. If we choose the cfq group, we should
>> continue to choose a cfq queue in that group.
>> If that's the case, I think the original CFQ logic has been broken.
>> Am I missing something?
>>
> 
> Can you give more details about what's broken in running CFQ queue and
> CFQ group on same service tree?
> 
> To me only thing which was broken is that how to take care of giving
> higher disk share to higher prio queue when idling is disabled. In that
> case we don't idle on queue and after request dispatch queue is deleted
> from service tree and when new request comes in, queue is put at the end
> of service tree (like other entities). And this happens with queues of
> all prio and hence the prio difference between queues is lost.
> 
> Currently we put all new queues at the end of service tree. If we put
> some logic to give vdisktime boost based on priority for new queues, 
> then we should be able to achieve the similar affect as current CFQ. Isn't
> it?

I'm wondering if CFQ queue and CFQ group schedule on a same service tree,
How to deal with workload type(SYNC,SYNC_NOIDLE,ASYNC) and IO Class?  
Currently, different workload type and IO Class cfqqs are put on different
trees. If they are scheduling on a same tree, we can't differentiate them.

Gui

> 
> Thanks
> Vivek 
> 
>> Thanks
>> Gui
>>
>>> Thanks
>>> Vivek
>>>
>>>
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support
  2010-10-26  2:15         ` Gui Jianfeng
@ 2010-10-26 15:57           ` Vivek Goyal
  2010-10-27  1:29             ` Gui Jianfeng
  0 siblings, 1 reply; 12+ messages in thread
From: Vivek Goyal @ 2010-10-26 15:57 UTC (permalink / raw)
  To: Gui Jianfeng
  Cc: Jens Axboe, Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list

On Tue, Oct 26, 2010 at 10:15:09AM +0800, Gui Jianfeng wrote:

[..]
> >>> So I am not yet convinced that we should take the hidden group approach.
> >> Hi Vivek,
> >>
> >> In short, All of the problems are bacause of the fixed weight "Hidden group".
> >> So how about make the "hidden group" weight becoming dynamic according to
> >> the cfqq number and priority. Or whether we can export an new user interface
> >> to make "Hidden group" configurable. Thus, user can configure the "Hidden group".
> > 
> > Gui,
> > 
> > Even if you do that it will still not solve the problem of RT tread in
> > root group getting all the disk.
> 
> Hi Vivek
> 
> Next step, whether we can enable cfq group IO Class and export it in cgroup.
> So that, hidden group might boost to RT if there're RT cfqqs in it.

We can/should export notion of RT group at some point of time but again,
doing this for hidden group, again does not appeal to me.

Another analogy I was thinking about is, files and directories in a filesystem
directory. Here files and directories exist at same level. Now it is not the
case that we have files inside some hidden group which enforces additional
policies on the files and one can control that policy. Kind of sounds odd to
me...

[..]
> >>> Now coming to the question of how to resolve conflict with the cfqq queue
> >>> scheduling algorithm. Can we do following.
> >>>
> >>> - Give some kind of boost to queue entities based on their weight. So when
> >>>   queue and group entities are hanging on a service tree, they are
> >>>   scheduled according to their vdisktime, and vdisktime is calculated
> >>>   based on entitie's weight and how much time entity spent on disk just
> >>>   now.
> >>>
> >>>   Group entities can continue to follow existing method and we can try
> >>>   to reduce the vdisktime of queue entities a bit based on their priority.
> >>>
> >>>   That way, one might see some service differentiation between ioprio
> >>>   of queues and also the relative share between groups does not change.
> >>>   The only problematic part is that when queue and groups are at same
> >>>   level then it is not very predictable that group gets how much share
> >>>   and queues get how much share. But I guess this is lesser of a problem
> >>>   as compared to hidden group approach.
> >>>
> >>> Thoughts?
> >> Do you mean that let cfqq and cfq group schedule at the same service tree. If
> >> we choose a cfq queue, ok let it run. If we choose the cfq group, we should
> >> continue to choose a cfq queue in that group.
> >> If that's the case, I think the original CFQ logic has been broken.
> >> Am I missing something?
> >>
> > 
> > Can you give more details about what's broken in running CFQ queue and
> > CFQ group on same service tree?
> > 
> > To me only thing which was broken is that how to take care of giving
> > higher disk share to higher prio queue when idling is disabled. In that
> > case we don't idle on queue and after request dispatch queue is deleted
> > from service tree and when new request comes in, queue is put at the end
> > of service tree (like other entities). And this happens with queues of
> > all prio and hence the prio difference between queues is lost.
> > 
> > Currently we put all new queues at the end of service tree. If we put
> > some logic to give vdisktime boost based on priority for new queues, 
> > then we should be able to achieve the similar affect as current CFQ. Isn't
> > it?
> 
> I'm wondering if CFQ queue and CFQ group schedule on a same service tree,
> How to deal with workload type(SYNC,SYNC_NOIDLE,ASYNC) and IO Class?  

For queues we continue to derive IO class and workload type as usual. For
group entities, in the first step, we can assume them to be of class BE
and put them probably on SYNC tree. Later we can introduce prio classes
for groups also so that a user can specify RT, BE or IDLE class of groups.

Regaring workload type of group, it is a tricky business. Again, because
we idle on the group and SYNC tree contains the entities which we idle on
it might be the right place to put group entities along with SYNC cfqq
queues.

> Currently, different workload type and IO Class cfqqs are put on different
> trees. If they are scheduling on a same tree, we can't differentiate them.

I meant that we will continue to have multiple service trees per group and
cfq queue entities will continue to go onto respective service tree and
for group entities I think we can assume these to be of type SYNC. If it
becomes a problem may be we can later try to put some logic to determine
the nature of overall traffic in group and classify it as either SYNC or
SYNC-NOIDLE etc.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support
  2010-10-26 15:57           ` Vivek Goyal
@ 2010-10-27  1:29             ` Gui Jianfeng
  0 siblings, 0 replies; 12+ messages in thread
From: Gui Jianfeng @ 2010-10-27  1:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Nauman Rafique, Chad Talbott, Divyesh Shah,
	linux kernel mailing list

Vivek Goyal wrote:
> On Tue, Oct 26, 2010 at 10:15:09AM +0800, Gui Jianfeng wrote:
> 
> [..]
>>>>> So I am not yet convinced that we should take the hidden group approach.
>>>> Hi Vivek,
>>>>
>>>> In short, All of the problems are bacause of the fixed weight "Hidden group".
>>>> So how about make the "hidden group" weight becoming dynamic according to
>>>> the cfqq number and priority. Or whether we can export an new user interface
>>>> to make "Hidden group" configurable. Thus, user can configure the "Hidden group".
>>> Gui,
>>>
>>> Even if you do that it will still not solve the problem of RT tread in
>>> root group getting all the disk.
>> Hi Vivek
>>
>> Next step, whether we can enable cfq group IO Class and export it in cgroup.
>> So that, hidden group might boost to RT if there're RT cfqqs in it.
> 
> We can/should export notion of RT group at some point of time but again,
> doing this for hidden group, again does not appeal to me.

OK...

> 
> Another analogy I was thinking about is, files and directories in a filesystem
> directory. Here files and directories exist at same level. Now it is not the
> case that we have files inside some hidden group which enforces additional
> policies on the files and one can control that policy. Kind of sounds odd to
> me...
> 
> [..]
>>>>> Now coming to the question of how to resolve conflict with the cfqq queue
>>>>> scheduling algorithm. Can we do following.
>>>>>
>>>>> - Give some kind of boost to queue entities based on their weight. So when
>>>>>   queue and group entities are hanging on a service tree, they are
>>>>>   scheduled according to their vdisktime, and vdisktime is calculated
>>>>>   based on entitie's weight and how much time entity spent on disk just
>>>>>   now.
>>>>>
>>>>>   Group entities can continue to follow existing method and we can try
>>>>>   to reduce the vdisktime of queue entities a bit based on their priority.
>>>>>
>>>>>   That way, one might see some service differentiation between ioprio
>>>>>   of queues and also the relative share between groups does not change.
>>>>>   The only problematic part is that when queue and groups are at same
>>>>>   level then it is not very predictable that group gets how much share
>>>>>   and queues get how much share. But I guess this is lesser of a problem
>>>>>   as compared to hidden group approach.
>>>>>
>>>>> Thoughts?
>>>> Do you mean that let cfqq and cfq group schedule at the same service tree. If
>>>> we choose a cfq queue, ok let it run. If we choose the cfq group, we should
>>>> continue to choose a cfq queue in that group.
>>>> If that's the case, I think the original CFQ logic has been broken.
>>>> Am I missing something?
>>>>
>>> Can you give more details about what's broken in running CFQ queue and
>>> CFQ group on same service tree?
>>>
>>> To me only thing which was broken is that how to take care of giving
>>> higher disk share to higher prio queue when idling is disabled. In that
>>> case we don't idle on queue and after request dispatch queue is deleted
>>> from service tree and when new request comes in, queue is put at the end
>>> of service tree (like other entities). And this happens with queues of
>>> all prio and hence the prio difference between queues is lost.
>>>
>>> Currently we put all new queues at the end of service tree. If we put
>>> some logic to give vdisktime boost based on priority for new queues, 
>>> then we should be able to achieve the similar affect as current CFQ. Isn't
>>> it?
>> I'm wondering if CFQ queue and CFQ group schedule on a same service tree,
>> How to deal with workload type(SYNC,SYNC_NOIDLE,ASYNC) and IO Class?  
> 
> For queues we continue to derive IO class and workload type as usual. For
> group entities, in the first step, we can assume them to be of class BE
> and put them probably on SYNC tree. Later we can introduce prio classes
> for groups also so that a user can specify RT, BE or IDLE class of groups.
> 
> Regaring workload type of group, it is a tricky business. Again, because
> we idle on the group and SYNC tree contains the entities which we idle on
> it might be the right place to put group entities along with SYNC cfqq
> queues.
> 
>> Currently, different workload type and IO Class cfqqs are put on different
>> trees. If they are scheduling on a same tree, we can't differentiate them.
> 
> I meant that we will continue to have multiple service trees per group and
> cfq queue entities will continue to go onto respective service tree and
> for group entities I think we can assume these to be of type SYNC. If it
> becomes a problem may be we can later try to put some logic to determine
> the nature of overall traffic in group and classify it as either SYNC or
> SYNC-NOIDLE etc.

OK, I will think about it.

Thanks
Gui

> 
> Thanks
> Vivek
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-10-27  1:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-21  2:32 [RFC] [PATCH 0/4] cfq-iosched: Enable hierarchical cfq group scheduling and add use_hierarchy interface Gui Jianfeng
2010-10-21  2:34 ` [PATCH 1/4 v2] cfq-iosched: add cfq group hierarchical scheduling support Gui Jianfeng
2010-10-22 20:54   ` Vivek Goyal
2010-10-22 21:11     ` Vivek Goyal
2010-10-25  2:48     ` Gui Jianfeng
2010-10-25 20:20       ` Vivek Goyal
2010-10-26  2:15         ` Gui Jianfeng
2010-10-26 15:57           ` Vivek Goyal
2010-10-27  1:29             ` Gui Jianfeng
2010-10-21  2:36 ` [PATCH 2/4] blkio-cgroup: Add a new interface use_hierarchy Gui Jianfeng
2010-10-21  2:36 ` [PATCH 3/4] cfq-iosched: Enable both hierarchical mode and flat mode for cfq group scheduling Gui Jianfeng
2010-10-21  2:37 ` [PATCH 4/4] blkio-cgroup: Documents for use_hierarchy interface Gui Jianfeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).