* Re: [RFC] [PATCH 3/8] cfq-iosched: Introduce vdisktime and io weight for CFQ queue
[not found] ` <20101115193352.GB3396@redhat.com>
@ 2010-11-29 2:32 ` Gui Jianfeng
0 siblings, 0 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-11-29 2:32 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Sun, Nov 14, 2010 at 04:24:56PM +0800, Gui Jianfeng wrote:
>> Introduce vdisktime and io weight for CFQ queue scheduling. Currently, io priority
>> maps to a range [100,1000]. It also gets rid of cfq_slice_offset() logic and makes
>> use the same scheduling algorithm as CFQ group does. This helps for CFQ queue and
>> group scheduling on the same service tree.
>>
>
> Gui,
>
> I think we can't get rid of cfq_slice_offset() logic altogether because
> I believe this piece can help provide some service differentiation between
> queues on SSDs or when idling is not enabled. Though that service
> differentiation is highly unpredicatable and becomes even less visible
> when NCQ is enabled.
>
> So we shall have to replace with some similar logic. When a new queue
> entity gets backlogged on service tree, give it some jump in vdisktime
> based on ioprio. Lower ioprio gets higher vdisktime jump etc.
Vivek,
Ok, I'll consider your suggestion.
>
> To test this, I would say take an SSD, set the queue depth to 1, and
> then run bunch of threads with different ioprio. First see if without
> patch do you see any service differentiation and then run it again with
> your patch applied for comparision.
Ok
>
>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> block/cfq-iosched.c | 194 ++++++++++++++++++++++++++++++++++----------------
>> 1 files changed, 132 insertions(+), 62 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 5cce1e8..ef88931 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -102,10 +102,7 @@ struct io_sched_entity {
>> struct cfq_rb_root *service_tree;
>> /* service_tree member */
>> struct rb_node rb_node;
>> - /* service_tree key, represent the position on the tree */
>> - unsigned long rb_key;
>> -
>> - /* group service_tree key */
>> + /* service_tree key */
>> u64 vdisktime;
>> bool on_st;
>> bool is_group_entity;
>> @@ -118,6 +115,8 @@ struct io_sched_entity {
>> struct cfq_queue {
>> /* The schedule entity */
>> struct io_sched_entity queue_entity;
>> + /* Reposition time */
>> + unsigned long reposition_time;
>> /* reference count */
>> atomic_t ref;
>> /* various state flags, see below */
>> @@ -306,6 +305,22 @@ struct cfq_data {
>> struct rcu_head rcu;
>> };
>>
>> +/*
>> + * Map io priority(7 ~ 0) to io weight(100 ~ 1000)
>> + */
>> +static inline unsigned int cfq_prio_to_weight(unsigned short ioprio)
>> +{
>> + unsigned int step;
>> +
>> + BUG_ON(ioprio >= IOPRIO_BE_NR);
>> +
>> + step = (BLKIO_WEIGHT_MAX - BLKIO_WEIGHT_MIN) / (IOPRIO_BE_NR - 1);
>> + if (ioprio == 0)
>> + return BLKIO_WEIGHT_MAX;
>> +
>> + return BLKIO_WEIGHT_MIN + (IOPRIO_BE_NR - ioprio - 1) * step;
>> +}
>> +
>
> What's the rationale behind above formula? How does it map prio to
> weigths?
This formula map ioprio to weight as follow:
prio 0 1 2 3 4 5 6 7
weight 1000 868 740 612 484 356 228 100 (new prio to weight mapping)
>
> Could we just do following.
>
> step = BLKIO_WEIGHT_MAX/IOPRIO_BE_NR
>
> return BLKIO_WEIGHT_MAX - (ioprio * step)
>
> above should map prio to weights as follows.
>
> slice 180 160 140 120 100 80 60 40 (old prio to slice mapping)
> prio 0 1 2 3 4 5 6 7
> weight 1000 875 750 625 500 375 250 125 (new prio to weight mapping)
>
>> static inline struct cfq_queue *
>> cfqq_of_entity(struct io_sched_entity *io_entity)
>> {
>> @@ -551,12 +566,13 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
>> }
>>
>> -static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
>> +static inline u64
>> +cfq_scale_slice(unsigned long delta, struct io_sched_entity *entity)
>> {
>> u64 d = delta << CFQ_SERVICE_SHIFT;
>>
>> d = d * BLKIO_WEIGHT_DEFAULT;
>> - do_div(d, cfqg->group_entity.weight);
>> + do_div(d, entity->weight);
>
> This can go in previous patch?
Sure.
>
>> return d;
>
>> }
>>
>> @@ -581,16 +597,16 @@ static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
>> static void update_min_vdisktime(struct cfq_rb_root *st)
>> {
>> u64 vdisktime = st->min_vdisktime;
>> - struct io_sched_entity *group_entity;
>> + struct io_sched_entity *entity;
>>
>> if (st->active) {
>> - group_entity = rb_entry_entity(st->active);
>> - vdisktime = group_entity->vdisktime;
>> + entity = rb_entry_entity(st->active);
>> + vdisktime = entity->vdisktime;
>> }
>>
>> if (st->left) {
>> - group_entity = rb_entry_entity(st->left);
>> - vdisktime = min_vdisktime(vdisktime, group_entity->vdisktime);
>> + entity = rb_entry_entity(st->left);
>> + vdisktime = min_vdisktime(vdisktime, entity->vdisktime);
>> }
>>
>
> Can go in previous patch?
Sure.
>
>> st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
>> @@ -838,16 +854,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> return cfq_choose_req(cfqd, next, prev, blk_rq_pos(last));
>> }
>>
>> -static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
>> - struct cfq_queue *cfqq)
>> -{
>> - /*
>> - * just an approximation, should be ok.
>> - */
>> - return (cfqq->cfqg->nr_cfqq - 1) * (cfq_prio_slice(cfqd, 1, 0) -
>> - cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
>> -}
>> -
>> static inline s64
>> entity_key(struct cfq_rb_root *st, struct io_sched_entity *entity)
>> {
>> @@ -983,7 +989,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>
>> /* Can't update vdisktime while group is on service tree */
>> cfq_rb_erase(&group_entity->rb_node, st);
>> - group_entity->vdisktime += cfq_scale_slice(charge, cfqg);
>> + group_entity->vdisktime += cfq_scale_slice(charge, group_entity);
>
> can go in previous patch?
Sure.
>
> [..]
>>
>> +/*
>> + * The time when a CFQ queue is put onto a service tree is recoreded in
>> + * cfqq->reposition_time. Currently, we check the first priority CFQ queues
>> + * on each service tree, and select the workload type that contain the lowest
>> + * reposition_time CFQ queue among them.
>> + */
>
> What is the rational behind reposition_time. Can you explain it a bit more
> that why do we need it. I can't figure it out yet.
In original CFQ, rb_key is cross trees. We select the lowest rb_key one among
three workload trees. When vdisktime is introduced, vdisktime is maintained
by each tree self. But we still want to choose the first priority one cross trees,
So for the time being, I record the reposition_time for each cfqq, and select
the smallest reposition_time one as the next servicing workload when selecting
a new workload.
Gui
>
> Vivek
>
>> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> struct cfq_group *cfqg, enum wl_prio_t prio)
>> {
>> struct io_sched_entity *queue_entity;
>> + struct cfq_queue *cfqq;
>> + unsigned long lowest_start_time;
>> int i;
>> - bool key_valid = false;
>> - unsigned long lowest_key = 0;
>> + bool time_valid = false;
>> enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
>>
>> + /*
>> + * TODO: We may take io priority into account when choosing a workload
>> + * type. But for the time being just make use of reposition_time only.
>> + */
>> for (i = 0; i <= SYNC_WORKLOAD; ++i) {
>> - /* select the one with lowest rb_key */
>> queue_entity = cfq_rb_first(service_tree_for(cfqg, prio, i));
>> + cfqq = cfqq_of_entity(queue_entity);
>> if (queue_entity &&
>> - (!key_valid ||
>> - time_before(queue_entity->rb_key, lowest_key))) {
>> - lowest_key = queue_entity->rb_key;
>> + (!time_valid ||
>> + cfqq->reposition_time < lowest_start_time)) {
>> + lowest_start_time = cfqq->reposition_time;
>> cur_best = i;
>> - key_valid = true;
>> + time_valid = true;
>> }
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] [PATCH 4/8] cfq-iosched: Get rid of st->active
[not found] ` <20101115194855.GC3396@redhat.com>
@ 2010-11-29 2:34 ` Gui Jianfeng
0 siblings, 0 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-11-29 2:34 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Sun, Nov 14, 2010 at 04:25:04PM +0800, Gui Jianfeng wrote:
>> When a cfq group is running, it won't be dequeued from service tree, so
>> there's no need to store the active one in st->active. Just gid rid of it.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>
> Hmm..., once I was also wondering if ->active pointer is superflous. Looks
> like it is and st->left will always represent the element being served. I
> think this is more of a cleanup patch and can either be first patch in the
> series or you can post it independently.
Ok.
Gui
>
> Vivek
>
>> ---
>> block/cfq-iosched.c | 15 +--------------
>> 1 files changed, 1 insertions(+), 14 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index ef88931..ad577b5 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -88,7 +88,6 @@ struct cfq_rb_root {
>> unsigned count;
>> unsigned total_weight;
>> u64 min_vdisktime;
>> - struct rb_node *active;
>> };
>> #define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
>> .count = 0, .min_vdisktime = 0, }
>> @@ -599,11 +598,6 @@ static void update_min_vdisktime(struct cfq_rb_root *st)
>> u64 vdisktime = st->min_vdisktime;
>> struct io_sched_entity *entity;
>>
>> - if (st->active) {
>> - entity = rb_entry_entity(st->active);
>> - vdisktime = entity->vdisktime;
>> - }
>> -
>> if (st->left) {
>> entity = rb_entry_entity(st->left);
>> vdisktime = min_vdisktime(vdisktime, entity->vdisktime);
>> @@ -925,9 +919,6 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> struct io_sched_entity *group_entity = &cfqg->group_entity;
>>
>> - if (st->active == &group_entity->rb_node)
>> - st->active = NULL;
>> -
>> BUG_ON(cfqg->nr_cfqq < 1);
>> cfqg->nr_cfqq--;
>>
>> @@ -1130,7 +1121,7 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
>> if (!atomic_dec_and_test(&cfqg->ref))
>> return;
>> for_each_cfqg_st(cfqg, i, j, st)
>> - BUG_ON(!RB_EMPTY_ROOT(&st->rb) || st->active != NULL);
>> + BUG_ON(!RB_EMPTY_ROOT(&st->rb));
>> kfree(cfqg);
>> }
>>
>> @@ -1773,9 +1764,6 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> if (cfqq == cfqd->active_queue)
>> cfqd->active_queue = NULL;
>>
>> - if (&cfqq->cfqg->group_entity.rb_node == cfqd->grp_service_tree.active)
>> - cfqd->grp_service_tree.active = NULL;
>> -
>> if (cfqd->active_cic) {
>> put_io_context(cfqd->active_cic->ioc);
>> cfqd->active_cic = NULL;
>> @@ -2305,7 +2293,6 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
>> group_entity = cfq_rb_first_entity(st);
>> cfqg = cfqg_of_entity(group_entity);
>> BUG_ON(!cfqg);
>> - st->active = &group_entity->rb_node;
>> update_min_vdisktime(st);
>> return cfqg;
>> }
>> --
>> 1.6.5.2
>>
>>
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] [PATCH 7/8] cfq-iosched: Enable deep hierarchy in CGgroup
[not found] ` <20101115195428.GE3396@redhat.com>
@ 2010-11-29 2:35 ` Gui Jianfeng
0 siblings, 0 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-11-29 2:35 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Sun, Nov 14, 2010 at 04:25:42PM +0800, Gui Jianfeng wrote:
>> Currenlty, only level 0 and 1 CGroup directory is allowed to be created.
>> This patch enables deep hierarchy in CGgroup.
>>
>
> I also had posted the patch. Today Jens applied it to
> origin/for-2.6.38/rc2-holder and after -rc2 will move to 2.6.38/core. At
> that point of time you can drop this patch from the series.
>
Ok.
Gui
> Vivek
>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> block/blk-cgroup.c | 4 ----
>> 1 files changed, 0 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index b1febd0..455768a 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -1452,10 +1452,6 @@ blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
>> goto done;
>> }
>>
>> - /* Currently we do not support hierarchy deeper than two level (0,1) */
>> - if (parent != cgroup->top_cgroup)
>> - return ERR_PTR(-EPERM);
>> -
>> blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL);
>> if (!blkcg)
>> return ERR_PTR(-ENOMEM);
>> --
>> 1.6.5.2
>>
>>
>>
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] [PATCH 8/8] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
[not found] ` <20101115204459.GF3396@redhat.com>
@ 2010-11-29 2:42 ` Gui Jianfeng
2010-11-29 14:31 ` Vivek Goyal
0 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-11-29 2:42 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Sun, Nov 14, 2010 at 04:25:49PM +0800, Gui Jianfeng wrote:
>> This patch makes CFQ queue and CFQ group schedules at the same level.
>> Consider the following hierarchy:
>>
>> Root
>> / | \
>> q1 q2 G1
>> / \
>> q3 G2
>>
>> q1 q2 and q3 are CFQ queues G1 and G2 are CFQ groups. Currently, q1, q2
>> and G1 are scheduling on a same service tree in Root CFQ group. q3 and G2
>> are schedluing under G1. Note, for the time being, CFQ group is treated
>> as "BE and SYNC" workload, and is put on "BE and SYNC" service tree. That
>> means service differentiate only happens in "BE and SYNC" service tree.
>> Later, we may introduce "IO Class" for CFQ group.
>>
>
> Have you got rid of flat mode (existing default mode). IOW, I don't see
> the introduction of "use_hierarchy" which will differentiate between
> whether to treat an hierarchy as flat or not?
Vivek,
As I said in [PATCH 0/8], yes, for the time being, I get rid of flat mode.
Hierarchical cgroup creation is just merge into block-tree, not in Mainline
now, .so I think it's ok I'd like to post "use_hierarchy" patchset separately
when this patchset get merged. How do you say, Vivek?
Gui
>
> Vivek
>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> block/cfq-iosched.c | 483 ++++++++++++++++++++++++++++++++++-----------------
>> 1 files changed, 324 insertions(+), 159 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 1df0928..9def3a2 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -105,6 +105,9 @@ struct io_sched_entity {
>> u64 vdisktime;
>> bool is_group_entity;
>> unsigned int weight;
>> + struct io_sched_entity *parent;
>> + /* Reposition time */
>> + unsigned long reposition_time;
>> };
>>
>> /*
>> @@ -113,8 +116,6 @@ struct io_sched_entity {
>> struct cfq_queue {
>> /* The schedule entity */
>> struct io_sched_entity queue_entity;
>> - /* Reposition time */
>> - unsigned long reposition_time;
>> /* reference count */
>> atomic_t ref;
>> /* various state flags, see below */
>> @@ -193,6 +194,9 @@ struct cfq_group {
>> /* number of cfqq currently on this group */
>> int nr_cfqq;
>>
>> + /* number of sub cfq groups */
>> + int nr_subgp;
>> +
>> /* Per group busy queus average. Useful for workload slice calc. */
>> unsigned int busy_queues_avg[2];
>> /*
>> @@ -219,8 +223,6 @@ struct cfq_group {
>> */
>> struct cfq_data {
>> struct request_queue *queue;
>> - /* Root service tree for cfq_groups */
>> - struct cfq_rb_root grp_service_tree;
>> struct cfq_group root_group;
>>
>> /*
>> @@ -337,8 +339,6 @@ cfqg_of_entity(struct io_sched_entity *io_entity)
>> return NULL;
>> }
>>
>> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
>> -
>> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
>> enum wl_prio_t prio,
>> enum wl_type_t type)
>> @@ -629,10 +629,15 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
>> static inline unsigned
>> cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> struct io_sched_entity *group_entity = &cfqg->group_entity;
>> + struct cfq_rb_root *st = group_entity->service_tree;
>>
>> - return cfq_target_latency * group_entity->weight / st->total_weight;
>> + if (st)
>> + return cfq_target_latency * group_entity->weight
>> + / st->total_weight;
>> + else
>> + /* If this is the root group, give it a full slice. */
>> + return cfq_target_latency;
>> }
>>
>> static inline void
>> @@ -795,17 +800,6 @@ static struct io_sched_entity *cfq_rb_first(struct cfq_rb_root *root)
>> return NULL;
>> }
>>
>> -static struct io_sched_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
>> -{
>> - if (!root->left)
>> - root->left = rb_first(&root->rb);
>> -
>> - if (root->left)
>> - return rb_entry_entity(root->left);
>> -
>> - return NULL;
>> -}
>> -
>> static void rb_erase_init(struct rb_node *n, struct rb_root *root)
>> {
>> rb_erase(n, root);
>> @@ -887,6 +881,7 @@ io_entity_service_tree_add(struct cfq_rb_root *st,
>> struct io_sched_entity *io_entity)
>> {
>> __io_entity_service_tree_add(st, io_entity);
>> + io_entity->reposition_time = jiffies;
>> st->count++;
>> st->total_weight += io_entity->weight;
>> }
>> @@ -894,29 +889,49 @@ io_entity_service_tree_add(struct cfq_rb_root *st,
>> static void
>> cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> struct rb_node *n;
>> struct io_sched_entity *group_entity = &cfqg->group_entity;
>> - struct io_sched_entity *__group_entity;
>> + struct io_sched_entity *entity;
>> + struct cfq_rb_root *st;
>> + struct cfq_group *__cfqg;
>>
>> cfqg->nr_cfqq++;
>> +
>> + /*
>> + * Root group doesn't belongs to any service
>> + */
>> + if (cfqg == &cfqd->root_group)
>> + return;
>> +
>> if (!RB_EMPTY_NODE(&group_entity->rb_node))
>> return;
>>
>> - /*
>> - * Currently put the group at the end. Later implement something
>> - * so that groups get lesser vtime based on their weights, so that
>> - * if group does not loose all if it was not continously backlogged.
>> + /*
>> + * Enqueue this group and its ancestors onto their service tree.
>> */
>> - n = rb_last(&st->rb);
>> - if (n) {
>> - __group_entity = rb_entry_entity(n);
>> - group_entity->vdisktime = __group_entity->vdisktime +
>> - CFQ_IDLE_DELAY;
>> - } else
>> - group_entity->vdisktime = st->min_vdisktime;
>> + while (group_entity && group_entity->parent) {
>> + if (!RB_EMPTY_NODE(&group_entity->rb_node))
>> + return;
>> + /*
>> + * Currently put the group at the end. Later implement
>> + * something so that groups get lesser vtime based on their
>> + * weights, so that if group does not loose all if it was not
>> + * continously backlogged.
>> + */
>> + st = group_entity->service_tree;
>> + n = rb_last(&st->rb);
>> + if (n) {
>> + entity = rb_entry_entity(n);
>> + group_entity->vdisktime = entity->vdisktime +
>> + CFQ_IDLE_DELAY;
>> + } else
>> + group_entity->vdisktime = st->min_vdisktime;
>>
>> - io_entity_service_tree_add(st, group_entity);
>> + io_entity_service_tree_add(st, group_entity);
>> + group_entity = group_entity->parent;
>> + __cfqg = cfqg_of_entity(group_entity);
>> + __cfqg->nr_subgp++;
>> + }
>> }
>>
>> static void
>> @@ -933,27 +948,47 @@ io_entity_service_tree_del(struct cfq_rb_root *st,
>> if (!RB_EMPTY_NODE(&io_entity->rb_node)) {
>> __io_entity_service_tree_del(st, io_entity);
>> st->total_weight -= io_entity->weight;
>> - io_entity->service_tree = NULL;
>> }
>> }
>>
>> static void
>> cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> struct io_sched_entity *group_entity = &cfqg->group_entity;
>> + struct cfq_group *__cfqg, *p_cfqg;
>>
>> BUG_ON(cfqg->nr_cfqq < 1);
>> cfqg->nr_cfqq--;
>>
>> + /*
>> + * Root group doesn't belongs to any service
>> + */
>> + if (cfqg == &cfqd->root_group)
>> + return;
>> +
>> /* If there are other cfq queues under this group, don't delete it */
>> if (cfqg->nr_cfqq)
>> return;
>> -
>> - cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
>> - io_entity_service_tree_del(st, group_entity);
>> - cfqg->saved_workload_slice = 0;
>> - cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
>> + /* If child group exists, don't dequeue it */
>> + if (cfqg->nr_subgp)
>> + return;
>> +
>> + /*
>> + * Dequeue this group and its ancestors from their service tree.
>> + */
>> + while (group_entity && group_entity->parent) {
>> + __cfqg = cfqg_of_entity(group_entity);
>> + p_cfqg = cfqg_of_entity(group_entity->parent);
>> + io_entity_service_tree_del(group_entity->service_tree,
>> + group_entity);
>> + cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
>> + cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
>> + __cfqg->saved_workload_slice = 0;
>> + group_entity = group_entity->parent;
>> + p_cfqg->nr_subgp--;
>> + if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
>> + return;
>> + }
>> }
>>
>> static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
>> @@ -985,7 +1020,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
>> static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>> struct cfq_queue *cfqq)
>> {
>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> unsigned int used_sl, charge;
>> int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
>> - cfqg->service_tree_idle.count;
>> @@ -999,10 +1033,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>> else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
>> charge = cfqq->allocated_slice;
>>
>> - /* Can't update vdisktime while group is on service tree */
>> - __io_entity_service_tree_del(st, group_entity);
>> - group_entity->vdisktime += cfq_scale_slice(charge, group_entity);
>> - __io_entity_service_tree_add(st, group_entity);
>> + /*
>> + * Update the vdisktime on the whole chain.
>> + */
>> + while (group_entity && group_entity->parent) {
>> + struct cfq_rb_root *st = group_entity->service_tree;
>> +
>> + /* Can't update vdisktime while group is on service tree */
>> + __io_entity_service_tree_del(st, group_entity);
>> + group_entity->vdisktime += cfq_scale_slice(charge,
>> + group_entity);
>> + __io_entity_service_tree_add(st, group_entity);
>> + st->count++;
>> + group_entity->reposition_time = jiffies;
>> + group_entity = group_entity->parent;
>> + }
>>
>> /* This group is being expired. Save the context */
>> if (time_after(cfqd->workload_expires, jiffies)) {
>> @@ -1014,7 +1059,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>> cfqg->saved_workload_slice = 0;
>>
>> cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu",
>> - group_entity->vdisktime, st->min_vdisktime);
>> + cfqg->group_entity.vdisktime,
>> + cfqg->group_entity.service_tree->min_vdisktime);
>> cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
>> " sect=%u", used_sl, cfqq->slice_dispatch, charge,
>> iops_mode(cfqd), cfqq->nr_sectors);
>> @@ -1036,35 +1082,27 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
>> cfqg_of_blkg(blkg)->group_entity.weight = weight;
>> }
>>
>> -static struct cfq_group *
>> -cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>> +static void init_group_entity(struct blkio_cgroup *blkcg,
>> + struct cfq_group *cfqg)
>> +{
>> + struct io_sched_entity *group_entity = &cfqg->group_entity;
>> +
>> + group_entity->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
>> + RB_CLEAR_NODE(&group_entity->rb_node);
>> + group_entity->is_group_entity = true;
>> + group_entity->parent = NULL;
>> +}
>> +
>> +static void init_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
>> + struct cfq_group *cfqg)
>> {
>> - struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>> - struct cfq_group *cfqg = NULL;
>> - void *key = cfqd;
>> int i, j;
>> struct cfq_rb_root *st;
>> - struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>> unsigned int major, minor;
>> -
>> - cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>> - if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
>> - sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>> - cfqg->blkg.dev = MKDEV(major, minor);
>> - goto done;
>> - }
>> - if (cfqg || !create)
>> - goto done;
>> -
>> - cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
>> - if (!cfqg)
>> - goto done;
>> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>>
>> for_each_cfqg_st(cfqg, i, j, st)
>> *st = CFQ_RB_ROOT;
>> - RB_CLEAR_NODE(&cfqg->group_entity.rb_node);
>> -
>> - cfqg->group_entity.is_group_entity = true;
>>
>> /*
>> * Take the initial reference that will be released on destroy
>> @@ -1074,24 +1112,119 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>> */
>> atomic_set(&cfqg->ref, 1);
>>
>> + /* Add group onto cgroup list */
>> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>> + cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
>> + MKDEV(major, minor));
>> + /* Initiate group entity */
>> + init_group_entity(blkcg, cfqg);
>> + /* Add group on cfqd list */
>> + hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
>> +}
>> +
>> +static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg);
>> +
>> +static void uninit_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> +{
>> + if (!cfq_blkiocg_del_blkio_group(&cfqg->blkg))
>> + cfq_destroy_cfqg(cfqd, cfqg);
>> +}
>> +
>> +static void cfqg_set_parent(struct cfq_data *cfqd, struct cfq_group *cfqg,
>> + struct cfq_group *p_cfqg)
>> +{
>> + struct io_sched_entity *group_entity, *p_group_entity;
>> +
>> + group_entity = &cfqg->group_entity;
>> +
>> + p_group_entity = &p_cfqg->group_entity;
>> +
>> + group_entity->parent = p_group_entity;
>> +
>> /*
>> - * Add group onto cgroup list. It might happen that bdi->dev is
>> - * not initiliazed yet. Initialize this new group without major
>> - * and minor info and this info will be filled in once a new thread
>> - * comes for IO. See code above.
>> + * Currently, just put cfq group entity on "BE:SYNC" workload
>> + * service tree.
>> */
>> - if (bdi->dev) {
>> - sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>> - cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
>> - MKDEV(major, minor));
>> - } else
>> - cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
>> - 0);
>> + group_entity->service_tree = service_tree_for(p_cfqg, BE_WORKLOAD,
>> + SYNC_WORKLOAD);
>> + /* child reference */
>> + atomic_inc(&p_cfqg->ref);
>> +}
>>
>> - cfqg->group_entity.weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
>> +int cfqg_chain_alloc(struct cfq_data *cfqd, struct cgroup *cgroup)
>> +{
>> + struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>> + struct blkio_cgroup *p_blkcg;
>> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>> + unsigned int major, minor;
>> + struct cfq_group *cfqg, *p_cfqg;
>> + void *key = cfqd;
>> + int ret;
>>
>> - /* Add group on cfqd list */
>> - hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
>> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>> + if (cfqg) {
>> + if (!cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
>> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>> + cfqg->blkg.dev = MKDEV(major, minor);
>> + }
>> + /* chain has already been built */
>> + return 0;
>> + }
>> +
>> + cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
>> + if (!cfqg)
>> + return -1;
>> +
>> + init_cfqg(cfqd, blkcg, cfqg);
>> +
>> + /* Already to the top group */
>> + if (!cgroup->parent)
>> + return 0;
>> +
>> + /* Allocate CFQ groups on the chain */
>> + ret = cfqg_chain_alloc(cfqd, cgroup->parent);
>> + if (ret == -1) {
>> + uninit_cfqg(cfqd, cfqg);
>> + return -1;
>> + }
>> +
>> + p_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
>> + p_cfqg = cfqg_of_blkg(blkiocg_lookup_group(p_blkcg, key));
>> + BUG_ON(p_cfqg == NULL);
>> +
>> + cfqg_set_parent(cfqd, cfqg, p_cfqg);
>> + return 0;
>> +}
>> +
>> +static struct cfq_group *
>> +cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>> +{
>> + struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>> + struct cfq_group *cfqg = NULL;
>> + void *key = cfqd;
>> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>> + unsigned int major, minor;
>> + int ret;
>> +
>> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>> + if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
>> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>> + cfqg->blkg.dev = MKDEV(major, minor);
>> + goto done;
>> + }
>> + if (cfqg || !create)
>> + goto done;
>> +
>> + /*
>> + * For hierarchical cfq group scheduling, we need to allocate
>> + * the whole cfq group chain.
>> + */
>> + ret = cfqg_chain_alloc(cfqd, cgroup);
>> + if (!ret) {
>> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>> + BUG_ON(cfqg == NULL);
>> + goto done;
>> + }
>>
>> done:
>> return cfqg;
>> @@ -1136,12 +1269,22 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
>> {
>> struct cfq_rb_root *st;
>> int i, j;
>> + struct io_sched_entity *group_entity;
>> + struct cfq_group *p_cfqg;
>>
>> BUG_ON(atomic_read(&cfqg->ref) <= 0);
>> if (!atomic_dec_and_test(&cfqg->ref))
>> return;
>> for_each_cfqg_st(cfqg, i, j, st)
>> BUG_ON(!RB_EMPTY_ROOT(&st->rb));
>> +
>> + group_entity = &cfqg->group_entity;
>> + if (group_entity->parent) {
>> + p_cfqg = cfqg_of_entity(group_entity->parent);
>> + /* Drop the reference taken by children */
>> + atomic_dec(&p_cfqg->ref);
>> + }
>> +
>> kfree(cfqg);
>> }
>>
>> @@ -1336,7 +1479,6 @@ insert:
>> io_entity_service_tree_add(service_tree, queue_entity);
>>
>> update_min_vdisktime(service_tree);
>> - cfqq->reposition_time = jiffies;
>> if ((add_front || !new_cfqq) && !group_changed)
>> return;
>> cfq_group_service_tree_add(cfqd, cfqq->cfqg);
>> @@ -1779,28 +1921,30 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
>> return cfqq_of_entity(cfq_rb_first(service_tree));
>> }
>>
>> -static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
>> +static struct io_sched_entity *
>> +cfq_get_next_entity_forced(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_group *cfqg;
>> - struct io_sched_entity *queue_entity;
>> + struct io_sched_entity *entity;
>> int i, j;
>> struct cfq_rb_root *st;
>>
>> if (!cfqd->rq_queued)
>> return NULL;
>>
>> - cfqg = cfq_get_next_cfqg(cfqd);
>> - if (!cfqg)
>> - return NULL;
>> -
>> for_each_cfqg_st(cfqg, i, j, st) {
>> - queue_entity = cfq_rb_first(st);
>> - if (queue_entity != NULL)
>> - return cfqq_of_entity(queue_entity);
>> + entity = cfq_rb_first(st);
>> +
>> + if (entity && !entity->is_group_entity)
>> + return entity;
>> + else if (entity && entity->is_group_entity) {
>> + cfqg = cfqg_of_entity(entity);
>> + return cfq_get_next_entity_forced(cfqd, cfqg);
>> + }
>> }
>> return NULL;
>> }
>>
>> +
>> /*
>> * Get and set a new active queue for service.
>> */
>> @@ -2155,8 +2299,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
>> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> struct cfq_group *cfqg, enum wl_prio_t prio)
>> {
>> - struct io_sched_entity *queue_entity;
>> - struct cfq_queue *cfqq;
>> + struct io_sched_entity *entity;
>> unsigned long lowest_start_time;
>> int i;
>> bool time_valid = false;
>> @@ -2167,12 +2310,11 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> * type. But for the time being just make use of reposition_time only.
>> */
>> for (i = 0; i <= SYNC_WORKLOAD; ++i) {
>> - queue_entity = cfq_rb_first(service_tree_for(cfqg, prio, i));
>> - cfqq = cfqq_of_entity(queue_entity);
>> - if (queue_entity &&
>> + entity = cfq_rb_first(service_tree_for(cfqg, prio, i));
>> + if (entity &&
>> (!time_valid ||
>> - cfqq->reposition_time < lowest_start_time)) {
>> - lowest_start_time = cfqq->reposition_time;
>> + entity->reposition_time < lowest_start_time)) {
>> + lowest_start_time = entity->reposition_time;
>> cur_best = i;
>> time_valid = true;
>> }
>> @@ -2181,47 +2323,13 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> return cur_best;
>> }
>>
>> -static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> +static void set_workload_expire(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> unsigned slice;
>> unsigned count;
>> struct cfq_rb_root *st;
>> unsigned group_slice;
>>
>> - if (!cfqg) {
>> - cfqd->serving_prio = IDLE_WORKLOAD;
>> - cfqd->workload_expires = jiffies + 1;
>> - return;
>> - }
>> -
>> - /* Choose next priority. RT > BE > IDLE */
>> - if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
>> - cfqd->serving_prio = RT_WORKLOAD;
>> - else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
>> - cfqd->serving_prio = BE_WORKLOAD;
>> - else {
>> - cfqd->serving_prio = IDLE_WORKLOAD;
>> - cfqd->workload_expires = jiffies + 1;
>> - return;
>> - }
>> -
>> - /*
>> - * For RT and BE, we have to choose also the type
>> - * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
>> - * expiration time
>> - */
>> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>> - count = st->count;
>> -
>> - /*
>> - * check workload expiration, and that we still have other queues ready
>> - */
>> - if (count && !time_after(jiffies, cfqd->workload_expires))
>> - return;
>> -
>> - /* otherwise select new workload type */
>> - cfqd->serving_type =
>> - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
>> st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>> count = st->count;
>>
>> @@ -2262,26 +2370,51 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> cfqd->workload_expires = jiffies + slice;
>> }
>>
>> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
>> +static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> - struct cfq_group *cfqg;
>> - struct io_sched_entity *group_entity;
>> + struct cfq_rb_root *st;
>> + unsigned count;
>>
>> - if (RB_EMPTY_ROOT(&st->rb))
>> - return NULL;
>> - group_entity = cfq_rb_first_entity(st);
>> - cfqg = cfqg_of_entity(group_entity);
>> - BUG_ON(!cfqg);
>> - update_min_vdisktime(st);
>> - return cfqg;
>> + if (!cfqg) {
>> + cfqd->serving_prio = IDLE_WORKLOAD;
>> + cfqd->workload_expires = jiffies + 1;
>> + return;
>> + }
>> +
>> + /* Choose next priority. RT > BE > IDLE */
>> + if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
>> + cfqd->serving_prio = RT_WORKLOAD;
>> + else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
>> + cfqd->serving_prio = BE_WORKLOAD;
>> + else {
>> + cfqd->serving_prio = IDLE_WORKLOAD;
>> + cfqd->workload_expires = jiffies + 1;
>> + return;
>> + }
>> +
>> + /*
>> + * For RT and BE, we have to choose also the type
>> + * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
>> + * expiration time
>> + */
>> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>> + count = st->count;
>> +
>> + /*
>> + * check workload expiration, and that we still have other queues ready
>> + */
>> + if (count && !time_after(jiffies, cfqd->workload_expires))
>> + return;
>> +
>> + /* otherwise select new workload type */
>> + cfqd->serving_type =
>> + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
>> }
>>
>> -static void cfq_choose_cfqg(struct cfq_data *cfqd)
>> +struct io_sched_entity *choose_serving_entity(struct cfq_data *cfqd,
>> + struct cfq_group *cfqg)
>> {
>> - struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
>> -
>> - cfqd->serving_group = cfqg;
>> + struct cfq_rb_root *service_tree;
>>
>> /* Restore the workload type data */
>> if (cfqg->saved_workload_slice) {
>> @@ -2292,8 +2425,21 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
>> cfqd->workload_expires = jiffies - 1;
>>
>> choose_service_tree(cfqd, cfqg);
>> -}
>>
>> + service_tree = service_tree_for(cfqg, cfqd->serving_prio,
>> + cfqd->serving_type);
>> +
>> + if (!cfqd->rq_queued)
>> + return NULL;
>> +
>> + /* There is nothing to dispatch */
>> + if (!service_tree)
>> + return NULL;
>> + if (RB_EMPTY_ROOT(&service_tree->rb))
>> + return NULL;
>> +
>> + return cfq_rb_first(service_tree);
>> +}
>> /*
>> * Select a queue for service. If we have a current active queue,
>> * check whether to continue servicing it, or retrieve and set a new one.
>> @@ -2301,6 +2447,8 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
>> static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>> {
>> struct cfq_queue *cfqq, *new_cfqq = NULL;
>> + struct cfq_group *cfqg;
>> + struct io_sched_entity *entity;
>>
>> cfqq = cfqd->active_queue;
>> if (!cfqq)
>> @@ -2389,8 +2537,23 @@ new_queue:
>> * Current queue expired. Check if we have to switch to a new
>> * service tree
>> */
>> - if (!new_cfqq)
>> - cfq_choose_cfqg(cfqd);
>> + cfqg = &cfqd->root_group;
>> +
>> + if (!new_cfqq) {
>> + do {
>> + entity = choose_serving_entity(cfqd, cfqg);
>> + if (entity && !entity->is_group_entity) {
>> + /* This is the CFQ queue that should run */
>> + new_cfqq = cfqq_of_entity(entity);
>> + cfqd->serving_group = cfqg;
>> + set_workload_expire(cfqd, cfqg);
>> + break;
>> + } else if (entity && entity->is_group_entity) {
>> + /* Continue to lookup in this CFQ group */
>> + cfqg = cfqg_of_entity(entity);
>> + }
>> + } while (entity && entity->is_group_entity);
>> + }
>>
>> cfqq = cfq_set_active_queue(cfqd, new_cfqq);
>> keep_queue:
>> @@ -2421,10 +2584,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
>> {
>> struct cfq_queue *cfqq;
>> int dispatched = 0;
>> + struct io_sched_entity *entity;
>> + struct cfq_group *root = &cfqd->root_group;
>>
>> /* Expire the timeslice of the current active queue first */
>> cfq_slice_expired(cfqd, 0);
>> - while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL) {
>> + while ((entity = cfq_get_next_entity_forced(cfqd, root)) != NULL) {
>> + BUG_ON(entity->is_group_entity);
>> + cfqq = cfqq_of_entity(entity);
>> __cfq_set_active_queue(cfqd, cfqq);
>> dispatched += __cfq_forced_dispatch_cfqq(cfqq);
>> }
>> @@ -3954,9 +4121,6 @@ static void *cfq_init_queue(struct request_queue *q)
>>
>> cfqd->cic_index = i;
>>
>> - /* Init root service tree */
>> - cfqd->grp_service_tree = CFQ_RB_ROOT;
>> -
>> /* Init root group */
>> cfqg = &cfqd->root_group;
>> for_each_cfqg_st(cfqg, i, j, st)
>> @@ -3966,6 +4130,7 @@ static void *cfq_init_queue(struct request_queue *q)
>> /* Give preference to root group over other groups */
>> cfqg->group_entity.weight = 2*BLKIO_WEIGHT_DEFAULT;
>> cfqg->group_entity.is_group_entity = true;
>> + cfqg->group_entity.parent = NULL;
>>
>> #ifdef CONFIG_CFQ_GROUP_IOSCHED
>> /*
>> --
>> 1.6.5.2
>>
>>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] [PATCH 8/8] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
2010-11-29 2:42 ` [RFC] [PATCH 8/8] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level Gui Jianfeng
@ 2010-11-29 14:31 ` Vivek Goyal
2010-11-30 1:15 ` Gui Jianfeng
0 siblings, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2010-11-29 14:31 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Mon, Nov 29, 2010 at 10:42:15AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Sun, Nov 14, 2010 at 04:25:49PM +0800, Gui Jianfeng wrote:
> >> This patch makes CFQ queue and CFQ group schedules at the same level.
> >> Consider the following hierarchy:
> >>
> >> Root
> >> / | \
> >> q1 q2 G1
> >> / \
> >> q3 G2
> >>
> >> q1 q2 and q3 are CFQ queues G1 and G2 are CFQ groups. Currently, q1, q2
> >> and G1 are scheduling on a same service tree in Root CFQ group. q3 and G2
> >> are schedluing under G1. Note, for the time being, CFQ group is treated
> >> as "BE and SYNC" workload, and is put on "BE and SYNC" service tree. That
> >> means service differentiate only happens in "BE and SYNC" service tree.
> >> Later, we may introduce "IO Class" for CFQ group.
> >>
> >
> > Have you got rid of flat mode (existing default mode). IOW, I don't see
> > the introduction of "use_hierarchy" which will differentiate between
> > whether to treat an hierarchy as flat or not?
>
> Vivek,
>
> As I said in [PATCH 0/8], yes, for the time being, I get rid of flat mode.
> Hierarchical cgroup creation is just merge into block-tree, not in Mainline
> now, .so I think it's ok I'd like to post "use_hierarchy" patchset separately
> when this patchset get merged. How do you say, Vivek?
But even single level of group creation can be either flat or
hierarchical. This of two cgroups test1 and test2 created under root. In
flat and hierarchical scheme they will look different.
So I don't think we can drop flat mode for time being. We need to include
use_hierarhcy support along with this patch series.
Thanks
Vivek
>
> Gui
>
> >
> > Vivek
> >
> >> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> >> ---
> >> block/cfq-iosched.c | 483 ++++++++++++++++++++++++++++++++++-----------------
> >> 1 files changed, 324 insertions(+), 159 deletions(-)
> >>
> >> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> >> index 1df0928..9def3a2 100644
> >> --- a/block/cfq-iosched.c
> >> +++ b/block/cfq-iosched.c
> >> @@ -105,6 +105,9 @@ struct io_sched_entity {
> >> u64 vdisktime;
> >> bool is_group_entity;
> >> unsigned int weight;
> >> + struct io_sched_entity *parent;
> >> + /* Reposition time */
> >> + unsigned long reposition_time;
> >> };
> >>
> >> /*
> >> @@ -113,8 +116,6 @@ struct io_sched_entity {
> >> struct cfq_queue {
> >> /* The schedule entity */
> >> struct io_sched_entity queue_entity;
> >> - /* Reposition time */
> >> - unsigned long reposition_time;
> >> /* reference count */
> >> atomic_t ref;
> >> /* various state flags, see below */
> >> @@ -193,6 +194,9 @@ struct cfq_group {
> >> /* number of cfqq currently on this group */
> >> int nr_cfqq;
> >>
> >> + /* number of sub cfq groups */
> >> + int nr_subgp;
> >> +
> >> /* Per group busy queus average. Useful for workload slice calc. */
> >> unsigned int busy_queues_avg[2];
> >> /*
> >> @@ -219,8 +223,6 @@ struct cfq_group {
> >> */
> >> struct cfq_data {
> >> struct request_queue *queue;
> >> - /* Root service tree for cfq_groups */
> >> - struct cfq_rb_root grp_service_tree;
> >> struct cfq_group root_group;
> >>
> >> /*
> >> @@ -337,8 +339,6 @@ cfqg_of_entity(struct io_sched_entity *io_entity)
> >> return NULL;
> >> }
> >>
> >> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
> >> -
> >> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
> >> enum wl_prio_t prio,
> >> enum wl_type_t type)
> >> @@ -629,10 +629,15 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
> >> static inline unsigned
> >> cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> {
> >> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> >> struct io_sched_entity *group_entity = &cfqg->group_entity;
> >> + struct cfq_rb_root *st = group_entity->service_tree;
> >>
> >> - return cfq_target_latency * group_entity->weight / st->total_weight;
> >> + if (st)
> >> + return cfq_target_latency * group_entity->weight
> >> + / st->total_weight;
> >> + else
> >> + /* If this is the root group, give it a full slice. */
> >> + return cfq_target_latency;
> >> }
> >>
> >> static inline void
> >> @@ -795,17 +800,6 @@ static struct io_sched_entity *cfq_rb_first(struct cfq_rb_root *root)
> >> return NULL;
> >> }
> >>
> >> -static struct io_sched_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
> >> -{
> >> - if (!root->left)
> >> - root->left = rb_first(&root->rb);
> >> -
> >> - if (root->left)
> >> - return rb_entry_entity(root->left);
> >> -
> >> - return NULL;
> >> -}
> >> -
> >> static void rb_erase_init(struct rb_node *n, struct rb_root *root)
> >> {
> >> rb_erase(n, root);
> >> @@ -887,6 +881,7 @@ io_entity_service_tree_add(struct cfq_rb_root *st,
> >> struct io_sched_entity *io_entity)
> >> {
> >> __io_entity_service_tree_add(st, io_entity);
> >> + io_entity->reposition_time = jiffies;
> >> st->count++;
> >> st->total_weight += io_entity->weight;
> >> }
> >> @@ -894,29 +889,49 @@ io_entity_service_tree_add(struct cfq_rb_root *st,
> >> static void
> >> cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> {
> >> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> >> struct rb_node *n;
> >> struct io_sched_entity *group_entity = &cfqg->group_entity;
> >> - struct io_sched_entity *__group_entity;
> >> + struct io_sched_entity *entity;
> >> + struct cfq_rb_root *st;
> >> + struct cfq_group *__cfqg;
> >>
> >> cfqg->nr_cfqq++;
> >> +
> >> + /*
> >> + * Root group doesn't belongs to any service
> >> + */
> >> + if (cfqg == &cfqd->root_group)
> >> + return;
> >> +
> >> if (!RB_EMPTY_NODE(&group_entity->rb_node))
> >> return;
> >>
> >> - /*
> >> - * Currently put the group at the end. Later implement something
> >> - * so that groups get lesser vtime based on their weights, so that
> >> - * if group does not loose all if it was not continously backlogged.
> >> + /*
> >> + * Enqueue this group and its ancestors onto their service tree.
> >> */
> >> - n = rb_last(&st->rb);
> >> - if (n) {
> >> - __group_entity = rb_entry_entity(n);
> >> - group_entity->vdisktime = __group_entity->vdisktime +
> >> - CFQ_IDLE_DELAY;
> >> - } else
> >> - group_entity->vdisktime = st->min_vdisktime;
> >> + while (group_entity && group_entity->parent) {
> >> + if (!RB_EMPTY_NODE(&group_entity->rb_node))
> >> + return;
> >> + /*
> >> + * Currently put the group at the end. Later implement
> >> + * something so that groups get lesser vtime based on their
> >> + * weights, so that if group does not loose all if it was not
> >> + * continously backlogged.
> >> + */
> >> + st = group_entity->service_tree;
> >> + n = rb_last(&st->rb);
> >> + if (n) {
> >> + entity = rb_entry_entity(n);
> >> + group_entity->vdisktime = entity->vdisktime +
> >> + CFQ_IDLE_DELAY;
> >> + } else
> >> + group_entity->vdisktime = st->min_vdisktime;
> >>
> >> - io_entity_service_tree_add(st, group_entity);
> >> + io_entity_service_tree_add(st, group_entity);
> >> + group_entity = group_entity->parent;
> >> + __cfqg = cfqg_of_entity(group_entity);
> >> + __cfqg->nr_subgp++;
> >> + }
> >> }
> >>
> >> static void
> >> @@ -933,27 +948,47 @@ io_entity_service_tree_del(struct cfq_rb_root *st,
> >> if (!RB_EMPTY_NODE(&io_entity->rb_node)) {
> >> __io_entity_service_tree_del(st, io_entity);
> >> st->total_weight -= io_entity->weight;
> >> - io_entity->service_tree = NULL;
> >> }
> >> }
> >>
> >> static void
> >> cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> {
> >> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> >> struct io_sched_entity *group_entity = &cfqg->group_entity;
> >> + struct cfq_group *__cfqg, *p_cfqg;
> >>
> >> BUG_ON(cfqg->nr_cfqq < 1);
> >> cfqg->nr_cfqq--;
> >>
> >> + /*
> >> + * Root group doesn't belongs to any service
> >> + */
> >> + if (cfqg == &cfqd->root_group)
> >> + return;
> >> +
> >> /* If there are other cfq queues under this group, don't delete it */
> >> if (cfqg->nr_cfqq)
> >> return;
> >> -
> >> - cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
> >> - io_entity_service_tree_del(st, group_entity);
> >> - cfqg->saved_workload_slice = 0;
> >> - cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
> >> + /* If child group exists, don't dequeue it */
> >> + if (cfqg->nr_subgp)
> >> + return;
> >> +
> >> + /*
> >> + * Dequeue this group and its ancestors from their service tree.
> >> + */
> >> + while (group_entity && group_entity->parent) {
> >> + __cfqg = cfqg_of_entity(group_entity);
> >> + p_cfqg = cfqg_of_entity(group_entity->parent);
> >> + io_entity_service_tree_del(group_entity->service_tree,
> >> + group_entity);
> >> + cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
> >> + cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
> >> + __cfqg->saved_workload_slice = 0;
> >> + group_entity = group_entity->parent;
> >> + p_cfqg->nr_subgp--;
> >> + if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
> >> + return;
> >> + }
> >> }
> >>
> >> static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> >> @@ -985,7 +1020,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
> >> static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
> >> struct cfq_queue *cfqq)
> >> {
> >> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> >> unsigned int used_sl, charge;
> >> int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
> >> - cfqg->service_tree_idle.count;
> >> @@ -999,10 +1033,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
> >> else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
> >> charge = cfqq->allocated_slice;
> >>
> >> - /* Can't update vdisktime while group is on service tree */
> >> - __io_entity_service_tree_del(st, group_entity);
> >> - group_entity->vdisktime += cfq_scale_slice(charge, group_entity);
> >> - __io_entity_service_tree_add(st, group_entity);
> >> + /*
> >> + * Update the vdisktime on the whole chain.
> >> + */
> >> + while (group_entity && group_entity->parent) {
> >> + struct cfq_rb_root *st = group_entity->service_tree;
> >> +
> >> + /* Can't update vdisktime while group is on service tree */
> >> + __io_entity_service_tree_del(st, group_entity);
> >> + group_entity->vdisktime += cfq_scale_slice(charge,
> >> + group_entity);
> >> + __io_entity_service_tree_add(st, group_entity);
> >> + st->count++;
> >> + group_entity->reposition_time = jiffies;
> >> + group_entity = group_entity->parent;
> >> + }
> >>
> >> /* This group is being expired. Save the context */
> >> if (time_after(cfqd->workload_expires, jiffies)) {
> >> @@ -1014,7 +1059,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
> >> cfqg->saved_workload_slice = 0;
> >>
> >> cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu",
> >> - group_entity->vdisktime, st->min_vdisktime);
> >> + cfqg->group_entity.vdisktime,
> >> + cfqg->group_entity.service_tree->min_vdisktime);
> >> cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
> >> " sect=%u", used_sl, cfqq->slice_dispatch, charge,
> >> iops_mode(cfqd), cfqq->nr_sectors);
> >> @@ -1036,35 +1082,27 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
> >> cfqg_of_blkg(blkg)->group_entity.weight = weight;
> >> }
> >>
> >> -static struct cfq_group *
> >> -cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
> >> +static void init_group_entity(struct blkio_cgroup *blkcg,
> >> + struct cfq_group *cfqg)
> >> +{
> >> + struct io_sched_entity *group_entity = &cfqg->group_entity;
> >> +
> >> + group_entity->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
> >> + RB_CLEAR_NODE(&group_entity->rb_node);
> >> + group_entity->is_group_entity = true;
> >> + group_entity->parent = NULL;
> >> +}
> >> +
> >> +static void init_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
> >> + struct cfq_group *cfqg)
> >> {
> >> - struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
> >> - struct cfq_group *cfqg = NULL;
> >> - void *key = cfqd;
> >> int i, j;
> >> struct cfq_rb_root *st;
> >> - struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
> >> unsigned int major, minor;
> >> -
> >> - cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
> >> - if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
> >> - sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> >> - cfqg->blkg.dev = MKDEV(major, minor);
> >> - goto done;
> >> - }
> >> - if (cfqg || !create)
> >> - goto done;
> >> -
> >> - cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
> >> - if (!cfqg)
> >> - goto done;
> >> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
> >>
> >> for_each_cfqg_st(cfqg, i, j, st)
> >> *st = CFQ_RB_ROOT;
> >> - RB_CLEAR_NODE(&cfqg->group_entity.rb_node);
> >> -
> >> - cfqg->group_entity.is_group_entity = true;
> >>
> >> /*
> >> * Take the initial reference that will be released on destroy
> >> @@ -1074,24 +1112,119 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
> >> */
> >> atomic_set(&cfqg->ref, 1);
> >>
> >> + /* Add group onto cgroup list */
> >> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> >> + cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
> >> + MKDEV(major, minor));
> >> + /* Initiate group entity */
> >> + init_group_entity(blkcg, cfqg);
> >> + /* Add group on cfqd list */
> >> + hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
> >> +}
> >> +
> >> +static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg);
> >> +
> >> +static void uninit_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> +{
> >> + if (!cfq_blkiocg_del_blkio_group(&cfqg->blkg))
> >> + cfq_destroy_cfqg(cfqd, cfqg);
> >> +}
> >> +
> >> +static void cfqg_set_parent(struct cfq_data *cfqd, struct cfq_group *cfqg,
> >> + struct cfq_group *p_cfqg)
> >> +{
> >> + struct io_sched_entity *group_entity, *p_group_entity;
> >> +
> >> + group_entity = &cfqg->group_entity;
> >> +
> >> + p_group_entity = &p_cfqg->group_entity;
> >> +
> >> + group_entity->parent = p_group_entity;
> >> +
> >> /*
> >> - * Add group onto cgroup list. It might happen that bdi->dev is
> >> - * not initiliazed yet. Initialize this new group without major
> >> - * and minor info and this info will be filled in once a new thread
> >> - * comes for IO. See code above.
> >> + * Currently, just put cfq group entity on "BE:SYNC" workload
> >> + * service tree.
> >> */
> >> - if (bdi->dev) {
> >> - sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> >> - cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
> >> - MKDEV(major, minor));
> >> - } else
> >> - cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
> >> - 0);
> >> + group_entity->service_tree = service_tree_for(p_cfqg, BE_WORKLOAD,
> >> + SYNC_WORKLOAD);
> >> + /* child reference */
> >> + atomic_inc(&p_cfqg->ref);
> >> +}
> >>
> >> - cfqg->group_entity.weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
> >> +int cfqg_chain_alloc(struct cfq_data *cfqd, struct cgroup *cgroup)
> >> +{
> >> + struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
> >> + struct blkio_cgroup *p_blkcg;
> >> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
> >> + unsigned int major, minor;
> >> + struct cfq_group *cfqg, *p_cfqg;
> >> + void *key = cfqd;
> >> + int ret;
> >>
> >> - /* Add group on cfqd list */
> >> - hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
> >> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
> >> + if (cfqg) {
> >> + if (!cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
> >> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> >> + cfqg->blkg.dev = MKDEV(major, minor);
> >> + }
> >> + /* chain has already been built */
> >> + return 0;
> >> + }
> >> +
> >> + cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
> >> + if (!cfqg)
> >> + return -1;
> >> +
> >> + init_cfqg(cfqd, blkcg, cfqg);
> >> +
> >> + /* Already to the top group */
> >> + if (!cgroup->parent)
> >> + return 0;
> >> +
> >> + /* Allocate CFQ groups on the chain */
> >> + ret = cfqg_chain_alloc(cfqd, cgroup->parent);
> >> + if (ret == -1) {
> >> + uninit_cfqg(cfqd, cfqg);
> >> + return -1;
> >> + }
> >> +
> >> + p_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
> >> + p_cfqg = cfqg_of_blkg(blkiocg_lookup_group(p_blkcg, key));
> >> + BUG_ON(p_cfqg == NULL);
> >> +
> >> + cfqg_set_parent(cfqd, cfqg, p_cfqg);
> >> + return 0;
> >> +}
> >> +
> >> +static struct cfq_group *
> >> +cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
> >> +{
> >> + struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
> >> + struct cfq_group *cfqg = NULL;
> >> + void *key = cfqd;
> >> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
> >> + unsigned int major, minor;
> >> + int ret;
> >> +
> >> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
> >> + if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
> >> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> >> + cfqg->blkg.dev = MKDEV(major, minor);
> >> + goto done;
> >> + }
> >> + if (cfqg || !create)
> >> + goto done;
> >> +
> >> + /*
> >> + * For hierarchical cfq group scheduling, we need to allocate
> >> + * the whole cfq group chain.
> >> + */
> >> + ret = cfqg_chain_alloc(cfqd, cgroup);
> >> + if (!ret) {
> >> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
> >> + BUG_ON(cfqg == NULL);
> >> + goto done;
> >> + }
> >>
> >> done:
> >> return cfqg;
> >> @@ -1136,12 +1269,22 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
> >> {
> >> struct cfq_rb_root *st;
> >> int i, j;
> >> + struct io_sched_entity *group_entity;
> >> + struct cfq_group *p_cfqg;
> >>
> >> BUG_ON(atomic_read(&cfqg->ref) <= 0);
> >> if (!atomic_dec_and_test(&cfqg->ref))
> >> return;
> >> for_each_cfqg_st(cfqg, i, j, st)
> >> BUG_ON(!RB_EMPTY_ROOT(&st->rb));
> >> +
> >> + group_entity = &cfqg->group_entity;
> >> + if (group_entity->parent) {
> >> + p_cfqg = cfqg_of_entity(group_entity->parent);
> >> + /* Drop the reference taken by children */
> >> + atomic_dec(&p_cfqg->ref);
> >> + }
> >> +
> >> kfree(cfqg);
> >> }
> >>
> >> @@ -1336,7 +1479,6 @@ insert:
> >> io_entity_service_tree_add(service_tree, queue_entity);
> >>
> >> update_min_vdisktime(service_tree);
> >> - cfqq->reposition_time = jiffies;
> >> if ((add_front || !new_cfqq) && !group_changed)
> >> return;
> >> cfq_group_service_tree_add(cfqd, cfqq->cfqg);
> >> @@ -1779,28 +1921,30 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
> >> return cfqq_of_entity(cfq_rb_first(service_tree));
> >> }
> >>
> >> -static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
> >> +static struct io_sched_entity *
> >> +cfq_get_next_entity_forced(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> {
> >> - struct cfq_group *cfqg;
> >> - struct io_sched_entity *queue_entity;
> >> + struct io_sched_entity *entity;
> >> int i, j;
> >> struct cfq_rb_root *st;
> >>
> >> if (!cfqd->rq_queued)
> >> return NULL;
> >>
> >> - cfqg = cfq_get_next_cfqg(cfqd);
> >> - if (!cfqg)
> >> - return NULL;
> >> -
> >> for_each_cfqg_st(cfqg, i, j, st) {
> >> - queue_entity = cfq_rb_first(st);
> >> - if (queue_entity != NULL)
> >> - return cfqq_of_entity(queue_entity);
> >> + entity = cfq_rb_first(st);
> >> +
> >> + if (entity && !entity->is_group_entity)
> >> + return entity;
> >> + else if (entity && entity->is_group_entity) {
> >> + cfqg = cfqg_of_entity(entity);
> >> + return cfq_get_next_entity_forced(cfqd, cfqg);
> >> + }
> >> }
> >> return NULL;
> >> }
> >>
> >> +
> >> /*
> >> * Get and set a new active queue for service.
> >> */
> >> @@ -2155,8 +2299,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
> >> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> >> struct cfq_group *cfqg, enum wl_prio_t prio)
> >> {
> >> - struct io_sched_entity *queue_entity;
> >> - struct cfq_queue *cfqq;
> >> + struct io_sched_entity *entity;
> >> unsigned long lowest_start_time;
> >> int i;
> >> bool time_valid = false;
> >> @@ -2167,12 +2310,11 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> >> * type. But for the time being just make use of reposition_time only.
> >> */
> >> for (i = 0; i <= SYNC_WORKLOAD; ++i) {
> >> - queue_entity = cfq_rb_first(service_tree_for(cfqg, prio, i));
> >> - cfqq = cfqq_of_entity(queue_entity);
> >> - if (queue_entity &&
> >> + entity = cfq_rb_first(service_tree_for(cfqg, prio, i));
> >> + if (entity &&
> >> (!time_valid ||
> >> - cfqq->reposition_time < lowest_start_time)) {
> >> - lowest_start_time = cfqq->reposition_time;
> >> + entity->reposition_time < lowest_start_time)) {
> >> + lowest_start_time = entity->reposition_time;
> >> cur_best = i;
> >> time_valid = true;
> >> }
> >> @@ -2181,47 +2323,13 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> >> return cur_best;
> >> }
> >>
> >> -static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> +static void set_workload_expire(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> {
> >> unsigned slice;
> >> unsigned count;
> >> struct cfq_rb_root *st;
> >> unsigned group_slice;
> >>
> >> - if (!cfqg) {
> >> - cfqd->serving_prio = IDLE_WORKLOAD;
> >> - cfqd->workload_expires = jiffies + 1;
> >> - return;
> >> - }
> >> -
> >> - /* Choose next priority. RT > BE > IDLE */
> >> - if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
> >> - cfqd->serving_prio = RT_WORKLOAD;
> >> - else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
> >> - cfqd->serving_prio = BE_WORKLOAD;
> >> - else {
> >> - cfqd->serving_prio = IDLE_WORKLOAD;
> >> - cfqd->workload_expires = jiffies + 1;
> >> - return;
> >> - }
> >> -
> >> - /*
> >> - * For RT and BE, we have to choose also the type
> >> - * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
> >> - * expiration time
> >> - */
> >> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
> >> - count = st->count;
> >> -
> >> - /*
> >> - * check workload expiration, and that we still have other queues ready
> >> - */
> >> - if (count && !time_after(jiffies, cfqd->workload_expires))
> >> - return;
> >> -
> >> - /* otherwise select new workload type */
> >> - cfqd->serving_type =
> >> - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
> >> st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
> >> count = st->count;
> >>
> >> @@ -2262,26 +2370,51 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> cfqd->workload_expires = jiffies + slice;
> >> }
> >>
> >> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
> >> +static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> {
> >> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> >> - struct cfq_group *cfqg;
> >> - struct io_sched_entity *group_entity;
> >> + struct cfq_rb_root *st;
> >> + unsigned count;
> >>
> >> - if (RB_EMPTY_ROOT(&st->rb))
> >> - return NULL;
> >> - group_entity = cfq_rb_first_entity(st);
> >> - cfqg = cfqg_of_entity(group_entity);
> >> - BUG_ON(!cfqg);
> >> - update_min_vdisktime(st);
> >> - return cfqg;
> >> + if (!cfqg) {
> >> + cfqd->serving_prio = IDLE_WORKLOAD;
> >> + cfqd->workload_expires = jiffies + 1;
> >> + return;
> >> + }
> >> +
> >> + /* Choose next priority. RT > BE > IDLE */
> >> + if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
> >> + cfqd->serving_prio = RT_WORKLOAD;
> >> + else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
> >> + cfqd->serving_prio = BE_WORKLOAD;
> >> + else {
> >> + cfqd->serving_prio = IDLE_WORKLOAD;
> >> + cfqd->workload_expires = jiffies + 1;
> >> + return;
> >> + }
> >> +
> >> + /*
> >> + * For RT and BE, we have to choose also the type
> >> + * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
> >> + * expiration time
> >> + */
> >> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
> >> + count = st->count;
> >> +
> >> + /*
> >> + * check workload expiration, and that we still have other queues ready
> >> + */
> >> + if (count && !time_after(jiffies, cfqd->workload_expires))
> >> + return;
> >> +
> >> + /* otherwise select new workload type */
> >> + cfqd->serving_type =
> >> + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
> >> }
> >>
> >> -static void cfq_choose_cfqg(struct cfq_data *cfqd)
> >> +struct io_sched_entity *choose_serving_entity(struct cfq_data *cfqd,
> >> + struct cfq_group *cfqg)
> >> {
> >> - struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
> >> -
> >> - cfqd->serving_group = cfqg;
> >> + struct cfq_rb_root *service_tree;
> >>
> >> /* Restore the workload type data */
> >> if (cfqg->saved_workload_slice) {
> >> @@ -2292,8 +2425,21 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
> >> cfqd->workload_expires = jiffies - 1;
> >>
> >> choose_service_tree(cfqd, cfqg);
> >> -}
> >>
> >> + service_tree = service_tree_for(cfqg, cfqd->serving_prio,
> >> + cfqd->serving_type);
> >> +
> >> + if (!cfqd->rq_queued)
> >> + return NULL;
> >> +
> >> + /* There is nothing to dispatch */
> >> + if (!service_tree)
> >> + return NULL;
> >> + if (RB_EMPTY_ROOT(&service_tree->rb))
> >> + return NULL;
> >> +
> >> + return cfq_rb_first(service_tree);
> >> +}
> >> /*
> >> * Select a queue for service. If we have a current active queue,
> >> * check whether to continue servicing it, or retrieve and set a new one.
> >> @@ -2301,6 +2447,8 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
> >> static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> >> {
> >> struct cfq_queue *cfqq, *new_cfqq = NULL;
> >> + struct cfq_group *cfqg;
> >> + struct io_sched_entity *entity;
> >>
> >> cfqq = cfqd->active_queue;
> >> if (!cfqq)
> >> @@ -2389,8 +2537,23 @@ new_queue:
> >> * Current queue expired. Check if we have to switch to a new
> >> * service tree
> >> */
> >> - if (!new_cfqq)
> >> - cfq_choose_cfqg(cfqd);
> >> + cfqg = &cfqd->root_group;
> >> +
> >> + if (!new_cfqq) {
> >> + do {
> >> + entity = choose_serving_entity(cfqd, cfqg);
> >> + if (entity && !entity->is_group_entity) {
> >> + /* This is the CFQ queue that should run */
> >> + new_cfqq = cfqq_of_entity(entity);
> >> + cfqd->serving_group = cfqg;
> >> + set_workload_expire(cfqd, cfqg);
> >> + break;
> >> + } else if (entity && entity->is_group_entity) {
> >> + /* Continue to lookup in this CFQ group */
> >> + cfqg = cfqg_of_entity(entity);
> >> + }
> >> + } while (entity && entity->is_group_entity);
> >> + }
> >>
> >> cfqq = cfq_set_active_queue(cfqd, new_cfqq);
> >> keep_queue:
> >> @@ -2421,10 +2584,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
> >> {
> >> struct cfq_queue *cfqq;
> >> int dispatched = 0;
> >> + struct io_sched_entity *entity;
> >> + struct cfq_group *root = &cfqd->root_group;
> >>
> >> /* Expire the timeslice of the current active queue first */
> >> cfq_slice_expired(cfqd, 0);
> >> - while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL) {
> >> + while ((entity = cfq_get_next_entity_forced(cfqd, root)) != NULL) {
> >> + BUG_ON(entity->is_group_entity);
> >> + cfqq = cfqq_of_entity(entity);
> >> __cfq_set_active_queue(cfqd, cfqq);
> >> dispatched += __cfq_forced_dispatch_cfqq(cfqq);
> >> }
> >> @@ -3954,9 +4121,6 @@ static void *cfq_init_queue(struct request_queue *q)
> >>
> >> cfqd->cic_index = i;
> >>
> >> - /* Init root service tree */
> >> - cfqd->grp_service_tree = CFQ_RB_ROOT;
> >> -
> >> /* Init root group */
> >> cfqg = &cfqd->root_group;
> >> for_each_cfqg_st(cfqg, i, j, st)
> >> @@ -3966,6 +4130,7 @@ static void *cfq_init_queue(struct request_queue *q)
> >> /* Give preference to root group over other groups */
> >> cfqg->group_entity.weight = 2*BLKIO_WEIGHT_DEFAULT;
> >> cfqg->group_entity.is_group_entity = true;
> >> + cfqg->group_entity.parent = NULL;
> >>
> >> #ifdef CONFIG_CFQ_GROUP_IOSCHED
> >> /*
> >> --
> >> 1.6.5.2
> >>
> >>
> >
>
> --
> Regards
> Gui Jianfeng
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [RFC] [PATCH 8/8] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
2010-11-29 14:31 ` Vivek Goyal
@ 2010-11-30 1:15 ` Gui Jianfeng
0 siblings, 0 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-11-30 1:15 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Mon, Nov 29, 2010 at 10:42:15AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Sun, Nov 14, 2010 at 04:25:49PM +0800, Gui Jianfeng wrote:
>>>> This patch makes CFQ queue and CFQ group schedules at the same level.
>>>> Consider the following hierarchy:
>>>>
>>>> Root
>>>> / | \
>>>> q1 q2 G1
>>>> / \
>>>> q3 G2
>>>>
>>>> q1 q2 and q3 are CFQ queues G1 and G2 are CFQ groups. Currently, q1, q2
>>>> and G1 are scheduling on a same service tree in Root CFQ group. q3 and G2
>>>> are schedluing under G1. Note, for the time being, CFQ group is treated
>>>> as "BE and SYNC" workload, and is put on "BE and SYNC" service tree. That
>>>> means service differentiate only happens in "BE and SYNC" service tree.
>>>> Later, we may introduce "IO Class" for CFQ group.
>>>>
>>> Have you got rid of flat mode (existing default mode). IOW, I don't see
>>> the introduction of "use_hierarchy" which will differentiate between
>>> whether to treat an hierarchy as flat or not?
>> Vivek,
>>
>> As I said in [PATCH 0/8], yes, for the time being, I get rid of flat mode.
>> Hierarchical cgroup creation is just merge into block-tree, not in Mainline
>> now, .so I think it's ok I'd like to post "use_hierarchy" patchset separately
>> when this patchset get merged. How do you say, Vivek?
>
> But even single level of group creation can be either flat or
> hierarchical. This of two cgroups test1 and test2 created under root. In
> flat and hierarchical scheme they will look different.
>
> So I don't think we can drop flat mode for time being. We need to include
> use_hierarhcy support along with this patch series.
hmm.. yes, I'll add use_hierarchy in this series.
Gui
>
> Thanks
> Vivek
>
>> Gui
>>
>>> Vivek
>>>
>>>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>>>> ---
>>>> block/cfq-iosched.c | 483 ++++++++++++++++++++++++++++++++++-----------------
>>>> 1 files changed, 324 insertions(+), 159 deletions(-)
>>>>
>>>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>>>> index 1df0928..9def3a2 100644
>>>> --- a/block/cfq-iosched.c
>>>> +++ b/block/cfq-iosched.c
>>>> @@ -105,6 +105,9 @@ struct io_sched_entity {
>>>> u64 vdisktime;
>>>> bool is_group_entity;
>>>> unsigned int weight;
>>>> + struct io_sched_entity *parent;
>>>> + /* Reposition time */
>>>> + unsigned long reposition_time;
>>>> };
>>>>
>>>> /*
>>>> @@ -113,8 +116,6 @@ struct io_sched_entity {
>>>> struct cfq_queue {
>>>> /* The schedule entity */
>>>> struct io_sched_entity queue_entity;
>>>> - /* Reposition time */
>>>> - unsigned long reposition_time;
>>>> /* reference count */
>>>> atomic_t ref;
>>>> /* various state flags, see below */
>>>> @@ -193,6 +194,9 @@ struct cfq_group {
>>>> /* number of cfqq currently on this group */
>>>> int nr_cfqq;
>>>>
>>>> + /* number of sub cfq groups */
>>>> + int nr_subgp;
>>>> +
>>>> /* Per group busy queus average. Useful for workload slice calc. */
>>>> unsigned int busy_queues_avg[2];
>>>> /*
>>>> @@ -219,8 +223,6 @@ struct cfq_group {
>>>> */
>>>> struct cfq_data {
>>>> struct request_queue *queue;
>>>> - /* Root service tree for cfq_groups */
>>>> - struct cfq_rb_root grp_service_tree;
>>>> struct cfq_group root_group;
>>>>
>>>> /*
>>>> @@ -337,8 +339,6 @@ cfqg_of_entity(struct io_sched_entity *io_entity)
>>>> return NULL;
>>>> }
>>>>
>>>> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
>>>> -
>>>> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
>>>> enum wl_prio_t prio,
>>>> enum wl_type_t type)
>>>> @@ -629,10 +629,15 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
>>>> static inline unsigned
>>>> cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>>> {
>>>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>>>> struct io_sched_entity *group_entity = &cfqg->group_entity;
>>>> + struct cfq_rb_root *st = group_entity->service_tree;
>>>>
>>>> - return cfq_target_latency * group_entity->weight / st->total_weight;
>>>> + if (st)
>>>> + return cfq_target_latency * group_entity->weight
>>>> + / st->total_weight;
>>>> + else
>>>> + /* If this is the root group, give it a full slice. */
>>>> + return cfq_target_latency;
>>>> }
>>>>
>>>> static inline void
>>>> @@ -795,17 +800,6 @@ static struct io_sched_entity *cfq_rb_first(struct cfq_rb_root *root)
>>>> return NULL;
>>>> }
>>>>
>>>> -static struct io_sched_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
>>>> -{
>>>> - if (!root->left)
>>>> - root->left = rb_first(&root->rb);
>>>> -
>>>> - if (root->left)
>>>> - return rb_entry_entity(root->left);
>>>> -
>>>> - return NULL;
>>>> -}
>>>> -
>>>> static void rb_erase_init(struct rb_node *n, struct rb_root *root)
>>>> {
>>>> rb_erase(n, root);
>>>> @@ -887,6 +881,7 @@ io_entity_service_tree_add(struct cfq_rb_root *st,
>>>> struct io_sched_entity *io_entity)
>>>> {
>>>> __io_entity_service_tree_add(st, io_entity);
>>>> + io_entity->reposition_time = jiffies;
>>>> st->count++;
>>>> st->total_weight += io_entity->weight;
>>>> }
>>>> @@ -894,29 +889,49 @@ io_entity_service_tree_add(struct cfq_rb_root *st,
>>>> static void
>>>> cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>>> {
>>>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>>>> struct rb_node *n;
>>>> struct io_sched_entity *group_entity = &cfqg->group_entity;
>>>> - struct io_sched_entity *__group_entity;
>>>> + struct io_sched_entity *entity;
>>>> + struct cfq_rb_root *st;
>>>> + struct cfq_group *__cfqg;
>>>>
>>>> cfqg->nr_cfqq++;
>>>> +
>>>> + /*
>>>> + * Root group doesn't belongs to any service
>>>> + */
>>>> + if (cfqg == &cfqd->root_group)
>>>> + return;
>>>> +
>>>> if (!RB_EMPTY_NODE(&group_entity->rb_node))
>>>> return;
>>>>
>>>> - /*
>>>> - * Currently put the group at the end. Later implement something
>>>> - * so that groups get lesser vtime based on their weights, so that
>>>> - * if group does not loose all if it was not continously backlogged.
>>>> + /*
>>>> + * Enqueue this group and its ancestors onto their service tree.
>>>> */
>>>> - n = rb_last(&st->rb);
>>>> - if (n) {
>>>> - __group_entity = rb_entry_entity(n);
>>>> - group_entity->vdisktime = __group_entity->vdisktime +
>>>> - CFQ_IDLE_DELAY;
>>>> - } else
>>>> - group_entity->vdisktime = st->min_vdisktime;
>>>> + while (group_entity && group_entity->parent) {
>>>> + if (!RB_EMPTY_NODE(&group_entity->rb_node))
>>>> + return;
>>>> + /*
>>>> + * Currently put the group at the end. Later implement
>>>> + * something so that groups get lesser vtime based on their
>>>> + * weights, so that if group does not loose all if it was not
>>>> + * continously backlogged.
>>>> + */
>>>> + st = group_entity->service_tree;
>>>> + n = rb_last(&st->rb);
>>>> + if (n) {
>>>> + entity = rb_entry_entity(n);
>>>> + group_entity->vdisktime = entity->vdisktime +
>>>> + CFQ_IDLE_DELAY;
>>>> + } else
>>>> + group_entity->vdisktime = st->min_vdisktime;
>>>>
>>>> - io_entity_service_tree_add(st, group_entity);
>>>> + io_entity_service_tree_add(st, group_entity);
>>>> + group_entity = group_entity->parent;
>>>> + __cfqg = cfqg_of_entity(group_entity);
>>>> + __cfqg->nr_subgp++;
>>>> + }
>>>> }
>>>>
>>>> static void
>>>> @@ -933,27 +948,47 @@ io_entity_service_tree_del(struct cfq_rb_root *st,
>>>> if (!RB_EMPTY_NODE(&io_entity->rb_node)) {
>>>> __io_entity_service_tree_del(st, io_entity);
>>>> st->total_weight -= io_entity->weight;
>>>> - io_entity->service_tree = NULL;
>>>> }
>>>> }
>>>>
>>>> static void
>>>> cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>>> {
>>>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>>>> struct io_sched_entity *group_entity = &cfqg->group_entity;
>>>> + struct cfq_group *__cfqg, *p_cfqg;
>>>>
>>>> BUG_ON(cfqg->nr_cfqq < 1);
>>>> cfqg->nr_cfqq--;
>>>>
>>>> + /*
>>>> + * Root group doesn't belongs to any service
>>>> + */
>>>> + if (cfqg == &cfqd->root_group)
>>>> + return;
>>>> +
>>>> /* If there are other cfq queues under this group, don't delete it */
>>>> if (cfqg->nr_cfqq)
>>>> return;
>>>> -
>>>> - cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
>>>> - io_entity_service_tree_del(st, group_entity);
>>>> - cfqg->saved_workload_slice = 0;
>>>> - cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
>>>> + /* If child group exists, don't dequeue it */
>>>> + if (cfqg->nr_subgp)
>>>> + return;
>>>> +
>>>> + /*
>>>> + * Dequeue this group and its ancestors from their service tree.
>>>> + */
>>>> + while (group_entity && group_entity->parent) {
>>>> + __cfqg = cfqg_of_entity(group_entity);
>>>> + p_cfqg = cfqg_of_entity(group_entity->parent);
>>>> + io_entity_service_tree_del(group_entity->service_tree,
>>>> + group_entity);
>>>> + cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
>>>> + cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
>>>> + __cfqg->saved_workload_slice = 0;
>>>> + group_entity = group_entity->parent;
>>>> + p_cfqg->nr_subgp--;
>>>> + if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
>>>> + return;
>>>> + }
>>>> }
>>>>
>>>> static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
>>>> @@ -985,7 +1020,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
>>>> static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>>> struct cfq_queue *cfqq)
>>>> {
>>>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>>>> unsigned int used_sl, charge;
>>>> int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
>>>> - cfqg->service_tree_idle.count;
>>>> @@ -999,10 +1033,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>>> else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
>>>> charge = cfqq->allocated_slice;
>>>>
>>>> - /* Can't update vdisktime while group is on service tree */
>>>> - __io_entity_service_tree_del(st, group_entity);
>>>> - group_entity->vdisktime += cfq_scale_slice(charge, group_entity);
>>>> - __io_entity_service_tree_add(st, group_entity);
>>>> + /*
>>>> + * Update the vdisktime on the whole chain.
>>>> + */
>>>> + while (group_entity && group_entity->parent) {
>>>> + struct cfq_rb_root *st = group_entity->service_tree;
>>>> +
>>>> + /* Can't update vdisktime while group is on service tree */
>>>> + __io_entity_service_tree_del(st, group_entity);
>>>> + group_entity->vdisktime += cfq_scale_slice(charge,
>>>> + group_entity);
>>>> + __io_entity_service_tree_add(st, group_entity);
>>>> + st->count++;
>>>> + group_entity->reposition_time = jiffies;
>>>> + group_entity = group_entity->parent;
>>>> + }
>>>>
>>>> /* This group is being expired. Save the context */
>>>> if (time_after(cfqd->workload_expires, jiffies)) {
>>>> @@ -1014,7 +1059,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>>> cfqg->saved_workload_slice = 0;
>>>>
>>>> cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu",
>>>> - group_entity->vdisktime, st->min_vdisktime);
>>>> + cfqg->group_entity.vdisktime,
>>>> + cfqg->group_entity.service_tree->min_vdisktime);
>>>> cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
>>>> " sect=%u", used_sl, cfqq->slice_dispatch, charge,
>>>> iops_mode(cfqd), cfqq->nr_sectors);
>>>> @@ -1036,35 +1082,27 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
>>>> cfqg_of_blkg(blkg)->group_entity.weight = weight;
>>>> }
>>>>
>>>> -static struct cfq_group *
>>>> -cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>>>> +static void init_group_entity(struct blkio_cgroup *blkcg,
>>>> + struct cfq_group *cfqg)
>>>> +{
>>>> + struct io_sched_entity *group_entity = &cfqg->group_entity;
>>>> +
>>>> + group_entity->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
>>>> + RB_CLEAR_NODE(&group_entity->rb_node);
>>>> + group_entity->is_group_entity = true;
>>>> + group_entity->parent = NULL;
>>>> +}
>>>> +
>>>> +static void init_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
>>>> + struct cfq_group *cfqg)
>>>> {
>>>> - struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>>>> - struct cfq_group *cfqg = NULL;
>>>> - void *key = cfqd;
>>>> int i, j;
>>>> struct cfq_rb_root *st;
>>>> - struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>>>> unsigned int major, minor;
>>>> -
>>>> - cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>>>> - if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
>>>> - sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>>>> - cfqg->blkg.dev = MKDEV(major, minor);
>>>> - goto done;
>>>> - }
>>>> - if (cfqg || !create)
>>>> - goto done;
>>>> -
>>>> - cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
>>>> - if (!cfqg)
>>>> - goto done;
>>>> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>>>>
>>>> for_each_cfqg_st(cfqg, i, j, st)
>>>> *st = CFQ_RB_ROOT;
>>>> - RB_CLEAR_NODE(&cfqg->group_entity.rb_node);
>>>> -
>>>> - cfqg->group_entity.is_group_entity = true;
>>>>
>>>> /*
>>>> * Take the initial reference that will be released on destroy
>>>> @@ -1074,24 +1112,119 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>>>> */
>>>> atomic_set(&cfqg->ref, 1);
>>>>
>>>> + /* Add group onto cgroup list */
>>>> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>>>> + cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
>>>> + MKDEV(major, minor));
>>>> + /* Initiate group entity */
>>>> + init_group_entity(blkcg, cfqg);
>>>> + /* Add group on cfqd list */
>>>> + hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
>>>> +}
>>>> +
>>>> +static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg);
>>>> +
>>>> +static void uninit_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>>> +{
>>>> + if (!cfq_blkiocg_del_blkio_group(&cfqg->blkg))
>>>> + cfq_destroy_cfqg(cfqd, cfqg);
>>>> +}
>>>> +
>>>> +static void cfqg_set_parent(struct cfq_data *cfqd, struct cfq_group *cfqg,
>>>> + struct cfq_group *p_cfqg)
>>>> +{
>>>> + struct io_sched_entity *group_entity, *p_group_entity;
>>>> +
>>>> + group_entity = &cfqg->group_entity;
>>>> +
>>>> + p_group_entity = &p_cfqg->group_entity;
>>>> +
>>>> + group_entity->parent = p_group_entity;
>>>> +
>>>> /*
>>>> - * Add group onto cgroup list. It might happen that bdi->dev is
>>>> - * not initiliazed yet. Initialize this new group without major
>>>> - * and minor info and this info will be filled in once a new thread
>>>> - * comes for IO. See code above.
>>>> + * Currently, just put cfq group entity on "BE:SYNC" workload
>>>> + * service tree.
>>>> */
>>>> - if (bdi->dev) {
>>>> - sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>>>> - cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
>>>> - MKDEV(major, minor));
>>>> - } else
>>>> - cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
>>>> - 0);
>>>> + group_entity->service_tree = service_tree_for(p_cfqg, BE_WORKLOAD,
>>>> + SYNC_WORKLOAD);
>>>> + /* child reference */
>>>> + atomic_inc(&p_cfqg->ref);
>>>> +}
>>>>
>>>> - cfqg->group_entity.weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
>>>> +int cfqg_chain_alloc(struct cfq_data *cfqd, struct cgroup *cgroup)
>>>> +{
>>>> + struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>>>> + struct blkio_cgroup *p_blkcg;
>>>> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>>>> + unsigned int major, minor;
>>>> + struct cfq_group *cfqg, *p_cfqg;
>>>> + void *key = cfqd;
>>>> + int ret;
>>>>
>>>> - /* Add group on cfqd list */
>>>> - hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
>>>> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>>>> + if (cfqg) {
>>>> + if (!cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
>>>> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>>>> + cfqg->blkg.dev = MKDEV(major, minor);
>>>> + }
>>>> + /* chain has already been built */
>>>> + return 0;
>>>> + }
>>>> +
>>>> + cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
>>>> + if (!cfqg)
>>>> + return -1;
>>>> +
>>>> + init_cfqg(cfqd, blkcg, cfqg);
>>>> +
>>>> + /* Already to the top group */
>>>> + if (!cgroup->parent)
>>>> + return 0;
>>>> +
>>>> + /* Allocate CFQ groups on the chain */
>>>> + ret = cfqg_chain_alloc(cfqd, cgroup->parent);
>>>> + if (ret == -1) {
>>>> + uninit_cfqg(cfqd, cfqg);
>>>> + return -1;
>>>> + }
>>>> +
>>>> + p_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
>>>> + p_cfqg = cfqg_of_blkg(blkiocg_lookup_group(p_blkcg, key));
>>>> + BUG_ON(p_cfqg == NULL);
>>>> +
>>>> + cfqg_set_parent(cfqd, cfqg, p_cfqg);
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static struct cfq_group *
>>>> +cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>>>> +{
>>>> + struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>>>> + struct cfq_group *cfqg = NULL;
>>>> + void *key = cfqd;
>>>> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>>>> + unsigned int major, minor;
>>>> + int ret;
>>>> +
>>>> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>>>> + if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
>>>> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>>>> + cfqg->blkg.dev = MKDEV(major, minor);
>>>> + goto done;
>>>> + }
>>>> + if (cfqg || !create)
>>>> + goto done;
>>>> +
>>>> + /*
>>>> + * For hierarchical cfq group scheduling, we need to allocate
>>>> + * the whole cfq group chain.
>>>> + */
>>>> + ret = cfqg_chain_alloc(cfqd, cgroup);
>>>> + if (!ret) {
>>>> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>>>> + BUG_ON(cfqg == NULL);
>>>> + goto done;
>>>> + }
>>>>
>>>> done:
>>>> return cfqg;
>>>> @@ -1136,12 +1269,22 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
>>>> {
>>>> struct cfq_rb_root *st;
>>>> int i, j;
>>>> + struct io_sched_entity *group_entity;
>>>> + struct cfq_group *p_cfqg;
>>>>
>>>> BUG_ON(atomic_read(&cfqg->ref) <= 0);
>>>> if (!atomic_dec_and_test(&cfqg->ref))
>>>> return;
>>>> for_each_cfqg_st(cfqg, i, j, st)
>>>> BUG_ON(!RB_EMPTY_ROOT(&st->rb));
>>>> +
>>>> + group_entity = &cfqg->group_entity;
>>>> + if (group_entity->parent) {
>>>> + p_cfqg = cfqg_of_entity(group_entity->parent);
>>>> + /* Drop the reference taken by children */
>>>> + atomic_dec(&p_cfqg->ref);
>>>> + }
>>>> +
>>>> kfree(cfqg);
>>>> }
>>>>
>>>> @@ -1336,7 +1479,6 @@ insert:
>>>> io_entity_service_tree_add(service_tree, queue_entity);
>>>>
>>>> update_min_vdisktime(service_tree);
>>>> - cfqq->reposition_time = jiffies;
>>>> if ((add_front || !new_cfqq) && !group_changed)
>>>> return;
>>>> cfq_group_service_tree_add(cfqd, cfqq->cfqg);
>>>> @@ -1779,28 +1921,30 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
>>>> return cfqq_of_entity(cfq_rb_first(service_tree));
>>>> }
>>>>
>>>> -static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
>>>> +static struct io_sched_entity *
>>>> +cfq_get_next_entity_forced(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>>> {
>>>> - struct cfq_group *cfqg;
>>>> - struct io_sched_entity *queue_entity;
>>>> + struct io_sched_entity *entity;
>>>> int i, j;
>>>> struct cfq_rb_root *st;
>>>>
>>>> if (!cfqd->rq_queued)
>>>> return NULL;
>>>>
>>>> - cfqg = cfq_get_next_cfqg(cfqd);
>>>> - if (!cfqg)
>>>> - return NULL;
>>>> -
>>>> for_each_cfqg_st(cfqg, i, j, st) {
>>>> - queue_entity = cfq_rb_first(st);
>>>> - if (queue_entity != NULL)
>>>> - return cfqq_of_entity(queue_entity);
>>>> + entity = cfq_rb_first(st);
>>>> +
>>>> + if (entity && !entity->is_group_entity)
>>>> + return entity;
>>>> + else if (entity && entity->is_group_entity) {
>>>> + cfqg = cfqg_of_entity(entity);
>>>> + return cfq_get_next_entity_forced(cfqd, cfqg);
>>>> + }
>>>> }
>>>> return NULL;
>>>> }
>>>>
>>>> +
>>>> /*
>>>> * Get and set a new active queue for service.
>>>> */
>>>> @@ -2155,8 +2299,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
>>>> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>>>> struct cfq_group *cfqg, enum wl_prio_t prio)
>>>> {
>>>> - struct io_sched_entity *queue_entity;
>>>> - struct cfq_queue *cfqq;
>>>> + struct io_sched_entity *entity;
>>>> unsigned long lowest_start_time;
>>>> int i;
>>>> bool time_valid = false;
>>>> @@ -2167,12 +2310,11 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>>>> * type. But for the time being just make use of reposition_time only.
>>>> */
>>>> for (i = 0; i <= SYNC_WORKLOAD; ++i) {
>>>> - queue_entity = cfq_rb_first(service_tree_for(cfqg, prio, i));
>>>> - cfqq = cfqq_of_entity(queue_entity);
>>>> - if (queue_entity &&
>>>> + entity = cfq_rb_first(service_tree_for(cfqg, prio, i));
>>>> + if (entity &&
>>>> (!time_valid ||
>>>> - cfqq->reposition_time < lowest_start_time)) {
>>>> - lowest_start_time = cfqq->reposition_time;
>>>> + entity->reposition_time < lowest_start_time)) {
>>>> + lowest_start_time = entity->reposition_time;
>>>> cur_best = i;
>>>> time_valid = true;
>>>> }
>>>> @@ -2181,47 +2323,13 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>>>> return cur_best;
>>>> }
>>>>
>>>> -static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>>> +static void set_workload_expire(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>>> {
>>>> unsigned slice;
>>>> unsigned count;
>>>> struct cfq_rb_root *st;
>>>> unsigned group_slice;
>>>>
>>>> - if (!cfqg) {
>>>> - cfqd->serving_prio = IDLE_WORKLOAD;
>>>> - cfqd->workload_expires = jiffies + 1;
>>>> - return;
>>>> - }
>>>> -
>>>> - /* Choose next priority. RT > BE > IDLE */
>>>> - if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
>>>> - cfqd->serving_prio = RT_WORKLOAD;
>>>> - else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
>>>> - cfqd->serving_prio = BE_WORKLOAD;
>>>> - else {
>>>> - cfqd->serving_prio = IDLE_WORKLOAD;
>>>> - cfqd->workload_expires = jiffies + 1;
>>>> - return;
>>>> - }
>>>> -
>>>> - /*
>>>> - * For RT and BE, we have to choose also the type
>>>> - * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
>>>> - * expiration time
>>>> - */
>>>> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>>>> - count = st->count;
>>>> -
>>>> - /*
>>>> - * check workload expiration, and that we still have other queues ready
>>>> - */
>>>> - if (count && !time_after(jiffies, cfqd->workload_expires))
>>>> - return;
>>>> -
>>>> - /* otherwise select new workload type */
>>>> - cfqd->serving_type =
>>>> - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
>>>> st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>>>> count = st->count;
>>>>
>>>> @@ -2262,26 +2370,51 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>>> cfqd->workload_expires = jiffies + slice;
>>>> }
>>>>
>>>> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
>>>> +static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>>>> {
>>>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>>>> - struct cfq_group *cfqg;
>>>> - struct io_sched_entity *group_entity;
>>>> + struct cfq_rb_root *st;
>>>> + unsigned count;
>>>>
>>>> - if (RB_EMPTY_ROOT(&st->rb))
>>>> - return NULL;
>>>> - group_entity = cfq_rb_first_entity(st);
>>>> - cfqg = cfqg_of_entity(group_entity);
>>>> - BUG_ON(!cfqg);
>>>> - update_min_vdisktime(st);
>>>> - return cfqg;
>>>> + if (!cfqg) {
>>>> + cfqd->serving_prio = IDLE_WORKLOAD;
>>>> + cfqd->workload_expires = jiffies + 1;
>>>> + return;
>>>> + }
>>>> +
>>>> + /* Choose next priority. RT > BE > IDLE */
>>>> + if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
>>>> + cfqd->serving_prio = RT_WORKLOAD;
>>>> + else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
>>>> + cfqd->serving_prio = BE_WORKLOAD;
>>>> + else {
>>>> + cfqd->serving_prio = IDLE_WORKLOAD;
>>>> + cfqd->workload_expires = jiffies + 1;
>>>> + return;
>>>> + }
>>>> +
>>>> + /*
>>>> + * For RT and BE, we have to choose also the type
>>>> + * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
>>>> + * expiration time
>>>> + */
>>>> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>>>> + count = st->count;
>>>> +
>>>> + /*
>>>> + * check workload expiration, and that we still have other queues ready
>>>> + */
>>>> + if (count && !time_after(jiffies, cfqd->workload_expires))
>>>> + return;
>>>> +
>>>> + /* otherwise select new workload type */
>>>> + cfqd->serving_type =
>>>> + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
>>>> }
>>>>
>>>> -static void cfq_choose_cfqg(struct cfq_data *cfqd)
>>>> +struct io_sched_entity *choose_serving_entity(struct cfq_data *cfqd,
>>>> + struct cfq_group *cfqg)
>>>> {
>>>> - struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
>>>> -
>>>> - cfqd->serving_group = cfqg;
>>>> + struct cfq_rb_root *service_tree;
>>>>
>>>> /* Restore the workload type data */
>>>> if (cfqg->saved_workload_slice) {
>>>> @@ -2292,8 +2425,21 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
>>>> cfqd->workload_expires = jiffies - 1;
>>>>
>>>> choose_service_tree(cfqd, cfqg);
>>>> -}
>>>>
>>>> + service_tree = service_tree_for(cfqg, cfqd->serving_prio,
>>>> + cfqd->serving_type);
>>>> +
>>>> + if (!cfqd->rq_queued)
>>>> + return NULL;
>>>> +
>>>> + /* There is nothing to dispatch */
>>>> + if (!service_tree)
>>>> + return NULL;
>>>> + if (RB_EMPTY_ROOT(&service_tree->rb))
>>>> + return NULL;
>>>> +
>>>> + return cfq_rb_first(service_tree);
>>>> +}
>>>> /*
>>>> * Select a queue for service. If we have a current active queue,
>>>> * check whether to continue servicing it, or retrieve and set a new one.
>>>> @@ -2301,6 +2447,8 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
>>>> static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>>>> {
>>>> struct cfq_queue *cfqq, *new_cfqq = NULL;
>>>> + struct cfq_group *cfqg;
>>>> + struct io_sched_entity *entity;
>>>>
>>>> cfqq = cfqd->active_queue;
>>>> if (!cfqq)
>>>> @@ -2389,8 +2537,23 @@ new_queue:
>>>> * Current queue expired. Check if we have to switch to a new
>>>> * service tree
>>>> */
>>>> - if (!new_cfqq)
>>>> - cfq_choose_cfqg(cfqd);
>>>> + cfqg = &cfqd->root_group;
>>>> +
>>>> + if (!new_cfqq) {
>>>> + do {
>>>> + entity = choose_serving_entity(cfqd, cfqg);
>>>> + if (entity && !entity->is_group_entity) {
>>>> + /* This is the CFQ queue that should run */
>>>> + new_cfqq = cfqq_of_entity(entity);
>>>> + cfqd->serving_group = cfqg;
>>>> + set_workload_expire(cfqd, cfqg);
>>>> + break;
>>>> + } else if (entity && entity->is_group_entity) {
>>>> + /* Continue to lookup in this CFQ group */
>>>> + cfqg = cfqg_of_entity(entity);
>>>> + }
>>>> + } while (entity && entity->is_group_entity);
>>>> + }
>>>>
>>>> cfqq = cfq_set_active_queue(cfqd, new_cfqq);
>>>> keep_queue:
>>>> @@ -2421,10 +2584,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
>>>> {
>>>> struct cfq_queue *cfqq;
>>>> int dispatched = 0;
>>>> + struct io_sched_entity *entity;
>>>> + struct cfq_group *root = &cfqd->root_group;
>>>>
>>>> /* Expire the timeslice of the current active queue first */
>>>> cfq_slice_expired(cfqd, 0);
>>>> - while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL) {
>>>> + while ((entity = cfq_get_next_entity_forced(cfqd, root)) != NULL) {
>>>> + BUG_ON(entity->is_group_entity);
>>>> + cfqq = cfqq_of_entity(entity);
>>>> __cfq_set_active_queue(cfqd, cfqq);
>>>> dispatched += __cfq_forced_dispatch_cfqq(cfqq);
>>>> }
>>>> @@ -3954,9 +4121,6 @@ static void *cfq_init_queue(struct request_queue *q)
>>>>
>>>> cfqd->cic_index = i;
>>>>
>>>> - /* Init root service tree */
>>>> - cfqd->grp_service_tree = CFQ_RB_ROOT;
>>>> -
>>>> /* Init root group */
>>>> cfqg = &cfqd->root_group;
>>>> for_each_cfqg_st(cfqg, i, j, st)
>>>> @@ -3966,6 +4130,7 @@ static void *cfq_init_queue(struct request_queue *q)
>>>> /* Give preference to root group over other groups */
>>>> cfqg->group_entity.weight = 2*BLKIO_WEIGHT_DEFAULT;
>>>> cfqg->group_entity.is_group_entity = true;
>>>> + cfqg->group_entity.parent = NULL;
>>>>
>>>> #ifdef CONFIG_CFQ_GROUP_IOSCHED
>>>> /*
>>>> --
>>>> 1.6.5.2
>>>>
>>>>
>> --
>> Regards
>> Gui Jianfeng
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 0/8 v2] Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface
[not found] ` <4CE2718C.6010406@kernel.dk>
@ 2010-12-13 1:44 ` Gui Jianfeng
2010-12-13 13:36 ` Jens Axboe
2010-12-13 14:29 ` Vivek Goyal
[not found] ` <4D01C6AB.9040807@cn.fujitsu.com>
1 sibling, 2 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-13 1:44 UTC (permalink / raw)
To: Jens Axboe, Vivek Goyal
Cc: Corrado Zoccolo, Chad Talbott, Nauman Rafique, Divyesh Shah,
linux kernel mailing list
[-- Attachment #1: Type: text/plain, Size: 2241 bytes --]
Hi
Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
in the way that it puts all CFQ queues in a hidden group and schedules with other
CFQ group under their parent. The patchset is available here,
http://lkml.org/lkml/2010/8/30/30
Vivek think this approach isn't so instinct that we should treat CFQ queues
and groups at the same level. Here is the new approach for hierarchical
scheduling based on Vivek's suggestion. The most big change of CFQ is that
it gets rid of cfq_slice_offset logic, and makes use of vdisktime for CFQ
queue scheduling just like CFQ group does. But I still give cfqq some jump
in vdisktime based on ioprio, thanks for Vivek to point out this. Now CFQ
queue and CFQ group uses the same scheduling algorithm.
"use_hierarchy" interface is now added to switch between hierarchical mode
and flat mode. For this time being, this interface only appears in Root Cgroup.
V1 -> V2 Changes:
- Raname "struct io_sched_entity" to "struct cfq_entity" and don't differentiate
queue_entity and group_entity, just use cfqe instead.
- Give newly added cfqq a small vdisktime jump accord to its ioprio.
- Make flat mode as default CFQ group scheduling mode.
- Introduce "use_hierarchy" interface.
- Update blkio cgroup documents
[PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue
[PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group
[PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue
[PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group
[PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
[PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality
[PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy"
[PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy.
Benchmarks:
I made use of Vivek's iostest to perform some benchmarks on my box. I tested different workloads, and
didn't see any performance drop comparing to vanilla kernel. The attached files are some performance
numbers on vanilla Kernel, patched kernel with flat mode and patched kernel with hierarchical mode.
[-- Attachment #2: patched-flat.log --]
[-- Type: text/plain, Size: 5424 bytes --]
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=brr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[brr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
brr 1 1 294 692 1101 1526 3613
brr 1 2 176 420 755 1281 2632
brr 1 4 160 323 583 1253 2319
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=brr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[brr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
brr 1 1 380 738 1092 1439 3649
brr 1 2 171 413 733 1242 2559
brr 1 4 188 350 665 1193 2396
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=bsr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
bsr 1 1 6856 11480 17644 22647 58627
bsr 1 2 2592 5409 8464 13300 29765
bsr 1 4 2502 4635 7640 12909 27686
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=bsr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
bsr 1 1 6913 11643 17843 22909 59308
bsr 1 2 6682 11234 15527 19410 52853
bsr 1 4 5209 10882 15002 18167 49260
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[drr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drr 1 1 298 701 1117 1538 3654
drr 1 2 190 372 731 1244 2537
drr 1 4 147 322 563 1143 2175
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[drr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drr 1 1 370 713 1050 1416 3549
drr 1 2 192 434 755 1265 2646
drr 1 4 157 333 677 1159 2326
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drw iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[drw] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drw 1 1 595 1272 2007 2737 6611
drw 1 2 269 690 1407 1953 4319
drw 1 4 145 396 978 1752 3271
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drw iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[drw] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drw 1 1 604 1310 1827 2778 6519
drw 1 2 287 723 1368 1887 4265
drw 1 4 170 407 979 1575 3131
[-- Attachment #3: patched-hier.log --]
[-- Type: text/plain, Size: 5424 bytes --]
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=brr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[brr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
brr 1 1 287 690 1096 1506 3579
brr 1 2 189 404 800 1283 2676
brr 1 4 141 317 557 1106 2121
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=brr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[brr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
brr 1 1 386 715 1071 1437 3609
brr 1 2 187 401 717 1258 2563
brr 1 4 296 767 1553 32 2648
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=bsr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
bsr 1 1 6971 11459 17789 23158 59377
bsr 1 2 2592 5536 8679 13389 30196
bsr 1 4 2194 4635 7820 13984 28633
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=bsr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
bsr 1 1 6851 11588 17459 22297 58195
bsr 1 2 6814 11534 16141 20426 54915
bsr 1 4 5118 10741 13994 17661 47514
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[drr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drr 1 1 297 689 1097 1522 3605
drr 1 2 175 426 757 1277 2635
drr 1 4 150 330 604 1100 2184
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[drr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drr 1 1 379 735 1077 1436 3627
drr 1 2 190 404 760 1247 2601
drr 1 4 155 333 692 1044 2224
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drw iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[drw] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drw 1 1 566 1293 2001 2686 6546
drw 1 2 225 662 1233 1902 4022
drw 1 4 147 379 922 1761 3209
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drw iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[drw] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drw 1 1 579 1226 2020 2823 6648
drw 1 2 276 689 1288 2068 4321
drw 1 4 183 399 798 2113 3493
[-- Attachment #4: vanilla.log --]
[-- Type: text/plain, Size: 5424 bytes --]
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=brr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[brr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
brr 1 1 289 684 1098 1508 3579
brr 1 2 178 421 765 1228 2592
brr 1 4 144 301 585 1094 2124
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=brr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[brr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
brr 1 1 375 734 1081 1434 3624
brr 1 2 172 397 700 1201 2470
brr 1 4 154 316 573 1087 2130
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=bsr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
bsr 1 1 6818 11510 17820 23239 59387
bsr 1 2 2643 5502 8728 13329 30202
bsr 1 4 2166 4785 7344 12954 27249
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=bsr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[bsr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
bsr 1 1 6979 11629 17782 23064 59454
bsr 1 2 6803 11274 15865 20024 53966
bsr 1 4 5292 10674 14504 17674 48144
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[drr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drr 1 1 298 694 1116 1540 3648
drr 1 2 183 405 721 1197 2506
drr 1 4 151 296 553 1119 2119
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drr iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[drr] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drr 1 1 367 724 1078 1433 3602
drr 1 2 184 418 744 1245 2591
drr 1 4 157 295 562 1122 2136
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drw iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=0 group_idle=8 quantum=8
=========================================================================
AVERAGE[drw] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drw 1 1 582 1271 1948 2754 6555
drw 1 2 277 700 1294 1970 4241
drw 1 4 172 345 928 1585 3030
Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
GROUPMODE=1 NRGRP=4
DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
Workload=drw iosched=cfq Filesz=512M bs=32k
group_isolation=1 slice_idle=8 group_idle=8 quantum=8
=========================================================================
AVERAGE[drw] [bw in KB/s]
-------
job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
--- --- -- -----------------------------------
drw 1 1 586 1296 1888 2739 6509
drw 1 2 294 749 1360 1931 4334
drw 1 4 156 337 814 1806 3113
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue
[not found] ` <4D01C6AB.9040807@cn.fujitsu.com>
@ 2010-12-13 1:44 ` Gui Jianfeng
2010-12-13 15:44 ` Vivek Goyal
2010-12-13 1:44 ` [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group Gui Jianfeng
` (6 subsequent siblings)
7 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-13 1:44 UTC (permalink / raw)
To: Jens Axboe, Vivek Goyal
Cc: Gui Jianfeng, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Introduce cfq_entity for CFQ queue
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/cfq-iosched.c | 125 +++++++++++++++++++++++++++++++++-----------------
1 files changed, 82 insertions(+), 43 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5d0349d..9b07a24 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -91,20 +91,31 @@ struct cfq_rb_root {
#define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
.count = 0, .min_vdisktime = 0, }
+
+/*
+ * This's the CFQ queue schedule entity which is scheduled on service tree.
+ */
+struct cfq_entity {
+ /* service tree */
+ struct cfq_rb_root *service_tree;
+ /* service_tree member */
+ struct rb_node rb_node;
+ /* service_tree key, represent the position on the tree */
+ unsigned long rb_key;
+};
+
/*
* Per process-grouping structure
*/
struct cfq_queue {
+ /* The schedule entity */
+ struct cfq_entity cfqe;
/* reference count */
atomic_t ref;
/* various state flags, see below */
unsigned int flags;
/* parent cfq_data */
struct cfq_data *cfqd;
- /* service_tree member */
- struct rb_node rb_node;
- /* service_tree key */
- unsigned long rb_key;
/* prio tree member */
struct rb_node p_node;
/* prio tree root we belong to, if any */
@@ -143,7 +154,6 @@ struct cfq_queue {
u32 seek_history;
sector_t last_request_pos;
- struct cfq_rb_root *service_tree;
struct cfq_queue *new_cfqq;
struct cfq_group *cfqg;
struct cfq_group *orig_cfqg;
@@ -302,6 +312,15 @@ struct cfq_data {
struct rcu_head rcu;
};
+static inline struct cfq_queue *
+cfqq_of_entity(struct cfq_entity *cfqe)
+{
+ if (cfqe)
+ return container_of(cfqe, struct cfq_queue,
+ cfqe);
+ return NULL;
+}
+
static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
@@ -743,7 +762,7 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2,
/*
* The below is leftmost cache rbtree addon
*/
-static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
+static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
{
/* Service tree is empty */
if (!root->count)
@@ -753,7 +772,7 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
root->left = rb_first(&root->rb);
if (root->left)
- return rb_entry(root->left, struct cfq_queue, rb_node);
+ return rb_entry(root->left, struct cfq_entity, rb_node);
return NULL;
}
@@ -1170,21 +1189,24 @@ static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
bool add_front)
{
+ struct cfq_entity *cfqe;
struct rb_node **p, *parent;
- struct cfq_queue *__cfqq;
+ struct cfq_entity *__cfqe;
unsigned long rb_key;
struct cfq_rb_root *service_tree;
int left;
int new_cfqq = 1;
int group_changed = 0;
+ cfqe = &cfqq->cfqe;
+
#ifdef CONFIG_CFQ_GROUP_IOSCHED
if (!cfqd->cfq_group_isolation
&& cfqq_type(cfqq) == SYNC_NOIDLE_WORKLOAD
&& cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
/* Move this cfq to root group */
cfq_log_cfqq(cfqd, cfqq, "moving to root group");
- if (!RB_EMPTY_NODE(&cfqq->rb_node))
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
cfq_group_service_tree_del(cfqd, cfqq->cfqg);
cfqq->orig_cfqg = cfqq->cfqg;
cfqq->cfqg = &cfqd->root_group;
@@ -1194,7 +1216,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
&& cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
/* cfqq is sequential now needs to go to its original group */
BUG_ON(cfqq->cfqg != &cfqd->root_group);
- if (!RB_EMPTY_NODE(&cfqq->rb_node))
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
cfq_group_service_tree_del(cfqd, cfqq->cfqg);
cfq_put_cfqg(cfqq->cfqg);
cfqq->cfqg = cfqq->orig_cfqg;
@@ -1209,9 +1231,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (cfq_class_idle(cfqq)) {
rb_key = CFQ_IDLE_DELAY;
parent = rb_last(&service_tree->rb);
- if (parent && parent != &cfqq->rb_node) {
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
- rb_key += __cfqq->rb_key;
+ if (parent && parent != &cfqe->rb_node) {
+ __cfqe = rb_entry(parent,
+ struct cfq_entity,
+ rb_node);
+ rb_key += __cfqe->rb_key;
} else
rb_key += jiffies;
} else if (!add_front) {
@@ -1226,37 +1250,39 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
cfqq->slice_resid = 0;
} else {
rb_key = -HZ;
- __cfqq = cfq_rb_first(service_tree);
- rb_key += __cfqq ? __cfqq->rb_key : jiffies;
+ __cfqe = cfq_rb_first(service_tree);
+ rb_key += __cfqe ? __cfqe->rb_key : jiffies;
}
- if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
new_cfqq = 0;
/*
* same position, nothing more to do
*/
- if (rb_key == cfqq->rb_key &&
- cfqq->service_tree == service_tree)
+ if (rb_key == cfqe->rb_key &&
+ cfqe->service_tree == service_tree)
return;
- cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
- cfqq->service_tree = NULL;
+ cfq_rb_erase(&cfqe->rb_node,
+ cfqe->service_tree);
+ cfqe->service_tree = NULL;
}
left = 1;
parent = NULL;
- cfqq->service_tree = service_tree;
+ cfqe->service_tree = service_tree;
p = &service_tree->rb.rb_node;
while (*p) {
struct rb_node **n;
parent = *p;
- __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
+ __cfqe = rb_entry(parent, struct cfq_entity,
+ rb_node);
/*
* sort by key, that represents service time.
*/
- if (time_before(rb_key, __cfqq->rb_key))
+ if (time_before(rb_key, __cfqe->rb_key))
n = &(*p)->rb_left;
else {
n = &(*p)->rb_right;
@@ -1267,11 +1293,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
}
if (left)
- service_tree->left = &cfqq->rb_node;
+ service_tree->left = &cfqe->rb_node;
- cfqq->rb_key = rb_key;
- rb_link_node(&cfqq->rb_node, parent, p);
- rb_insert_color(&cfqq->rb_node, &service_tree->rb);
+ cfqe->rb_key = rb_key;
+ rb_link_node(&cfqe->rb_node, parent, p);
+ rb_insert_color(&cfqe->rb_node, &service_tree->rb);
service_tree->count++;
if ((add_front || !new_cfqq) && !group_changed)
return;
@@ -1373,13 +1399,17 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
*/
static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
+ struct cfq_entity *cfqe;
cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
BUG_ON(!cfq_cfqq_on_rr(cfqq));
cfq_clear_cfqq_on_rr(cfqq);
- if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
- cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
- cfqq->service_tree = NULL;
+ cfqe = &cfqq->cfqe;
+
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
+ cfq_rb_erase(&cfqe->rb_node,
+ cfqe->service_tree);
+ cfqe->service_tree = NULL;
}
if (cfqq->p_root) {
rb_erase(&cfqq->p_node, cfqq->p_root);
@@ -1707,13 +1737,13 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
return NULL;
if (RB_EMPTY_ROOT(&service_tree->rb))
return NULL;
- return cfq_rb_first(service_tree);
+ return cfqq_of_entity(cfq_rb_first(service_tree));
}
static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
{
struct cfq_group *cfqg;
- struct cfq_queue *cfqq;
+ struct cfq_entity *cfqe;
int i, j;
struct cfq_rb_root *st;
@@ -1724,9 +1754,11 @@ static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
if (!cfqg)
return NULL;
- for_each_cfqg_st(cfqg, i, j, st)
- if ((cfqq = cfq_rb_first(st)) != NULL)
- return cfqq;
+ for_each_cfqg_st(cfqg, i, j, st) {
+ cfqe = cfq_rb_first(st);
+ if (cfqe != NULL)
+ return cfqq_of_entity(cfqe);
+ }
return NULL;
}
@@ -1863,9 +1895,12 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
+ struct cfq_entity *cfqe;
enum wl_prio_t prio = cfqq_prio(cfqq);
- struct cfq_rb_root *service_tree = cfqq->service_tree;
+ struct cfq_rb_root *service_tree;
+ cfqe = &cfqq->cfqe;
+ service_tree = cfqe->service_tree;
BUG_ON(!service_tree);
BUG_ON(!service_tree->count);
@@ -2075,7 +2110,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
struct cfq_group *cfqg, enum wl_prio_t prio)
{
- struct cfq_queue *queue;
+ struct cfq_entity *cfqe;
int i;
bool key_valid = false;
unsigned long lowest_key = 0;
@@ -2083,10 +2118,11 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
for (i = 0; i <= SYNC_WORKLOAD; ++i) {
/* select the one with lowest rb_key */
- queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
- if (queue &&
- (!key_valid || time_before(queue->rb_key, lowest_key))) {
- lowest_key = queue->rb_key;
+ cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
+ if (cfqe &&
+ (!key_valid ||
+ time_before(cfqe->rb_key, lowest_key))) {
+ lowest_key = cfqe->rb_key;
cur_best = i;
key_valid = true;
}
@@ -2834,7 +2870,10 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
pid_t pid, bool is_sync)
{
- RB_CLEAR_NODE(&cfqq->rb_node);
+ struct cfq_entity *cfqe;
+
+ cfqe = &cfqq->cfqe;
+ RB_CLEAR_NODE(&cfqe->rb_node);
RB_CLEAR_NODE(&cfqq->p_node);
INIT_LIST_HEAD(&cfqq->fifo);
@@ -3243,7 +3282,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
/* Allow preemption only if we are idling on sync-noidle tree */
if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
- new_cfqq->service_tree->count == 2 &&
+ new_cfqq->cfqe.service_tree->count == 2 &&
RB_EMPTY_ROOT(&cfqq->sort_list))
return true;
--
1.6.5.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group
[not found] ` <4D01C6AB.9040807@cn.fujitsu.com>
2010-12-13 1:44 ` [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue Gui Jianfeng
@ 2010-12-13 1:44 ` Gui Jianfeng
2010-12-13 16:59 ` Vivek Goyal
2010-12-13 1:44 ` [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue Gui Jianfeng
` (5 subsequent siblings)
7 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-13 1:44 UTC (permalink / raw)
To: Jens Axboe, Vivek Goyal
Cc: Gui Jianfeng, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Introduce cfq_entity for CFQ group
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/cfq-iosched.c | 113 ++++++++++++++++++++++++++++++--------------------
1 files changed, 68 insertions(+), 45 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9b07a24..91e9833 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1,5 +1,5 @@
/*
- * CFQ, or complete fairness queueing, disk scheduler.
+ * Cfq, or complete fairness queueing, disk scheduler.
*
* Based on ideas from a previously unfinished io
* scheduler (round robin per-process disk scheduling) and Andrea Arcangeli.
@@ -73,7 +73,8 @@ static DEFINE_IDA(cic_index_ida);
#define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
#define sample_valid(samples) ((samples) > 80)
-#define rb_entry_cfqg(node) rb_entry((node), struct cfq_group, rb_node)
+#define rb_entry_entity(node) rb_entry((node), struct cfq_entity,\
+ rb_node)
/*
* Most of our rbtree usage is for sorting with min extraction, so
@@ -102,6 +103,11 @@ struct cfq_entity {
struct rb_node rb_node;
/* service_tree key, represent the position on the tree */
unsigned long rb_key;
+
+ /* group service_tree key */
+ u64 vdisktime;
+ bool is_group_entity;
+ unsigned int weight;
};
/*
@@ -183,12 +189,8 @@ enum wl_type_t {
/* This is per cgroup per device grouping structure */
struct cfq_group {
- /* group service_tree member */
- struct rb_node rb_node;
-
- /* group service_tree key */
- u64 vdisktime;
- unsigned int weight;
+ /* cfq group sched entity */
+ struct cfq_entity cfqe;
/* number of cfqq currently on this group */
int nr_cfqq;
@@ -315,12 +317,21 @@ struct cfq_data {
static inline struct cfq_queue *
cfqq_of_entity(struct cfq_entity *cfqe)
{
- if (cfqe)
+ if (cfqe && !cfqe->is_group_entity)
return container_of(cfqe, struct cfq_queue,
cfqe);
return NULL;
}
+static inline struct cfq_group *
+cfqg_of_entity(struct cfq_entity *cfqe)
+{
+ if (cfqe && cfqe->is_group_entity)
+ return container_of(cfqe, struct cfq_group,
+ cfqe);
+ return NULL;
+}
+
static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
@@ -548,12 +559,12 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
}
-static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
+static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_entity *cfqe)
{
u64 d = delta << CFQ_SERVICE_SHIFT;
d = d * BLKIO_WEIGHT_DEFAULT;
- do_div(d, cfqg->weight);
+ do_div(d, cfqe->weight);
return d;
}
@@ -578,11 +589,11 @@ static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
static void update_min_vdisktime(struct cfq_rb_root *st)
{
u64 vdisktime = st->min_vdisktime;
- struct cfq_group *cfqg;
+ struct cfq_entity *cfqe;
if (st->left) {
- cfqg = rb_entry_cfqg(st->left);
- vdisktime = min_vdisktime(vdisktime, cfqg->vdisktime);
+ cfqe = rb_entry_entity(st->left);
+ vdisktime = min_vdisktime(vdisktime, cfqe->vdisktime);
}
st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
@@ -613,8 +624,9 @@ static inline unsigned
cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_entity *cfqe = &cfqg->cfqe;
- return cfq_target_latency * cfqg->weight / st->total_weight;
+ return cfq_target_latency * cfqe->weight / st->total_weight;
}
static inline void
@@ -777,13 +789,13 @@ static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
return NULL;
}
-static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
+static struct cfq_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
{
if (!root->left)
root->left = rb_first(&root->rb);
if (root->left)
- return rb_entry_cfqg(root->left);
+ return rb_entry_entity(root->left);
return NULL;
}
@@ -840,9 +852,9 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
}
static inline s64
-cfqg_key(struct cfq_rb_root *st, struct cfq_group *cfqg)
+entity_key(struct cfq_rb_root *st, struct cfq_entity *entity)
{
- return cfqg->vdisktime - st->min_vdisktime;
+ return entity->vdisktime - st->min_vdisktime;
}
static void
@@ -850,15 +862,16 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
{
struct rb_node **node = &st->rb.rb_node;
struct rb_node *parent = NULL;
- struct cfq_group *__cfqg;
- s64 key = cfqg_key(st, cfqg);
+ struct cfq_entity *__cfqe;
+ struct cfq_entity *cfqe = &cfqg->cfqe;
+ s64 key = entity_key(st, cfqe);
int left = 1;
while (*node != NULL) {
parent = *node;
- __cfqg = rb_entry_cfqg(parent);
+ __cfqe = rb_entry_entity(parent);
- if (key < cfqg_key(st, __cfqg))
+ if (key < entity_key(st, __cfqe))
node = &parent->rb_left;
else {
node = &parent->rb_right;
@@ -867,21 +880,22 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
}
if (left)
- st->left = &cfqg->rb_node;
+ st->left = &cfqe->rb_node;
- rb_link_node(&cfqg->rb_node, parent, node);
- rb_insert_color(&cfqg->rb_node, &st->rb);
+ rb_link_node(&cfqe->rb_node, parent, node);
+ rb_insert_color(&cfqe->rb_node, &st->rb);
}
static void
cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
- struct cfq_group *__cfqg;
+ struct cfq_entity *cfqe = &cfqg->cfqe;
+ struct cfq_entity *__cfqe;
struct rb_node *n;
cfqg->nr_cfqq++;
- if (!RB_EMPTY_NODE(&cfqg->rb_node))
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
return;
/*
@@ -891,19 +905,20 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
*/
n = rb_last(&st->rb);
if (n) {
- __cfqg = rb_entry_cfqg(n);
- cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
+ __cfqe = rb_entry_entity(n);
+ cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
} else
- cfqg->vdisktime = st->min_vdisktime;
+ cfqe->vdisktime = st->min_vdisktime;
__cfq_group_service_tree_add(st, cfqg);
- st->total_weight += cfqg->weight;
+ st->total_weight += cfqe->weight;
}
static void
cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
+ struct cfq_entity *cfqe = &cfqg->cfqe;
BUG_ON(cfqg->nr_cfqq < 1);
cfqg->nr_cfqq--;
@@ -913,9 +928,9 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
return;
cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
- st->total_weight -= cfqg->weight;
- if (!RB_EMPTY_NODE(&cfqg->rb_node))
- cfq_rb_erase(&cfqg->rb_node, st);
+ st->total_weight -= cfqe->weight;
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
+ cfq_rb_erase(&cfqe->rb_node, st);
cfqg->saved_workload_slice = 0;
cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
}
@@ -953,6 +968,7 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
unsigned int used_sl, charge;
int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
- cfqg->service_tree_idle.count;
+ struct cfq_entity *cfqe = &cfqg->cfqe;
BUG_ON(nr_sync < 0);
used_sl = charge = cfq_cfqq_slice_usage(cfqq);
@@ -963,8 +979,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
charge = cfqq->allocated_slice;
/* Can't update vdisktime while group is on service tree */
- cfq_rb_erase(&cfqg->rb_node, st);
- cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
+ cfq_rb_erase(&cfqe->rb_node, st);
+ cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
__cfq_group_service_tree_add(st, cfqg);
/* This group is being expired. Save the context */
@@ -976,8 +992,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
} else
cfqg->saved_workload_slice = 0;
- cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
- st->min_vdisktime);
+ cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu",
+ cfqe->vdisktime, st->min_vdisktime);
cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
" sect=%u", used_sl, cfqq->slice_dispatch, charge,
iops_mode(cfqd), cfqq->nr_sectors);
@@ -996,7 +1012,7 @@ static inline struct cfq_group *cfqg_of_blkg(struct blkio_group *blkg)
void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
unsigned int weight)
{
- cfqg_of_blkg(blkg)->weight = weight;
+ cfqg_of_blkg(blkg)->cfqe.weight = weight;
}
static struct cfq_group *
@@ -1025,7 +1041,9 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
for_each_cfqg_st(cfqg, i, j, st)
*st = CFQ_RB_ROOT;
- RB_CLEAR_NODE(&cfqg->rb_node);
+ RB_CLEAR_NODE(&cfqg->cfqe.rb_node);
+
+ cfqg->cfqe.is_group_entity = true;
/*
* Take the initial reference that will be released on destroy
@@ -1049,7 +1067,7 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
0);
- cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+ cfqg->cfqe.weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
/* Add group on cfqd list */
hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
@@ -2216,10 +2234,13 @@ static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
struct cfq_group *cfqg;
+ struct cfq_entity *cfqe;
if (RB_EMPTY_ROOT(&st->rb))
return NULL;
- cfqg = cfq_rb_first_group(st);
+ cfqe = cfq_rb_first_entity(st);
+ cfqg = cfqg_of_entity(cfqe);
+ BUG_ON(!cfqg);
update_min_vdisktime(st);
return cfqg;
}
@@ -2877,6 +2898,7 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
RB_CLEAR_NODE(&cfqq->p_node);
INIT_LIST_HEAD(&cfqq->fifo);
+ cfqe->is_group_entity = false;
atomic_set(&cfqq->ref, 0);
cfqq->cfqd = cfqd;
@@ -3909,10 +3931,11 @@ static void *cfq_init_queue(struct request_queue *q)
cfqg = &cfqd->root_group;
for_each_cfqg_st(cfqg, i, j, st)
*st = CFQ_RB_ROOT;
- RB_CLEAR_NODE(&cfqg->rb_node);
+ RB_CLEAR_NODE(&cfqg->cfqe.rb_node);
/* Give preference to root group over other groups */
- cfqg->weight = 2*BLKIO_WEIGHT_DEFAULT;
+ cfqg->cfqe.weight = 2*BLKIO_WEIGHT_DEFAULT;
+ cfqg->cfqe.is_group_entity = true;
#ifdef CONFIG_CFQ_GROUP_IOSCHED
/*
--
1.6.5.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue
[not found] ` <4D01C6AB.9040807@cn.fujitsu.com>
2010-12-13 1:44 ` [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue Gui Jianfeng
2010-12-13 1:44 ` [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group Gui Jianfeng
@ 2010-12-13 1:44 ` Gui Jianfeng
2010-12-13 16:59 ` Vivek Goyal
2010-12-13 1:44 ` [PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group Gui Jianfeng
` (4 subsequent siblings)
7 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-13 1:44 UTC (permalink / raw)
To: Jens Axboe, Vivek Goyal
Cc: Gui Jianfeng, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Introduce vdisktime and io weight for CFQ queue scheduling. Currently, io priority
maps to a range [100,1000]. It also gets rid of cfq_slice_offset() logic and makes
use the same scheduling algorithm as CFQ group does. But it still give newly added
cfqq a small vdisktime jump according to its ioprio. This patch will help to make
CFQ queue and group schedule on a same service tree.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/cfq-iosched.c | 196 ++++++++++++++++++++++++++++++++++++---------------
1 files changed, 139 insertions(+), 57 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 91e9833..30d19c0 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -101,10 +101,7 @@ struct cfq_entity {
struct cfq_rb_root *service_tree;
/* service_tree member */
struct rb_node rb_node;
- /* service_tree key, represent the position on the tree */
- unsigned long rb_key;
-
- /* group service_tree key */
+ /* service_tree key */
u64 vdisktime;
bool is_group_entity;
unsigned int weight;
@@ -116,6 +113,8 @@ struct cfq_entity {
struct cfq_queue {
/* The schedule entity */
struct cfq_entity cfqe;
+ /* Reposition time */
+ unsigned long reposition_time;
/* reference count */
atomic_t ref;
/* various state flags, see below */
@@ -314,6 +313,22 @@ struct cfq_data {
struct rcu_head rcu;
};
+/*
+ * Map io priority(7 ~ 0) to io weight(100 ~ 1000)
+ */
+static inline unsigned int cfq_prio_to_weight(unsigned short ioprio)
+{
+ unsigned int step;
+
+ BUG_ON(ioprio >= IOPRIO_BE_NR);
+
+ step = (BLKIO_WEIGHT_MAX - BLKIO_WEIGHT_MIN) / (IOPRIO_BE_NR - 1);
+ if (ioprio == 0)
+ return BLKIO_WEIGHT_MAX;
+
+ return BLKIO_WEIGHT_MIN + (IOPRIO_BE_NR - ioprio - 1) * step;
+}
+
static inline struct cfq_queue *
cfqq_of_entity(struct cfq_entity *cfqe)
{
@@ -841,16 +856,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
return cfq_choose_req(cfqd, next, prev, blk_rq_pos(last));
}
-static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
- struct cfq_queue *cfqq)
-{
- /*
- * just an approximation, should be ok.
- */
- return (cfqq->cfqg->nr_cfqq - 1) * (cfq_prio_slice(cfqd, 1, 0) -
- cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
-}
-
static inline s64
entity_key(struct cfq_rb_root *st, struct cfq_entity *entity)
{
@@ -1199,6 +1204,16 @@ static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
#endif /* GROUP_IOSCHED */
+static inline u64 cfq_get_boost(struct cfq_data *cfqd,
+ struct cfq_entity *cfqe)
+{
+ u64 d = cfqd->cfq_slice[1] << CFQ_SERVICE_SHIFT;
+
+ d = d * BLKIO_WEIGHT_DEFAULT;
+ do_div(d, BLKIO_WEIGHT_MAX - cfqe->weight + BLKIO_WEIGHT_MIN);
+ return d;
+}
+
/*
* The cfqd->service_trees holds all pending cfq_queue's that have
* requests waiting to be processed. It is sorted in the order that
@@ -1210,13 +1225,14 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
struct cfq_entity *cfqe;
struct rb_node **p, *parent;
struct cfq_entity *__cfqe;
- unsigned long rb_key;
- struct cfq_rb_root *service_tree;
+ struct cfq_rb_root *service_tree, *orig_st;
int left;
int new_cfqq = 1;
int group_changed = 0;
+ s64 key;
cfqe = &cfqq->cfqe;
+ orig_st = cfqe->service_tree;
#ifdef CONFIG_CFQ_GROUP_IOSCHED
if (!cfqd->cfq_group_isolation
@@ -1224,8 +1240,15 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
&& cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
/* Move this cfq to root group */
cfq_log_cfqq(cfqd, cfqq, "moving to root group");
- if (!RB_EMPTY_NODE(&cfqe->rb_node))
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
cfq_group_service_tree_del(cfqd, cfqq->cfqg);
+ /*
+ * Group changed, dequeue this CFQ queue from the
+ * original service tree.
+ */
+ cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
+ orig_st->total_weight -= cfqe->weight;
+ }
cfqq->orig_cfqg = cfqq->cfqg;
cfqq->cfqg = &cfqd->root_group;
atomic_inc(&cfqd->root_group.ref);
@@ -1234,8 +1257,15 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
&& cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
/* cfqq is sequential now needs to go to its original group */
BUG_ON(cfqq->cfqg != &cfqd->root_group);
- if (!RB_EMPTY_NODE(&cfqe->rb_node))
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
cfq_group_service_tree_del(cfqd, cfqq->cfqg);
+ /*
+ * Group changed, dequeue this CFQ queue from the
+ * original service tree.
+ */
+ cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
+ orig_st->total_weight -= cfqe->weight;
+ }
cfq_put_cfqg(cfqq->cfqg);
cfqq->cfqg = cfqq->orig_cfqg;
cfqq->orig_cfqg = NULL;
@@ -1246,50 +1276,73 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
cfqq_type(cfqq));
+ /*
+ * For the time being, put the newly added CFQ queue at the end of the
+ * service tree.
+ */
+ if (RB_EMPTY_NODE(&cfqe->rb_node)) {
+ /*
+ * If this CFQ queue moves to another group, the original
+ * vdisktime makes no sense any more, reset the vdisktime
+ * here.
+ */
+ parent = rb_last(&service_tree->rb);
+ if (parent) {
+ u64 boost;
+ s64 __vdisktime;
+
+ __cfqe = rb_entry_entity(parent);
+ cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
+
+ /* Give some vdisktime boost according to its weight */
+ boost = cfq_get_boost(cfqd, cfqe);
+ __vdisktime = cfqe->vdisktime - boost;
+ if (__vdisktime)
+ cfqe->vdisktime = __vdisktime;
+ else
+ cfqe->vdisktime = 0;
+ } else
+ cfqe->vdisktime = service_tree->min_vdisktime;
+
+ goto insert;
+ }
+ /*
+ * Ok, we get here, this CFQ queue is on the service tree, dequeue it
+ * firstly.
+ */
+ cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
+ orig_st->total_weight -= cfqe->weight;
+
+ new_cfqq = 0;
+
if (cfq_class_idle(cfqq)) {
- rb_key = CFQ_IDLE_DELAY;
parent = rb_last(&service_tree->rb);
if (parent && parent != &cfqe->rb_node) {
__cfqe = rb_entry(parent,
- struct cfq_entity,
- rb_node);
- rb_key += __cfqe->rb_key;
+ struct cfq_entity,
+ rb_node);
+ cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
} else
- rb_key += jiffies;
+ cfqe->vdisktime = service_tree->min_vdisktime;
} else if (!add_front) {
/*
- * Get our rb key offset. Subtract any residual slice
- * value carried from last service. A negative resid
- * count indicates slice overrun, and this should position
- * the next service time further away in the tree.
+ * We charge the CFQ queue by the time this queue runs, and
+ * repsition it on the service tree.
*/
- rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
- rb_key -= cfqq->slice_resid;
- cfqq->slice_resid = 0;
- } else {
- rb_key = -HZ;
- __cfqe = cfq_rb_first(service_tree);
- rb_key += __cfqe ? __cfqe->rb_key : jiffies;
- }
+ unsigned int used_sl;
- if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
- new_cfqq = 0;
- /*
- * same position, nothing more to do
- */
- if (rb_key == cfqe->rb_key &&
- cfqe->service_tree == service_tree)
- return;
-
- cfq_rb_erase(&cfqe->rb_node,
- cfqe->service_tree);
- cfqe->service_tree = NULL;
+ used_sl = cfq_cfqq_slice_usage(cfqq);
+ cfqe->vdisktime += cfq_scale_slice(used_sl, cfqe);
+ } else {
+ cfqe->vdisktime = service_tree->min_vdisktime;
}
+insert:
left = 1;
parent = NULL;
cfqe->service_tree = service_tree;
p = &service_tree->rb.rb_node;
+ key = entity_key(service_tree, cfqe);
while (*p) {
struct rb_node **n;
@@ -1300,7 +1353,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
/*
* sort by key, that represents service time.
*/
- if (time_before(rb_key, __cfqe->rb_key))
+ if (key < entity_key(service_tree, __cfqe))
n = &(*p)->rb_left;
else {
n = &(*p)->rb_right;
@@ -1313,10 +1366,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
if (left)
service_tree->left = &cfqe->rb_node;
- cfqe->rb_key = rb_key;
rb_link_node(&cfqe->rb_node, parent, p);
rb_insert_color(&cfqe->rb_node, &service_tree->rb);
+ update_min_vdisktime(service_tree);
service_tree->count++;
+ service_tree->total_weight += cfqe->weight;
+ cfqq->reposition_time = jiffies;
if ((add_front || !new_cfqq) && !group_changed)
return;
cfq_group_service_tree_add(cfqd, cfqq->cfqg);
@@ -1418,15 +1473,19 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
struct cfq_entity *cfqe;
+ struct cfq_rb_root *service_tree;
+
cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
BUG_ON(!cfq_cfqq_on_rr(cfqq));
cfq_clear_cfqq_on_rr(cfqq);
cfqe = &cfqq->cfqe;
+ service_tree = cfqe->service_tree;
if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
cfq_rb_erase(&cfqe->rb_node,
cfqe->service_tree);
+ service_tree->total_weight -= cfqe->weight;
cfqe->service_tree = NULL;
}
if (cfqq->p_root) {
@@ -2125,24 +2184,34 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
}
}
+/*
+ * The time when a CFQ queue is put onto a service tree is recoreded in
+ * cfqq->reposition_time. Currently, we check the first priority CFQ queues
+ * on each service tree, and select the workload type that contain the lowest
+ * reposition_time CFQ queue among them.
+ */
static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
struct cfq_group *cfqg, enum wl_prio_t prio)
{
struct cfq_entity *cfqe;
+ struct cfq_queue *cfqq;
+ unsigned long lowest_start_time;
int i;
- bool key_valid = false;
- unsigned long lowest_key = 0;
+ bool time_valid = false;
enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
+ /*
+ * TODO: We may take io priority into account when choosing a workload
+ * type. But for the time being just make use of reposition_time only.
+ */
for (i = 0; i <= SYNC_WORKLOAD; ++i) {
- /* select the one with lowest rb_key */
cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
- if (cfqe &&
- (!key_valid ||
- time_before(cfqe->rb_key, lowest_key))) {
- lowest_key = cfqe->rb_key;
+ cfqq = cfqq_of_entity(cfqe);
+ if (cfqe && (!time_valid ||
+ cfqq->reposition_time < lowest_start_time)) {
+ lowest_start_time = cfqq->reposition_time;
cur_best = i;
- key_valid = true;
+ time_valid = true;
}
}
@@ -2814,10 +2883,13 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
{
struct task_struct *tsk = current;
int ioprio_class;
+ struct cfq_entity *cfqe;
if (!cfq_cfqq_prio_changed(cfqq))
return;
+ cfqe = &cfqq->cfqe;
+
ioprio_class = IOPRIO_PRIO_CLASS(ioc->ioprio);
switch (ioprio_class) {
default:
@@ -2844,6 +2916,8 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
break;
}
+ cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
+
/*
* keep track of original prio settings in case we have to temporarily
* elevate the priority of this queue
@@ -3578,6 +3652,9 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
*/
static void cfq_prio_boost(struct cfq_queue *cfqq)
{
+ struct cfq_entity *cfqe;
+
+ cfqe = &cfqq->cfqe;
if (has_fs_excl()) {
/*
* boost idle prio on transactions that would lock out other
@@ -3594,6 +3671,11 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
cfqq->ioprio_class = cfqq->org_ioprio_class;
cfqq->ioprio = cfqq->org_ioprio;
}
+
+ /*
+ * update the io weight if io priority gets changed.
+ */
+ cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
}
static inline int __cfq_may_queue(struct cfq_queue *cfqq)
--
1.6.5.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group.
[not found] ` <4D01C6AB.9040807@cn.fujitsu.com>
` (2 preceding siblings ...)
2010-12-13 1:44 ` [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue Gui Jianfeng
@ 2010-12-13 1:44 ` Gui Jianfeng
2010-12-13 22:11 ` Vivek Goyal
2010-12-13 1:45 ` [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level Gui Jianfeng
` (3 subsequent siblings)
7 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-13 1:44 UTC (permalink / raw)
To: Jens Axboe, Vivek Goyal
Cc: Gui Jianfeng, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Extract some common code of service tree handling for CFQ queue and
CFQ group. This helps when CFQ queue and CFQ group are scheduling
together.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/cfq-iosched.c | 87 +++++++++++++++++++++------------------------------
1 files changed, 36 insertions(+), 51 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 30d19c0..6486956 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -863,12 +863,11 @@ entity_key(struct cfq_rb_root *st, struct cfq_entity *entity)
}
static void
-__cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
+__cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
{
struct rb_node **node = &st->rb.rb_node;
struct rb_node *parent = NULL;
struct cfq_entity *__cfqe;
- struct cfq_entity *cfqe = &cfqg->cfqe;
s64 key = entity_key(st, cfqe);
int left = 1;
@@ -892,6 +891,14 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
}
static void
+cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
+{
+ __cfq_entity_service_tree_add(st, cfqe);
+ st->count++;
+ st->total_weight += cfqe->weight;
+}
+
+static void
cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
struct cfq_rb_root *st = &cfqd->grp_service_tree;
@@ -915,8 +922,23 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
} else
cfqe->vdisktime = st->min_vdisktime;
- __cfq_group_service_tree_add(st, cfqg);
- st->total_weight += cfqe->weight;
+ cfq_entity_service_tree_add(st, cfqe);
+}
+
+static void
+__cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
+{
+ cfq_rb_erase(&cfqe->rb_node, st);
+}
+
+static void
+cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
+{
+ if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
+ __cfq_entity_service_tree_del(st, cfqe);
+ st->total_weight -= cfqe->weight;
+ cfqe->service_tree = NULL;
+ }
}
static void
@@ -933,9 +955,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
return;
cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
- st->total_weight -= cfqe->weight;
- if (!RB_EMPTY_NODE(&cfqe->rb_node))
- cfq_rb_erase(&cfqe->rb_node, st);
+ cfq_entity_service_tree_del(st, cfqe);
cfqg->saved_workload_slice = 0;
cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
}
@@ -984,9 +1004,9 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
charge = cfqq->allocated_slice;
/* Can't update vdisktime while group is on service tree */
- cfq_rb_erase(&cfqe->rb_node, st);
+ __cfq_entity_service_tree_del(st, cfqe);
cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
- __cfq_group_service_tree_add(st, cfqg);
+ __cfq_entity_service_tree_add(st, cfqe);
/* This group is being expired. Save the context */
if (time_after(cfqd->workload_expires, jiffies)) {
@@ -1223,13 +1243,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
bool add_front)
{
struct cfq_entity *cfqe;
- struct rb_node **p, *parent;
+ struct rb_node *parent;
struct cfq_entity *__cfqe;
struct cfq_rb_root *service_tree, *orig_st;
- int left;
int new_cfqq = 1;
int group_changed = 0;
- s64 key;
cfqe = &cfqq->cfqe;
orig_st = cfqe->service_tree;
@@ -1246,8 +1264,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
* Group changed, dequeue this CFQ queue from the
* original service tree.
*/
- cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
- orig_st->total_weight -= cfqe->weight;
+ cfq_entity_service_tree_del(orig_st, cfqe);
}
cfqq->orig_cfqg = cfqq->cfqg;
cfqq->cfqg = &cfqd->root_group;
@@ -1263,8 +1280,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
* Group changed, dequeue this CFQ queue from the
* original service tree.
*/
- cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
- orig_st->total_weight -= cfqe->weight;
+ cfq_entity_service_tree_del(orig_st, cfqe);
}
cfq_put_cfqg(cfqq->cfqg);
cfqq->cfqg = cfqq->orig_cfqg;
@@ -1310,8 +1326,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
* Ok, we get here, this CFQ queue is on the service tree, dequeue it
* firstly.
*/
- cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
- orig_st->total_weight -= cfqe->weight;
+ cfq_entity_service_tree_del(orig_st, cfqe);
new_cfqq = 0;
@@ -1338,39 +1353,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
}
insert:
- left = 1;
- parent = NULL;
cfqe->service_tree = service_tree;
- p = &service_tree->rb.rb_node;
- key = entity_key(service_tree, cfqe);
- while (*p) {
- struct rb_node **n;
-
- parent = *p;
- __cfqe = rb_entry(parent, struct cfq_entity,
- rb_node);
-
- /*
- * sort by key, that represents service time.
- */
- if (key < entity_key(service_tree, __cfqe))
- n = &(*p)->rb_left;
- else {
- n = &(*p)->rb_right;
- left = 0;
- }
- p = n;
- }
+ /* Add cfqq onto service tree. */
+ cfq_entity_service_tree_add(service_tree, cfqe);
- if (left)
- service_tree->left = &cfqe->rb_node;
-
- rb_link_node(&cfqe->rb_node, parent, p);
- rb_insert_color(&cfqe->rb_node, &service_tree->rb);
update_min_vdisktime(service_tree);
- service_tree->count++;
- service_tree->total_weight += cfqe->weight;
cfqq->reposition_time = jiffies;
if ((add_front || !new_cfqq) && !group_changed)
return;
@@ -1483,10 +1471,7 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
service_tree = cfqe->service_tree;
if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
- cfq_rb_erase(&cfqe->rb_node,
- cfqe->service_tree);
- service_tree->total_weight -= cfqe->weight;
- cfqe->service_tree = NULL;
+ cfq_entity_service_tree_del(service_tree, cfqe);
}
if (cfqq->p_root) {
rb_erase(&cfqq->p_node, cfqq->p_root);
--
1.6.5.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
[not found] ` <4D01C6AB.9040807@cn.fujitsu.com>
` (3 preceding siblings ...)
2010-12-13 1:44 ` [PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group Gui Jianfeng
@ 2010-12-13 1:45 ` Gui Jianfeng
2010-12-14 3:49 ` Vivek Goyal
2010-12-13 1:45 ` [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality Gui Jianfeng
` (2 subsequent siblings)
7 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-13 1:45 UTC (permalink / raw)
To: Jens Axboe, Vivek Goyal
Cc: Gui Jianfeng, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
This patch makes CFQ queue and CFQ group schedules at the same level.
Consider the following hierarchy:
Root
/ | \
q1 q2 G1
/ \
q3 G2
q1 q2 and q3 are CFQ queues G1 and G2 are CFQ groups. With this patch, q1,
q2 and G1 are scheduling on a same service tree in Root CFQ group. q3 and G2
are schedluing under G1. Note, for the time being, CFQ group is treated
as "BE and SYNC" workload, and is put on "BE and SYNC" service tree. That
means service differentiate only happens in "BE and SYNC" service tree.
Later, we may introduce "IO Class" for CFQ group.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/cfq-iosched.c | 473 ++++++++++++++++++++++++++++++++++----------------
1 files changed, 321 insertions(+), 152 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 6486956..d90627e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -105,6 +105,9 @@ struct cfq_entity {
u64 vdisktime;
bool is_group_entity;
unsigned int weight;
+ struct cfq_entity *parent;
+ /* Reposition time */
+ unsigned long reposition_time;
};
/*
@@ -113,8 +116,6 @@ struct cfq_entity {
struct cfq_queue {
/* The schedule entity */
struct cfq_entity cfqe;
- /* Reposition time */
- unsigned long reposition_time;
/* reference count */
atomic_t ref;
/* various state flags, see below */
@@ -194,6 +195,9 @@ struct cfq_group {
/* number of cfqq currently on this group */
int nr_cfqq;
+ /* number of sub cfq groups */
+ int nr_subgp;
+
/*
* Per group busy queus average. Useful for workload slice calc. We
* create the array for each prio class but at run time it is used
@@ -229,8 +233,6 @@ struct cfq_group {
*/
struct cfq_data {
struct request_queue *queue;
- /* Root service tree for cfq_groups */
- struct cfq_rb_root grp_service_tree;
struct cfq_group root_group;
/*
@@ -347,8 +349,6 @@ cfqg_of_entity(struct cfq_entity *cfqe)
return NULL;
}
-static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
-
static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
enum wl_prio_t prio,
enum wl_type_t type)
@@ -638,10 +638,15 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
static inline unsigned
cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
struct cfq_entity *cfqe = &cfqg->cfqe;
+ struct cfq_rb_root *st = cfqe->service_tree;
- return cfq_target_latency * cfqe->weight / st->total_weight;
+ if (st)
+ return cfq_target_latency * cfqe->weight
+ / st->total_weight;
+ else
+ /* If this is the root group, give it a full slice. */
+ return cfq_target_latency;
}
static inline void
@@ -804,17 +809,6 @@ static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
return NULL;
}
-static struct cfq_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
-{
- if (!root->left)
- root->left = rb_first(&root->rb);
-
- if (root->left)
- return rb_entry_entity(root->left);
-
- return NULL;
-}
-
static void rb_erase_init(struct rb_node *n, struct rb_root *root)
{
rb_erase(n, root);
@@ -888,12 +882,15 @@ __cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
rb_link_node(&cfqe->rb_node, parent, node);
rb_insert_color(&cfqe->rb_node, &st->rb);
+
+ update_min_vdisktime(st);
}
static void
cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
{
__cfq_entity_service_tree_add(st, cfqe);
+ cfqe->reposition_time = jiffies;
st->count++;
st->total_weight += cfqe->weight;
}
@@ -901,34 +898,57 @@ cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
static void
cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
struct cfq_entity *cfqe = &cfqg->cfqe;
- struct cfq_entity *__cfqe;
struct rb_node *n;
+ struct cfq_entity *entity;
+ struct cfq_rb_root *st;
+ struct cfq_group *__cfqg;
cfqg->nr_cfqq++;
+
+ /*
+ * Root group doesn't belongs to any service
+ */
+ if (cfqg == &cfqd->root_group)
+ return;
+
if (!RB_EMPTY_NODE(&cfqe->rb_node))
return;
/*
- * Currently put the group at the end. Later implement something
- * so that groups get lesser vtime based on their weights, so that
- * if group does not loose all if it was not continously backlogged.
+ * Enqueue this group and its ancestors onto their service tree.
*/
- n = rb_last(&st->rb);
- if (n) {
- __cfqe = rb_entry_entity(n);
- cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
- } else
- cfqe->vdisktime = st->min_vdisktime;
+ while (cfqe && cfqe->parent) {
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
+ return;
+
+ /*
+ * Currently put the group at the end. Later implement
+ * something so that groups get lesser vtime based on their
+ * weights, so that if group does not loose all if it was not
+ * continously backlogged.
+ */
+ st = cfqe->service_tree;
+ n = rb_last(&st->rb);
+ if (n) {
+ entity = rb_entry_entity(n);
+ cfqe->vdisktime = entity->vdisktime +
+ CFQ_IDLE_DELAY;
+ } else
+ cfqe->vdisktime = st->min_vdisktime;
- cfq_entity_service_tree_add(st, cfqe);
+ cfq_entity_service_tree_add(st, cfqe);
+ cfqe = cfqe->parent;
+ __cfqg = cfqg_of_entity(cfqe);
+ __cfqg->nr_subgp++;
+ }
}
static void
__cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
{
cfq_rb_erase(&cfqe->rb_node, st);
+ update_min_vdisktime(st);
}
static void
@@ -937,27 +957,47 @@ cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
__cfq_entity_service_tree_del(st, cfqe);
st->total_weight -= cfqe->weight;
- cfqe->service_tree = NULL;
}
}
static void
cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
struct cfq_entity *cfqe = &cfqg->cfqe;
+ struct cfq_group *__cfqg, *p_cfqg;
BUG_ON(cfqg->nr_cfqq < 1);
cfqg->nr_cfqq--;
+ /*
+ * Root group doesn't belongs to any service
+ */
+ if (cfqg == &cfqd->root_group)
+ return;
+
/* If there are other cfq queues under this group, don't delete it */
if (cfqg->nr_cfqq)
return;
- cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
- cfq_entity_service_tree_del(st, cfqe);
- cfqg->saved_workload_slice = 0;
- cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
+ /* If child group exists, don't dequeue it */
+ if (cfqg->nr_subgp)
+ return;
+
+ /*
+ * Dequeue this group and its ancestors from their service tree.
+ */
+ while (cfqe && cfqe->parent) {
+ __cfqg = cfqg_of_entity(cfqe);
+ p_cfqg = cfqg_of_entity(cfqe->parent);
+ cfq_entity_service_tree_del(cfqe->service_tree, cfqe);
+ cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
+ cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
+ __cfqg->saved_workload_slice = 0;
+ cfqe = cfqe->parent;
+ p_cfqg->nr_subgp--;
+ if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
+ return;
+ }
}
static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
@@ -989,7 +1029,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
struct cfq_queue *cfqq)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
unsigned int used_sl, charge;
int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
- cfqg->service_tree_idle.count;
@@ -1003,10 +1042,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
charge = cfqq->allocated_slice;
- /* Can't update vdisktime while group is on service tree */
- __cfq_entity_service_tree_del(st, cfqe);
- cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
- __cfq_entity_service_tree_add(st, cfqe);
+ /*
+ * Update the vdisktime on the whole chain.
+ */
+ while (cfqe && cfqe->parent) {
+ struct cfq_rb_root *st = cfqe->service_tree;
+
+ /* Can't update vdisktime while group is on service tree */
+ __cfq_entity_service_tree_del(st, cfqe);
+ cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
+ __cfq_entity_service_tree_add(st, cfqe);
+ st->count++;
+ cfqe->reposition_time = jiffies;
+ cfqe = cfqe->parent;
+ }
+
/* This group is being expired. Save the context */
if (time_after(cfqd->workload_expires, jiffies)) {
@@ -1018,7 +1068,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
cfqg->saved_workload_slice = 0;
cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu",
- cfqe->vdisktime, st->min_vdisktime);
+ cfqg->cfqe.vdisktime,
+ cfqg->cfqe.service_tree->min_vdisktime);
cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
" sect=%u", used_sl, cfqq->slice_dispatch, charge,
iops_mode(cfqd), cfqq->nr_sectors);
@@ -1040,35 +1091,27 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
cfqg_of_blkg(blkg)->cfqe.weight = weight;
}
-static struct cfq_group *
-cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+static void init_cfqe(struct blkio_cgroup *blkcg,
+ struct cfq_group *cfqg)
+{
+ struct cfq_entity *cfqe = &cfqg->cfqe;
+
+ cfqe->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+ RB_CLEAR_NODE(&cfqe->rb_node);
+ cfqe->is_group_entity = true;
+ cfqe->parent = NULL;
+}
+
+static void init_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
+ struct cfq_group *cfqg)
{
- struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
- struct cfq_group *cfqg = NULL;
- void *key = cfqd;
int i, j;
struct cfq_rb_root *st;
- struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
unsigned int major, minor;
-
- cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
- if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
- sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
- cfqg->blkg.dev = MKDEV(major, minor);
- goto done;
- }
- if (cfqg || !create)
- goto done;
-
- cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
- if (!cfqg)
- goto done;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
for_each_cfqg_st(cfqg, i, j, st)
*st = CFQ_RB_ROOT;
- RB_CLEAR_NODE(&cfqg->cfqe.rb_node);
-
- cfqg->cfqe.is_group_entity = true;
/*
* Take the initial reference that will be released on destroy
@@ -1078,24 +1121,119 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
*/
atomic_set(&cfqg->ref, 1);
+ /* Add group onto cgroup list */
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
+ MKDEV(major, minor));
+ /* Initiate group entity */
+ init_cfqe(blkcg, cfqg);
+ /* Add group on cfqd list */
+ hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+}
+
+static void cfq_destroy_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg);
+
+static void uninit_cfqg(struct cfq_data *cfqd, struct cfq_group *cfqg)
+{
+ if (!cfq_blkiocg_del_blkio_group(&cfqg->blkg))
+ cfq_destroy_cfqg(cfqd, cfqg);
+}
+
+static void cfqg_set_parent(struct cfq_data *cfqd, struct cfq_group *cfqg,
+ struct cfq_group *p_cfqg)
+{
+ struct cfq_entity *cfqe, *p_cfqe;
+
+ cfqe = &cfqg->cfqe;
+
+ p_cfqe = &p_cfqg->cfqe;
+
+ cfqe->parent = p_cfqe;
+
/*
- * Add group onto cgroup list. It might happen that bdi->dev is
- * not initiliazed yet. Initialize this new group without major
- * and minor info and this info will be filled in once a new thread
- * comes for IO. See code above.
+ * Currently, just put cfq group entity on "BE:SYNC" workload
+ * service tree.
*/
- if (bdi->dev) {
- sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
- cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
- MKDEV(major, minor));
- } else
- cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
- 0);
+ cfqe->service_tree = service_tree_for(p_cfqg, BE_WORKLOAD,
+ SYNC_WORKLOAD);
+ /* child reference */
+ atomic_inc(&p_cfqg->ref);
+}
- cfqg->cfqe.weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+int cfqg_chain_alloc(struct cfq_data *cfqd, struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ struct blkio_cgroup *p_blkcg;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+ unsigned int major, minor;
+ struct cfq_group *cfqg, *p_cfqg;
+ void *key = cfqd;
+ int ret;
- /* Add group on cfqd list */
- hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ if (cfqg) {
+ if (!cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ cfqg->blkg.dev = MKDEV(major, minor);
+ }
+ /* chain has already been built */
+ return 0;
+ }
+
+ cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
+ if (!cfqg)
+ return -1;
+
+ init_cfqg(cfqd, blkcg, cfqg);
+
+ /* Already to the top group */
+ if (!cgroup->parent)
+ return 0;
+
+ /* Allocate CFQ groups on the chain */
+ ret = cfqg_chain_alloc(cfqd, cgroup->parent);
+ if (ret == -1) {
+ uninit_cfqg(cfqd, cfqg);
+ return -1;
+ }
+
+ p_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
+ p_cfqg = cfqg_of_blkg(blkiocg_lookup_group(p_blkcg, key));
+ BUG_ON(p_cfqg == NULL);
+
+ cfqg_set_parent(cfqd, cfqg, p_cfqg);
+ return 0;
+}
+
+static struct cfq_group *
+cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ struct cfq_group *cfqg = NULL;
+ void *key = cfqd;
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+ unsigned int major, minor;
+ int ret;
+
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ cfqg->blkg.dev = MKDEV(major, minor);
+ goto done;
+ }
+ if (cfqg || !create)
+ goto done;
+
+ /*
+ * For hierarchical cfq group scheduling, we need to allocate
+ * the whole cfq group chain.
+ */
+ ret = cfqg_chain_alloc(cfqd, cgroup);
+ if (!ret) {
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ BUG_ON(cfqg == NULL);
+ goto done;
+ }
done:
return cfqg;
@@ -1140,12 +1278,22 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
{
struct cfq_rb_root *st;
int i, j;
+ struct cfq_entity *cfqe;
+ struct cfq_group *p_cfqg;
BUG_ON(atomic_read(&cfqg->ref) <= 0);
if (!atomic_dec_and_test(&cfqg->ref))
return;
for_each_cfqg_st(cfqg, i, j, st)
BUG_ON(!RB_EMPTY_ROOT(&st->rb));
+
+ cfqe = &cfqg->cfqe;
+ if (cfqe->parent) {
+ p_cfqg = cfqg_of_entity(cfqe->parent);
+ /* Drop the reference taken by children */
+ atomic_dec(&p_cfqg->ref);
+ }
+
kfree(cfqg);
}
@@ -1358,8 +1506,6 @@ insert:
/* Add cfqq onto service tree. */
cfq_entity_service_tree_add(service_tree, cfqe);
- update_min_vdisktime(service_tree);
- cfqq->reposition_time = jiffies;
if ((add_front || !new_cfqq) && !group_changed)
return;
cfq_group_service_tree_add(cfqd, cfqq->cfqg);
@@ -1802,28 +1948,30 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
return cfqq_of_entity(cfq_rb_first(service_tree));
}
-static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
+static struct cfq_entity *
+cfq_get_next_entity_forced(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_group *cfqg;
- struct cfq_entity *cfqe;
+ struct cfq_entity *entity;
int i, j;
struct cfq_rb_root *st;
if (!cfqd->rq_queued)
return NULL;
- cfqg = cfq_get_next_cfqg(cfqd);
- if (!cfqg)
- return NULL;
-
for_each_cfqg_st(cfqg, i, j, st) {
- cfqe = cfq_rb_first(st);
- if (cfqe != NULL)
- return cfqq_of_entity(cfqe);
+ entity = cfq_rb_first(st);
+
+ if (entity && !entity->is_group_entity)
+ return entity;
+ else if (entity && entity->is_group_entity) {
+ cfqg = cfqg_of_entity(entity);
+ return cfq_get_next_entity_forced(cfqd, cfqg);
+ }
}
return NULL;
}
+
/*
* Get and set a new active queue for service.
*/
@@ -2179,7 +2327,6 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
struct cfq_group *cfqg, enum wl_prio_t prio)
{
struct cfq_entity *cfqe;
- struct cfq_queue *cfqq;
unsigned long lowest_start_time;
int i;
bool time_valid = false;
@@ -2191,10 +2338,9 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
*/
for (i = 0; i <= SYNC_WORKLOAD; ++i) {
cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
- cfqq = cfqq_of_entity(cfqe);
if (cfqe && (!time_valid ||
- cfqq->reposition_time < lowest_start_time)) {
- lowest_start_time = cfqq->reposition_time;
+ cfqe->reposition_time < lowest_start_time)) {
+ lowest_start_time = cfqe->reposition_time;
cur_best = i;
time_valid = true;
}
@@ -2203,47 +2349,13 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
return cur_best;
}
-static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
+static void set_workload_expire(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
unsigned slice;
unsigned count;
struct cfq_rb_root *st;
unsigned group_slice;
- if (!cfqg) {
- cfqd->serving_prio = IDLE_WORKLOAD;
- cfqd->workload_expires = jiffies + 1;
- return;
- }
-
- /* Choose next priority. RT > BE > IDLE */
- if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
- cfqd->serving_prio = RT_WORKLOAD;
- else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
- cfqd->serving_prio = BE_WORKLOAD;
- else {
- cfqd->serving_prio = IDLE_WORKLOAD;
- cfqd->workload_expires = jiffies + 1;
- return;
- }
-
- /*
- * For RT and BE, we have to choose also the type
- * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
- * expiration time
- */
- st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
- count = st->count;
-
- /*
- * check workload expiration, and that we still have other queues ready
- */
- if (count && !time_after(jiffies, cfqd->workload_expires))
- return;
-
- /* otherwise select new workload type */
- cfqd->serving_type =
- cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
count = st->count;
@@ -2284,26 +2396,51 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
cfqd->workload_expires = jiffies + slice;
}
-static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
+static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
{
- struct cfq_rb_root *st = &cfqd->grp_service_tree;
- struct cfq_group *cfqg;
- struct cfq_entity *cfqe;
+ struct cfq_rb_root *st;
+ unsigned count;
- if (RB_EMPTY_ROOT(&st->rb))
- return NULL;
- cfqe = cfq_rb_first_entity(st);
- cfqg = cfqg_of_entity(cfqe);
- BUG_ON(!cfqg);
- update_min_vdisktime(st);
- return cfqg;
+ if (!cfqg) {
+ cfqd->serving_prio = IDLE_WORKLOAD;
+ cfqd->workload_expires = jiffies + 1;
+ return;
+ }
+
+ /* Choose next priority. RT > BE > IDLE */
+ if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
+ cfqd->serving_prio = RT_WORKLOAD;
+ else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
+ cfqd->serving_prio = BE_WORKLOAD;
+ else {
+ cfqd->serving_prio = IDLE_WORKLOAD;
+ cfqd->workload_expires = jiffies + 1;
+ return;
+ }
+
+ /*
+ * For RT and BE, we have to choose also the type
+ * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
+ * expiration time
+ */
+ st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
+ count = st->count;
+
+ /*
+ * check workload expiration, and that we still have other queues ready
+ */
+ if (count && !time_after(jiffies, cfqd->workload_expires))
+ return;
+
+ /* otherwise select new workload type */
+ cfqd->serving_type =
+ cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
}
-static void cfq_choose_cfqg(struct cfq_data *cfqd)
+struct cfq_entity *choose_serving_entity(struct cfq_data *cfqd,
+ struct cfq_group *cfqg)
{
- struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
-
- cfqd->serving_group = cfqg;
+ struct cfq_rb_root *service_tree;
/* Restore the workload type data */
if (cfqg->saved_workload_slice) {
@@ -2314,8 +2451,21 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
cfqd->workload_expires = jiffies - 1;
choose_service_tree(cfqd, cfqg);
-}
+ service_tree = service_tree_for(cfqg, cfqd->serving_prio,
+ cfqd->serving_type);
+
+ if (!cfqd->rq_queued)
+ return NULL;
+
+ /* There is nothing to dispatch */
+ if (!service_tree)
+ return NULL;
+ if (RB_EMPTY_ROOT(&service_tree->rb))
+ return NULL;
+
+ return cfq_rb_first(service_tree);
+}
/*
* Select a queue for service. If we have a current active queue,
* check whether to continue servicing it, or retrieve and set a new one.
@@ -2323,6 +2473,8 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq, *new_cfqq = NULL;
+ struct cfq_group *cfqg;
+ struct cfq_entity *entity;
cfqq = cfqd->active_queue;
if (!cfqq)
@@ -2422,8 +2574,23 @@ new_queue:
* Current queue expired. Check if we have to switch to a new
* service tree
*/
- if (!new_cfqq)
- cfq_choose_cfqg(cfqd);
+ cfqg = &cfqd->root_group;
+
+ if (!new_cfqq) {
+ do {
+ entity = choose_serving_entity(cfqd, cfqg);
+ if (entity && !entity->is_group_entity) {
+ /* This is the CFQ queue that should run */
+ new_cfqq = cfqq_of_entity(entity);
+ cfqd->serving_group = cfqg;
+ set_workload_expire(cfqd, cfqg);
+ break;
+ } else if (entity && entity->is_group_entity) {
+ /* Continue to lookup in this CFQ group */
+ cfqg = cfqg_of_entity(entity);
+ }
+ } while (entity && entity->is_group_entity);
+ }
cfqq = cfq_set_active_queue(cfqd, new_cfqq);
keep_queue:
@@ -2454,10 +2621,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
{
struct cfq_queue *cfqq;
int dispatched = 0;
+ struct cfq_entity *entity;
+ struct cfq_group *root = &cfqd->root_group;
/* Expire the timeslice of the current active queue first */
cfq_slice_expired(cfqd, 0);
- while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL) {
+ while ((entity = cfq_get_next_entity_forced(cfqd, root)) != NULL) {
+ BUG_ON(entity->is_group_entity);
+ cfqq = cfqq_of_entity(entity);
__cfq_set_active_queue(cfqd, cfqq);
dispatched += __cfq_forced_dispatch_cfqq(cfqq);
}
@@ -3991,9 +4162,6 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cic_index = i;
- /* Init root service tree */
- cfqd->grp_service_tree = CFQ_RB_ROOT;
-
/* Init root group */
cfqg = &cfqd->root_group;
for_each_cfqg_st(cfqg, i, j, st)
@@ -4003,6 +4171,7 @@ static void *cfq_init_queue(struct request_queue *q)
/* Give preference to root group over other groups */
cfqg->cfqe.weight = 2*BLKIO_WEIGHT_DEFAULT;
cfqg->cfqe.is_group_entity = true;
+ cfqg->cfqe.parent = NULL;
#ifdef CONFIG_CFQ_GROUP_IOSCHED
/*
--
1.6.5.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality.
[not found] ` <4D01C6AB.9040807@cn.fujitsu.com>
` (4 preceding siblings ...)
2010-12-13 1:45 ` [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level Gui Jianfeng
@ 2010-12-13 1:45 ` Gui Jianfeng
2010-12-15 21:26 ` Vivek Goyal
2010-12-13 1:45 ` [PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy" Gui Jianfeng
2010-12-13 1:45 ` [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy Gui Jianfeng
7 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-13 1:45 UTC (permalink / raw)
To: Jens Axboe, Vivek Goyal
Cc: Gui Jianfeng, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
This patch adds "use_hierarchy" in Root CGroup with out any functionality.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/blk-cgroup.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--
block/blk-cgroup.h | 5 +++-
block/cfq-iosched.c | 24 +++++++++++++++++
3 files changed, 97 insertions(+), 4 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 455768a..9747ebb 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -25,7 +25,10 @@
static DEFINE_SPINLOCK(blkio_list_lock);
static LIST_HEAD(blkio_list);
-struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
+struct blkio_cgroup blkio_root_cgroup = {
+ .weight = 2*BLKIO_WEIGHT_DEFAULT,
+ .use_hierarchy = 1,
+ };
EXPORT_SYMBOL_GPL(blkio_root_cgroup);
static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
@@ -1385,10 +1388,73 @@ struct cftype blkio_files[] = {
#endif
};
+static u64 blkiocg_use_hierarchy_read(struct cgroup *cgroup,
+ struct cftype *cftype)
+{
+ struct blkio_cgroup *blkcg;
+
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ return (u64)blkcg->use_hierarchy;
+}
+
+static int
+blkiocg_use_hierarchy_write(struct cgroup *cgroup,
+ struct cftype *cftype, u64 val)
+{
+ struct blkio_cgroup *blkcg;
+ struct blkio_group *blkg;
+ struct hlist_node *n;
+ struct blkio_policy_type *blkiop;
+
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+
+ if (val > 1 || !list_empty(&cgroup->children))
+ return -EINVAL;
+
+ if (blkcg->use_hierarchy == val)
+ return 0;
+
+ spin_lock(&blkio_list_lock);
+ blkcg->use_hierarchy = val;
+
+ hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
+ list_for_each_entry(blkiop, &blkio_list, list) {
+ /*
+ * If this policy does not own the blkg, do not change
+ * cfq group scheduling mode.
+ */
+ if (blkiop->plid != blkg->plid)
+ continue;
+
+ if (blkiop->ops.blkio_update_use_hierarchy_fn)
+ blkiop->ops.blkio_update_use_hierarchy_fn(blkg,
+ val);
+ }
+ }
+ spin_unlock(&blkio_list_lock);
+ return 0;
+}
+
+static struct cftype blkio_use_hierarchy = {
+ .name = "use_hierarchy",
+ .read_u64 = blkiocg_use_hierarchy_read,
+ .write_u64 = blkiocg_use_hierarchy_write,
+};
+
static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
- return cgroup_add_files(cgroup, subsys, blkio_files,
- ARRAY_SIZE(blkio_files));
+ int ret;
+
+ ret = cgroup_add_files(cgroup, subsys, blkio_files,
+ ARRAY_SIZE(blkio_files));
+ if (ret)
+ return ret;
+
+ /* use_hierarchy is in root cgroup only. */
+ if (!cgroup->parent)
+ ret = cgroup_add_file(cgroup, subsys, &blkio_use_hierarchy);
+
+ return ret;
}
static void blkiocg_destroy(struct cgroup_subsys *subsys, struct cgroup *cgroup)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index ea4861b..c8caf4e 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -105,6 +105,7 @@ enum blkcg_file_name_throtl {
struct blkio_cgroup {
struct cgroup_subsys_state css;
unsigned int weight;
+ bool use_hierarchy;
spinlock_t lock;
struct hlist_head blkg_list;
struct list_head policy_list; /* list of blkio_policy_node */
@@ -200,7 +201,8 @@ typedef void (blkio_update_group_read_iops_fn) (void *key,
struct blkio_group *blkg, unsigned int read_iops);
typedef void (blkio_update_group_write_iops_fn) (void *key,
struct blkio_group *blkg, unsigned int write_iops);
-
+typedef void (blkio_update_use_hierarchy_fn) (struct blkio_group *blkg,
+ bool val);
struct blkio_policy_ops {
blkio_unlink_group_fn *blkio_unlink_group_fn;
blkio_update_group_weight_fn *blkio_update_group_weight_fn;
@@ -208,6 +210,7 @@ struct blkio_policy_ops {
blkio_update_group_write_bps_fn *blkio_update_group_write_bps_fn;
blkio_update_group_read_iops_fn *blkio_update_group_read_iops_fn;
blkio_update_group_write_iops_fn *blkio_update_group_write_iops_fn;
+ blkio_update_use_hierarchy_fn *blkio_update_use_hierarchy_fn;
};
struct blkio_policy_type {
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d90627e..08323f5 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -192,6 +192,9 @@ struct cfq_group {
/* cfq group sched entity */
struct cfq_entity cfqe;
+ /* parent cfq_data */
+ struct cfq_data *cfqd;
+
/* number of cfqq currently on this group */
int nr_cfqq;
@@ -235,6 +238,9 @@ struct cfq_data {
struct request_queue *queue;
struct cfq_group root_group;
+ /* cfq group schedule in flat or hierarchy manner. */
+ bool use_hierarchy;
+
/*
* The priority currently being served
*/
@@ -1091,6 +1097,15 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
cfqg_of_blkg(blkg)->cfqe.weight = weight;
}
+void
+cfq_update_blkio_use_hierarchy(struct blkio_group *blkg, bool val)
+{
+ struct cfq_group *cfqg;
+
+ cfqg = cfqg_of_blkg(blkg);
+ cfqg->cfqd->use_hierarchy = val;
+}
+
static void init_cfqe(struct blkio_cgroup *blkcg,
struct cfq_group *cfqg)
{
@@ -1121,6 +1136,9 @@ static void init_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg,
*/
atomic_set(&cfqg->ref, 1);
+ /* Setup cfq data for cfq group */
+ cfqg->cfqd = cfqd;
+
/* Add group onto cgroup list */
sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
@@ -4164,6 +4182,7 @@ static void *cfq_init_queue(struct request_queue *q)
/* Init root group */
cfqg = &cfqd->root_group;
+ cfqg->cfqd = cfqd;
for_each_cfqg_st(cfqg, i, j, st)
*st = CFQ_RB_ROOT;
RB_CLEAR_NODE(&cfqg->cfqe.rb_node);
@@ -4224,6 +4243,10 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_latency = 1;
cfqd->cfq_group_isolation = 0;
cfqd->hw_tag = -1;
+
+ /* hierarchical scheduling for cfq group by default */
+ cfqd->use_hierarchy = 1;
+
/*
* we optimistically start assuming sync ops weren't delayed in last
* second, in order to have larger depth for async operations.
@@ -4386,6 +4409,7 @@ static struct blkio_policy_type blkio_policy_cfq = {
.ops = {
.blkio_unlink_group_fn = cfq_unlink_blkio_group,
.blkio_update_group_weight_fn = cfq_update_blkio_group_weight,
+ .blkio_update_use_hierarchy_fn = cfq_update_blkio_use_hierarchy,
},
.plid = BLKIO_POLICY_PROP,
};
--
1.6.5.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy"
[not found] ` <4D01C6AB.9040807@cn.fujitsu.com>
` (5 preceding siblings ...)
2010-12-13 1:45 ` [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality Gui Jianfeng
@ 2010-12-13 1:45 ` Gui Jianfeng
2010-12-20 19:43 ` Vivek Goyal
2010-12-13 1:45 ` [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy Gui Jianfeng
7 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-13 1:45 UTC (permalink / raw)
To: Jens Axboe, Vivek Goyal
Cc: Gui Jianfeng, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Add flat CFQ group scheduling mode and switch between two modes
by "use_hierarchy". Currently, It works when there's only root
group available.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
block/blk-cgroup.c | 2 +-
block/cfq-iosched.c | 357 +++++++++++++++++++++++++++++++++++++++------------
2 files changed, 276 insertions(+), 83 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9747ebb..baa286b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -27,7 +27,7 @@ static LIST_HEAD(blkio_list);
struct blkio_cgroup blkio_root_cgroup = {
.weight = 2*BLKIO_WEIGHT_DEFAULT,
- .use_hierarchy = 1,
+ .use_hierarchy = 0,
};
EXPORT_SYMBOL_GPL(blkio_root_cgroup);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 08323f5..cbd23f6 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -241,6 +241,9 @@ struct cfq_data {
/* cfq group schedule in flat or hierarchy manner. */
bool use_hierarchy;
+ /* Service tree for cfq group flat scheduling mode. */
+ struct cfq_rb_root flat_service_tree;
+
/*
* The priority currently being served
*/
@@ -647,12 +650,16 @@ cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
struct cfq_entity *cfqe = &cfqg->cfqe;
struct cfq_rb_root *st = cfqe->service_tree;
- if (st)
- return cfq_target_latency * cfqe->weight
- / st->total_weight;
- else
- /* If this is the root group, give it a full slice. */
- return cfq_target_latency;
+ if (cfqd->use_hierarchy) {
+ if (st)
+ return cfq_target_latency * cfqe->weight
+ / st->total_weight;
+ else
+ /* If this is the root group, give it a full slice. */
+ return cfq_target_latency;
+ } else {
+ return cfq_target_latency * cfqe->weight / st->total_weight;
+ }
}
static inline void
@@ -915,24 +922,50 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
/*
* Root group doesn't belongs to any service
*/
- if (cfqg == &cfqd->root_group)
+ if (cfqd->use_hierarchy && cfqg == &cfqd->root_group)
return;
if (!RB_EMPTY_NODE(&cfqe->rb_node))
return;
- /*
- * Enqueue this group and its ancestors onto their service tree.
- */
- while (cfqe && cfqe->parent) {
+ if (cfqd->use_hierarchy) {
+ /*
+ * Enqueue this group and its ancestors onto their service
+ * tree.
+ */
+ while (cfqe && cfqe->parent) {
+ if (!RB_EMPTY_NODE(&cfqe->rb_node))
+ return;
+
+ /*
+ * Currently put the group at the end. Later implement
+ * something so that groups get lesser vtime based on
+ * their weights, so that if group does not loose all
+ * if it was not continously backlogged.
+ */
+ st = cfqe->service_tree;
+ n = rb_last(&st->rb);
+ if (n) {
+ entity = rb_entry_entity(n);
+ cfqe->vdisktime = entity->vdisktime +
+ CFQ_IDLE_DELAY;
+ } else
+ cfqe->vdisktime = st->min_vdisktime;
+
+ cfq_entity_service_tree_add(st, cfqe);
+ cfqe = cfqe->parent;
+ __cfqg = cfqg_of_entity(cfqe);
+ __cfqg->nr_subgp++;
+ }
+ } else {
if (!RB_EMPTY_NODE(&cfqe->rb_node))
return;
/*
* Currently put the group at the end. Later implement
- * something so that groups get lesser vtime based on their
- * weights, so that if group does not loose all if it was not
- * continously backlogged.
+ * something so that groups get lesser vtime based on
+ * their weights, so that if group does not loose all
+ * if it was not continously backlogged.
*/
st = cfqe->service_tree;
n = rb_last(&st->rb);
@@ -943,10 +976,11 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
} else
cfqe->vdisktime = st->min_vdisktime;
- cfq_entity_service_tree_add(st, cfqe);
- cfqe = cfqe->parent;
- __cfqg = cfqg_of_entity(cfqe);
- __cfqg->nr_subgp++;
+ /*
+ * For flat mode, all cfq groups schedule on the global service
+ * tree(cfqd->flat_service_tree).
+ */
+ cfq_entity_service_tree_add(cfqe->service_tree, cfqe);
}
}
@@ -975,35 +1009,46 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
BUG_ON(cfqg->nr_cfqq < 1);
cfqg->nr_cfqq--;
- /*
- * Root group doesn't belongs to any service
- */
- if (cfqg == &cfqd->root_group)
- return;
-
/* If there are other cfq queues under this group, don't delete it */
if (cfqg->nr_cfqq)
return;
- /* If child group exists, don't dequeue it */
- if (cfqg->nr_subgp)
- return;
- /*
- * Dequeue this group and its ancestors from their service tree.
- */
- while (cfqe && cfqe->parent) {
- __cfqg = cfqg_of_entity(cfqe);
- p_cfqg = cfqg_of_entity(cfqe->parent);
- cfq_entity_service_tree_del(cfqe->service_tree, cfqe);
- cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
- cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
- __cfqg->saved_workload_slice = 0;
- cfqe = cfqe->parent;
- p_cfqg->nr_subgp--;
- if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
+ if (cfqd->use_hierarchy) {
+ /*
+ * Root group doesn't belongs to any service
+ */
+ if (cfqg == &cfqd->root_group)
+ return;
+
+ /* If child group exists, don't dequeue it */
+ if (cfqg->nr_subgp)
return;
+
+ /*
+ * Dequeue this group and its ancestors from their service
+ * tree.
+ */
+ while (cfqe && cfqe->parent) {
+ __cfqg = cfqg_of_entity(cfqe);
+ p_cfqg = cfqg_of_entity(cfqe->parent);
+ cfq_entity_service_tree_del(cfqe->service_tree, cfqe);
+ cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
+ cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
+ __cfqg->saved_workload_slice = 0;
+ cfqe = cfqe->parent;
+ p_cfqg->nr_subgp--;
+ if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
+ return;
+ }
+ } else {
+ /* Dequeued from flat service tree. */
+ cfq_entity_service_tree_del(cfqe->service_tree, cfqe);
+ cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
+ cfqg->saved_workload_slice = 0;
+ cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
}
+
}
static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
@@ -1048,19 +1093,31 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
charge = cfqq->allocated_slice;
- /*
- * Update the vdisktime on the whole chain.
- */
- while (cfqe && cfqe->parent) {
- struct cfq_rb_root *st = cfqe->service_tree;
+ if (cfqd->use_hierarchy) {
+ /*
+ * Update the vdisktime on the whole chain.
+ */
+ while (cfqe && cfqe->parent) {
+ struct cfq_rb_root *st = cfqe->service_tree;
- /* Can't update vdisktime while group is on service tree */
- __cfq_entity_service_tree_del(st, cfqe);
+ /*
+ * Can't update vdisktime while group is on service
+ * tree.
+ */
+ __cfq_entity_service_tree_del(st, cfqe);
+ cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
+ __cfq_entity_service_tree_add(st, cfqe);
+ st->count++;
+ cfqe->reposition_time = jiffies;
+ cfqe = cfqe->parent;
+ }
+ } else {
+ /* For flat mode, just charge its self */
+ __cfq_entity_service_tree_del(cfqe->service_tree, cfqe);
cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
- __cfq_entity_service_tree_add(st, cfqe);
- st->count++;
+ __cfq_entity_service_tree_add(cfqe->service_tree, cfqe);
+ cfqe->service_tree->count++;
cfqe->reposition_time = jiffies;
- cfqe = cfqe->parent;
}
@@ -1097,13 +1154,36 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
cfqg_of_blkg(blkg)->cfqe.weight = weight;
}
+static int cfq_forced_dispatch(struct cfq_data *cfqd);
+
void
cfq_update_blkio_use_hierarchy(struct blkio_group *blkg, bool val)
{
+ unsigned long flags;
struct cfq_group *cfqg;
+ struct cfq_data *cfqd;
+ struct cfq_entity *cfqe;
+ int nr;
+ /* Get root group here */
cfqg = cfqg_of_blkg(blkg);
+ cfqd = cfqg->cfqd;
+
+ spin_lock_irqsave(cfqd->queue->queue_lock, flags);
+
+ /* Drain all requests */
+ nr = cfq_forced_dispatch(cfqd);
+
+ cfqe = &cfqg->cfqe;
+
+ if (!val)
+ cfqe->service_tree = &cfqd->flat_service_tree;
+ else
+ cfqe->service_tree = NULL;
+
cfqg->cfqd->use_hierarchy = val;
+
+ spin_unlock_irqrestore(cfqd->queue->queue_lock, flags);
}
static void init_cfqe(struct blkio_cgroup *blkcg,
@@ -1164,6 +1244,12 @@ static void cfqg_set_parent(struct cfq_data *cfqd, struct cfq_group *cfqg,
cfqe = &cfqg->cfqe;
+ if (!p_cfqg) {
+ cfqe->service_tree = &cfqd->flat_service_tree;
+ cfqe->parent = NULL;
+ return;
+ }
+
p_cfqe = &p_cfqg->cfqe;
cfqe->parent = p_cfqe;
@@ -1223,6 +1309,36 @@ int cfqg_chain_alloc(struct cfq_data *cfqd, struct cgroup *cgroup)
return 0;
}
+static struct cfq_group *cfqg_alloc(struct cfq_data *cfqd,
+ struct cgroup *cgroup)
+{
+ struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
+ struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+ unsigned int major, minor;
+ struct cfq_group *cfqg;
+ void *key = cfqd;
+
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ if (cfqg) {
+ if (!cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+ sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+ cfqg->blkg.dev = MKDEV(major, minor);
+ }
+ return cfqg;
+ }
+
+ cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
+ if (!cfqg)
+ return NULL;
+
+ init_cfqg(cfqd, blkcg, cfqg);
+
+ cfqg_set_parent(cfqd, cfqg, NULL);
+
+ return cfqg;
+}
+
+
static struct cfq_group *
cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
{
@@ -1242,15 +1358,23 @@ cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
if (cfqg || !create)
goto done;
- /*
- * For hierarchical cfq group scheduling, we need to allocate
- * the whole cfq group chain.
- */
- ret = cfqg_chain_alloc(cfqd, cgroup);
- if (!ret) {
- cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
- BUG_ON(cfqg == NULL);
- goto done;
+ if (cfqd->use_hierarchy) {
+ /*
+ * For hierarchical cfq group scheduling, we need to allocate
+ * the whole cfq group chain.
+ */
+ ret = cfqg_chain_alloc(cfqd, cgroup);
+ if (!ret) {
+ cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
+ BUG_ON(cfqg == NULL);
+ goto done;
+ }
+ } else {
+ /*
+ * For flat cfq group scheduling, we just need to allocate a
+ * single cfq group.
+ */
+ cfqg = cfqg_alloc(cfqd, cgroup);
}
done:
@@ -2484,6 +2608,24 @@ struct cfq_entity *choose_serving_entity(struct cfq_data *cfqd,
return cfq_rb_first(service_tree);
}
+
+
+static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
+{
+ struct cfq_rb_root *st = &cfqd->flat_service_tree;
+ struct cfq_group *cfqg;
+ struct cfq_entity *cfqe;
+
+ if (RB_EMPTY_ROOT(&st->rb))
+ return NULL;
+
+ cfqe = cfq_rb_first(st);
+ cfqg = cfqg_of_entity(cfqe);
+ BUG_ON(!cfqg);
+ return cfqg;
+}
+
+
/*
* Select a queue for service. If we have a current active queue,
* check whether to continue servicing it, or retrieve and set a new one.
@@ -2592,22 +2734,41 @@ new_queue:
* Current queue expired. Check if we have to switch to a new
* service tree
*/
- cfqg = &cfqd->root_group;
- if (!new_cfqq) {
- do {
- entity = choose_serving_entity(cfqd, cfqg);
- if (entity && !entity->is_group_entity) {
- /* This is the CFQ queue that should run */
- new_cfqq = cfqq_of_entity(entity);
- cfqd->serving_group = cfqg;
- set_workload_expire(cfqd, cfqg);
- break;
- } else if (entity && entity->is_group_entity) {
- /* Continue to lookup in this CFQ group */
- cfqg = cfqg_of_entity(entity);
- }
- } while (entity && entity->is_group_entity);
+ if (cfqd->use_hierarchy) {
+ cfqg = &cfqd->root_group;
+
+ if (!new_cfqq) {
+ do {
+ entity = choose_serving_entity(cfqd, cfqg);
+ if (entity && !entity->is_group_entity) {
+ /*
+ * This is the CFQ queue that should
+ * run.
+ */
+ new_cfqq = cfqq_of_entity(entity);
+ cfqd->serving_group = cfqg;
+ set_workload_expire(cfqd, cfqg);
+ break;
+ } else if (entity && entity->is_group_entity) {
+ /*
+ * Continue to lookup in this CFQ
+ * group.
+ */
+ cfqg = cfqg_of_entity(entity);
+ }
+ } while (entity && entity->is_group_entity);
+ }
+ } else {
+ /* Select a CFQ group from flat service tree. */
+ cfqg = cfq_get_next_cfqg(cfqd);
+ cfqd->serving_group = cfqg;
+ entity = choose_serving_entity(cfqd, cfqg);
+ if (entity) {
+ BUG_ON(entity->is_group_entity);
+ new_cfqq = cfqq_of_entity(entity);
+ set_workload_expire(cfqd, cfqg);
+ }
}
cfqq = cfq_set_active_queue(cfqd, new_cfqq);
@@ -2631,6 +2792,28 @@ static int __cfq_forced_dispatch_cfqq(struct cfq_queue *cfqq)
return dispatched;
}
+static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
+{
+ struct cfq_group *cfqg;
+ struct cfq_entity *cfqe;
+ int i, j;
+ struct cfq_rb_root *st;
+
+ if (!cfqd->rq_queued)
+ return NULL;
+
+ cfqg = cfq_get_next_cfqg(cfqd);
+ if (!cfqg)
+ return NULL;
+
+ for_each_cfqg_st(cfqg, i, j, st) {
+ cfqe = cfq_rb_first(st);
+ if (cfqe != NULL)
+ return cfqq_of_entity(cfqe);
+ }
+ return NULL;
+}
+
/*
* Drain our current requests. Used for barriers and when switching
* io schedulers on-the-fly.
@@ -2644,16 +2827,26 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
/* Expire the timeslice of the current active queue first */
cfq_slice_expired(cfqd, 0);
- while ((entity = cfq_get_next_entity_forced(cfqd, root)) != NULL) {
- BUG_ON(entity->is_group_entity);
- cfqq = cfqq_of_entity(entity);
- __cfq_set_active_queue(cfqd, cfqq);
- dispatched += __cfq_forced_dispatch_cfqq(cfqq);
+
+ if (cfqd->use_hierarchy) {
+ while ((entity =
+ cfq_get_next_entity_forced(cfqd, root)) != NULL) {
+ BUG_ON(entity->is_group_entity);
+ cfqq = cfqq_of_entity(entity);
+ __cfq_set_active_queue(cfqd, cfqq);
+ dispatched += __cfq_forced_dispatch_cfqq(cfqq);
+ }
+ } else {
+ while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL) {
+ __cfq_set_active_queue(cfqd, cfqq);
+ dispatched += __cfq_forced_dispatch_cfqq(cfqq);
+ }
}
BUG_ON(cfqd->busy_queues);
cfq_log(cfqd, "forced_dispatch=%d", dispatched);
+
return dispatched;
}
@@ -4190,7 +4383,7 @@ static void *cfq_init_queue(struct request_queue *q)
/* Give preference to root group over other groups */
cfqg->cfqe.weight = 2*BLKIO_WEIGHT_DEFAULT;
cfqg->cfqe.is_group_entity = true;
- cfqg->cfqe.parent = NULL;
+ cfqg_set_parent(cfqd, cfqg, NULL);
#ifdef CONFIG_CFQ_GROUP_IOSCHED
/*
@@ -4244,8 +4437,8 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_group_isolation = 0;
cfqd->hw_tag = -1;
- /* hierarchical scheduling for cfq group by default */
- cfqd->use_hierarchy = 1;
+ /* flat scheduling for cfq group by default */
+ cfqd->use_hierarchy = 0;
/*
* we optimistically start assuming sync ops weren't delayed in last
--
1.6.5.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy.
[not found] ` <4D01C6AB.9040807@cn.fujitsu.com>
` (6 preceding siblings ...)
2010-12-13 1:45 ` [PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy" Gui Jianfeng
@ 2010-12-13 1:45 ` Gui Jianfeng
2010-12-13 15:10 ` Vivek Goyal
7 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-13 1:45 UTC (permalink / raw)
To: Jens Axboe, Vivek Goyal
Cc: Gui Jianfeng, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Document for blkio.use_hierarchy.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
---
Documentation/cgroups/blkio-controller.txt | 58 +++++++++++++++++++---------
1 files changed, 39 insertions(+), 19 deletions(-)
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index 4ed7b5c..9c6dc9e 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -91,30 +91,44 @@ Throttling/Upper Limit policy
Hierarchical Cgroups
====================
-- Currently none of the IO control policy supports hierarhical groups. But
- cgroup interface does allow creation of hierarhical cgroups and internally
- IO policies treat them as flat hierarchy.
+- Cgroup interface allows creation of hierarchical cgroups. Currently,
+ internally IO policies are able to treat them as flat hierarchy or
+ hierarchical hierarchy. Both hierarchical bandwidth division and flat
+ bandwidth division are supported. "blkio.use_hierarchy" can be used to
+ switch between flat mode and hierarchical mode.
- So this patch will allow creation of cgroup hierarhcy but at the backend
- everything will be treated as flat. So if somebody created a hierarchy like
- as follows.
+ Consider the following CGroup hierarchy:
- root
- / \
- test1 test2
- |
- test3
+ Root
+ / | \
+ Grp1 Grp2 tsk1
+ / \
+ Grp3 tsk2
- CFQ and throttling will practically treat all groups at same level.
+ If flat mode is enabled, CFQ and throttling will practically treat all
+ groups at the same level.
- pivot
- / | \ \
- root test1 test2 test3
+ Pivot tree
+ / | | \
+ Root Grp1 Grp2 Grp3
+ / |
+ tsk1 tsk2
- Down the line we can implement hierarchical accounting/control support
- and also introduce a new cgroup file "use_hierarchy" which will control
- whether cgroup hierarchy is viewed as flat or hierarchical by the policy..
- This is how memory controller also has implemented the things.
+ If hierarchical mode is enabled, CFQ will treat groups and tasks as the same
+ view in CGroup hierarchy.
+
+ Root
+ / | \
+ Grp1 Grp2 tsk1
+ / \
+ Grp3 tsk2
+
+ Grp1, Grp2 and tsk1 are treated at the same level under Root group. Grp3 and
+ tsk2 are treated at the same level under Grp1. Below is the mapping between
+ task io priority and io weight:
+
+ prio 0 1 2 3 4 5 6 7
+ weight 1000 868 740 612 484 356 228 100
Various user visible config options
===================================
@@ -169,6 +183,12 @@ Proportional weight policy files
dev weight
8:16 300
+- blkio.use_hierarchy
+ - Switch between hierarchical mode and flat mode as stated above.
+ blkio.use_hierarchy == 1 means hierarchical mode is enabled.
+ blkio.use_hierarchy == 0 means flat mode is enabled.
+ The default mode is flat mode.
+
- blkio.time
- disk time allocated to cgroup per device in milliseconds. First
two fields specify the major and minor number of the device and
--
1.6.5.2
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH 0/8 v2] Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface
2010-12-13 1:44 ` [PATCH 0/8 v2] Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface Gui Jianfeng
@ 2010-12-13 13:36 ` Jens Axboe
2010-12-14 3:30 ` Gui Jianfeng
2010-12-13 14:29 ` Vivek Goyal
1 sibling, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2010-12-13 13:36 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Vivek Goyal, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On 2010-12-13 02:44, Gui Jianfeng wrote:
> Hi
>
> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
> in the way that it puts all CFQ queues in a hidden group and schedules with other
> CFQ group under their parent. The patchset is available here,
> http://lkml.org/lkml/2010/8/30/30
>
> Vivek think this approach isn't so instinct that we should treat CFQ queues
> and groups at the same level. Here is the new approach for hierarchical
> scheduling based on Vivek's suggestion. The most big change of CFQ is that
> it gets rid of cfq_slice_offset logic, and makes use of vdisktime for CFQ
> queue scheduling just like CFQ group does. But I still give cfqq some jump
> in vdisktime based on ioprio, thanks for Vivek to point out this. Now CFQ
> queue and CFQ group uses the same scheduling algorithm.
>
> "use_hierarchy" interface is now added to switch between hierarchical mode
> and flat mode. For this time being, this interface only appears in Root Cgroup.
>
> V1 -> V2 Changes:
> - Raname "struct io_sched_entity" to "struct cfq_entity" and don't differentiate
> queue_entity and group_entity, just use cfqe instead.
> - Give newly added cfqq a small vdisktime jump accord to its ioprio.
> - Make flat mode as default CFQ group scheduling mode.
> - Introduce "use_hierarchy" interface.
> - Update blkio cgroup documents
>
> [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue
> [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group
> [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue
> [PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group
> [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
> [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality
> [PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy"
> [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy.
>
> Benchmarks:
> I made use of Vivek's iostest to perform some benchmarks on my box. I tested different workloads, and
> didn't see any performance drop comparing to vanilla kernel. The attached files are some performance
> numbers on vanilla Kernel, patched kernel with flat mode and patched kernel with hierarchical mode.
I was going to apply this patchset now, but it doesn't apply to current
for-2.6.38/core (#5 fails, git am stops there). Care to resend it
against that branch? Thanks.
--
Jens Axboe
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/8 v2] Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface
2010-12-13 1:44 ` [PATCH 0/8 v2] Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface Gui Jianfeng
2010-12-13 13:36 ` Jens Axboe
@ 2010-12-13 14:29 ` Vivek Goyal
2010-12-14 3:06 ` Gui Jianfeng
1 sibling, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2010-12-13 14:29 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Mon, Dec 13, 2010 at 09:44:10AM +0800, Gui Jianfeng wrote:
> Hi
>
> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
> in the way that it puts all CFQ queues in a hidden group and schedules with other
> CFQ group under their parent. The patchset is available here,
> http://lkml.org/lkml/2010/8/30/30
>
> Vivek think this approach isn't so instinct that we should treat CFQ queues
> and groups at the same level. Here is the new approach for hierarchical
> scheduling based on Vivek's suggestion. The most big change of CFQ is that
> it gets rid of cfq_slice_offset logic, and makes use of vdisktime for CFQ
> queue scheduling just like CFQ group does. But I still give cfqq some jump
> in vdisktime based on ioprio, thanks for Vivek to point out this. Now CFQ
> queue and CFQ group uses the same scheduling algorithm.
Hi Gui,
Thanks for the patches. Few thoughts.
- I think we can implement vdisktime jump logic for both cfq queue and
cfq groups. So any entity (queue/group) which is being backlogged fresh
will get the vdisktime jump but anything which has been using its slice
will get queued at the end of tree.
- Have you done testing in true hierarchical mode. In the sense that
create atleast two level of hierarchy and see if bandwidth division
is happening properly. Something like as follows.
root
/ \
test1 test2
/ \ / \
G1 G2 G3 G4
- On what kind of storage you have been doing your testing? I have noticed
that IO controllers works well only with idling on and with idling on
performance is bad on high end storage. The simple reason being that
an storage array can support multiple IOs at the same time and if we
are idling on queue or group in an attempt to provide fairness it hurts.
It hurts especially more if we are doing random IO (I am assuming this
is more typical of workloads).
So we need to come up with a proper logic so that we can provide some
kind of fairness even with idle disabled. I think that's where this
vdisktime jump logic comes into picture and is important to get it
right.
So can you also do some testing with idle disabled (both queue
and group) and see if the vdisktime logic is helping with providing
some kind of service differentation. I think results will vary
based on what is the storage and what queue depth are you driving. You
can even try to do this testing on an SSD.
Thanks
Vivek
>
> "use_hierarchy" interface is now added to switch between hierarchical mode
> and flat mode. For this time being, this interface only appears in Root Cgroup.
>
> V1 -> V2 Changes:
> - Raname "struct io_sched_entity" to "struct cfq_entity" and don't differentiate
> queue_entity and group_entity, just use cfqe instead.
> - Give newly added cfqq a small vdisktime jump accord to its ioprio.
> - Make flat mode as default CFQ group scheduling mode.
> - Introduce "use_hierarchy" interface.
> - Update blkio cgroup documents
>
> [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue
> [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group
> [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue
> [PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group
> [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
> [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality
> [PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy"
> [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy.
>
> Benchmarks:
> I made use of Vivek's iostest to perform some benchmarks on my box. I tested different workloads, and
> didn't see any performance drop comparing to vanilla kernel. The attached files are some performance
> numbers on vanilla Kernel, patched kernel with flat mode and patched kernel with hierarchical mode.
>
>
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=brr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[brr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> brr 1 1 294 692 1101 1526 3613
> brr 1 2 176 420 755 1281 2632
> brr 1 4 160 323 583 1253 2319
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=brr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[brr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> brr 1 1 380 738 1092 1439 3649
> brr 1 2 171 413 733 1242 2559
> brr 1 4 188 350 665 1193 2396
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=bsr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[bsr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> bsr 1 1 6856 11480 17644 22647 58627
> bsr 1 2 2592 5409 8464 13300 29765
> bsr 1 4 2502 4635 7640 12909 27686
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=bsr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[bsr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> bsr 1 1 6913 11643 17843 22909 59308
> bsr 1 2 6682 11234 15527 19410 52853
> bsr 1 4 5209 10882 15002 18167 49260
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drr 1 1 298 701 1117 1538 3654
> drr 1 2 190 372 731 1244 2537
> drr 1 4 147 322 563 1143 2175
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drr 1 1 370 713 1050 1416 3549
> drr 1 2 192 434 755 1265 2646
> drr 1 4 157 333 677 1159 2326
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drw iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drw] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drw 1 1 595 1272 2007 2737 6611
> drw 1 2 269 690 1407 1953 4319
> drw 1 4 145 396 978 1752 3271
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drw iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drw] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drw 1 1 604 1310 1827 2778 6519
> drw 1 2 287 723 1368 1887 4265
> drw 1 4 170 407 979 1575 3131
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=brr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[brr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> brr 1 1 287 690 1096 1506 3579
> brr 1 2 189 404 800 1283 2676
> brr 1 4 141 317 557 1106 2121
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=brr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[brr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> brr 1 1 386 715 1071 1437 3609
> brr 1 2 187 401 717 1258 2563
> brr 1 4 296 767 1553 32 2648
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=bsr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[bsr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> bsr 1 1 6971 11459 17789 23158 59377
> bsr 1 2 2592 5536 8679 13389 30196
> bsr 1 4 2194 4635 7820 13984 28633
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=bsr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[bsr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> bsr 1 1 6851 11588 17459 22297 58195
> bsr 1 2 6814 11534 16141 20426 54915
> bsr 1 4 5118 10741 13994 17661 47514
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drr 1 1 297 689 1097 1522 3605
> drr 1 2 175 426 757 1277 2635
> drr 1 4 150 330 604 1100 2184
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drr 1 1 379 735 1077 1436 3627
> drr 1 2 190 404 760 1247 2601
> drr 1 4 155 333 692 1044 2224
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drw iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drw] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drw 1 1 566 1293 2001 2686 6546
> drw 1 2 225 662 1233 1902 4022
> drw 1 4 147 379 922 1761 3209
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drw iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drw] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drw 1 1 579 1226 2020 2823 6648
> drw 1 2 276 689 1288 2068 4321
> drw 1 4 183 399 798 2113 3493
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=brr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[brr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> brr 1 1 289 684 1098 1508 3579
> brr 1 2 178 421 765 1228 2592
> brr 1 4 144 301 585 1094 2124
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=brr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[brr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> brr 1 1 375 734 1081 1434 3624
> brr 1 2 172 397 700 1201 2470
> brr 1 4 154 316 573 1087 2130
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=bsr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[bsr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> bsr 1 1 6818 11510 17820 23239 59387
> bsr 1 2 2643 5502 8728 13329 30202
> bsr 1 4 2166 4785 7344 12954 27249
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=bsr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[bsr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> bsr 1 1 6979 11629 17782 23064 59454
> bsr 1 2 6803 11274 15865 20024 53966
> bsr 1 4 5292 10674 14504 17674 48144
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drr 1 1 298 694 1116 1540 3648
> drr 1 2 183 405 721 1197 2506
> drr 1 4 151 296 553 1119 2119
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drr iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drr] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drr 1 1 367 724 1078 1433 3602
> drr 1 2 184 418 744 1245 2591
> drr 1 4 157 295 562 1122 2136
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drw iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drw] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drw 1 1 582 1271 1948 2754 6555
> drw 1 2 277 700 1294 1970 4241
> drw 1 4 172 345 928 1585 3030
>
>
> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
> GROUPMODE=1 NRGRP=4
> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
> Workload=drw iosched=cfq Filesz=512M bs=32k
> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
> =========================================================================
> AVERAGE[drw] [bw in KB/s]
> -------
> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
> --- --- -- -----------------------------------
> drw 1 1 586 1296 1888 2739 6509
> drw 1 2 294 749 1360 1931 4334
> drw 1 4 156 337 814 1806 3113
>
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy.
2010-12-13 1:45 ` [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy Gui Jianfeng
@ 2010-12-13 15:10 ` Vivek Goyal
2010-12-14 2:52 ` Gui Jianfeng
0 siblings, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2010-12-13 15:10 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Mon, Dec 13, 2010 at 09:45:22AM +0800, Gui Jianfeng wrote:
> Document for blkio.use_hierarchy.
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> Documentation/cgroups/blkio-controller.txt | 58 +++++++++++++++++++---------
> 1 files changed, 39 insertions(+), 19 deletions(-)
>
> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> index 4ed7b5c..9c6dc9e 100644
> --- a/Documentation/cgroups/blkio-controller.txt
> +++ b/Documentation/cgroups/blkio-controller.txt
> @@ -91,30 +91,44 @@ Throttling/Upper Limit policy
>
> Hierarchical Cgroups
> ====================
> -- Currently none of the IO control policy supports hierarhical groups. But
> - cgroup interface does allow creation of hierarhical cgroups and internally
> - IO policies treat them as flat hierarchy.
> +- Cgroup interface allows creation of hierarchical cgroups. Currently,
> + internally IO policies are able to treat them as flat hierarchy or
> + hierarchical hierarchy. Both hierarchical bandwidth division and flat
> + bandwidth division are supported. "blkio.use_hierarchy" can be used to
> + switch between flat mode and hierarchical mode.
>
> - So this patch will allow creation of cgroup hierarhcy but at the backend
> - everything will be treated as flat. So if somebody created a hierarchy like
> - as follows.
> + Consider the following CGroup hierarchy:
>
> - root
> - / \
> - test1 test2
> - |
> - test3
> + Root
> + / | \
> + Grp1 Grp2 tsk1
> + / \
> + Grp3 tsk2
>
> - CFQ and throttling will practically treat all groups at same level.
> + If flat mode is enabled, CFQ and throttling will practically treat all
> + groups at the same level.
>
> - pivot
> - / | \ \
> - root test1 test2 test3
> + Pivot tree
> + / | | \
> + Root Grp1 Grp2 Grp3
> + / |
> + tsk1 tsk2
>
> - Down the line we can implement hierarchical accounting/control support
> - and also introduce a new cgroup file "use_hierarchy" which will control
> - whether cgroup hierarchy is viewed as flat or hierarchical by the policy..
> - This is how memory controller also has implemented the things.
> + If hierarchical mode is enabled, CFQ will treat groups and tasks as the same
> + view in CGroup hierarchy.
> +
> + Root
> + / | \
> + Grp1 Grp2 tsk1
> + / \
> + Grp3 tsk2
> +
> + Grp1, Grp2 and tsk1 are treated at the same level under Root group. Grp3 and
> + tsk2 are treated at the same level under Grp1. Below is the mapping between
> + task io priority and io weight:
> +
> + prio 0 1 2 3 4 5 6 7
> + weight 1000 868 740 612 484 356 228 100
I am curious to know that why did you choose above mappings. Current prio
to slice mapping seems to be.
prio 0 1 2 3 4 5 6 7
slice 180 160 140 120 100 80 60 40
Now with above weights difference between prio 0 and prio 7 will be 10
times as compared to old scheme of 4.5 times. Well then there is
slice offset logic which tries to introduce more service differentation.
anyway, I am not particular about it. Just curious. If it works well, then
it is fine.
>
> Various user visible config options
> ===================================
> @@ -169,6 +183,12 @@ Proportional weight policy files
> dev weight
> 8:16 300
>
> +- blkio.use_hierarchy
> + - Switch between hierarchical mode and flat mode as stated above.
> + blkio.use_hierarchy == 1 means hierarchical mode is enabled.
> + blkio.use_hierarchy == 0 means flat mode is enabled.
> + The default mode is flat mode.
> +
Can you please explicitly mentiond that blkio.use_hierarchy only effects
CFQ and has impact on "throttling" logic as of today. Throttling will
still continue to treat everything as flat.
I am working on making throttling logic hierarchical. It has been going
on kind of slow and expecting it to get ready for 2.6.39.
Vivek
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue
2010-12-13 1:44 ` [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue Gui Jianfeng
@ 2010-12-13 15:44 ` Vivek Goyal
2010-12-14 1:30 ` Gui Jianfeng
0 siblings, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2010-12-13 15:44 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Mon, Dec 13, 2010 at 09:44:24AM +0800, Gui Jianfeng wrote:
> Introduce cfq_entity for CFQ queue
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> block/cfq-iosched.c | 125 +++++++++++++++++++++++++++++++++-----------------
> 1 files changed, 82 insertions(+), 43 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 5d0349d..9b07a24 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -91,20 +91,31 @@ struct cfq_rb_root {
> #define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
> .count = 0, .min_vdisktime = 0, }
>
> +
> +/*
> + * This's the CFQ queue schedule entity which is scheduled on service tree.
> + */
> +struct cfq_entity {
> + /* service tree */
> + struct cfq_rb_root *service_tree;
> + /* service_tree member */
> + struct rb_node rb_node;
> + /* service_tree key, represent the position on the tree */
> + unsigned long rb_key;
> +};
> +
> /*
> * Per process-grouping structure
> */
> struct cfq_queue {
> + /* The schedule entity */
> + struct cfq_entity cfqe;
> /* reference count */
> atomic_t ref;
> /* various state flags, see below */
> unsigned int flags;
> /* parent cfq_data */
> struct cfq_data *cfqd;
> - /* service_tree member */
> - struct rb_node rb_node;
> - /* service_tree key */
> - unsigned long rb_key;
> /* prio tree member */
> struct rb_node p_node;
> /* prio tree root we belong to, if any */
> @@ -143,7 +154,6 @@ struct cfq_queue {
> u32 seek_history;
> sector_t last_request_pos;
>
> - struct cfq_rb_root *service_tree;
> struct cfq_queue *new_cfqq;
> struct cfq_group *cfqg;
> struct cfq_group *orig_cfqg;
> @@ -302,6 +312,15 @@ struct cfq_data {
> struct rcu_head rcu;
> };
>
> +static inline struct cfq_queue *
> +cfqq_of_entity(struct cfq_entity *cfqe)
> +{
> + if (cfqe)
> + return container_of(cfqe, struct cfq_queue,
> + cfqe);
> + return NULL;
> +}
> +
> static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
>
> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
> @@ -743,7 +762,7 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2,
> /*
> * The below is leftmost cache rbtree addon
> */
> -static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
> +static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
> {
> /* Service tree is empty */
> if (!root->count)
> @@ -753,7 +772,7 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
> root->left = rb_first(&root->rb);
>
> if (root->left)
> - return rb_entry(root->left, struct cfq_queue, rb_node);
> + return rb_entry(root->left, struct cfq_entity, rb_node);
>
> return NULL;
> }
> @@ -1170,21 +1189,24 @@ static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
> static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> bool add_front)
> {
> + struct cfq_entity *cfqe;
> struct rb_node **p, *parent;
> - struct cfq_queue *__cfqq;
> + struct cfq_entity *__cfqe;
> unsigned long rb_key;
> struct cfq_rb_root *service_tree;
> int left;
> int new_cfqq = 1;
> int group_changed = 0;
>
> + cfqe = &cfqq->cfqe;
> +
> #ifdef CONFIG_CFQ_GROUP_IOSCHED
> if (!cfqd->cfq_group_isolation
> && cfqq_type(cfqq) == SYNC_NOIDLE_WORKLOAD
> && cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
> /* Move this cfq to root group */
> cfq_log_cfqq(cfqd, cfqq, "moving to root group");
> - if (!RB_EMPTY_NODE(&cfqq->rb_node))
> + if (!RB_EMPTY_NODE(&cfqe->rb_node))
> cfq_group_service_tree_del(cfqd, cfqq->cfqg);
> cfqq->orig_cfqg = cfqq->cfqg;
> cfqq->cfqg = &cfqd->root_group;
> @@ -1194,7 +1216,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> && cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
> /* cfqq is sequential now needs to go to its original group */
> BUG_ON(cfqq->cfqg != &cfqd->root_group);
> - if (!RB_EMPTY_NODE(&cfqq->rb_node))
> + if (!RB_EMPTY_NODE(&cfqe->rb_node))
> cfq_group_service_tree_del(cfqd, cfqq->cfqg);
> cfq_put_cfqg(cfqq->cfqg);
> cfqq->cfqg = cfqq->orig_cfqg;
> @@ -1209,9 +1231,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> if (cfq_class_idle(cfqq)) {
> rb_key = CFQ_IDLE_DELAY;
> parent = rb_last(&service_tree->rb);
> - if (parent && parent != &cfqq->rb_node) {
> - __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
> - rb_key += __cfqq->rb_key;
> + if (parent && parent != &cfqe->rb_node) {
> + __cfqe = rb_entry(parent,
> + struct cfq_entity,
> + rb_node);
Above can fit into a single line or at max two lines?
> + rb_key += __cfqe->rb_key;
> } else
> rb_key += jiffies;
> } else if (!add_front) {
> @@ -1226,37 +1250,39 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> cfqq->slice_resid = 0;
> } else {
> rb_key = -HZ;
> - __cfqq = cfq_rb_first(service_tree);
> - rb_key += __cfqq ? __cfqq->rb_key : jiffies;
> + __cfqe = cfq_rb_first(service_tree);
> + rb_key += __cfqe ? __cfqe->rb_key : jiffies;
> }
>
> - if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
> + if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
> new_cfqq = 0;
> /*
> * same position, nothing more to do
> */
> - if (rb_key == cfqq->rb_key &&
> - cfqq->service_tree == service_tree)
> + if (rb_key == cfqe->rb_key &&
> + cfqe->service_tree == service_tree)
> return;
>
> - cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
> - cfqq->service_tree = NULL;
> + cfq_rb_erase(&cfqe->rb_node,
> + cfqe->service_tree);
Above can fit on single line?
> + cfqe->service_tree = NULL;
> }
>
> left = 1;
> parent = NULL;
> - cfqq->service_tree = service_tree;
> + cfqe->service_tree = service_tree;
> p = &service_tree->rb.rb_node;
> while (*p) {
> struct rb_node **n;
>
> parent = *p;
> - __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
> + __cfqe = rb_entry(parent, struct cfq_entity,
> + rb_node);
Single line.
>
> /*
> * sort by key, that represents service time.
> */
> - if (time_before(rb_key, __cfqq->rb_key))
> + if (time_before(rb_key, __cfqe->rb_key))
> n = &(*p)->rb_left;
> else {
> n = &(*p)->rb_right;
> @@ -1267,11 +1293,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> }
>
> if (left)
> - service_tree->left = &cfqq->rb_node;
> + service_tree->left = &cfqe->rb_node;
>
> - cfqq->rb_key = rb_key;
> - rb_link_node(&cfqq->rb_node, parent, p);
> - rb_insert_color(&cfqq->rb_node, &service_tree->rb);
> + cfqe->rb_key = rb_key;
> + rb_link_node(&cfqe->rb_node, parent, p);
> + rb_insert_color(&cfqe->rb_node, &service_tree->rb);
> service_tree->count++;
> if ((add_front || !new_cfqq) && !group_changed)
> return;
> @@ -1373,13 +1399,17 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> */
> static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> {
> + struct cfq_entity *cfqe;
> cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
> BUG_ON(!cfq_cfqq_on_rr(cfqq));
> cfq_clear_cfqq_on_rr(cfqq);
>
> - if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
> - cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
> - cfqq->service_tree = NULL;
> + cfqe = &cfqq->cfqe;
> +
> + if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
> + cfq_rb_erase(&cfqe->rb_node,
> + cfqe->service_tree);
Single line above.
> + cfqe->service_tree = NULL;
> }
> if (cfqq->p_root) {
> rb_erase(&cfqq->p_node, cfqq->p_root);
> @@ -1707,13 +1737,13 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
> return NULL;
> if (RB_EMPTY_ROOT(&service_tree->rb))
> return NULL;
> - return cfq_rb_first(service_tree);
> + return cfqq_of_entity(cfq_rb_first(service_tree));
> }
>
> static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
> {
> struct cfq_group *cfqg;
> - struct cfq_queue *cfqq;
> + struct cfq_entity *cfqe;
> int i, j;
> struct cfq_rb_root *st;
>
> @@ -1724,9 +1754,11 @@ static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
> if (!cfqg)
> return NULL;
>
> - for_each_cfqg_st(cfqg, i, j, st)
> - if ((cfqq = cfq_rb_first(st)) != NULL)
> - return cfqq;
> + for_each_cfqg_st(cfqg, i, j, st) {
> + cfqe = cfq_rb_first(st);
> + if (cfqe != NULL)
> + return cfqq_of_entity(cfqe);
> + }
> return NULL;
> }
>
> @@ -1863,9 +1895,12 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
>
> static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> {
> + struct cfq_entity *cfqe;
> enum wl_prio_t prio = cfqq_prio(cfqq);
> - struct cfq_rb_root *service_tree = cfqq->service_tree;
> + struct cfq_rb_root *service_tree;
>
> + cfqe = &cfqq->cfqe;
> + service_tree = cfqe->service_tree;
> BUG_ON(!service_tree);
> BUG_ON(!service_tree->count);
>
> @@ -2075,7 +2110,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> struct cfq_group *cfqg, enum wl_prio_t prio)
> {
> - struct cfq_queue *queue;
> + struct cfq_entity *cfqe;
> int i;
> bool key_valid = false;
> unsigned long lowest_key = 0;
> @@ -2083,10 +2118,11 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>
> for (i = 0; i <= SYNC_WORKLOAD; ++i) {
> /* select the one with lowest rb_key */
> - queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
> - if (queue &&
> - (!key_valid || time_before(queue->rb_key, lowest_key))) {
> - lowest_key = queue->rb_key;
> + cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
> + if (cfqe &&
> + (!key_valid ||
> + time_before(cfqe->rb_key, lowest_key))) {
Merge two lines into one above.
> + lowest_key = cfqe->rb_key;
> cur_best = i;
> key_valid = true;
> }
> @@ -2834,7 +2870,10 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
> static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> pid_t pid, bool is_sync)
> {
> - RB_CLEAR_NODE(&cfqq->rb_node);
> + struct cfq_entity *cfqe;
> +
> + cfqe = &cfqq->cfqe;
> + RB_CLEAR_NODE(&cfqe->rb_node);
> RB_CLEAR_NODE(&cfqq->p_node);
> INIT_LIST_HEAD(&cfqq->fifo);
>
> @@ -3243,7 +3282,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
> /* Allow preemption only if we are idling on sync-noidle tree */
> if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
> cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
> - new_cfqq->service_tree->count == 2 &&
> + new_cfqq->cfqe.service_tree->count == 2 &&
> RB_EMPTY_ROOT(&cfqq->sort_list))
> return true;
>
Apart from above minor nits, this patch looks good to me.
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Vivek
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue
2010-12-13 1:44 ` [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue Gui Jianfeng
@ 2010-12-13 16:59 ` Vivek Goyal
2010-12-14 2:41 ` Gui Jianfeng
0 siblings, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2010-12-13 16:59 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Mon, Dec 13, 2010 at 09:44:45AM +0800, Gui Jianfeng wrote:
> Introduce vdisktime and io weight for CFQ queue scheduling. Currently, io priority
> maps to a range [100,1000]. It also gets rid of cfq_slice_offset() logic and makes
> use the same scheduling algorithm as CFQ group does. But it still give newly added
> cfqq a small vdisktime jump according to its ioprio. This patch will help to make
> CFQ queue and group schedule on a same service tree.
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> block/cfq-iosched.c | 196 ++++++++++++++++++++++++++++++++++++---------------
> 1 files changed, 139 insertions(+), 57 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 91e9833..30d19c0 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -101,10 +101,7 @@ struct cfq_entity {
> struct cfq_rb_root *service_tree;
> /* service_tree member */
> struct rb_node rb_node;
> - /* service_tree key, represent the position on the tree */
> - unsigned long rb_key;
> -
> - /* group service_tree key */
> + /* service_tree key */
> u64 vdisktime;
> bool is_group_entity;
> unsigned int weight;
> @@ -116,6 +113,8 @@ struct cfq_entity {
> struct cfq_queue {
> /* The schedule entity */
> struct cfq_entity cfqe;
> + /* Reposition time */
> + unsigned long reposition_time;
> /* reference count */
> atomic_t ref;
> /* various state flags, see below */
> @@ -314,6 +313,22 @@ struct cfq_data {
> struct rcu_head rcu;
> };
>
> +/*
> + * Map io priority(7 ~ 0) to io weight(100 ~ 1000)
> + */
> +static inline unsigned int cfq_prio_to_weight(unsigned short ioprio)
> +{
> + unsigned int step;
> +
> + BUG_ON(ioprio >= IOPRIO_BE_NR);
> +
> + step = (BLKIO_WEIGHT_MAX - BLKIO_WEIGHT_MIN) / (IOPRIO_BE_NR - 1);
> + if (ioprio == 0)
> + return BLKIO_WEIGHT_MAX;
> +
> + return BLKIO_WEIGHT_MIN + (IOPRIO_BE_NR - ioprio - 1) * step;
> +}
> +
> static inline struct cfq_queue *
> cfqq_of_entity(struct cfq_entity *cfqe)
> {
> @@ -841,16 +856,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> return cfq_choose_req(cfqd, next, prev, blk_rq_pos(last));
> }
>
> -static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
> - struct cfq_queue *cfqq)
> -{
> - /*
> - * just an approximation, should be ok.
> - */
> - return (cfqq->cfqg->nr_cfqq - 1) * (cfq_prio_slice(cfqd, 1, 0) -
> - cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
> -}
> -
> static inline s64
> entity_key(struct cfq_rb_root *st, struct cfq_entity *entity)
> {
> @@ -1199,6 +1204,16 @@ static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
>
> #endif /* GROUP_IOSCHED */
>
> +static inline u64 cfq_get_boost(struct cfq_data *cfqd,
> + struct cfq_entity *cfqe)
> +{
> + u64 d = cfqd->cfq_slice[1] << CFQ_SERVICE_SHIFT;
> +
> + d = d * BLKIO_WEIGHT_DEFAULT;
> + do_div(d, BLKIO_WEIGHT_MAX - cfqe->weight + BLKIO_WEIGHT_MIN);
> + return d;
> +}
> +
> /*
> * The cfqd->service_trees holds all pending cfq_queue's that have
> * requests waiting to be processed. It is sorted in the order that
> @@ -1210,13 +1225,14 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> struct cfq_entity *cfqe;
> struct rb_node **p, *parent;
> struct cfq_entity *__cfqe;
> - unsigned long rb_key;
> - struct cfq_rb_root *service_tree;
> + struct cfq_rb_root *service_tree, *orig_st;
> int left;
> int new_cfqq = 1;
> int group_changed = 0;
> + s64 key;
>
> cfqe = &cfqq->cfqe;
> + orig_st = cfqe->service_tree;
>
> #ifdef CONFIG_CFQ_GROUP_IOSCHED
> if (!cfqd->cfq_group_isolation
> @@ -1224,8 +1240,15 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> && cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
> /* Move this cfq to root group */
> cfq_log_cfqq(cfqd, cfqq, "moving to root group");
> - if (!RB_EMPTY_NODE(&cfqe->rb_node))
> + if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
> cfq_group_service_tree_del(cfqd, cfqq->cfqg);
> + /*
> + * Group changed, dequeue this CFQ queue from the
> + * original service tree.
> + */
> + cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
> + orig_st->total_weight -= cfqe->weight;
> + }
> cfqq->orig_cfqg = cfqq->cfqg;
> cfqq->cfqg = &cfqd->root_group;
> atomic_inc(&cfqd->root_group.ref);
> @@ -1234,8 +1257,15 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> && cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
> /* cfqq is sequential now needs to go to its original group */
> BUG_ON(cfqq->cfqg != &cfqd->root_group);
> - if (!RB_EMPTY_NODE(&cfqe->rb_node))
> + if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
> cfq_group_service_tree_del(cfqd, cfqq->cfqg);
> + /*
> + * Group changed, dequeue this CFQ queue from the
> + * original service tree.
> + */
> + cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
> + orig_st->total_weight -= cfqe->weight;
> + }
> cfq_put_cfqg(cfqq->cfqg);
> cfqq->cfqg = cfqq->orig_cfqg;
> cfqq->orig_cfqg = NULL;
> @@ -1246,50 +1276,73 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>
> service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
> cfqq_type(cfqq));
> + /*
> + * For the time being, put the newly added CFQ queue at the end of the
> + * service tree.
> + */
> + if (RB_EMPTY_NODE(&cfqe->rb_node)) {
> + /*
> + * If this CFQ queue moves to another group, the original
> + * vdisktime makes no sense any more, reset the vdisktime
> + * here.
> + */
> + parent = rb_last(&service_tree->rb);
> + if (parent) {
> + u64 boost;
> + s64 __vdisktime;
> +
> + __cfqe = rb_entry_entity(parent);
> + cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
> +
> + /* Give some vdisktime boost according to its weight */
> + boost = cfq_get_boost(cfqd, cfqe);
> + __vdisktime = cfqe->vdisktime - boost;
> + if (__vdisktime)
> + cfqe->vdisktime = __vdisktime;
> + else
> + cfqe->vdisktime = 0;
After subtraction (boost), __vdisktime can be negative? How do we take
care that it does not go below min_vdisktime. Remember min_vdisktime is
an increasing number.
> + } else
> + cfqe->vdisktime = service_tree->min_vdisktime;
> +
> + goto insert;
> + }
> + /*
> + * Ok, we get here, this CFQ queue is on the service tree, dequeue it
> + * firstly.
> + */
> + cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
> + orig_st->total_weight -= cfqe->weight;
> +
> + new_cfqq = 0;
> +
> if (cfq_class_idle(cfqq)) {
> - rb_key = CFQ_IDLE_DELAY;
> parent = rb_last(&service_tree->rb);
> if (parent && parent != &cfqe->rb_node) {
> __cfqe = rb_entry(parent,
> - struct cfq_entity,
> - rb_node);
> - rb_key += __cfqe->rb_key;
> + struct cfq_entity,
> + rb_node);
> + cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
> } else
> - rb_key += jiffies;
> + cfqe->vdisktime = service_tree->min_vdisktime;
> } else if (!add_front) {
> /*
> - * Get our rb key offset. Subtract any residual slice
> - * value carried from last service. A negative resid
> - * count indicates slice overrun, and this should position
> - * the next service time further away in the tree.
> + * We charge the CFQ queue by the time this queue runs, and
> + * repsition it on the service tree.
> */
> - rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
> - rb_key -= cfqq->slice_resid;
> - cfqq->slice_resid = 0;
> - } else {
> - rb_key = -HZ;
> - __cfqe = cfq_rb_first(service_tree);
> - rb_key += __cfqe ? __cfqe->rb_key : jiffies;
> - }
> + unsigned int used_sl;
>
> - if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
> - new_cfqq = 0;
> - /*
> - * same position, nothing more to do
> - */
> - if (rb_key == cfqe->rb_key &&
> - cfqe->service_tree == service_tree)
> - return;
> -
> - cfq_rb_erase(&cfqe->rb_node,
> - cfqe->service_tree);
> - cfqe->service_tree = NULL;
> + used_sl = cfq_cfqq_slice_usage(cfqq);
> + cfqe->vdisktime += cfq_scale_slice(used_sl, cfqe);
> + } else {
> + cfqe->vdisktime = service_tree->min_vdisktime;
> }
>
> +insert:
> left = 1;
> parent = NULL;
> cfqe->service_tree = service_tree;
> p = &service_tree->rb.rb_node;
> + key = entity_key(service_tree, cfqe);
> while (*p) {
> struct rb_node **n;
>
> @@ -1300,7 +1353,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> /*
> * sort by key, that represents service time.
> */
> - if (time_before(rb_key, __cfqe->rb_key))
> + if (key < entity_key(service_tree, __cfqe))
> n = &(*p)->rb_left;
> else {
> n = &(*p)->rb_right;
> @@ -1313,10 +1366,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> if (left)
> service_tree->left = &cfqe->rb_node;
>
> - cfqe->rb_key = rb_key;
> rb_link_node(&cfqe->rb_node, parent, p);
> rb_insert_color(&cfqe->rb_node, &service_tree->rb);
> + update_min_vdisktime(service_tree);
> service_tree->count++;
> + service_tree->total_weight += cfqe->weight;
> + cfqq->reposition_time = jiffies;
> if ((add_front || !new_cfqq) && !group_changed)
> return;
> cfq_group_service_tree_add(cfqd, cfqq->cfqg);
> @@ -1418,15 +1473,19 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> {
> struct cfq_entity *cfqe;
> + struct cfq_rb_root *service_tree;
> +
> cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
> BUG_ON(!cfq_cfqq_on_rr(cfqq));
> cfq_clear_cfqq_on_rr(cfqq);
>
> cfqe = &cfqq->cfqe;
> + service_tree = cfqe->service_tree;
>
> if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
> cfq_rb_erase(&cfqe->rb_node,
> cfqe->service_tree);
> + service_tree->total_weight -= cfqe->weight;
> cfqe->service_tree = NULL;
> }
> if (cfqq->p_root) {
> @@ -2125,24 +2184,34 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
> }
> }
>
> +/*
> + * The time when a CFQ queue is put onto a service tree is recoreded in
> + * cfqq->reposition_time. Currently, we check the first priority CFQ queues
> + * on each service tree, and select the workload type that contain the lowest
> + * reposition_time CFQ queue among them.
> + */
> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> struct cfq_group *cfqg, enum wl_prio_t prio)
> {
> struct cfq_entity *cfqe;
> + struct cfq_queue *cfqq;
> + unsigned long lowest_start_time;
> int i;
> - bool key_valid = false;
> - unsigned long lowest_key = 0;
> + bool time_valid = false;
> enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
>
> + /*
> + * TODO: We may take io priority into account when choosing a workload
> + * type. But for the time being just make use of reposition_time only.
> + */
> for (i = 0; i <= SYNC_WORKLOAD; ++i) {
> - /* select the one with lowest rb_key */
> cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
> - if (cfqe &&
> - (!key_valid ||
> - time_before(cfqe->rb_key, lowest_key))) {
> - lowest_key = cfqe->rb_key;
> + cfqq = cfqq_of_entity(cfqe);
> + if (cfqe && (!time_valid ||
> + cfqq->reposition_time < lowest_start_time)) {
do you need to use here time_before() etc marco to take care of
jifies/reposition_time wrapping?
> + lowest_start_time = cfqq->reposition_time;
> cur_best = i;
> - key_valid = true;
> + time_valid = true;
> }
> }
>
> @@ -2814,10 +2883,13 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
> {
> struct task_struct *tsk = current;
> int ioprio_class;
> + struct cfq_entity *cfqe;
>
> if (!cfq_cfqq_prio_changed(cfqq))
> return;
>
> + cfqe = &cfqq->cfqe;
> +
> ioprio_class = IOPRIO_PRIO_CLASS(ioc->ioprio);
> switch (ioprio_class) {
> default:
> @@ -2844,6 +2916,8 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
> break;
> }
>
> + cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
> +
Same here, can you update cfqe->weight while you are on service tree. In
the past I could not and we had to maintain a serparate variable where
we stored the new weight and once we requeue the entity we used to
process the new weight.
> /*
> * keep track of original prio settings in case we have to temporarily
> * elevate the priority of this queue
> @@ -3578,6 +3652,9 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> */
> static void cfq_prio_boost(struct cfq_queue *cfqq)
> {
> + struct cfq_entity *cfqe;
> +
> + cfqe = &cfqq->cfqe;
> if (has_fs_excl()) {
> /*
> * boost idle prio on transactions that would lock out other
> @@ -3594,6 +3671,11 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
> cfqq->ioprio_class = cfqq->org_ioprio_class;
> cfqq->ioprio = cfqq->org_ioprio;
> }
> +
> + /*
> + * update the io weight if io priority gets changed.
> + */
> + cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
How do you know that this cfqe/cfqq is not already on the service tree? I
don't think you can update weights while you are enqueued on the tree.
> }
>
> static inline int __cfq_may_queue(struct cfq_queue *cfqq)
> --
> 1.6.5.2
>
>
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group
2010-12-13 1:44 ` [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group Gui Jianfeng
@ 2010-12-13 16:59 ` Vivek Goyal
2010-12-14 1:33 ` Gui Jianfeng
2010-12-14 1:47 ` Gui Jianfeng
0 siblings, 2 replies; 41+ messages in thread
From: Vivek Goyal @ 2010-12-13 16:59 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Mon, Dec 13, 2010 at 09:44:33AM +0800, Gui Jianfeng wrote:
> Introduce cfq_entity for CFQ group
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> block/cfq-iosched.c | 113 ++++++++++++++++++++++++++++++--------------------
> 1 files changed, 68 insertions(+), 45 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 9b07a24..91e9833 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1,5 +1,5 @@
> /*
> - * CFQ, or complete fairness queueing, disk scheduler.
> + * Cfq, or complete fairness queueing, disk scheduler.
Is this really required?
> *
> * Based on ideas from a previously unfinished io
> * scheduler (round robin per-process disk scheduling) and Andrea Arcangeli.
> @@ -73,7 +73,8 @@ static DEFINE_IDA(cic_index_ida);
> #define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
>
> #define sample_valid(samples) ((samples) > 80)
> -#define rb_entry_cfqg(node) rb_entry((node), struct cfq_group, rb_node)
> +#define rb_entry_entity(node) rb_entry((node), struct cfq_entity,\
> + rb_node)
>
> /*
> * Most of our rbtree usage is for sorting with min extraction, so
> @@ -102,6 +103,11 @@ struct cfq_entity {
> struct rb_node rb_node;
> /* service_tree key, represent the position on the tree */
> unsigned long rb_key;
> +
> + /* group service_tree key */
> + u64 vdisktime;
> + bool is_group_entity;
> + unsigned int weight;
> };
>
> /*
> @@ -183,12 +189,8 @@ enum wl_type_t {
>
> /* This is per cgroup per device grouping structure */
> struct cfq_group {
> - /* group service_tree member */
> - struct rb_node rb_node;
> -
> - /* group service_tree key */
> - u64 vdisktime;
> - unsigned int weight;
> + /* cfq group sched entity */
> + struct cfq_entity cfqe;
>
> /* number of cfqq currently on this group */
> int nr_cfqq;
> @@ -315,12 +317,21 @@ struct cfq_data {
> static inline struct cfq_queue *
> cfqq_of_entity(struct cfq_entity *cfqe)
> {
> - if (cfqe)
> + if (cfqe && !cfqe->is_group_entity)
> return container_of(cfqe, struct cfq_queue,
> cfqe);
can be single line above. I think came from previous patch.
> return NULL;
> }
>
> +static inline struct cfq_group *
> +cfqg_of_entity(struct cfq_entity *cfqe)
> +{
> + if (cfqe && cfqe->is_group_entity)
> + return container_of(cfqe, struct cfq_group,
> + cfqe);
No need to split line.
> + return NULL;
> +}
> +
> static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
>
> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
> @@ -548,12 +559,12 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
> }
>
> -static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
> +static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_entity *cfqe)
> {
> u64 d = delta << CFQ_SERVICE_SHIFT;
>
> d = d * BLKIO_WEIGHT_DEFAULT;
> - do_div(d, cfqg->weight);
> + do_div(d, cfqe->weight);
> return d;
> }
>
> @@ -578,11 +589,11 @@ static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
> static void update_min_vdisktime(struct cfq_rb_root *st)
> {
> u64 vdisktime = st->min_vdisktime;
> - struct cfq_group *cfqg;
> + struct cfq_entity *cfqe;
>
> if (st->left) {
> - cfqg = rb_entry_cfqg(st->left);
> - vdisktime = min_vdisktime(vdisktime, cfqg->vdisktime);
> + cfqe = rb_entry_entity(st->left);
> + vdisktime = min_vdisktime(vdisktime, cfqe->vdisktime);
> }
>
> st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
> @@ -613,8 +624,9 @@ static inline unsigned
> cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
> {
> struct cfq_rb_root *st = &cfqd->grp_service_tree;
> + struct cfq_entity *cfqe = &cfqg->cfqe;
>
> - return cfq_target_latency * cfqg->weight / st->total_weight;
> + return cfq_target_latency * cfqe->weight / st->total_weight;
> }
>
> static inline void
> @@ -777,13 +789,13 @@ static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
> return NULL;
> }
>
> -static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
> +static struct cfq_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
So now we have two functions. One cfq_rb_first() and one cfq_rb_first_entity()
both returning cfq_entity*? This is confusing. Or you are getting rid of
one in later patches. Why not make use of existing cfq_rb_first()?
Thanks
Vivek
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group.
2010-12-13 1:44 ` [PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group Gui Jianfeng
@ 2010-12-13 22:11 ` Vivek Goyal
0 siblings, 0 replies; 41+ messages in thread
From: Vivek Goyal @ 2010-12-13 22:11 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Mon, Dec 13, 2010 at 09:44:54AM +0800, Gui Jianfeng wrote:
> Extract some common code of service tree handling for CFQ queue and
> CFQ group. This helps when CFQ queue and CFQ group are scheduling
> together.
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> block/cfq-iosched.c | 87 +++++++++++++++++++++------------------------------
> 1 files changed, 36 insertions(+), 51 deletions(-)
This looks good to me.
Acked-by: Vivek Goyal <vgoyal@redhat.com>
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 30d19c0..6486956 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -863,12 +863,11 @@ entity_key(struct cfq_rb_root *st, struct cfq_entity *entity)
> }
>
> static void
> -__cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
> +__cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> {
> struct rb_node **node = &st->rb.rb_node;
> struct rb_node *parent = NULL;
> struct cfq_entity *__cfqe;
> - struct cfq_entity *cfqe = &cfqg->cfqe;
> s64 key = entity_key(st, cfqe);
> int left = 1;
>
> @@ -892,6 +891,14 @@ __cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
> }
>
> static void
> +cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> +{
> + __cfq_entity_service_tree_add(st, cfqe);
> + st->count++;
> + st->total_weight += cfqe->weight;
> +}
> +
> +static void
> cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
> {
> struct cfq_rb_root *st = &cfqd->grp_service_tree;
> @@ -915,8 +922,23 @@ cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
> } else
> cfqe->vdisktime = st->min_vdisktime;
>
> - __cfq_group_service_tree_add(st, cfqg);
> - st->total_weight += cfqe->weight;
> + cfq_entity_service_tree_add(st, cfqe);
> +}
> +
> +static void
> +__cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> +{
> + cfq_rb_erase(&cfqe->rb_node, st);
> +}
> +
> +static void
> +cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> +{
> + if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
> + __cfq_entity_service_tree_del(st, cfqe);
> + st->total_weight -= cfqe->weight;
> + cfqe->service_tree = NULL;
> + }
> }
>
> static void
> @@ -933,9 +955,7 @@ cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
> return;
>
> cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
> - st->total_weight -= cfqe->weight;
> - if (!RB_EMPTY_NODE(&cfqe->rb_node))
> - cfq_rb_erase(&cfqe->rb_node, st);
> + cfq_entity_service_tree_del(st, cfqe);
> cfqg->saved_workload_slice = 0;
> cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
> }
> @@ -984,9 +1004,9 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
> charge = cfqq->allocated_slice;
>
> /* Can't update vdisktime while group is on service tree */
> - cfq_rb_erase(&cfqe->rb_node, st);
> + __cfq_entity_service_tree_del(st, cfqe);
> cfqe->vdisktime += cfq_scale_slice(charge, cfqe);
> - __cfq_group_service_tree_add(st, cfqg);
> + __cfq_entity_service_tree_add(st, cfqe);
>
> /* This group is being expired. Save the context */
> if (time_after(cfqd->workload_expires, jiffies)) {
> @@ -1223,13 +1243,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> bool add_front)
> {
> struct cfq_entity *cfqe;
> - struct rb_node **p, *parent;
> + struct rb_node *parent;
> struct cfq_entity *__cfqe;
> struct cfq_rb_root *service_tree, *orig_st;
> - int left;
> int new_cfqq = 1;
> int group_changed = 0;
> - s64 key;
>
> cfqe = &cfqq->cfqe;
> orig_st = cfqe->service_tree;
> @@ -1246,8 +1264,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> * Group changed, dequeue this CFQ queue from the
> * original service tree.
> */
> - cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
> - orig_st->total_weight -= cfqe->weight;
> + cfq_entity_service_tree_del(orig_st, cfqe);
> }
> cfqq->orig_cfqg = cfqq->cfqg;
> cfqq->cfqg = &cfqd->root_group;
> @@ -1263,8 +1280,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> * Group changed, dequeue this CFQ queue from the
> * original service tree.
> */
> - cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
> - orig_st->total_weight -= cfqe->weight;
> + cfq_entity_service_tree_del(orig_st, cfqe);
> }
> cfq_put_cfqg(cfqq->cfqg);
> cfqq->cfqg = cfqq->orig_cfqg;
> @@ -1310,8 +1326,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> * Ok, we get here, this CFQ queue is on the service tree, dequeue it
> * firstly.
> */
> - cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
> - orig_st->total_weight -= cfqe->weight;
> + cfq_entity_service_tree_del(orig_st, cfqe);
>
> new_cfqq = 0;
>
> @@ -1338,39 +1353,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
> }
>
> insert:
> - left = 1;
> - parent = NULL;
> cfqe->service_tree = service_tree;
> - p = &service_tree->rb.rb_node;
> - key = entity_key(service_tree, cfqe);
> - while (*p) {
> - struct rb_node **n;
> -
> - parent = *p;
> - __cfqe = rb_entry(parent, struct cfq_entity,
> - rb_node);
> -
> - /*
> - * sort by key, that represents service time.
> - */
> - if (key < entity_key(service_tree, __cfqe))
> - n = &(*p)->rb_left;
> - else {
> - n = &(*p)->rb_right;
> - left = 0;
> - }
>
> - p = n;
> - }
> + /* Add cfqq onto service tree. */
> + cfq_entity_service_tree_add(service_tree, cfqe);
>
> - if (left)
> - service_tree->left = &cfqe->rb_node;
> -
> - rb_link_node(&cfqe->rb_node, parent, p);
> - rb_insert_color(&cfqe->rb_node, &service_tree->rb);
> update_min_vdisktime(service_tree);
> - service_tree->count++;
> - service_tree->total_weight += cfqe->weight;
> cfqq->reposition_time = jiffies;
> if ((add_front || !new_cfqq) && !group_changed)
> return;
> @@ -1483,10 +1471,7 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> service_tree = cfqe->service_tree;
>
> if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
> - cfq_rb_erase(&cfqe->rb_node,
> - cfqe->service_tree);
> - service_tree->total_weight -= cfqe->weight;
> - cfqe->service_tree = NULL;
> + cfq_entity_service_tree_del(service_tree, cfqe);
> }
> if (cfqq->p_root) {
> rb_erase(&cfqq->p_node, cfqq->p_root);
> --
> 1.6.5.2
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue
2010-12-13 15:44 ` Vivek Goyal
@ 2010-12-14 1:30 ` Gui Jianfeng
0 siblings, 0 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-14 1:30 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Mon, Dec 13, 2010 at 09:44:24AM +0800, Gui Jianfeng wrote:
>> Introduce cfq_entity for CFQ queue
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> block/cfq-iosched.c | 125 +++++++++++++++++++++++++++++++++-----------------
>> 1 files changed, 82 insertions(+), 43 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 5d0349d..9b07a24 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -91,20 +91,31 @@ struct cfq_rb_root {
>> #define CFQ_RB_ROOT (struct cfq_rb_root) { .rb = RB_ROOT, .left = NULL, \
>> .count = 0, .min_vdisktime = 0, }
>>
>> +
>> +/*
>> + * This's the CFQ queue schedule entity which is scheduled on service tree.
>> + */
>> +struct cfq_entity {
>> + /* service tree */
>> + struct cfq_rb_root *service_tree;
>> + /* service_tree member */
>> + struct rb_node rb_node;
>> + /* service_tree key, represent the position on the tree */
>> + unsigned long rb_key;
>> +};
>> +
>> /*
>> * Per process-grouping structure
>> */
>> struct cfq_queue {
>> + /* The schedule entity */
>> + struct cfq_entity cfqe;
>> /* reference count */
>> atomic_t ref;
>> /* various state flags, see below */
>> unsigned int flags;
>> /* parent cfq_data */
>> struct cfq_data *cfqd;
>> - /* service_tree member */
>> - struct rb_node rb_node;
>> - /* service_tree key */
>> - unsigned long rb_key;
>> /* prio tree member */
>> struct rb_node p_node;
>> /* prio tree root we belong to, if any */
>> @@ -143,7 +154,6 @@ struct cfq_queue {
>> u32 seek_history;
>> sector_t last_request_pos;
>>
>> - struct cfq_rb_root *service_tree;
>> struct cfq_queue *new_cfqq;
>> struct cfq_group *cfqg;
>> struct cfq_group *orig_cfqg;
>> @@ -302,6 +312,15 @@ struct cfq_data {
>> struct rcu_head rcu;
>> };
>>
>> +static inline struct cfq_queue *
>> +cfqq_of_entity(struct cfq_entity *cfqe)
>> +{
>> + if (cfqe)
>> + return container_of(cfqe, struct cfq_queue,
>> + cfqe);
>> + return NULL;
>> +}
>> +
>> static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
>>
>> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
>> @@ -743,7 +762,7 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2,
>> /*
>> * The below is leftmost cache rbtree addon
>> */
>> -static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
>> +static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
>> {
>> /* Service tree is empty */
>> if (!root->count)
>> @@ -753,7 +772,7 @@ static struct cfq_queue *cfq_rb_first(struct cfq_rb_root *root)
>> root->left = rb_first(&root->rb);
>>
>> if (root->left)
>> - return rb_entry(root->left, struct cfq_queue, rb_node);
>> + return rb_entry(root->left, struct cfq_entity, rb_node);
>>
>> return NULL;
>> }
>> @@ -1170,21 +1189,24 @@ static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
>> static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> bool add_front)
>> {
>> + struct cfq_entity *cfqe;
>> struct rb_node **p, *parent;
>> - struct cfq_queue *__cfqq;
>> + struct cfq_entity *__cfqe;
>> unsigned long rb_key;
>> struct cfq_rb_root *service_tree;
>> int left;
>> int new_cfqq = 1;
>> int group_changed = 0;
>>
>> + cfqe = &cfqq->cfqe;
>> +
>> #ifdef CONFIG_CFQ_GROUP_IOSCHED
>> if (!cfqd->cfq_group_isolation
>> && cfqq_type(cfqq) == SYNC_NOIDLE_WORKLOAD
>> && cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
>> /* Move this cfq to root group */
>> cfq_log_cfqq(cfqd, cfqq, "moving to root group");
>> - if (!RB_EMPTY_NODE(&cfqq->rb_node))
>> + if (!RB_EMPTY_NODE(&cfqe->rb_node))
>> cfq_group_service_tree_del(cfqd, cfqq->cfqg);
>> cfqq->orig_cfqg = cfqq->cfqg;
>> cfqq->cfqg = &cfqd->root_group;
>> @@ -1194,7 +1216,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> && cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
>> /* cfqq is sequential now needs to go to its original group */
>> BUG_ON(cfqq->cfqg != &cfqd->root_group);
>> - if (!RB_EMPTY_NODE(&cfqq->rb_node))
>> + if (!RB_EMPTY_NODE(&cfqe->rb_node))
>> cfq_group_service_tree_del(cfqd, cfqq->cfqg);
>> cfq_put_cfqg(cfqq->cfqg);
>> cfqq->cfqg = cfqq->orig_cfqg;
>> @@ -1209,9 +1231,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> if (cfq_class_idle(cfqq)) {
>> rb_key = CFQ_IDLE_DELAY;
>> parent = rb_last(&service_tree->rb);
>> - if (parent && parent != &cfqq->rb_node) {
>> - __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
>> - rb_key += __cfqq->rb_key;
>> + if (parent && parent != &cfqe->rb_node) {
>> + __cfqe = rb_entry(parent,
>> + struct cfq_entity,
>> + rb_node);
>
> Above can fit into a single line or at max two lines?
I replaced io_sched_entity into cfq_entity automatically, but forgot to
consider to line character number. :(
Will change all.
Thanks,
Gui
>
>> + rb_key += __cfqe->rb_key;
>> } else
>> rb_key += jiffies;
>> } else if (!add_front) {
>> @@ -1226,37 +1250,39 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> cfqq->slice_resid = 0;
>> } else {
>> rb_key = -HZ;
>> - __cfqq = cfq_rb_first(service_tree);
>> - rb_key += __cfqq ? __cfqq->rb_key : jiffies;
>> + __cfqe = cfq_rb_first(service_tree);
>> + rb_key += __cfqe ? __cfqe->rb_key : jiffies;
>> }
>>
>> - if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
>> + if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
>> new_cfqq = 0;
>> /*
>> * same position, nothing more to do
>> */
>> - if (rb_key == cfqq->rb_key &&
>> - cfqq->service_tree == service_tree)
>> + if (rb_key == cfqe->rb_key &&
>> + cfqe->service_tree == service_tree)
>> return;
>>
>> - cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
>> - cfqq->service_tree = NULL;
>> + cfq_rb_erase(&cfqe->rb_node,
>> + cfqe->service_tree);
>
> Above can fit on single line?
>
>> + cfqe->service_tree = NULL;
>> }
>>
>> left = 1;
>> parent = NULL;
>> - cfqq->service_tree = service_tree;
>> + cfqe->service_tree = service_tree;
>> p = &service_tree->rb.rb_node;
>> while (*p) {
>> struct rb_node **n;
>>
>> parent = *p;
>> - __cfqq = rb_entry(parent, struct cfq_queue, rb_node);
>> + __cfqe = rb_entry(parent, struct cfq_entity,
>> + rb_node);
>
> Single line.
>
>>
>> /*
>> * sort by key, that represents service time.
>> */
>> - if (time_before(rb_key, __cfqq->rb_key))
>> + if (time_before(rb_key, __cfqe->rb_key))
>> n = &(*p)->rb_left;
>> else {
>> n = &(*p)->rb_right;
>> @@ -1267,11 +1293,11 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> }
>>
>> if (left)
>> - service_tree->left = &cfqq->rb_node;
>> + service_tree->left = &cfqe->rb_node;
>>
>> - cfqq->rb_key = rb_key;
>> - rb_link_node(&cfqq->rb_node, parent, p);
>> - rb_insert_color(&cfqq->rb_node, &service_tree->rb);
>> + cfqe->rb_key = rb_key;
>> + rb_link_node(&cfqe->rb_node, parent, p);
>> + rb_insert_color(&cfqe->rb_node, &service_tree->rb);
>> service_tree->count++;
>> if ((add_front || !new_cfqq) && !group_changed)
>> return;
>> @@ -1373,13 +1399,17 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> */
>> static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> {
>> + struct cfq_entity *cfqe;
>> cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
>> BUG_ON(!cfq_cfqq_on_rr(cfqq));
>> cfq_clear_cfqq_on_rr(cfqq);
>>
>> - if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
>> - cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
>> - cfqq->service_tree = NULL;
>> + cfqe = &cfqq->cfqe;
>> +
>> + if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
>> + cfq_rb_erase(&cfqe->rb_node,
>> + cfqe->service_tree);
>
> Single line above.
>
>
>> + cfqe->service_tree = NULL;
>> }
>> if (cfqq->p_root) {
>> rb_erase(&cfqq->p_node, cfqq->p_root);
>> @@ -1707,13 +1737,13 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
>> return NULL;
>> if (RB_EMPTY_ROOT(&service_tree->rb))
>> return NULL;
>> - return cfq_rb_first(service_tree);
>> + return cfqq_of_entity(cfq_rb_first(service_tree));
>> }
>>
>> static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
>> {
>> struct cfq_group *cfqg;
>> - struct cfq_queue *cfqq;
>> + struct cfq_entity *cfqe;
>> int i, j;
>> struct cfq_rb_root *st;
>>
>> @@ -1724,9 +1754,11 @@ static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
>> if (!cfqg)
>> return NULL;
>>
>> - for_each_cfqg_st(cfqg, i, j, st)
>> - if ((cfqq = cfq_rb_first(st)) != NULL)
>> - return cfqq;
>> + for_each_cfqg_st(cfqg, i, j, st) {
>> + cfqe = cfq_rb_first(st);
>> + if (cfqe != NULL)
>> + return cfqq_of_entity(cfqe);
>> + }
>> return NULL;
>> }
>>
>> @@ -1863,9 +1895,12 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
>>
>> static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> {
>> + struct cfq_entity *cfqe;
>> enum wl_prio_t prio = cfqq_prio(cfqq);
>> - struct cfq_rb_root *service_tree = cfqq->service_tree;
>> + struct cfq_rb_root *service_tree;
>>
>> + cfqe = &cfqq->cfqe;
>> + service_tree = cfqe->service_tree;
>> BUG_ON(!service_tree);
>> BUG_ON(!service_tree->count);
>>
>> @@ -2075,7 +2110,7 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
>> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> struct cfq_group *cfqg, enum wl_prio_t prio)
>> {
>> - struct cfq_queue *queue;
>> + struct cfq_entity *cfqe;
>> int i;
>> bool key_valid = false;
>> unsigned long lowest_key = 0;
>> @@ -2083,10 +2118,11 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>>
>> for (i = 0; i <= SYNC_WORKLOAD; ++i) {
>> /* select the one with lowest rb_key */
>> - queue = cfq_rb_first(service_tree_for(cfqg, prio, i));
>> - if (queue &&
>> - (!key_valid || time_before(queue->rb_key, lowest_key))) {
>> - lowest_key = queue->rb_key;
>> + cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
>> + if (cfqe &&
>> + (!key_valid ||
>> + time_before(cfqe->rb_key, lowest_key))) {
>
> Merge two lines into one above.
>
>> + lowest_key = cfqe->rb_key;
>> cur_best = i;
>> key_valid = true;
>> }
>> @@ -2834,7 +2870,10 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
>> static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> pid_t pid, bool is_sync)
>> {
>> - RB_CLEAR_NODE(&cfqq->rb_node);
>> + struct cfq_entity *cfqe;
>> +
>> + cfqe = &cfqq->cfqe;
>> + RB_CLEAR_NODE(&cfqe->rb_node);
>> RB_CLEAR_NODE(&cfqq->p_node);
>> INIT_LIST_HEAD(&cfqq->fifo);
>>
>> @@ -3243,7 +3282,7 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>> /* Allow preemption only if we are idling on sync-noidle tree */
>> if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
>> cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
>> - new_cfqq->service_tree->count == 2 &&
>> + new_cfqq->cfqe.service_tree->count == 2 &&
>> RB_EMPTY_ROOT(&cfqq->sort_list))
>> return true;
>>
>
> Apart from above minor nits, this patch looks good to me.
>
> Acked-by: Vivek Goyal <vgoyal@redhat.com>
>
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group
2010-12-13 16:59 ` Vivek Goyal
@ 2010-12-14 1:33 ` Gui Jianfeng
2010-12-14 1:47 ` Gui Jianfeng
1 sibling, 0 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-14 1:33 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Mon, Dec 13, 2010 at 09:44:33AM +0800, Gui Jianfeng wrote:
>> Introduce cfq_entity for CFQ group
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> block/cfq-iosched.c | 113 ++++++++++++++++++++++++++++++--------------------
>> 1 files changed, 68 insertions(+), 45 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 9b07a24..91e9833 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -1,5 +1,5 @@
>> /*
>> - * CFQ, or complete fairness queueing, disk scheduler.
>> + * Cfq, or complete fairness queueing, disk scheduler.
>
> Is this really required?
>
>> *
>> * Based on ideas from a previously unfinished io
>> * scheduler (round robin per-process disk scheduling) and Andrea Arcangeli.
>> @@ -73,7 +73,8 @@ static DEFINE_IDA(cic_index_ida);
>> #define cfq_class_rt(cfqq) ((cfqq)->ioprio_class == IOPRIO_CLASS_RT)
>>
>> #define sample_valid(samples) ((samples) > 80)
>> -#define rb_entry_cfqg(node) rb_entry((node), struct cfq_group, rb_node)
>> +#define rb_entry_entity(node) rb_entry((node), struct cfq_entity,\
>> + rb_node)
>>
>> /*
>> * Most of our rbtree usage is for sorting with min extraction, so
>> @@ -102,6 +103,11 @@ struct cfq_entity {
>> struct rb_node rb_node;
>> /* service_tree key, represent the position on the tree */
>> unsigned long rb_key;
>> +
>> + /* group service_tree key */
>> + u64 vdisktime;
>> + bool is_group_entity;
>> + unsigned int weight;
>> };
>>
>> /*
>> @@ -183,12 +189,8 @@ enum wl_type_t {
>>
>> /* This is per cgroup per device grouping structure */
>> struct cfq_group {
>> - /* group service_tree member */
>> - struct rb_node rb_node;
>> -
>> - /* group service_tree key */
>> - u64 vdisktime;
>> - unsigned int weight;
>> + /* cfq group sched entity */
>> + struct cfq_entity cfqe;
>>
>> /* number of cfqq currently on this group */
>> int nr_cfqq;
>> @@ -315,12 +317,21 @@ struct cfq_data {
>> static inline struct cfq_queue *
>> cfqq_of_entity(struct cfq_entity *cfqe)
>> {
>> - if (cfqe)
>> + if (cfqe && !cfqe->is_group_entity)
>> return container_of(cfqe, struct cfq_queue,
>> cfqe);
>
> can be single line above. I think came from previous patch.
>
>> return NULL;
>> }
>>
>> +static inline struct cfq_group *
>> +cfqg_of_entity(struct cfq_entity *cfqe)
>> +{
>> + if (cfqe && cfqe->is_group_entity)
>> + return container_of(cfqe, struct cfq_group,
>> + cfqe);
>
> No need to split line.
>
>> + return NULL;
>> +}
>> +
>> static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
>>
>> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
>> @@ -548,12 +559,12 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
>> }
>>
>> -static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_group *cfqg)
>> +static inline u64 cfq_scale_slice(unsigned long delta, struct cfq_entity *cfqe)
>> {
>> u64 d = delta << CFQ_SERVICE_SHIFT;
>>
>> d = d * BLKIO_WEIGHT_DEFAULT;
>> - do_div(d, cfqg->weight);
>> + do_div(d, cfqe->weight);
>> return d;
>> }
>>
>> @@ -578,11 +589,11 @@ static inline u64 min_vdisktime(u64 min_vdisktime, u64 vdisktime)
>> static void update_min_vdisktime(struct cfq_rb_root *st)
>> {
>> u64 vdisktime = st->min_vdisktime;
>> - struct cfq_group *cfqg;
>> + struct cfq_entity *cfqe;
>>
>> if (st->left) {
>> - cfqg = rb_entry_cfqg(st->left);
>> - vdisktime = min_vdisktime(vdisktime, cfqg->vdisktime);
>> + cfqe = rb_entry_entity(st->left);
>> + vdisktime = min_vdisktime(vdisktime, cfqe->vdisktime);
>> }
>>
>> st->min_vdisktime = max_vdisktime(st->min_vdisktime, vdisktime);
>> @@ -613,8 +624,9 @@ static inline unsigned
>> cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> + struct cfq_entity *cfqe = &cfqg->cfqe;
>>
>> - return cfq_target_latency * cfqg->weight / st->total_weight;
>> + return cfq_target_latency * cfqe->weight / st->total_weight;
>> }
>>
>> static inline void
>> @@ -777,13 +789,13 @@ static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
>> return NULL;
>> }
>>
>> -static struct cfq_group *cfq_rb_first_group(struct cfq_rb_root *root)
>> +static struct cfq_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
>
> So now we have two functions. One cfq_rb_first() and one cfq_rb_first_entity()
> both returning cfq_entity*? This is confusing. Or you are getting rid of
> one in later patches. Why not make use of existing cfq_rb_first()?
Yes, I get rid of cfq_rb_first_entity() in later patch.
Thanks,
Gui
>
> Thanks
> Vivek
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group
2010-12-13 16:59 ` Vivek Goyal
2010-12-14 1:33 ` Gui Jianfeng
@ 2010-12-14 1:47 ` Gui Jianfeng
1 sibling, 0 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-14 1:47 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Mon, Dec 13, 2010 at 09:44:33AM +0800, Gui Jianfeng wrote:
>> Introduce cfq_entity for CFQ group
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> block/cfq-iosched.c | 113 ++++++++++++++++++++++++++++++--------------------
>> 1 files changed, 68 insertions(+), 45 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 9b07a24..91e9833 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -1,5 +1,5 @@
>> /*
>> - * CFQ, or complete fairness queueing, disk scheduler.
>> + * Cfq, or complete fairness queueing, disk scheduler.
>
> Is this really required?
Strange... I don't rember I do this change. Must be some misoperation.
Gui
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue
2010-12-13 16:59 ` Vivek Goyal
@ 2010-12-14 2:41 ` Gui Jianfeng
2010-12-14 2:47 ` Vivek Goyal
0 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-14 2:41 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Mon, Dec 13, 2010 at 09:44:45AM +0800, Gui Jianfeng wrote:
>> Introduce vdisktime and io weight for CFQ queue scheduling. Currently, io priority
>> maps to a range [100,1000]. It also gets rid of cfq_slice_offset() logic and makes
>> use the same scheduling algorithm as CFQ group does. But it still give newly added
>> cfqq a small vdisktime jump according to its ioprio. This patch will help to make
>> CFQ queue and group schedule on a same service tree.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> block/cfq-iosched.c | 196 ++++++++++++++++++++++++++++++++++++---------------
>> 1 files changed, 139 insertions(+), 57 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 91e9833..30d19c0 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -101,10 +101,7 @@ struct cfq_entity {
>> struct cfq_rb_root *service_tree;
>> /* service_tree member */
>> struct rb_node rb_node;
>> - /* service_tree key, represent the position on the tree */
>> - unsigned long rb_key;
>> -
>> - /* group service_tree key */
>> + /* service_tree key */
>> u64 vdisktime;
>> bool is_group_entity;
>> unsigned int weight;
>> @@ -116,6 +113,8 @@ struct cfq_entity {
>> struct cfq_queue {
>> /* The schedule entity */
>> struct cfq_entity cfqe;
>> + /* Reposition time */
>> + unsigned long reposition_time;
>> /* reference count */
>> atomic_t ref;
>> /* various state flags, see below */
>> @@ -314,6 +313,22 @@ struct cfq_data {
>> struct rcu_head rcu;
>> };
>>
>> +/*
>> + * Map io priority(7 ~ 0) to io weight(100 ~ 1000)
>> + */
>> +static inline unsigned int cfq_prio_to_weight(unsigned short ioprio)
>> +{
>> + unsigned int step;
>> +
>> + BUG_ON(ioprio >= IOPRIO_BE_NR);
>> +
>> + step = (BLKIO_WEIGHT_MAX - BLKIO_WEIGHT_MIN) / (IOPRIO_BE_NR - 1);
>> + if (ioprio == 0)
>> + return BLKIO_WEIGHT_MAX;
>> +
>> + return BLKIO_WEIGHT_MIN + (IOPRIO_BE_NR - ioprio - 1) * step;
>> +}
>> +
>> static inline struct cfq_queue *
>> cfqq_of_entity(struct cfq_entity *cfqe)
>> {
>> @@ -841,16 +856,6 @@ cfq_find_next_rq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> return cfq_choose_req(cfqd, next, prev, blk_rq_pos(last));
>> }
>>
>> -static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
>> - struct cfq_queue *cfqq)
>> -{
>> - /*
>> - * just an approximation, should be ok.
>> - */
>> - return (cfqq->cfqg->nr_cfqq - 1) * (cfq_prio_slice(cfqd, 1, 0) -
>> - cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
>> -}
>> -
>> static inline s64
>> entity_key(struct cfq_rb_root *st, struct cfq_entity *entity)
>> {
>> @@ -1199,6 +1204,16 @@ static inline void cfq_put_cfqg(struct cfq_group *cfqg) {}
>>
>> #endif /* GROUP_IOSCHED */
>>
>> +static inline u64 cfq_get_boost(struct cfq_data *cfqd,
>> + struct cfq_entity *cfqe)
>> +{
>> + u64 d = cfqd->cfq_slice[1] << CFQ_SERVICE_SHIFT;
>> +
>> + d = d * BLKIO_WEIGHT_DEFAULT;
>> + do_div(d, BLKIO_WEIGHT_MAX - cfqe->weight + BLKIO_WEIGHT_MIN);
>> + return d;
>> +}
>> +
>> /*
>> * The cfqd->service_trees holds all pending cfq_queue's that have
>> * requests waiting to be processed. It is sorted in the order that
>> @@ -1210,13 +1225,14 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> struct cfq_entity *cfqe;
>> struct rb_node **p, *parent;
>> struct cfq_entity *__cfqe;
>> - unsigned long rb_key;
>> - struct cfq_rb_root *service_tree;
>> + struct cfq_rb_root *service_tree, *orig_st;
>> int left;
>> int new_cfqq = 1;
>> int group_changed = 0;
>> + s64 key;
>>
>> cfqe = &cfqq->cfqe;
>> + orig_st = cfqe->service_tree;
>>
>> #ifdef CONFIG_CFQ_GROUP_IOSCHED
>> if (!cfqd->cfq_group_isolation
>> @@ -1224,8 +1240,15 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> && cfqq->cfqg && cfqq->cfqg != &cfqd->root_group) {
>> /* Move this cfq to root group */
>> cfq_log_cfqq(cfqd, cfqq, "moving to root group");
>> - if (!RB_EMPTY_NODE(&cfqe->rb_node))
>> + if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
>> cfq_group_service_tree_del(cfqd, cfqq->cfqg);
>> + /*
>> + * Group changed, dequeue this CFQ queue from the
>> + * original service tree.
>> + */
>> + cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
>> + orig_st->total_weight -= cfqe->weight;
>> + }
>> cfqq->orig_cfqg = cfqq->cfqg;
>> cfqq->cfqg = &cfqd->root_group;
>> atomic_inc(&cfqd->root_group.ref);
>> @@ -1234,8 +1257,15 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> && cfqq_type(cfqq) == SYNC_WORKLOAD && cfqq->orig_cfqg) {
>> /* cfqq is sequential now needs to go to its original group */
>> BUG_ON(cfqq->cfqg != &cfqd->root_group);
>> - if (!RB_EMPTY_NODE(&cfqe->rb_node))
>> + if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
>> cfq_group_service_tree_del(cfqd, cfqq->cfqg);
>> + /*
>> + * Group changed, dequeue this CFQ queue from the
>> + * original service tree.
>> + */
>> + cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
>> + orig_st->total_weight -= cfqe->weight;
>> + }
>> cfq_put_cfqg(cfqq->cfqg);
>> cfqq->cfqg = cfqq->orig_cfqg;
>> cfqq->orig_cfqg = NULL;
>> @@ -1246,50 +1276,73 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>>
>> service_tree = service_tree_for(cfqq->cfqg, cfqq_prio(cfqq),
>> cfqq_type(cfqq));
>> + /*
>> + * For the time being, put the newly added CFQ queue at the end of the
>> + * service tree.
>> + */
>> + if (RB_EMPTY_NODE(&cfqe->rb_node)) {
>> + /*
>> + * If this CFQ queue moves to another group, the original
>> + * vdisktime makes no sense any more, reset the vdisktime
>> + * here.
>> + */
>> + parent = rb_last(&service_tree->rb);
>> + if (parent) {
>> + u64 boost;
>> + s64 __vdisktime;
>> +
>> + __cfqe = rb_entry_entity(parent);
>> + cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
>> +
>> + /* Give some vdisktime boost according to its weight */
>> + boost = cfq_get_boost(cfqd, cfqe);
>> + __vdisktime = cfqe->vdisktime - boost;
>> + if (__vdisktime)
>> + cfqe->vdisktime = __vdisktime;
>> + else
>> + cfqe->vdisktime = 0;
>
> After subtraction (boost), __vdisktime can be negative? How do we take
> care that it does not go below min_vdisktime. Remember min_vdisktime is
> an increasing number.
Will take min_vdisktime in to account.
>
>> + } else
>> + cfqe->vdisktime = service_tree->min_vdisktime;
>> +
>> + goto insert;
>> + }
>> + /*
>> + * Ok, we get here, this CFQ queue is on the service tree, dequeue it
>> + * firstly.
>> + */
>> + cfq_rb_erase(&cfqe->rb_node, cfqe->service_tree);
>> + orig_st->total_weight -= cfqe->weight;
>> +
>> + new_cfqq = 0;
>> +
>> if (cfq_class_idle(cfqq)) {
>> - rb_key = CFQ_IDLE_DELAY;
>> parent = rb_last(&service_tree->rb);
>> if (parent && parent != &cfqe->rb_node) {
>> __cfqe = rb_entry(parent,
>> - struct cfq_entity,
>> - rb_node);
>> - rb_key += __cfqe->rb_key;
>> + struct cfq_entity,
>> + rb_node);
>> + cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
>> } else
>> - rb_key += jiffies;
>> + cfqe->vdisktime = service_tree->min_vdisktime;
>> } else if (!add_front) {
>> /*
>> - * Get our rb key offset. Subtract any residual slice
>> - * value carried from last service. A negative resid
>> - * count indicates slice overrun, and this should position
>> - * the next service time further away in the tree.
>> + * We charge the CFQ queue by the time this queue runs, and
>> + * repsition it on the service tree.
>> */
>> - rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
>> - rb_key -= cfqq->slice_resid;
>> - cfqq->slice_resid = 0;
>> - } else {
>> - rb_key = -HZ;
>> - __cfqe = cfq_rb_first(service_tree);
>> - rb_key += __cfqe ? __cfqe->rb_key : jiffies;
>> - }
>> + unsigned int used_sl;
>>
>> - if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
>> - new_cfqq = 0;
>> - /*
>> - * same position, nothing more to do
>> - */
>> - if (rb_key == cfqe->rb_key &&
>> - cfqe->service_tree == service_tree)
>> - return;
>> -
>> - cfq_rb_erase(&cfqe->rb_node,
>> - cfqe->service_tree);
>> - cfqe->service_tree = NULL;
>> + used_sl = cfq_cfqq_slice_usage(cfqq);
>> + cfqe->vdisktime += cfq_scale_slice(used_sl, cfqe);
>> + } else {
>> + cfqe->vdisktime = service_tree->min_vdisktime;
>> }
>>
>> +insert:
>> left = 1;
>> parent = NULL;
>> cfqe->service_tree = service_tree;
>> p = &service_tree->rb.rb_node;
>> + key = entity_key(service_tree, cfqe);
>> while (*p) {
>> struct rb_node **n;
>>
>> @@ -1300,7 +1353,7 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> /*
>> * sort by key, that represents service time.
>> */
>> - if (time_before(rb_key, __cfqe->rb_key))
>> + if (key < entity_key(service_tree, __cfqe))
>> n = &(*p)->rb_left;
>> else {
>> n = &(*p)->rb_right;
>> @@ -1313,10 +1366,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>> if (left)
>> service_tree->left = &cfqe->rb_node;
>>
>> - cfqe->rb_key = rb_key;
>> rb_link_node(&cfqe->rb_node, parent, p);
>> rb_insert_color(&cfqe->rb_node, &service_tree->rb);
>> + update_min_vdisktime(service_tree);
>> service_tree->count++;
>> + service_tree->total_weight += cfqe->weight;
>> + cfqq->reposition_time = jiffies;
>> if ((add_front || !new_cfqq) && !group_changed)
>> return;
>> cfq_group_service_tree_add(cfqd, cfqq->cfqg);
>> @@ -1418,15 +1473,19 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> {
>> struct cfq_entity *cfqe;
>> + struct cfq_rb_root *service_tree;
>> +
>> cfq_log_cfqq(cfqd, cfqq, "del_from_rr");
>> BUG_ON(!cfq_cfqq_on_rr(cfqq));
>> cfq_clear_cfqq_on_rr(cfqq);
>>
>> cfqe = &cfqq->cfqe;
>> + service_tree = cfqe->service_tree;
>>
>> if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
>> cfq_rb_erase(&cfqe->rb_node,
>> cfqe->service_tree);
>> + service_tree->total_weight -= cfqe->weight;
>> cfqe->service_tree = NULL;
>> }
>> if (cfqq->p_root) {
>> @@ -2125,24 +2184,34 @@ static void cfq_setup_merge(struct cfq_queue *cfqq, struct cfq_queue *new_cfqq)
>> }
>> }
>>
>> +/*
>> + * The time when a CFQ queue is put onto a service tree is recoreded in
>> + * cfqq->reposition_time. Currently, we check the first priority CFQ queues
>> + * on each service tree, and select the workload type that contain the lowest
>> + * reposition_time CFQ queue among them.
>> + */
>> static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> struct cfq_group *cfqg, enum wl_prio_t prio)
>> {
>> struct cfq_entity *cfqe;
>> + struct cfq_queue *cfqq;
>> + unsigned long lowest_start_time;
>> int i;
>> - bool key_valid = false;
>> - unsigned long lowest_key = 0;
>> + bool time_valid = false;
>> enum wl_type_t cur_best = SYNC_NOIDLE_WORKLOAD;
>>
>> + /*
>> + * TODO: We may take io priority into account when choosing a workload
>> + * type. But for the time being just make use of reposition_time only.
>> + */
>> for (i = 0; i <= SYNC_WORKLOAD; ++i) {
>> - /* select the one with lowest rb_key */
>> cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
>> - if (cfqe &&
>> - (!key_valid ||
>> - time_before(cfqe->rb_key, lowest_key))) {
>> - lowest_key = cfqe->rb_key;
>> + cfqq = cfqq_of_entity(cfqe);
>> + if (cfqe && (!time_valid ||
>> + cfqq->reposition_time < lowest_start_time)) {
>
> do you need to use here time_before() etc marco to take care of
> jifies/reposition_time wrapping?
Ok
>
>> + lowest_start_time = cfqq->reposition_time;
>> cur_best = i;
>> - key_valid = true;
>> + time_valid = true;
>> }
>> }
>>
>> @@ -2814,10 +2883,13 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
>> {
>> struct task_struct *tsk = current;
>> int ioprio_class;
>> + struct cfq_entity *cfqe;
>>
>> if (!cfq_cfqq_prio_changed(cfqq))
>> return;
>>
>> + cfqe = &cfqq->cfqe;
>> +
>> ioprio_class = IOPRIO_PRIO_CLASS(ioc->ioprio);
>> switch (ioprio_class) {
>> default:
>> @@ -2844,6 +2916,8 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
>> break;
>> }
>>
>> + cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
>> +
>
> Same here, can you update cfqe->weight while you are on service tree. In
> the past I could not and we had to maintain a serparate variable where
> we stored the new weight and once we requeue the entity we used to
> process the new weight.
>
>> /*
>> * keep track of original prio settings in case we have to temporarily
>> * elevate the priority of this queue
>> @@ -3578,6 +3652,9 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
>> */
>> static void cfq_prio_boost(struct cfq_queue *cfqq)
>> {
>> + struct cfq_entity *cfqe;
>> +
>> + cfqe = &cfqq->cfqe;
>> if (has_fs_excl()) {
>> /*
>> * boost idle prio on transactions that would lock out other
>> @@ -3594,6 +3671,11 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
>> cfqq->ioprio_class = cfqq->org_ioprio_class;
>> cfqq->ioprio = cfqq->org_ioprio;
>> }
>> +
>> + /*
>> + * update the io weight if io priority gets changed.
>> + */
>> + cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
>
> How do you know that this cfqe/cfqq is not already on the service tree? I
> don't think you can update weights while you are enqueued on the tree.
Vivek, I'm not sure why we can't update weights if cfqe is enqueued on the tree.
Do you mean when we update cfqe's weight we need to update tree's total_weight
accordingly?
>
>> }
>>
>> static inline int __cfq_may_queue(struct cfq_queue *cfqq)
>> --
>> 1.6.5.2
>>
>>
>>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue
2010-12-14 2:41 ` Gui Jianfeng
@ 2010-12-14 2:47 ` Vivek Goyal
0 siblings, 0 replies; 41+ messages in thread
From: Vivek Goyal @ 2010-12-14 2:47 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Tue, Dec 14, 2010 at 10:41:32AM +0800, Gui Jianfeng wrote:
[..]
> >> +
> >> + /*
> >> + * update the io weight if io priority gets changed.
> >> + */
> >> + cfqe->weight = cfq_prio_to_weight(cfqq->ioprio);
> >
> > How do you know that this cfqe/cfqq is not already on the service tree? I
> > don't think you can update weights while you are enqueued on the tree.
>
> Vivek, I'm not sure why we can't update weights if cfqe is enqueued on the tree.
> Do you mean when we update cfqe's weight we need to update tree's total_weight
> accordingly?
So far I can think of adjusting total_weight accordingly also if you are
updating the queue weight while it is on service tree.
Thanks
Vivek
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy.
2010-12-13 15:10 ` Vivek Goyal
@ 2010-12-14 2:52 ` Gui Jianfeng
0 siblings, 0 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-14 2:52 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Mon, Dec 13, 2010 at 09:45:22AM +0800, Gui Jianfeng wrote:
>> Document for blkio.use_hierarchy.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> Documentation/cgroups/blkio-controller.txt | 58 +++++++++++++++++++---------
>> 1 files changed, 39 insertions(+), 19 deletions(-)
>>
>> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
>> index 4ed7b5c..9c6dc9e 100644
>> --- a/Documentation/cgroups/blkio-controller.txt
>> +++ b/Documentation/cgroups/blkio-controller.txt
>> @@ -91,30 +91,44 @@ Throttling/Upper Limit policy
>>
>> Hierarchical Cgroups
>> ====================
>> -- Currently none of the IO control policy supports hierarhical groups. But
>> - cgroup interface does allow creation of hierarhical cgroups and internally
>> - IO policies treat them as flat hierarchy.
>> +- Cgroup interface allows creation of hierarchical cgroups. Currently,
>> + internally IO policies are able to treat them as flat hierarchy or
>> + hierarchical hierarchy. Both hierarchical bandwidth division and flat
>> + bandwidth division are supported. "blkio.use_hierarchy" can be used to
>> + switch between flat mode and hierarchical mode.
>>
>> - So this patch will allow creation of cgroup hierarhcy but at the backend
>> - everything will be treated as flat. So if somebody created a hierarchy like
>> - as follows.
>> + Consider the following CGroup hierarchy:
>>
>> - root
>> - / \
>> - test1 test2
>> - |
>> - test3
>> + Root
>> + / | \
>> + Grp1 Grp2 tsk1
>> + / \
>> + Grp3 tsk2
>>
>> - CFQ and throttling will practically treat all groups at same level.
>> + If flat mode is enabled, CFQ and throttling will practically treat all
>> + groups at the same level.
>>
>> - pivot
>> - / | \ \
>> - root test1 test2 test3
>> + Pivot tree
>> + / | | \
>> + Root Grp1 Grp2 Grp3
>> + / |
>> + tsk1 tsk2
>>
>> - Down the line we can implement hierarchical accounting/control support
>> - and also introduce a new cgroup file "use_hierarchy" which will control
>> - whether cgroup hierarchy is viewed as flat or hierarchical by the policy..
>> - This is how memory controller also has implemented the things.
>> + If hierarchical mode is enabled, CFQ will treat groups and tasks as the same
>> + view in CGroup hierarchy.
>> +
>> + Root
>> + / | \
>> + Grp1 Grp2 tsk1
>> + / \
>> + Grp3 tsk2
>> +
>> + Grp1, Grp2 and tsk1 are treated at the same level under Root group. Grp3 and
>> + tsk2 are treated at the same level under Grp1. Below is the mapping between
>> + task io priority and io weight:
>> +
>> + prio 0 1 2 3 4 5 6 7
>> + weight 1000 868 740 612 484 356 228 100
>
> I am curious to know that why did you choose above mappings. Current prio
> to slice mapping seems to be.
>
> prio 0 1 2 3 4 5 6 7
> slice 180 160 140 120 100 80 60 40
>
> Now with above weights difference between prio 0 and prio 7 will be 10
> times as compared to old scheme of 4.5 times. Well then there is
> slice offset logic which tries to introduce more service differentation.
> anyway, I am not particular about it. Just curious. If it works well, then
> it is fine.
Currently, Since CFQ queue and CFQ group are treated at the same level, I'd
like to map ioprio to the whole range of io weight. So choose this mapping.
>
>>
>> Various user visible config options
>> ===================================
>> @@ -169,6 +183,12 @@ Proportional weight policy files
>> dev weight
>> 8:16 300
>>
>> +- blkio.use_hierarchy
>> + - Switch between hierarchical mode and flat mode as stated above.
>> + blkio.use_hierarchy == 1 means hierarchical mode is enabled.
>> + blkio.use_hierarchy == 0 means flat mode is enabled.
>> + The default mode is flat mode.
>> +
>
> Can you please explicitly mentiond that blkio.use_hierarchy only effects
> CFQ and has impact on "throttling" logic as of today. Throttling will
> still continue to treat everything as flat.
Sure.
Gui
>
> I am working on making throttling logic hierarchical. It has been going
> on kind of slow and expecting it to get ready for 2.6.39.
>
> Vivek
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/8 v2] Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface
2010-12-13 14:29 ` Vivek Goyal
@ 2010-12-14 3:06 ` Gui Jianfeng
2010-12-14 3:29 ` Vivek Goyal
0 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-14 3:06 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Mon, Dec 13, 2010 at 09:44:10AM +0800, Gui Jianfeng wrote:
>> Hi
>>
>> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
>> in the way that it puts all CFQ queues in a hidden group and schedules with other
>> CFQ group under their parent. The patchset is available here,
>> http://lkml.org/lkml/2010/8/30/30
>>
>> Vivek think this approach isn't so instinct that we should treat CFQ queues
>> and groups at the same level. Here is the new approach for hierarchical
>> scheduling based on Vivek's suggestion. The most big change of CFQ is that
>> it gets rid of cfq_slice_offset logic, and makes use of vdisktime for CFQ
>> queue scheduling just like CFQ group does. But I still give cfqq some jump
>> in vdisktime based on ioprio, thanks for Vivek to point out this. Now CFQ
>> queue and CFQ group uses the same scheduling algorithm.
>
> Hi Gui,
>
> Thanks for the patches. Few thoughts.
>
> - I think we can implement vdisktime jump logic for both cfq queue and
> cfq groups. So any entity (queue/group) which is being backlogged fresh
> will get the vdisktime jump but anything which has been using its slice
> will get queued at the end of tree.
Vivek,
vdisktime jump for both CFQ queue and CFQ group is ok to me.
what do you mean "anything which has been using its slice will get queued at the
end of tree."
Currently, if a CFQ entity uses up its time slice, we'll update its vdisktime,
why should we put it at the end of tree.
>
> - Have you done testing in true hierarchical mode. In the sense that
> create atleast two level of hierarchy and see if bandwidth division
> is happening properly. Something like as follows.
>
> root
> / \
> test1 test2
> / \ / \
> G1 G2 G3 G4
yes, I tested with two level, and works fine.
>
> - On what kind of storage you have been doing your testing? I have noticed
> that IO controllers works well only with idling on and with idling on
> performance is bad on high end storage. The simple reason being that
> an storage array can support multiple IOs at the same time and if we
> are idling on queue or group in an attempt to provide fairness it hurts.
> It hurts especially more if we are doing random IO (I am assuming this
> is more typical of workloads).
>
> So we need to come up with a proper logic so that we can provide some
> kind of fairness even with idle disabled. I think that's where this
> vdisktime jump logic comes into picture and is important to get it
> right.
>
> So can you also do some testing with idle disabled (both queue
> and group) and see if the vdisktime logic is helping with providing
> some kind of service differentation. I think results will vary
> based on what is the storage and what queue depth are you driving. You
> can even try to do this testing on an SSD.
I tested on sata. will do more tests when idle disabled.
Thanks
Gui
>
> Thanks
> Vivek
>
>
>> "use_hierarchy" interface is now added to switch between hierarchical mode
>> and flat mode. For this time being, this interface only appears in Root Cgroup.
>>
>> V1 -> V2 Changes:
>> - Raname "struct io_sched_entity" to "struct cfq_entity" and don't differentiate
>> queue_entity and group_entity, just use cfqe instead.
>> - Give newly added cfqq a small vdisktime jump accord to its ioprio.
>> - Make flat mode as default CFQ group scheduling mode.
>> - Introduce "use_hierarchy" interface.
>> - Update blkio cgroup documents
>>
>> [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue
>> [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group
>> [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue
>> [PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group
>> [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
>> [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality
>> [PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy"
>> [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy.
>>
>> Benchmarks:
>> I made use of Vivek's iostest to perform some benchmarks on my box. I tested different workloads, and
>> didn't see any performance drop comparing to vanilla kernel. The attached files are some performance
>> numbers on vanilla Kernel, patched kernel with flat mode and patched kernel with hierarchical mode.
>>
>>
>>
>>
>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=brr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[brr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> brr 1 1 294 692 1101 1526 3613
>> brr 1 2 176 420 755 1281 2632
>> brr 1 4 160 323 583 1253 2319
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=brr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[brr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> brr 1 1 380 738 1092 1439 3649
>> brr 1 2 171 413 733 1242 2559
>> brr 1 4 188 350 665 1193 2396
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=bsr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[bsr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> bsr 1 1 6856 11480 17644 22647 58627
>> bsr 1 2 2592 5409 8464 13300 29765
>> bsr 1 4 2502 4635 7640 12909 27686
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=bsr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[bsr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> bsr 1 1 6913 11643 17843 22909 59308
>> bsr 1 2 6682 11234 15527 19410 52853
>> bsr 1 4 5209 10882 15002 18167 49260
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drr 1 1 298 701 1117 1538 3654
>> drr 1 2 190 372 731 1244 2537
>> drr 1 4 147 322 563 1143 2175
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drr 1 1 370 713 1050 1416 3549
>> drr 1 2 192 434 755 1265 2646
>> drr 1 4 157 333 677 1159 2326
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drw iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drw] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drw 1 1 595 1272 2007 2737 6611
>> drw 1 2 269 690 1407 1953 4319
>> drw 1 4 145 396 978 1752 3271
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drw iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drw] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drw 1 1 604 1310 1827 2778 6519
>> drw 1 2 287 723 1368 1887 4265
>> drw 1 4 170 407 979 1575 3131
>>
>>
>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=brr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[brr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> brr 1 1 287 690 1096 1506 3579
>> brr 1 2 189 404 800 1283 2676
>> brr 1 4 141 317 557 1106 2121
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=brr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[brr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> brr 1 1 386 715 1071 1437 3609
>> brr 1 2 187 401 717 1258 2563
>> brr 1 4 296 767 1553 32 2648
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=bsr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[bsr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> bsr 1 1 6971 11459 17789 23158 59377
>> bsr 1 2 2592 5536 8679 13389 30196
>> bsr 1 4 2194 4635 7820 13984 28633
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=bsr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[bsr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> bsr 1 1 6851 11588 17459 22297 58195
>> bsr 1 2 6814 11534 16141 20426 54915
>> bsr 1 4 5118 10741 13994 17661 47514
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drr 1 1 297 689 1097 1522 3605
>> drr 1 2 175 426 757 1277 2635
>> drr 1 4 150 330 604 1100 2184
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drr 1 1 379 735 1077 1436 3627
>> drr 1 2 190 404 760 1247 2601
>> drr 1 4 155 333 692 1044 2224
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drw iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drw] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drw 1 1 566 1293 2001 2686 6546
>> drw 1 2 225 662 1233 1902 4022
>> drw 1 4 147 379 922 1761 3209
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drw iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drw] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drw 1 1 579 1226 2020 2823 6648
>> drw 1 2 276 689 1288 2068 4321
>> drw 1 4 183 399 798 2113 3493
>>
>>
>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=brr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[brr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> brr 1 1 289 684 1098 1508 3579
>> brr 1 2 178 421 765 1228 2592
>> brr 1 4 144 301 585 1094 2124
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=brr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[brr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> brr 1 1 375 734 1081 1434 3624
>> brr 1 2 172 397 700 1201 2470
>> brr 1 4 154 316 573 1087 2130
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=bsr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[bsr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> bsr 1 1 6818 11510 17820 23239 59387
>> bsr 1 2 2643 5502 8728 13329 30202
>> bsr 1 4 2166 4785 7344 12954 27249
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=bsr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[bsr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> bsr 1 1 6979 11629 17782 23064 59454
>> bsr 1 2 6803 11274 15865 20024 53966
>> bsr 1 4 5292 10674 14504 17674 48144
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drr 1 1 298 694 1116 1540 3648
>> drr 1 2 183 405 721 1197 2506
>> drr 1 4 151 296 553 1119 2119
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drr iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drr] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drr 1 1 367 724 1078 1433 3602
>> drr 1 2 184 418 744 1245 2591
>> drr 1 4 157 295 562 1122 2136
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drw iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=0 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drw] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drw 1 1 582 1271 1948 2754 6555
>> drw 1 2 277 700 1294 1970 4241
>> drw 1 4 172 345 928 1585 3030
>>
>>
>> Host=localhost.localdomain Kernel=2.6.37-rc2-Block-+
>> GROUPMODE=1 NRGRP=4
>> DIR=/mnt/iostestmnt/fio DEV=/dev/sdb2
>> Workload=drw iosched=cfq Filesz=512M bs=32k
>> group_isolation=1 slice_idle=8 group_idle=8 quantum=8
>> =========================================================================
>> AVERAGE[drw] [bw in KB/s]
>> -------
>> job Set NR cgrp1 cgrp2 cgrp3 cgrp4 total
>> --- --- -- -----------------------------------
>> drw 1 1 586 1296 1888 2739 6509
>> drw 1 2 294 749 1360 1931 4334
>> drw 1 4 156 337 814 1806 3113
>>
>>
>
>
--
Regards
Gui Jianfeng
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/8 v2] Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface
2010-12-14 3:06 ` Gui Jianfeng
@ 2010-12-14 3:29 ` Vivek Goyal
0 siblings, 0 replies; 41+ messages in thread
From: Vivek Goyal @ 2010-12-14 3:29 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Tue, Dec 14, 2010 at 11:06:26AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Mon, Dec 13, 2010 at 09:44:10AM +0800, Gui Jianfeng wrote:
> >> Hi
> >>
> >> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
> >> in the way that it puts all CFQ queues in a hidden group and schedules with other
> >> CFQ group under their parent. The patchset is available here,
> >> http://lkml.org/lkml/2010/8/30/30
> >>
> >> Vivek think this approach isn't so instinct that we should treat CFQ queues
> >> and groups at the same level. Here is the new approach for hierarchical
> >> scheduling based on Vivek's suggestion. The most big change of CFQ is that
> >> it gets rid of cfq_slice_offset logic, and makes use of vdisktime for CFQ
> >> queue scheduling just like CFQ group does. But I still give cfqq some jump
> >> in vdisktime based on ioprio, thanks for Vivek to point out this. Now CFQ
> >> queue and CFQ group uses the same scheduling algorithm.
> >
> > Hi Gui,
> >
> > Thanks for the patches. Few thoughts.
> >
> > - I think we can implement vdisktime jump logic for both cfq queue and
> > cfq groups. So any entity (queue/group) which is being backlogged fresh
> > will get the vdisktime jump but anything which has been using its slice
> > will get queued at the end of tree.
>
> Vivek,
>
> vdisktime jump for both CFQ queue and CFQ group is ok to me.
> what do you mean "anything which has been using its slice will get queued at the
> end of tree."
> Currently, if a CFQ entity uses up its time slice, we'll update its vdisktime,
> why should we put it at the end of tree.
Sorry, what I actually meant was that any queue/group which has been using
its slice and is being requeued will be queue at a position based on vdisktime
calculation and no boost logic required. For queues/groups which gets queued
new gets a vdisktime boost. That way once we disable slice_idle=0 and
group_idle=0, we might get good bandwidth utilization at the same time
some service differentation for higher weight queues/groups.
>
>
> >
> > - Have you done testing in true hierarchical mode. In the sense that
> > create atleast two level of hierarchy and see if bandwidth division
> > is happening properly. Something like as follows.
> >
> > root
> > / \
> > test1 test2
> > / \ / \
> > G1 G2 G3 G4
>
> yes, I tested with two level, and works fine.
>
> >
> > - On what kind of storage you have been doing your testing? I have noticed
> > that IO controllers works well only with idling on and with idling on
> > performance is bad on high end storage. The simple reason being that
> > an storage array can support multiple IOs at the same time and if we
> > are idling on queue or group in an attempt to provide fairness it hurts.
> > It hurts especially more if we are doing random IO (I am assuming this
> > is more typical of workloads).
> >
> > So we need to come up with a proper logic so that we can provide some
> > kind of fairness even with idle disabled. I think that's where this
> > vdisktime jump logic comes into picture and is important to get it
> > right.
> >
> > So can you also do some testing with idle disabled (both queue
> > and group) and see if the vdisktime logic is helping with providing
> > some kind of service differentation. I think results will vary
> > based on what is the storage and what queue depth are you driving. You
> > can even try to do this testing on an SSD.
>
> I tested on sata. will do more tests when idle disabled.
Ok, actulally SATA with low queue depth is the case where block IO controller
works best. I am also keen to make it work well for SSDs and faster storage
like storage arrays without losing too much of throughput in the process.
Thanks
Vivek
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 0/8 v2] Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface
2010-12-13 13:36 ` Jens Axboe
@ 2010-12-14 3:30 ` Gui Jianfeng
0 siblings, 0 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-14 3:30 UTC (permalink / raw)
To: Jens Axboe
Cc: Vivek Goyal, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Jens Axboe wrote:
> On 2010-12-13 02:44, Gui Jianfeng wrote:
>> Hi
>>
>> Previously, I posted a patchset to add support of CFQ group hierarchical scheduling
>> in the way that it puts all CFQ queues in a hidden group and schedules with other
>> CFQ group under their parent. The patchset is available here,
>> http://lkml.org/lkml/2010/8/30/30
>>
>> Vivek think this approach isn't so instinct that we should treat CFQ queues
>> and groups at the same level. Here is the new approach for hierarchical
>> scheduling based on Vivek's suggestion. The most big change of CFQ is that
>> it gets rid of cfq_slice_offset logic, and makes use of vdisktime for CFQ
>> queue scheduling just like CFQ group does. But I still give cfqq some jump
>> in vdisktime based on ioprio, thanks for Vivek to point out this. Now CFQ
>> queue and CFQ group uses the same scheduling algorithm.
>>
>> "use_hierarchy" interface is now added to switch between hierarchical mode
>> and flat mode. For this time being, this interface only appears in Root Cgroup.
>>
>> V1 -> V2 Changes:
>> - Raname "struct io_sched_entity" to "struct cfq_entity" and don't differentiate
>> queue_entity and group_entity, just use cfqe instead.
>> - Give newly added cfqq a small vdisktime jump accord to its ioprio.
>> - Make flat mode as default CFQ group scheduling mode.
>> - Introduce "use_hierarchy" interface.
>> - Update blkio cgroup documents
>>
>> [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue
>> [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group
>> [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue
>> [PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group
>> [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
>> [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality
>> [PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy"
>> [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy.
>>
>> Benchmarks:
>> I made use of Vivek's iostest to perform some benchmarks on my box. I tested different workloads, and
>> didn't see any performance drop comparing to vanilla kernel. The attached files are some performance
>> numbers on vanilla Kernel, patched kernel with flat mode and patched kernel with hierarchical mode.
>
> I was going to apply this patchset now, but it doesn't apply to current
> for-2.6.38/core (#5 fails, git am stops there). Care to resend it
> against that branch? Thanks.
Sure.
Thanks,
Gui
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
2010-12-13 1:45 ` [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level Gui Jianfeng
@ 2010-12-14 3:49 ` Vivek Goyal
2010-12-14 6:09 ` Gui Jianfeng
2010-12-15 7:02 ` Gui Jianfeng
0 siblings, 2 replies; 41+ messages in thread
From: Vivek Goyal @ 2010-12-14 3:49 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Mon, Dec 13, 2010 at 09:45:01AM +0800, Gui Jianfeng wrote:
> This patch makes CFQ queue and CFQ group schedules at the same level.
> Consider the following hierarchy:
>
> Root
> / | \
> q1 q2 G1
> / \
> q3 G2
>
> q1 q2 and q3 are CFQ queues G1 and G2 are CFQ groups. With this patch, q1,
> q2 and G1 are scheduling on a same service tree in Root CFQ group. q3 and G2
> are schedluing under G1. Note, for the time being, CFQ group is treated
> as "BE and SYNC" workload, and is put on "BE and SYNC" service tree. That
> means service differentiate only happens in "BE and SYNC" service tree.
> Later, we may introduce "IO Class" for CFQ group.
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> block/cfq-iosched.c | 473 ++++++++++++++++++++++++++++++++++----------------
> 1 files changed, 321 insertions(+), 152 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 6486956..d90627e 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -105,6 +105,9 @@ struct cfq_entity {
> u64 vdisktime;
> bool is_group_entity;
> unsigned int weight;
> + struct cfq_entity *parent;
> + /* Reposition time */
> + unsigned long reposition_time;
> };
>
> /*
> @@ -113,8 +116,6 @@ struct cfq_entity {
> struct cfq_queue {
> /* The schedule entity */
> struct cfq_entity cfqe;
> - /* Reposition time */
> - unsigned long reposition_time;
> /* reference count */
> atomic_t ref;
> /* various state flags, see below */
> @@ -194,6 +195,9 @@ struct cfq_group {
> /* number of cfqq currently on this group */
> int nr_cfqq;
>
> + /* number of sub cfq groups */
> + int nr_subgp;
> +
Do you really have to maintain separate count for child queue and
child groups? Will a common count something like nr_children be
not sufficient.
> /*
> * Per group busy queus average. Useful for workload slice calc. We
> * create the array for each prio class but at run time it is used
> @@ -229,8 +233,6 @@ struct cfq_group {
> */
> struct cfq_data {
> struct request_queue *queue;
> - /* Root service tree for cfq_groups */
> - struct cfq_rb_root grp_service_tree;
I see that you are removing this service tree here and then adding it
back in patch 7. I think it is confusing. In fact title of patch 7 is
add flat mode. The fact the flat mode is already supported and we
are just adding hierarchical mode on top of it. I think this is
just a matter of better naming and patch organization so that
it is more clear.
> struct cfq_group root_group;
>
> /*
> @@ -347,8 +349,6 @@ cfqg_of_entity(struct cfq_entity *cfqe)
> return NULL;
> }
>
> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
> -
> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
> enum wl_prio_t prio,
> enum wl_type_t type)
> @@ -638,10 +638,15 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
> static inline unsigned
> cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
> {
> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> struct cfq_entity *cfqe = &cfqg->cfqe;
> + struct cfq_rb_root *st = cfqe->service_tree;
>
> - return cfq_target_latency * cfqe->weight / st->total_weight;
> + if (st)
> + return cfq_target_latency * cfqe->weight
> + / st->total_weight;
Is it still true in hierarchical mode. Previously group used to be
at the top and there used to be only one service tree for groups so
st->total_weight represented total weight in the system.
Now with hierarhcy this will not/should not be true. So a group slice
calculation should be different?
> + else
> + /* If this is the root group, give it a full slice. */
> + return cfq_target_latency;
> }
>
> static inline void
> @@ -804,17 +809,6 @@ static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
> return NULL;
> }
>
> -static struct cfq_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
> -{
> - if (!root->left)
> - root->left = rb_first(&root->rb);
> -
> - if (root->left)
> - return rb_entry_entity(root->left);
> -
> - return NULL;
> -}
> -
> static void rb_erase_init(struct rb_node *n, struct rb_root *root)
> {
> rb_erase(n, root);
> @@ -888,12 +882,15 @@ __cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
>
> rb_link_node(&cfqe->rb_node, parent, node);
> rb_insert_color(&cfqe->rb_node, &st->rb);
> +
> + update_min_vdisktime(st);
> }
>
> static void
> cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> {
> __cfq_entity_service_tree_add(st, cfqe);
> + cfqe->reposition_time = jiffies;
> st->count++;
> st->total_weight += cfqe->weight;
> }
> @@ -901,34 +898,57 @@ cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> static void
> cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
> {
> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> struct cfq_entity *cfqe = &cfqg->cfqe;
> - struct cfq_entity *__cfqe;
> struct rb_node *n;
> + struct cfq_entity *entity;
> + struct cfq_rb_root *st;
> + struct cfq_group *__cfqg;
>
> cfqg->nr_cfqq++;
> +
> + /*
> + * Root group doesn't belongs to any service
> + */
> + if (cfqg == &cfqd->root_group)
> + return;
Can we keep root group on cfqd->grp_service_tree? In hierarchical mode
there will be only 1 group on grp service tree and in flat mode there
can be many.
> +
> if (!RB_EMPTY_NODE(&cfqe->rb_node))
> return;
>
> /*
> - * Currently put the group at the end. Later implement something
> - * so that groups get lesser vtime based on their weights, so that
> - * if group does not loose all if it was not continously backlogged.
> + * Enqueue this group and its ancestors onto their service tree.
> */
> - n = rb_last(&st->rb);
> - if (n) {
> - __cfqe = rb_entry_entity(n);
> - cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
> - } else
> - cfqe->vdisktime = st->min_vdisktime;
> + while (cfqe && cfqe->parent) {
> + if (!RB_EMPTY_NODE(&cfqe->rb_node))
> + return;
> +
> + /*
> + * Currently put the group at the end. Later implement
> + * something so that groups get lesser vtime based on their
> + * weights, so that if group does not loose all if it was not
> + * continously backlogged.
> + */
> + st = cfqe->service_tree;
> + n = rb_last(&st->rb);
> + if (n) {
> + entity = rb_entry_entity(n);
> + cfqe->vdisktime = entity->vdisktime +
> + CFQ_IDLE_DELAY;
> + } else
> + cfqe->vdisktime = st->min_vdisktime;
>
> - cfq_entity_service_tree_add(st, cfqe);
> + cfq_entity_service_tree_add(st, cfqe);
> + cfqe = cfqe->parent;
> + __cfqg = cfqg_of_entity(cfqe);
> + __cfqg->nr_subgp++;
> + }
> }
>
> static void
> __cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> {
> cfq_rb_erase(&cfqe->rb_node, st);
> + update_min_vdisktime(st);
> }
>
> static void
> @@ -937,27 +957,47 @@ cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
> __cfq_entity_service_tree_del(st, cfqe);
> st->total_weight -= cfqe->weight;
> - cfqe->service_tree = NULL;
> }
> }
>
> static void
> cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
> {
> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> struct cfq_entity *cfqe = &cfqg->cfqe;
> + struct cfq_group *__cfqg, *p_cfqg;
>
> BUG_ON(cfqg->nr_cfqq < 1);
> cfqg->nr_cfqq--;
>
> + /*
> + * Root group doesn't belongs to any service
> + */
> + if (cfqg == &cfqd->root_group)
> + return;
> +
> /* If there are other cfq queues under this group, don't delete it */
> if (cfqg->nr_cfqq)
> return;
>
> - cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
> - cfq_entity_service_tree_del(st, cfqe);
> - cfqg->saved_workload_slice = 0;
> - cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
> + /* If child group exists, don't dequeue it */
> + if (cfqg->nr_subgp)
> + return;
> +
> + /*
> + * Dequeue this group and its ancestors from their service tree.
> + */
> + while (cfqe && cfqe->parent) {
> + __cfqg = cfqg_of_entity(cfqe);
> + p_cfqg = cfqg_of_entity(cfqe->parent);
> + cfq_entity_service_tree_del(cfqe->service_tree, cfqe);
> + cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
> + cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
> + __cfqg->saved_workload_slice = 0;
> + cfqe = cfqe->parent;
> + p_cfqg->nr_subgp--;
> + if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
> + return;
> + }
> }
I think one you merge queue/group algorithms, you can use same function
for adding/deleting queue/group entities and don't have to use separate
functions for groups?
[..]
> - cfqg->cfqe.weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
> +int cfqg_chain_alloc(struct cfq_data *cfqd, struct cgroup *cgroup)
> +{
> + struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
> + struct blkio_cgroup *p_blkcg;
> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
> + unsigned int major, minor;
> + struct cfq_group *cfqg, *p_cfqg;
> + void *key = cfqd;
> + int ret;
>
> - /* Add group on cfqd list */
> - hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
> + if (cfqg) {
> + if (!cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> + cfqg->blkg.dev = MKDEV(major, minor);
> + }
> + /* chain has already been built */
> + return 0;
> + }
> +
> + cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
> + if (!cfqg)
> + return -1;
> +
> + init_cfqg(cfqd, blkcg, cfqg);
> +
> + /* Already to the top group */
> + if (!cgroup->parent)
> + return 0;
> +
> + /* Allocate CFQ groups on the chain */
> + ret = cfqg_chain_alloc(cfqd, cgroup->parent);
Can you avoid recursion and user for/while loops to initialize the
chain. Don't want to push multiple stack frames in case of a deep hier.
> + if (ret == -1) {
> + uninit_cfqg(cfqd, cfqg);
> + return -1;
> + }
> +
> + p_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
> + p_cfqg = cfqg_of_blkg(blkiocg_lookup_group(p_blkcg, key));
> + BUG_ON(p_cfqg == NULL);
> +
> + cfqg_set_parent(cfqd, cfqg, p_cfqg);
> + return 0;
> +}
> +
> +static struct cfq_group *
> +cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
> +{
> + struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
> + struct cfq_group *cfqg = NULL;
> + void *key = cfqd;
> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
> + unsigned int major, minor;
> + int ret;
> +
> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
> + if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
> + cfqg->blkg.dev = MKDEV(major, minor);
> + goto done;
> + }
> + if (cfqg || !create)
> + goto done;
> +
> + /*
> + * For hierarchical cfq group scheduling, we need to allocate
> + * the whole cfq group chain.
> + */
> + ret = cfqg_chain_alloc(cfqd, cgroup);
> + if (!ret) {
> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
> + BUG_ON(cfqg == NULL);
> + goto done;
> + }
>
> done:
> return cfqg;
> @@ -1140,12 +1278,22 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
> {
> struct cfq_rb_root *st;
> int i, j;
> + struct cfq_entity *cfqe;
> + struct cfq_group *p_cfqg;
>
> BUG_ON(atomic_read(&cfqg->ref) <= 0);
> if (!atomic_dec_and_test(&cfqg->ref))
> return;
> for_each_cfqg_st(cfqg, i, j, st)
> BUG_ON(!RB_EMPTY_ROOT(&st->rb));
> +
> + cfqe = &cfqg->cfqe;
> + if (cfqe->parent) {
> + p_cfqg = cfqg_of_entity(cfqe->parent);
> + /* Drop the reference taken by children */
> + atomic_dec(&p_cfqg->ref);
> + }
> +
Is it the right way to free up whole of the parent chain. just think that
in a hier of test1->test2->test3 somebody drops the reference to test3
and test1 and test2 don't have any other children. In that case after
freeing up test3, we should be freeing up test2 and test1 also.
I was thiking that we can achieve this by freeing up groups in a
loop.
do {
cfqe = cfqg->entity.parent;
if (!atomic_dec_and_test(&cfqg->ref))
return;
kfree(cfqg);
cfqg = cfqg_of_entity(cfqe);
} while(cfqg)
> kfree(cfqg);
> }
>
> @@ -1358,8 +1506,6 @@ insert:
> /* Add cfqq onto service tree. */
> cfq_entity_service_tree_add(service_tree, cfqe);
>
> - update_min_vdisktime(service_tree);
> - cfqq->reposition_time = jiffies;
> if ((add_front || !new_cfqq) && !group_changed)
> return;
> cfq_group_service_tree_add(cfqd, cfqq->cfqg);
> @@ -1802,28 +1948,30 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
> return cfqq_of_entity(cfq_rb_first(service_tree));
> }
>
> -static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
> +static struct cfq_entity *
> +cfq_get_next_entity_forced(struct cfq_data *cfqd, struct cfq_group *cfqg)
> {
> - struct cfq_group *cfqg;
> - struct cfq_entity *cfqe;
> + struct cfq_entity *entity;
> int i, j;
> struct cfq_rb_root *st;
>
> if (!cfqd->rq_queued)
> return NULL;
>
> - cfqg = cfq_get_next_cfqg(cfqd);
> - if (!cfqg)
> - return NULL;
> -
> for_each_cfqg_st(cfqg, i, j, st) {
> - cfqe = cfq_rb_first(st);
> - if (cfqe != NULL)
> - return cfqq_of_entity(cfqe);
> + entity = cfq_rb_first(st);
> +
> + if (entity && !entity->is_group_entity)
> + return entity;
> + else if (entity && entity->is_group_entity) {
> + cfqg = cfqg_of_entity(entity);
> + return cfq_get_next_entity_forced(cfqd, cfqg);
> + }
> }
> return NULL;
> }
Can above be siplified by just taking cfqd as parameter. Will work both
for hierarchical and flat mode. Wanted to avoid recursion as somebody
can create deep cgroup hierarchy and push lots of frames on stack.
struct cfq_entity *cfq_get_next_entity_forced(struct cfq_data *cfqd)
{
struct service_tree *st = cfqd->grp_service_tree;
do {
cfqe = cfq_rb_first(st);
if (is_cfqe_cfqq(cfqe))
return cfqe;
st = choose_service_tree_forced(cfqg);
} while (st);
}
And choose_service_tree_forced() can be something like.
choose_service_tree_forced(cfqg) {
for_each_cfqg_st() {
if (st->count !=0)
return st;
}
}
>
> +
> /*
> * Get and set a new active queue for service.
> */
> @@ -2179,7 +2327,6 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> struct cfq_group *cfqg, enum wl_prio_t prio)
> {
> struct cfq_entity *cfqe;
> - struct cfq_queue *cfqq;
> unsigned long lowest_start_time;
> int i;
> bool time_valid = false;
> @@ -2191,10 +2338,9 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> */
> for (i = 0; i <= SYNC_WORKLOAD; ++i) {
> cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
> - cfqq = cfqq_of_entity(cfqe);
> if (cfqe && (!time_valid ||
> - cfqq->reposition_time < lowest_start_time)) {
> - lowest_start_time = cfqq->reposition_time;
> + cfqe->reposition_time < lowest_start_time)) {
> + lowest_start_time = cfqe->reposition_time;
> cur_best = i;
> time_valid = true;
> }
> @@ -2203,47 +2349,13 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
> return cur_best;
> }
>
> -static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> +static void set_workload_expire(struct cfq_data *cfqd, struct cfq_group *cfqg)
> {
> unsigned slice;
> unsigned count;
> struct cfq_rb_root *st;
> unsigned group_slice;
>
> - if (!cfqg) {
> - cfqd->serving_prio = IDLE_WORKLOAD;
> - cfqd->workload_expires = jiffies + 1;
> - return;
> - }
> -
> - /* Choose next priority. RT > BE > IDLE */
> - if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
> - cfqd->serving_prio = RT_WORKLOAD;
> - else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
> - cfqd->serving_prio = BE_WORKLOAD;
> - else {
> - cfqd->serving_prio = IDLE_WORKLOAD;
> - cfqd->workload_expires = jiffies + 1;
> - return;
> - }
> -
> - /*
> - * For RT and BE, we have to choose also the type
> - * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
> - * expiration time
> - */
> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
> - count = st->count;
> -
> - /*
> - * check workload expiration, and that we still have other queues ready
> - */
> - if (count && !time_after(jiffies, cfqd->workload_expires))
> - return;
> -
> - /* otherwise select new workload type */
> - cfqd->serving_type =
> - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
> st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
> count = st->count;
>
> @@ -2284,26 +2396,51 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> cfqd->workload_expires = jiffies + slice;
> }
>
> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
> +static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
> {
> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> - struct cfq_group *cfqg;
> - struct cfq_entity *cfqe;
> + struct cfq_rb_root *st;
> + unsigned count;
>
> - if (RB_EMPTY_ROOT(&st->rb))
> - return NULL;
> - cfqe = cfq_rb_first_entity(st);
> - cfqg = cfqg_of_entity(cfqe);
> - BUG_ON(!cfqg);
> - update_min_vdisktime(st);
> - return cfqg;
> + if (!cfqg) {
> + cfqd->serving_prio = IDLE_WORKLOAD;
> + cfqd->workload_expires = jiffies + 1;
> + return;
> + }
I am wondering where do we use above code. Do we ever call
choose_service_tree() with cfqg == NULL? Can't seem to figure out by
looking at the code.
> +
> + /* Choose next priority. RT > BE > IDLE */
> + if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
> + cfqd->serving_prio = RT_WORKLOAD;
> + else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
> + cfqd->serving_prio = BE_WORKLOAD;
> + else {
> + cfqd->serving_prio = IDLE_WORKLOAD;
> + cfqd->workload_expires = jiffies + 1;
> + return;
> + }
> +
> + /*
> + * For RT and BE, we have to choose also the type
> + * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
> + * expiration time
> + */
> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
> + count = st->count;
> +
> + /*
> + * check workload expiration, and that we still have other queues ready
> + */
> + if (count && !time_after(jiffies, cfqd->workload_expires))
> + return;
> +
> + /* otherwise select new workload type */
> + cfqd->serving_type =
> + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
> }
>
> -static void cfq_choose_cfqg(struct cfq_data *cfqd)
> +struct cfq_entity *choose_serving_entity(struct cfq_data *cfqd,
> + struct cfq_group *cfqg)
I think for this function you don't have to pass cfqg as parameter. You
can just use cfqd as parameter and then take all decisions based on
service tree.
So at top we can continue to have grp_service_tree in cfqd. In
hierarchical mode it will only have root group queued there and in
flat mode it can have multiple groups queued.
Also I am looking forward to simplifying and organizing CFQ code little
better so that it is easy to read. Can chooser_serving entity function
be organized something as follows. This shoudl work both for flat and
hierarchical modes. Following is only a pesudo code.
struct cfq_entity *select_entity(struct cfq_data *cfqd)
{
struct cfq_rb_root *st = cfqd->grp_service_tree;
struct cfq_entity *cfqe;
do {
cfqe = cfq_rb_first(st);
if (is_cfqe_cfqq(cfqe))
/* We found the next queue to dispatch from */
break;
else
st = choose_service_tree();
} while (st)
return cfqe;
}
> {
> - struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
> -
> - cfqd->serving_group = cfqg;
> + struct cfq_rb_root *service_tree;
>
> /* Restore the workload type data */
> if (cfqg->saved_workload_slice) {
> @@ -2314,8 +2451,21 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
> cfqd->workload_expires = jiffies - 1;
>
> choose_service_tree(cfqd, cfqg);
> -}
>
> + service_tree = service_tree_for(cfqg, cfqd->serving_prio,
> + cfqd->serving_type);
> +
> + if (!cfqd->rq_queued)
> + return NULL;
> +
> + /* There is nothing to dispatch */
> + if (!service_tree)
> + return NULL;
> + if (RB_EMPTY_ROOT(&service_tree->rb))
> + return NULL;
> +
> + return cfq_rb_first(service_tree);
> +}
> /*
> * Select a queue for service. If we have a current active queue,
> * check whether to continue servicing it, or retrieve and set a new one.
> @@ -2323,6 +2473,8 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
> static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
> {
> struct cfq_queue *cfqq, *new_cfqq = NULL;
> + struct cfq_group *cfqg;
> + struct cfq_entity *entity;
>
> cfqq = cfqd->active_queue;
> if (!cfqq)
> @@ -2422,8 +2574,23 @@ new_queue:
> * Current queue expired. Check if we have to switch to a new
> * service tree
> */
> - if (!new_cfqq)
> - cfq_choose_cfqg(cfqd);
> + cfqg = &cfqd->root_group;
> +
> + if (!new_cfqq) {
> + do {
> + entity = choose_serving_entity(cfqd, cfqg);
> + if (entity && !entity->is_group_entity) {
> + /* This is the CFQ queue that should run */
> + new_cfqq = cfqq_of_entity(entity);
> + cfqd->serving_group = cfqg;
> + set_workload_expire(cfqd, cfqg);
> + break;
> + } else if (entity && entity->is_group_entity) {
> + /* Continue to lookup in this CFQ group */
> + cfqg = cfqg_of_entity(entity);
> + }
> + } while (entity && entity->is_group_entity);
I think move above logic in a separate function otherwise select_queue()
is becoming complicated.
Secondly for traversing the hierarchy you can introduce macros like
for_each_entity() or for_each_cfqe() etc.
Thirdly I would again say that flat mode is already supported. Build on
top of it instead of first getting rid of it and then adding it back
with the help of a separate patch. If it is too complicated then let
it be a single patch instead of separating it out in two pathes.
> + }
>
> cfqq = cfq_set_active_queue(cfqd, new_cfqq);
> keep_queue:
> @@ -2454,10 +2621,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
> {
> struct cfq_queue *cfqq;
> int dispatched = 0;
> + struct cfq_entity *entity;
> + struct cfq_group *root = &cfqd->root_group;
>
> /* Expire the timeslice of the current active queue first */
> cfq_slice_expired(cfqd, 0);
> - while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL) {
> + while ((entity = cfq_get_next_entity_forced(cfqd, root)) != NULL) {
> + BUG_ON(entity->is_group_entity);
> + cfqq = cfqq_of_entity(entity);
> __cfq_set_active_queue(cfqd, cfqq);
> dispatched += __cfq_forced_dispatch_cfqq(cfqq);
> }
> @@ -3991,9 +4162,6 @@ static void *cfq_init_queue(struct request_queue *q)
>
> cfqd->cic_index = i;
>
> - /* Init root service tree */
> - cfqd->grp_service_tree = CFQ_RB_ROOT;
> -
> /* Init root group */
> cfqg = &cfqd->root_group;
> for_each_cfqg_st(cfqg, i, j, st)
> @@ -4003,6 +4171,7 @@ static void *cfq_init_queue(struct request_queue *q)
> /* Give preference to root group over other groups */
> cfqg->cfqe.weight = 2*BLKIO_WEIGHT_DEFAULT;
> cfqg->cfqe.is_group_entity = true;
> + cfqg->cfqe.parent = NULL;
>
> #ifdef CONFIG_CFQ_GROUP_IOSCHED
> /*
> --
> 1.6.5.2
>
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
2010-12-14 3:49 ` Vivek Goyal
@ 2010-12-14 6:09 ` Gui Jianfeng
2010-12-15 7:02 ` Gui Jianfeng
1 sibling, 0 replies; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-14 6:09 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
...
>> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
>> +static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> - struct cfq_group *cfqg;
>> - struct cfq_entity *cfqe;
>> + struct cfq_rb_root *st;
>> + unsigned count;
>>
>> - if (RB_EMPTY_ROOT(&st->rb))
>> - return NULL;
>> - cfqe = cfq_rb_first_entity(st);
>> - cfqg = cfqg_of_entity(cfqe);
>> - BUG_ON(!cfqg);
>> - update_min_vdisktime(st);
>> - return cfqg;
>> + if (!cfqg) {
>> + cfqd->serving_prio = IDLE_WORKLOAD;
>> + cfqd->workload_expires = jiffies + 1;
>> + return;
>> + }
>
> I am wondering where do we use above code. Do we ever call
> choose_service_tree() with cfqg == NULL? Can't seem to figure out by
> looking at the code.
>
Vivek,
This piece of code comes from original CFQ code. Think more about it, this
piece of code seems redundant. When cfq_choose_cfqg() is called in select_queue(),
there must be at least one backlogged CFQ queue waiting for dispatching, hence
there must be at least one backlogged CFQ group on service tree. So we never call
choose_service_tree() with cfqg == NULL.
I'd like to post a seperate patch to get rid of this piece.
Thanks,
Gui
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
2010-12-14 3:49 ` Vivek Goyal
2010-12-14 6:09 ` Gui Jianfeng
@ 2010-12-15 7:02 ` Gui Jianfeng
2010-12-15 22:04 ` Vivek Goyal
1 sibling, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-15 7:02 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Mon, Dec 13, 2010 at 09:45:01AM +0800, Gui Jianfeng wrote:
>> This patch makes CFQ queue and CFQ group schedules at the same level.
>> Consider the following hierarchy:
>>
>> Root
>> / | \
>> q1 q2 G1
>> / \
>> q3 G2
>>
>> q1 q2 and q3 are CFQ queues G1 and G2 are CFQ groups. With this patch, q1,
>> q2 and G1 are scheduling on a same service tree in Root CFQ group. q3 and G2
>> are schedluing under G1. Note, for the time being, CFQ group is treated
>> as "BE and SYNC" workload, and is put on "BE and SYNC" service tree. That
>> means service differentiate only happens in "BE and SYNC" service tree.
>> Later, we may introduce "IO Class" for CFQ group.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> block/cfq-iosched.c | 473 ++++++++++++++++++++++++++++++++++----------------
>> 1 files changed, 321 insertions(+), 152 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 6486956..d90627e 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -105,6 +105,9 @@ struct cfq_entity {
>> u64 vdisktime;
>> bool is_group_entity;
>> unsigned int weight;
>> + struct cfq_entity *parent;
>> + /* Reposition time */
>> + unsigned long reposition_time;
>> };
>>
>> /*
>> @@ -113,8 +116,6 @@ struct cfq_entity {
>> struct cfq_queue {
>> /* The schedule entity */
>> struct cfq_entity cfqe;
>> - /* Reposition time */
>> - unsigned long reposition_time;
>> /* reference count */
>> atomic_t ref;
>> /* various state flags, see below */
>> @@ -194,6 +195,9 @@ struct cfq_group {
>> /* number of cfqq currently on this group */
>> int nr_cfqq;
>>
>> + /* number of sub cfq groups */
>> + int nr_subgp;
>> +
>
> Do you really have to maintain separate count for child queue and
> child groups? Will a common count something like nr_children be
> not sufficient.
Currently, nr_subgp is only effective in hierarchical mode, but nr_cfqq work
for both hierarchical and flat mode. So for the time being, we need separate
count for child queues and groups.
>
>> /*
>> * Per group busy queus average. Useful for workload slice calc. We
>> * create the array for each prio class but at run time it is used
>> @@ -229,8 +233,6 @@ struct cfq_group {
>> */
>> struct cfq_data {
>> struct request_queue *queue;
>> - /* Root service tree for cfq_groups */
>> - struct cfq_rb_root grp_service_tree;
>
> I see that you are removing this service tree here and then adding it
> back in patch 7. I think it is confusing. In fact title of patch 7 is
> add flat mode. The fact the flat mode is already supported and we
> are just adding hierarchical mode on top of it. I think this is
> just a matter of better naming and patch organization so that
> it is more clear.
Ok, will merge hierarchical patch and flat patch together.
>
>> struct cfq_group root_group;
>>
>> /*
>> @@ -347,8 +349,6 @@ cfqg_of_entity(struct cfq_entity *cfqe)
>> return NULL;
>> }
>>
>> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
>> -
>> static struct cfq_rb_root *service_tree_for(struct cfq_group *cfqg,
>> enum wl_prio_t prio,
>> enum wl_type_t type)
>> @@ -638,10 +638,15 @@ static inline unsigned cfq_group_get_avg_queues(struct cfq_data *cfqd,
>> static inline unsigned
>> cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> struct cfq_entity *cfqe = &cfqg->cfqe;
>> + struct cfq_rb_root *st = cfqe->service_tree;
>>
>> - return cfq_target_latency * cfqe->weight / st->total_weight;
>> + if (st)
>> + return cfq_target_latency * cfqe->weight
>> + / st->total_weight;
>
> Is it still true in hierarchical mode. Previously group used to be
> at the top and there used to be only one service tree for groups so
> st->total_weight represented total weight in the system.
>
> Now with hierarhcy this will not/should not be true. So a group slice
> calculation should be different?
I just keep the original group slice calculation here. I was thinking that
calculate group slice in a hierachical way, this might get a really small
group slice and not sure how it works. So I just keep the original calculation.
Any thoughts?
>
>> + else
>> + /* If this is the root group, give it a full slice. */
>> + return cfq_target_latency;
>> }
>>
>> static inline void
>> @@ -804,17 +809,6 @@ static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
>> return NULL;
>> }
>>
>> -static struct cfq_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
>> -{
>> - if (!root->left)
>> - root->left = rb_first(&root->rb);
>> -
>> - if (root->left)
>> - return rb_entry_entity(root->left);
>> -
>> - return NULL;
>> -}
>> -
>> static void rb_erase_init(struct rb_node *n, struct rb_root *root)
>> {
>> rb_erase(n, root);
>> @@ -888,12 +882,15 @@ __cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
>>
>> rb_link_node(&cfqe->rb_node, parent, node);
>> rb_insert_color(&cfqe->rb_node, &st->rb);
>> +
>> + update_min_vdisktime(st);
>> }
>>
>> static void
>> cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
>> {
>> __cfq_entity_service_tree_add(st, cfqe);
>> + cfqe->reposition_time = jiffies;
>> st->count++;
>> st->total_weight += cfqe->weight;
>> }
>> @@ -901,34 +898,57 @@ cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
>> static void
>> cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> struct cfq_entity *cfqe = &cfqg->cfqe;
>> - struct cfq_entity *__cfqe;
>> struct rb_node *n;
>> + struct cfq_entity *entity;
>> + struct cfq_rb_root *st;
>> + struct cfq_group *__cfqg;
>>
>> cfqg->nr_cfqq++;
>> +
>> + /*
>> + * Root group doesn't belongs to any service
>> + */
>> + if (cfqg == &cfqd->root_group)
>> + return;
>
> Can we keep root group on cfqd->grp_service_tree? In hierarchical mode
> there will be only 1 group on grp service tree and in flat mode there
> can be many.
Keep top service tree different for hierarchical mode and flat mode is just
fine to me. If you don't strongly object, I'd to keep the current way. :)
>
>> +
>> if (!RB_EMPTY_NODE(&cfqe->rb_node))
>> return;
>>
>> /*
>> - * Currently put the group at the end. Later implement something
>> - * so that groups get lesser vtime based on their weights, so that
>> - * if group does not loose all if it was not continously backlogged.
>> + * Enqueue this group and its ancestors onto their service tree.
>> */
>> - n = rb_last(&st->rb);
>> - if (n) {
>> - __cfqe = rb_entry_entity(n);
>> - cfqe->vdisktime = __cfqe->vdisktime + CFQ_IDLE_DELAY;
>> - } else
>> - cfqe->vdisktime = st->min_vdisktime;
>> + while (cfqe && cfqe->parent) {
>> + if (!RB_EMPTY_NODE(&cfqe->rb_node))
>> + return;
>> +
>> + /*
>> + * Currently put the group at the end. Later implement
>> + * something so that groups get lesser vtime based on their
>> + * weights, so that if group does not loose all if it was not
>> + * continously backlogged.
>> + */
>> + st = cfqe->service_tree;
>> + n = rb_last(&st->rb);
>> + if (n) {
>> + entity = rb_entry_entity(n);
>> + cfqe->vdisktime = entity->vdisktime +
>> + CFQ_IDLE_DELAY;
>> + } else
>> + cfqe->vdisktime = st->min_vdisktime;
>>
>> - cfq_entity_service_tree_add(st, cfqe);
>> + cfq_entity_service_tree_add(st, cfqe);
>> + cfqe = cfqe->parent;
>> + __cfqg = cfqg_of_entity(cfqe);
>> + __cfqg->nr_subgp++;
>> + }
>> }
>>
>> static void
>> __cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
>> {
>> cfq_rb_erase(&cfqe->rb_node, st);
>> + update_min_vdisktime(st);
>> }
>>
>> static void
>> @@ -937,27 +957,47 @@ cfq_entity_service_tree_del(struct cfq_rb_root *st, struct cfq_entity *cfqe)
>> if (!RB_EMPTY_NODE(&cfqe->rb_node)) {
>> __cfq_entity_service_tree_del(st, cfqe);
>> st->total_weight -= cfqe->weight;
>> - cfqe->service_tree = NULL;
>> }
>> }
>>
>> static void
>> cfq_group_service_tree_del(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> struct cfq_entity *cfqe = &cfqg->cfqe;
>> + struct cfq_group *__cfqg, *p_cfqg;
>>
>> BUG_ON(cfqg->nr_cfqq < 1);
>> cfqg->nr_cfqq--;
>>
>> + /*
>> + * Root group doesn't belongs to any service
>> + */
>> + if (cfqg == &cfqd->root_group)
>> + return;
>> +
>> /* If there are other cfq queues under this group, don't delete it */
>> if (cfqg->nr_cfqq)
>> return;
>>
>> - cfq_log_cfqg(cfqd, cfqg, "del_from_rr group");
>> - cfq_entity_service_tree_del(st, cfqe);
>> - cfqg->saved_workload_slice = 0;
>> - cfq_blkiocg_update_dequeue_stats(&cfqg->blkg, 1);
>> + /* If child group exists, don't dequeue it */
>> + if (cfqg->nr_subgp)
>> + return;
>> +
>> + /*
>> + * Dequeue this group and its ancestors from their service tree.
>> + */
>> + while (cfqe && cfqe->parent) {
>> + __cfqg = cfqg_of_entity(cfqe);
>> + p_cfqg = cfqg_of_entity(cfqe->parent);
>> + cfq_entity_service_tree_del(cfqe->service_tree, cfqe);
>> + cfq_blkiocg_update_dequeue_stats(&__cfqg->blkg, 1);
>> + cfq_log_cfqg(cfqd, __cfqg, "del_from_rr group");
>> + __cfqg->saved_workload_slice = 0;
>> + cfqe = cfqe->parent;
>> + p_cfqg->nr_subgp--;
>> + if (p_cfqg->nr_cfqq || p_cfqg->nr_subgp)
>> + return;
>> + }
>> }
>
> I think one you merge queue/group algorithms, you can use same function
> for adding/deleting queue/group entities and don't have to use separate
> functions for groups?
The CFQ entity adding/deleting for queue/group are almost the same, and I'v
already extract common function to handle it.
cfq_entity_service_tree_add() and cfq_entity_service_tree_del()
>
> [..]
>> - cfqg->cfqe.weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
>> +int cfqg_chain_alloc(struct cfq_data *cfqd, struct cgroup *cgroup)
>> +{
>> + struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>> + struct blkio_cgroup *p_blkcg;
>> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>> + unsigned int major, minor;
>> + struct cfq_group *cfqg, *p_cfqg;
>> + void *key = cfqd;
>> + int ret;
>>
>> - /* Add group on cfqd list */
>> - hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
>> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>> + if (cfqg) {
>> + if (!cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
>> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>> + cfqg->blkg.dev = MKDEV(major, minor);
>> + }
>> + /* chain has already been built */
>> + return 0;
>> + }
>> +
>> + cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
>> + if (!cfqg)
>> + return -1;
>> +
>> + init_cfqg(cfqd, blkcg, cfqg);
>> +
>> + /* Already to the top group */
>> + if (!cgroup->parent)
>> + return 0;
>> +
>> + /* Allocate CFQ groups on the chain */
>> + ret = cfqg_chain_alloc(cfqd, cgroup->parent);
>
> Can you avoid recursion and user for/while loops to initialize the
> chain. Don't want to push multiple stack frames in case of a deep hier.
>
OK, will change.
>> + if (ret == -1) {
>> + uninit_cfqg(cfqd, cfqg);
>> + return -1;
>> + }
>> +
>> + p_blkcg = cgroup_to_blkio_cgroup(cgroup->parent);
>> + p_cfqg = cfqg_of_blkg(blkiocg_lookup_group(p_blkcg, key));
>> + BUG_ON(p_cfqg == NULL);
>> +
>> + cfqg_set_parent(cfqd, cfqg, p_cfqg);
>> + return 0;
>> +}
>> +
>> +static struct cfq_group *
>> +cfq_find_alloc_cfqg(struct cfq_data *cfqd, struct cgroup *cgroup, int create)
>> +{
>> + struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
>> + struct cfq_group *cfqg = NULL;
>> + void *key = cfqd;
>> + struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
>> + unsigned int major, minor;
>> + int ret;
>> +
>> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>> + if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
>> + sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
>> + cfqg->blkg.dev = MKDEV(major, minor);
>> + goto done;
>> + }
>> + if (cfqg || !create)
>> + goto done;
>> +
>> + /*
>> + * For hierarchical cfq group scheduling, we need to allocate
>> + * the whole cfq group chain.
>> + */
>> + ret = cfqg_chain_alloc(cfqd, cgroup);
>> + if (!ret) {
>> + cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
>> + BUG_ON(cfqg == NULL);
>> + goto done;
>> + }
>>
>> done:
>> return cfqg;
>> @@ -1140,12 +1278,22 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
>> {
>> struct cfq_rb_root *st;
>> int i, j;
>> + struct cfq_entity *cfqe;
>> + struct cfq_group *p_cfqg;
>>
>> BUG_ON(atomic_read(&cfqg->ref) <= 0);
>> if (!atomic_dec_and_test(&cfqg->ref))
>> return;
>> for_each_cfqg_st(cfqg, i, j, st)
>> BUG_ON(!RB_EMPTY_ROOT(&st->rb));
>> +
>> + cfqe = &cfqg->cfqe;
>> + if (cfqe->parent) {
>> + p_cfqg = cfqg_of_entity(cfqe->parent);
>> + /* Drop the reference taken by children */
>> + atomic_dec(&p_cfqg->ref);
>> + }
>> +
>
> Is it the right way to free up whole of the parent chain. just think that
> in a hier of test1->test2->test3 somebody drops the reference to test3
> and test1 and test2 don't have any other children. In that case after
> freeing up test3, we should be freeing up test2 and test1 also.
>
> I was thiking that we can achieve this by freeing up groups in a
> loop.
>
> do {
> cfqe = cfqg->entity.parent;
> if (!atomic_dec_and_test(&cfqg->ref))
> return;
> kfree(cfqg);
> cfqg = cfqg_of_entity(cfqe);
> } while(cfqg)
>
OK
>> kfree(cfqg);
>> }
>>
>> @@ -1358,8 +1506,6 @@ insert:
>> /* Add cfqq onto service tree. */
>> cfq_entity_service_tree_add(service_tree, cfqe);
>>
>> - update_min_vdisktime(service_tree);
>> - cfqq->reposition_time = jiffies;
>> if ((add_front || !new_cfqq) && !group_changed)
>> return;
>> cfq_group_service_tree_add(cfqd, cfqq->cfqg);
>> @@ -1802,28 +1948,30 @@ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
>> return cfqq_of_entity(cfq_rb_first(service_tree));
>> }
>>
>> -static struct cfq_queue *cfq_get_next_queue_forced(struct cfq_data *cfqd)
>> +static struct cfq_entity *
>> +cfq_get_next_entity_forced(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_group *cfqg;
>> - struct cfq_entity *cfqe;
>> + struct cfq_entity *entity;
>> int i, j;
>> struct cfq_rb_root *st;
>>
>> if (!cfqd->rq_queued)
>> return NULL;
>>
>> - cfqg = cfq_get_next_cfqg(cfqd);
>> - if (!cfqg)
>> - return NULL;
>> -
>> for_each_cfqg_st(cfqg, i, j, st) {
>> - cfqe = cfq_rb_first(st);
>> - if (cfqe != NULL)
>> - return cfqq_of_entity(cfqe);
>> + entity = cfq_rb_first(st);
>> +
>> + if (entity && !entity->is_group_entity)
>> + return entity;
>> + else if (entity && entity->is_group_entity) {
>> + cfqg = cfqg_of_entity(entity);
>> + return cfq_get_next_entity_forced(cfqd, cfqg);
>> + }
>> }
>> return NULL;
>> }
>
> Can above be siplified by just taking cfqd as parameter. Will work both
> for hierarchical and flat mode. Wanted to avoid recursion as somebody
> can create deep cgroup hierarchy and push lots of frames on stack.
Ok, will change.
>
> struct cfq_entity *cfq_get_next_entity_forced(struct cfq_data *cfqd)
> {
> struct service_tree *st = cfqd->grp_service_tree;
>
> do {
> cfqe = cfq_rb_first(st);
> if (is_cfqe_cfqq(cfqe))
> return cfqe;
> st = choose_service_tree_forced(cfqg);
> } while (st);
> }
>
> And choose_service_tree_forced() can be something like.
>
> choose_service_tree_forced(cfqg) {
> for_each_cfqg_st() {
> if (st->count !=0)
> return st;
> }
> }
>
>>
>> +
>> /*
>> * Get and set a new active queue for service.
>> */
>> @@ -2179,7 +2327,6 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> struct cfq_group *cfqg, enum wl_prio_t prio)
>> {
>> struct cfq_entity *cfqe;
>> - struct cfq_queue *cfqq;
>> unsigned long lowest_start_time;
>> int i;
>> bool time_valid = false;
>> @@ -2191,10 +2338,9 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> */
>> for (i = 0; i <= SYNC_WORKLOAD; ++i) {
>> cfqe = cfq_rb_first(service_tree_for(cfqg, prio, i));
>> - cfqq = cfqq_of_entity(cfqe);
>> if (cfqe && (!time_valid ||
>> - cfqq->reposition_time < lowest_start_time)) {
>> - lowest_start_time = cfqq->reposition_time;
>> + cfqe->reposition_time < lowest_start_time)) {
>> + lowest_start_time = cfqe->reposition_time;
>> cur_best = i;
>> time_valid = true;
>> }
>> @@ -2203,47 +2349,13 @@ static enum wl_type_t cfq_choose_wl(struct cfq_data *cfqd,
>> return cur_best;
>> }
>>
>> -static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> +static void set_workload_expire(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> unsigned slice;
>> unsigned count;
>> struct cfq_rb_root *st;
>> unsigned group_slice;
>>
>> - if (!cfqg) {
>> - cfqd->serving_prio = IDLE_WORKLOAD;
>> - cfqd->workload_expires = jiffies + 1;
>> - return;
>> - }
>> -
>> - /* Choose next priority. RT > BE > IDLE */
>> - if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
>> - cfqd->serving_prio = RT_WORKLOAD;
>> - else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
>> - cfqd->serving_prio = BE_WORKLOAD;
>> - else {
>> - cfqd->serving_prio = IDLE_WORKLOAD;
>> - cfqd->workload_expires = jiffies + 1;
>> - return;
>> - }
>> -
>> - /*
>> - * For RT and BE, we have to choose also the type
>> - * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
>> - * expiration time
>> - */
>> - st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>> - count = st->count;
>> -
>> - /*
>> - * check workload expiration, and that we still have other queues ready
>> - */
>> - if (count && !time_after(jiffies, cfqd->workload_expires))
>> - return;
>> -
>> - /* otherwise select new workload type */
>> - cfqd->serving_type =
>> - cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
>> st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>> count = st->count;
>>
>> @@ -2284,26 +2396,51 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> cfqd->workload_expires = jiffies + slice;
>> }
>>
>> -static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
>> +static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
>> {
>> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
>> - struct cfq_group *cfqg;
>> - struct cfq_entity *cfqe;
>> + struct cfq_rb_root *st;
>> + unsigned count;
>>
>> - if (RB_EMPTY_ROOT(&st->rb))
>> - return NULL;
>> - cfqe = cfq_rb_first_entity(st);
>> - cfqg = cfqg_of_entity(cfqe);
>> - BUG_ON(!cfqg);
>> - update_min_vdisktime(st);
>> - return cfqg;
>> + if (!cfqg) {
>> + cfqd->serving_prio = IDLE_WORKLOAD;
>> + cfqd->workload_expires = jiffies + 1;
>> + return;
>> + }
>
> I am wondering where do we use above code. Do we ever call
> choose_service_tree() with cfqg == NULL? Can't seem to figure out by
> looking at the code.
Already fixed by a seperate patch. ;)
>
>> +
>> + /* Choose next priority. RT > BE > IDLE */
>> + if (cfq_group_busy_queues_wl(RT_WORKLOAD, cfqd, cfqg))
>> + cfqd->serving_prio = RT_WORKLOAD;
>> + else if (cfq_group_busy_queues_wl(BE_WORKLOAD, cfqd, cfqg))
>> + cfqd->serving_prio = BE_WORKLOAD;
>> + else {
>> + cfqd->serving_prio = IDLE_WORKLOAD;
>> + cfqd->workload_expires = jiffies + 1;
>> + return;
>> + }
>> +
>> + /*
>> + * For RT and BE, we have to choose also the type
>> + * (SYNC, SYNC_NOIDLE, ASYNC), and to compute a workload
>> + * expiration time
>> + */
>> + st = service_tree_for(cfqg, cfqd->serving_prio, cfqd->serving_type);
>> + count = st->count;
>> +
>> + /*
>> + * check workload expiration, and that we still have other queues ready
>> + */
>> + if (count && !time_after(jiffies, cfqd->workload_expires))
>> + return;
>> +
>> + /* otherwise select new workload type */
>> + cfqd->serving_type =
>> + cfq_choose_wl(cfqd, cfqg, cfqd->serving_prio);
>> }
>>
>> -static void cfq_choose_cfqg(struct cfq_data *cfqd)
>> +struct cfq_entity *choose_serving_entity(struct cfq_data *cfqd,
>> + struct cfq_group *cfqg)
>
> I think for this function you don't have to pass cfqg as parameter. You
> can just use cfqd as parameter and then take all decisions based on
> service tree.
>
> So at top we can continue to have grp_service_tree in cfqd. In
> hierarchical mode it will only have root group queued there and in
> flat mode it can have multiple groups queued.
>
> Also I am looking forward to simplifying and organizing CFQ code little
> better so that it is easy to read. Can chooser_serving entity function
> be organized something as follows. This shoudl work both for flat and
> hierarchical modes. Following is only a pesudo code.
>
> struct cfq_entity *select_entity(struct cfq_data *cfqd)
> {
> struct cfq_rb_root *st = cfqd->grp_service_tree;
> struct cfq_entity *cfqe;
>
> do {
> cfqe = cfq_rb_first(st);
> if (is_cfqe_cfqq(cfqe))
> /* We found the next queue to dispatch from */
> break;
> else
> st = choose_service_tree();
> } while (st)
>
> return cfqe;
> }
>
Ok, I'll refine the code to make it easier to read.
>> {
>> - struct cfq_group *cfqg = cfq_get_next_cfqg(cfqd);
>> -
>> - cfqd->serving_group = cfqg;
>> + struct cfq_rb_root *service_tree;
>>
>> /* Restore the workload type data */
>> if (cfqg->saved_workload_slice) {
>> @@ -2314,8 +2451,21 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
>> cfqd->workload_expires = jiffies - 1;
>>
>> choose_service_tree(cfqd, cfqg);
>> -}
>>
>> + service_tree = service_tree_for(cfqg, cfqd->serving_prio,
>> + cfqd->serving_type);
>> +
>> + if (!cfqd->rq_queued)
>> + return NULL;
>> +
>> + /* There is nothing to dispatch */
>> + if (!service_tree)
>> + return NULL;
>> + if (RB_EMPTY_ROOT(&service_tree->rb))
>> + return NULL;
>> +
>> + return cfq_rb_first(service_tree);
>> +}
>> /*
>> * Select a queue for service. If we have a current active queue,
>> * check whether to continue servicing it, or retrieve and set a new one.
>> @@ -2323,6 +2473,8 @@ static void cfq_choose_cfqg(struct cfq_data *cfqd)
>> static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>> {
>> struct cfq_queue *cfqq, *new_cfqq = NULL;
>> + struct cfq_group *cfqg;
>> + struct cfq_entity *entity;
>>
>> cfqq = cfqd->active_queue;
>> if (!cfqq)
>> @@ -2422,8 +2574,23 @@ new_queue:
>> * Current queue expired. Check if we have to switch to a new
>> * service tree
>> */
>> - if (!new_cfqq)
>> - cfq_choose_cfqg(cfqd);
>> + cfqg = &cfqd->root_group;
>> +
>> + if (!new_cfqq) {
>> + do {
>> + entity = choose_serving_entity(cfqd, cfqg);
>> + if (entity && !entity->is_group_entity) {
>> + /* This is the CFQ queue that should run */
>> + new_cfqq = cfqq_of_entity(entity);
>> + cfqd->serving_group = cfqg;
>> + set_workload_expire(cfqd, cfqg);
>> + break;
>> + } else if (entity && entity->is_group_entity) {
>> + /* Continue to lookup in this CFQ group */
>> + cfqg = cfqg_of_entity(entity);
>> + }
>> + } while (entity && entity->is_group_entity);
>
> I think move above logic in a separate function otherwise select_queue()
> is becoming complicated.
>
> Secondly for traversing the hierarchy you can introduce macros like
> for_each_entity() or for_each_cfqe() etc.
>
> Thirdly I would again say that flat mode is already supported. Build on
> top of it instead of first getting rid of it and then adding it back
> with the help of a separate patch. If it is too complicated then let
> it be a single patch instead of separating it out in two pathes.
Ok
Thanks for reviewing Vievk, will post a updated patch.
Gui
>
>> + }
>
>>
>> cfqq = cfq_set_active_queue(cfqd, new_cfqq);
>> keep_queue:
>> @@ -2454,10 +2621,14 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
>> {
>> struct cfq_queue *cfqq;
>> int dispatched = 0;
>> + struct cfq_entity *entity;
>> + struct cfq_group *root = &cfqd->root_group;
>>
>> /* Expire the timeslice of the current active queue first */
>> cfq_slice_expired(cfqd, 0);
>> - while ((cfqq = cfq_get_next_queue_forced(cfqd)) != NULL) {
>> + while ((entity = cfq_get_next_entity_forced(cfqd, root)) != NULL) {
>> + BUG_ON(entity->is_group_entity);
>> + cfqq = cfqq_of_entity(entity);
>> __cfq_set_active_queue(cfqd, cfqq);
>> dispatched += __cfq_forced_dispatch_cfqq(cfqq);
>> }
>> @@ -3991,9 +4162,6 @@ static void *cfq_init_queue(struct request_queue *q)
>>
>> cfqd->cic_index = i;
>>
>> - /* Init root service tree */
>> - cfqd->grp_service_tree = CFQ_RB_ROOT;
>> -
>> /* Init root group */
>> cfqg = &cfqd->root_group;
>> for_each_cfqg_st(cfqg, i, j, st)
>> @@ -4003,6 +4171,7 @@ static void *cfq_init_queue(struct request_queue *q)
>> /* Give preference to root group over other groups */
>> cfqg->cfqe.weight = 2*BLKIO_WEIGHT_DEFAULT;
>> cfqg->cfqe.is_group_entity = true;
>> + cfqg->cfqe.parent = NULL;
>>
>> #ifdef CONFIG_CFQ_GROUP_IOSCHED
>> /*
>> --
>> 1.6.5.2
>>
>>
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality.
2010-12-13 1:45 ` [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality Gui Jianfeng
@ 2010-12-15 21:26 ` Vivek Goyal
2010-12-16 2:42 ` Gui Jianfeng
0 siblings, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2010-12-15 21:26 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Mon, Dec 13, 2010 at 09:45:07AM +0800, Gui Jianfeng wrote:
> This patch adds "use_hierarchy" in Root CGroup with out any functionality.
>
> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> ---
> block/blk-cgroup.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--
> block/blk-cgroup.h | 5 +++-
> block/cfq-iosched.c | 24 +++++++++++++++++
> 3 files changed, 97 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 455768a..9747ebb 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -25,7 +25,10 @@
> static DEFINE_SPINLOCK(blkio_list_lock);
> static LIST_HEAD(blkio_list);
>
> -struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
> +struct blkio_cgroup blkio_root_cgroup = {
> + .weight = 2*BLKIO_WEIGHT_DEFAULT,
> + .use_hierarchy = 1,
Currently flat mode is the default. Lets not change the default. So lets
start with use_hierarchy = 0.
Secondly, why don't you make it per cgroup something along the lines of
memory controller where one can start the hierarchy lower in the cgroup
chain and not necessarily at the root. This way we can avoid some
accounting overhead for all the groups which are non-hierarchical.
> + };
> EXPORT_SYMBOL_GPL(blkio_root_cgroup);
>
> static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
> @@ -1385,10 +1388,73 @@ struct cftype blkio_files[] = {
> #endif
> };
>
> +static u64 blkiocg_use_hierarchy_read(struct cgroup *cgroup,
> + struct cftype *cftype)
> +{
> + struct blkio_cgroup *blkcg;
> +
> + blkcg = cgroup_to_blkio_cgroup(cgroup);
> + return (u64)blkcg->use_hierarchy;
> +}
> +
> +static int
> +blkiocg_use_hierarchy_write(struct cgroup *cgroup,
> + struct cftype *cftype, u64 val)
> +{
> + struct blkio_cgroup *blkcg;
> + struct blkio_group *blkg;
> + struct hlist_node *n;
> + struct blkio_policy_type *blkiop;
> +
> + blkcg = cgroup_to_blkio_cgroup(cgroup);
> +
> + if (val > 1 || !list_empty(&cgroup->children))
> + return -EINVAL;
> +
> + if (blkcg->use_hierarchy == val)
> + return 0;
> +
> + spin_lock(&blkio_list_lock);
> + blkcg->use_hierarchy = val;
> +
> + hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
> + list_for_each_entry(blkiop, &blkio_list, list) {
> + /*
> + * If this policy does not own the blkg, do not change
> + * cfq group scheduling mode.
> + */
> + if (blkiop->plid != blkg->plid)
> + continue;
> +
> + if (blkiop->ops.blkio_update_use_hierarchy_fn)
> + blkiop->ops.blkio_update_use_hierarchy_fn(blkg,
> + val);
Should we really allow this? I mean allow changing hierarchy of a group
when there are already children groups. I think memory controller does
not allow this. We can design along the same lines. Keep use_hierarchy
as 0 by default. Allow changing it only if there are no children cgroups.
Otherwise we shall have to send notifications to subscribing policies
and then change their structure etc. Lets keep it simple.
I was playing with a use_hierarhcy patch for throttling and parts have been
copied from memory controller. I am attaching that with the mail and see if
you can make that working.
---
block/blk-cgroup.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
block/blk-cgroup.h | 2 +
2 files changed, 60 insertions(+), 1 deletion(-)
Index: linux-2.6/block/blk-cgroup.c
===================================================================
--- linux-2.6.orig/block/blk-cgroup.c 2010-11-19 10:30:27.129704770 -0500
+++ linux-2.6/block/blk-cgroup.c 2010-11-19 10:30:29.885671705 -0500
@@ -1214,6 +1214,39 @@ static int blkio_weight_write(struct blk
return 0;
}
+static int blkio_throtl_use_hierarchy_write(struct cgroup *cgrp, u64 val)
+{
+ struct cgroup *parent = cgrp->parent;
+ struct blkio_cgroup *blkcg, *parent_blkcg;
+ int ret = 0;
+
+ if (val != 0 || val != 1)
+ return -EINVAL;
+
+ blkcg = cgroup_to_blkio_cgroup(cgrp);
+ if (parent)
+ parent_blkcg = cgroup_to_blkio_cgroup(parent);
+
+ cgroup_lock();
+ /*
+ * If parent's use_hierarchy is set, we can't make any modifications
+ * in the child subtrees. If it is unset, then the change can
+ * occur, provided the current cgroup has no children.
+ *
+ * For the root cgroup, parent_mem is NULL, we allow value to be
+ * set if there are no children.
+ */
+ if (!parent_blkcg || !parent_blkcg->throtl_use_hier) {
+ if (list_empty(&cgrp->children))
+ blkcg->throtl_use_hier = val;
+ else
+ ret = -EBUSY;
+ } else
+ ret = -EINVAL;
+ cgroup_unlock();
+ return ret;
+}
+
static u64 blkiocg_file_read_u64 (struct cgroup *cgrp, struct cftype *cft) {
struct blkio_cgroup *blkcg;
enum blkio_policy_id plid = BLKIOFILE_POLICY(cft->private);
@@ -1228,6 +1261,12 @@ static u64 blkiocg_file_read_u64 (struct
return (u64)blkcg->weight;
}
break;
+ case BLKIO_POLICY_THROTL:
+ switch(name) {
+ case BLKIO_THROTL_use_hierarchy:
+ return (u64)blkcg->throtl_use_hier;
+ }
+ break;
default:
BUG();
}
@@ -1250,6 +1289,12 @@ blkiocg_file_write_u64(struct cgroup *cg
return blkio_weight_write(blkcg, val);
}
break;
+ case BLKIO_POLICY_THROTL:
+ switch(name) {
+ case BLKIO_THROTL_use_hierarchy:
+ return blkio_throtl_use_hierarchy_write(cgrp, val);
+ }
+ break;
default:
BUG();
}
@@ -1373,6 +1418,13 @@ struct cftype blkio_files[] = {
BLKIO_THROTL_io_serviced),
.read_map = blkiocg_file_read_map,
},
+ {
+ .name = "throttle.use_hierarchy",
+ .private = BLKIOFILE_PRIVATE(BLKIO_POLICY_THROTL,
+ BLKIO_THROTL_use_hierarchy),
+ .read_u64 = blkiocg_file_read_u64,
+ .write_u64 = blkiocg_file_write_u64,
+ },
#endif /* CONFIG_BLK_DEV_THROTTLING */
#ifdef CONFIG_DEBUG_BLK_CGROUP
@@ -1470,7 +1522,7 @@ static void blkiocg_destroy(struct cgrou
static struct cgroup_subsys_state *
blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
{
- struct blkio_cgroup *blkcg;
+ struct blkio_cgroup *blkcg, *parent_blkcg = NULL;
struct cgroup *parent = cgroup->parent;
if (!parent) {
@@ -1483,11 +1535,16 @@ blkiocg_create(struct cgroup_subsys *sub
return ERR_PTR(-ENOMEM);
blkcg->weight = BLKIO_WEIGHT_DEFAULT;
+ parent_blkcg = cgroup_to_blkio_cgroup(parent);
done:
spin_lock_init(&blkcg->lock);
INIT_HLIST_HEAD(&blkcg->blkg_list);
INIT_LIST_HEAD(&blkcg->policy_list);
+ if (parent)
+ blkcg->throtl_use_hier = parent_blkcg->throtl_use_hier;
+ else
+ blkcg->throtl_use_hier = 0;
return &blkcg->css;
}
Index: linux-2.6/block/blk-cgroup.h
===================================================================
--- linux-2.6.orig/block/blk-cgroup.h 2010-11-19 10:15:56.321149940 -0500
+++ linux-2.6/block/blk-cgroup.h 2010-11-19 10:30:29.885671705 -0500
@@ -100,11 +100,13 @@ enum blkcg_file_name_throtl {
BLKIO_THROTL_write_iops_device,
BLKIO_THROTL_io_service_bytes,
BLKIO_THROTL_io_serviced,
+ BLKIO_THROTL_use_hierarchy,
};
struct blkio_cgroup {
struct cgroup_subsys_state css;
unsigned int weight;
+ bool throtl_use_hier;
spinlock_t lock;
struct hlist_head blkg_list;
/*
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level
2010-12-15 7:02 ` Gui Jianfeng
@ 2010-12-15 22:04 ` Vivek Goyal
0 siblings, 0 replies; 41+ messages in thread
From: Vivek Goyal @ 2010-12-15 22:04 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Wed, Dec 15, 2010 at 03:02:36PM +0800, Gui Jianfeng wrote:
[..]
> >> static inline unsigned
> >> cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> {
> >> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> >> struct cfq_entity *cfqe = &cfqg->cfqe;
> >> + struct cfq_rb_root *st = cfqe->service_tree;
> >>
> >> - return cfq_target_latency * cfqe->weight / st->total_weight;
> >> + if (st)
> >> + return cfq_target_latency * cfqe->weight
> >> + / st->total_weight;
> >
> > Is it still true in hierarchical mode. Previously group used to be
> > at the top and there used to be only one service tree for groups so
> > st->total_weight represented total weight in the system.
> >
> > Now with hierarhcy this will not/should not be true. So a group slice
> > calculation should be different?
>
> I just keep the original group slice calculation here. I was thinking that
> calculate group slice in a hierachical way, this might get a really small
> group slice and not sure how it works. So I just keep the original calculation.
> Any thoughts?
Corrado already had minimum per queue limits (16ms or something) so don't
worry about it getting too small. But we have to do hierarchical groups
share calculation otherwise what's the point of writting this code and
all the logic of trying to meet soft latency of 300ms.
>
> >
> >> + else
> >> + /* If this is the root group, give it a full slice. */
> >> + return cfq_target_latency;
> >> }
> >>
> >> static inline void
> >> @@ -804,17 +809,6 @@ static struct cfq_entity *cfq_rb_first(struct cfq_rb_root *root)
> >> return NULL;
> >> }
> >>
> >> -static struct cfq_entity *cfq_rb_first_entity(struct cfq_rb_root *root)
> >> -{
> >> - if (!root->left)
> >> - root->left = rb_first(&root->rb);
> >> -
> >> - if (root->left)
> >> - return rb_entry_entity(root->left);
> >> -
> >> - return NULL;
> >> -}
> >> -
> >> static void rb_erase_init(struct rb_node *n, struct rb_root *root)
> >> {
> >> rb_erase(n, root);
> >> @@ -888,12 +882,15 @@ __cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> >>
> >> rb_link_node(&cfqe->rb_node, parent, node);
> >> rb_insert_color(&cfqe->rb_node, &st->rb);
> >> +
> >> + update_min_vdisktime(st);
> >> }
> >>
> >> static void
> >> cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> >> {
> >> __cfq_entity_service_tree_add(st, cfqe);
> >> + cfqe->reposition_time = jiffies;
> >> st->count++;
> >> st->total_weight += cfqe->weight;
> >> }
> >> @@ -901,34 +898,57 @@ cfq_entity_service_tree_add(struct cfq_rb_root *st, struct cfq_entity *cfqe)
> >> static void
> >> cfq_group_service_tree_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
> >> {
> >> - struct cfq_rb_root *st = &cfqd->grp_service_tree;
> >> struct cfq_entity *cfqe = &cfqg->cfqe;
> >> - struct cfq_entity *__cfqe;
> >> struct rb_node *n;
> >> + struct cfq_entity *entity;
> >> + struct cfq_rb_root *st;
> >> + struct cfq_group *__cfqg;
> >>
> >> cfqg->nr_cfqq++;
> >> +
> >> + /*
> >> + * Root group doesn't belongs to any service
> >> + */
> >> + if (cfqg == &cfqd->root_group)
> >> + return;
> >
> > Can we keep root group on cfqd->grp_service_tree? In hierarchical mode
> > there will be only 1 group on grp service tree and in flat mode there
> > can be many.
>
> Keep top service tree different for hierarchical mode and flat mode is just
> fine to me. If you don't strongly object, I'd to keep the current way. :)
I am saying that keep one top tree both for hierarchical and flat mode and
not separate trees.
for flat mode everything goes on cfqd->grp_service_tree.
grp_service_tree
/ | \
root test1 test2
for hierarchical mode it will look as follows.
grp_service_tree
|
root
/ \
test1 test2
Or it could looks as follows if user has set use_hier=1 in test2 only.
grp_service_tree
| | |
root test1 test2
|
test3
Thanks
Vivek
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality.
2010-12-15 21:26 ` Vivek Goyal
@ 2010-12-16 2:42 ` Gui Jianfeng
2010-12-16 15:44 ` Vivek Goyal
0 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-16 2:42 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Mon, Dec 13, 2010 at 09:45:07AM +0800, Gui Jianfeng wrote:
>> This patch adds "use_hierarchy" in Root CGroup with out any functionality.
>>
>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>> ---
>> block/blk-cgroup.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--
>> block/blk-cgroup.h | 5 +++-
>> block/cfq-iosched.c | 24 +++++++++++++++++
>> 3 files changed, 97 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index 455768a..9747ebb 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -25,7 +25,10 @@
>> static DEFINE_SPINLOCK(blkio_list_lock);
>> static LIST_HEAD(blkio_list);
>>
>> -struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
>> +struct blkio_cgroup blkio_root_cgroup = {
>> + .weight = 2*BLKIO_WEIGHT_DEFAULT,
>> + .use_hierarchy = 1,
>
> Currently flat mode is the default. Lets not change the default. So lets
> start with use_hierarchy = 0.
OK, will do.
>
> Secondly, why don't you make it per cgroup something along the lines of
> memory controller where one can start the hierarchy lower in the cgroup
> chain and not necessarily at the root. This way we can avoid some
> accounting overhead for all the groups which are non-hierarchical.
I'm not sure whether there's a actual use case that needs per cgroup "use_hierarchy".
So for first step, I just give a global "use_hierarchy" in root group. If there're
some actual requirements that need per cgroup "use_hierarchy". We may add the feature
later.
>
>> + };
>> EXPORT_SYMBOL_GPL(blkio_root_cgroup);
>>
>> static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
>> @@ -1385,10 +1388,73 @@ struct cftype blkio_files[] = {
>> #endif
>> };
>>
>> +static u64 blkiocg_use_hierarchy_read(struct cgroup *cgroup,
>> + struct cftype *cftype)
>> +{
>> + struct blkio_cgroup *blkcg;
>> +
>> + blkcg = cgroup_to_blkio_cgroup(cgroup);
>> + return (u64)blkcg->use_hierarchy;
>> +}
>> +
>> +static int
>> +blkiocg_use_hierarchy_write(struct cgroup *cgroup,
>> + struct cftype *cftype, u64 val)
>> +{
>> + struct blkio_cgroup *blkcg;
>> + struct blkio_group *blkg;
>> + struct hlist_node *n;
>> + struct blkio_policy_type *blkiop;
>> +
>> + blkcg = cgroup_to_blkio_cgroup(cgroup);
>> +
>> + if (val > 1 || !list_empty(&cgroup->children))
>> + return -EINVAL;
>> +
>> + if (blkcg->use_hierarchy == val)
>> + return 0;
>> +
>> + spin_lock(&blkio_list_lock);
>> + blkcg->use_hierarchy = val;
>> +
>> + hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
>> + list_for_each_entry(blkiop, &blkio_list, list) {
>> + /*
>> + * If this policy does not own the blkg, do not change
>> + * cfq group scheduling mode.
>> + */
>> + if (blkiop->plid != blkg->plid)
>> + continue;
>> +
>> + if (blkiop->ops.blkio_update_use_hierarchy_fn)
>> + blkiop->ops.blkio_update_use_hierarchy_fn(blkg,
>> + val);
>
> Should we really allow this? I mean allow changing hierarchy of a group
> when there are already children groups. I think memory controller does
> not allow this. We can design along the same lines. Keep use_hierarchy
> as 0 by default. Allow changing it only if there are no children cgroups.
> Otherwise we shall have to send notifications to subscribing policies
> and then change their structure etc. Lets keep it simple.
Yes, I really don't allow changing use_hierarchy if there are childre cgroups.
Please consider following line in my patch.
if (val > 1 || !list_empty(&cgroup->children))
return -EINVAL;
>
> I was playing with a use_hierarhcy patch for throttling and parts have been
> copied from memory controller. I am attaching that with the mail and see if
> you can make that working.
Thanks
Gui
>
> ---
> block/blk-cgroup.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> block/blk-cgroup.h | 2 +
> 2 files changed, 60 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/block/blk-cgroup.c
> ===================================================================
> --- linux-2.6.orig/block/blk-cgroup.c 2010-11-19 10:30:27.129704770 -0500
> +++ linux-2.6/block/blk-cgroup.c 2010-11-19 10:30:29.885671705 -0500
> @@ -1214,6 +1214,39 @@ static int blkio_weight_write(struct blk
> return 0;
> }
>
> +static int blkio_throtl_use_hierarchy_write(struct cgroup *cgrp, u64 val)
> +{
> + struct cgroup *parent = cgrp->parent;
> + struct blkio_cgroup *blkcg, *parent_blkcg;
> + int ret = 0;
> +
> + if (val != 0 || val != 1)
> + return -EINVAL;
> +
> + blkcg = cgroup_to_blkio_cgroup(cgrp);
> + if (parent)
> + parent_blkcg = cgroup_to_blkio_cgroup(parent);
> +
> + cgroup_lock();
> + /*
> + * If parent's use_hierarchy is set, we can't make any modifications
> + * in the child subtrees. If it is unset, then the change can
> + * occur, provided the current cgroup has no children.
> + *
> + * For the root cgroup, parent_mem is NULL, we allow value to be
> + * set if there are no children.
> + */
> + if (!parent_blkcg || !parent_blkcg->throtl_use_hier) {
> + if (list_empty(&cgrp->children))
> + blkcg->throtl_use_hier = val;
> + else
> + ret = -EBUSY;
> + } else
> + ret = -EINVAL;
> + cgroup_unlock();
> + return ret;
> +}
> +
> static u64 blkiocg_file_read_u64 (struct cgroup *cgrp, struct cftype *cft) {
> struct blkio_cgroup *blkcg;
> enum blkio_policy_id plid = BLKIOFILE_POLICY(cft->private);
> @@ -1228,6 +1261,12 @@ static u64 blkiocg_file_read_u64 (struct
> return (u64)blkcg->weight;
> }
> break;
> + case BLKIO_POLICY_THROTL:
> + switch(name) {
> + case BLKIO_THROTL_use_hierarchy:
> + return (u64)blkcg->throtl_use_hier;
> + }
> + break;
> default:
> BUG();
> }
> @@ -1250,6 +1289,12 @@ blkiocg_file_write_u64(struct cgroup *cg
> return blkio_weight_write(blkcg, val);
> }
> break;
> + case BLKIO_POLICY_THROTL:
> + switch(name) {
> + case BLKIO_THROTL_use_hierarchy:
> + return blkio_throtl_use_hierarchy_write(cgrp, val);
> + }
> + break;
> default:
> BUG();
> }
> @@ -1373,6 +1418,13 @@ struct cftype blkio_files[] = {
> BLKIO_THROTL_io_serviced),
> .read_map = blkiocg_file_read_map,
> },
> + {
> + .name = "throttle.use_hierarchy",
> + .private = BLKIOFILE_PRIVATE(BLKIO_POLICY_THROTL,
> + BLKIO_THROTL_use_hierarchy),
> + .read_u64 = blkiocg_file_read_u64,
> + .write_u64 = blkiocg_file_write_u64,
> + },
> #endif /* CONFIG_BLK_DEV_THROTTLING */
>
> #ifdef CONFIG_DEBUG_BLK_CGROUP
> @@ -1470,7 +1522,7 @@ static void blkiocg_destroy(struct cgrou
> static struct cgroup_subsys_state *
> blkiocg_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
> {
> - struct blkio_cgroup *blkcg;
> + struct blkio_cgroup *blkcg, *parent_blkcg = NULL;
> struct cgroup *parent = cgroup->parent;
>
> if (!parent) {
> @@ -1483,11 +1535,16 @@ blkiocg_create(struct cgroup_subsys *sub
> return ERR_PTR(-ENOMEM);
>
> blkcg->weight = BLKIO_WEIGHT_DEFAULT;
> + parent_blkcg = cgroup_to_blkio_cgroup(parent);
> done:
> spin_lock_init(&blkcg->lock);
> INIT_HLIST_HEAD(&blkcg->blkg_list);
>
> INIT_LIST_HEAD(&blkcg->policy_list);
> + if (parent)
> + blkcg->throtl_use_hier = parent_blkcg->throtl_use_hier;
> + else
> + blkcg->throtl_use_hier = 0;
> return &blkcg->css;
> }
>
> Index: linux-2.6/block/blk-cgroup.h
> ===================================================================
> --- linux-2.6.orig/block/blk-cgroup.h 2010-11-19 10:15:56.321149940 -0500
> +++ linux-2.6/block/blk-cgroup.h 2010-11-19 10:30:29.885671705 -0500
> @@ -100,11 +100,13 @@ enum blkcg_file_name_throtl {
> BLKIO_THROTL_write_iops_device,
> BLKIO_THROTL_io_service_bytes,
> BLKIO_THROTL_io_serviced,
> + BLKIO_THROTL_use_hierarchy,
> };
>
> struct blkio_cgroup {
> struct cgroup_subsys_state css;
> unsigned int weight;
> + bool throtl_use_hier;
> spinlock_t lock;
> struct hlist_head blkg_list;
> /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality.
2010-12-16 2:42 ` Gui Jianfeng
@ 2010-12-16 15:44 ` Vivek Goyal
2010-12-17 3:06 ` Gui Jianfeng
0 siblings, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2010-12-16 15:44 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Thu, Dec 16, 2010 at 10:42:42AM +0800, Gui Jianfeng wrote:
> Vivek Goyal wrote:
> > On Mon, Dec 13, 2010 at 09:45:07AM +0800, Gui Jianfeng wrote:
> >> This patch adds "use_hierarchy" in Root CGroup with out any functionality.
> >>
> >> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
> >> ---
> >> block/blk-cgroup.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--
> >> block/blk-cgroup.h | 5 +++-
> >> block/cfq-iosched.c | 24 +++++++++++++++++
> >> 3 files changed, 97 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> >> index 455768a..9747ebb 100644
> >> --- a/block/blk-cgroup.c
> >> +++ b/block/blk-cgroup.c
> >> @@ -25,7 +25,10 @@
> >> static DEFINE_SPINLOCK(blkio_list_lock);
> >> static LIST_HEAD(blkio_list);
> >>
> >> -struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
> >> +struct blkio_cgroup blkio_root_cgroup = {
> >> + .weight = 2*BLKIO_WEIGHT_DEFAULT,
> >> + .use_hierarchy = 1,
> >
> > Currently flat mode is the default. Lets not change the default. So lets
> > start with use_hierarchy = 0.
>
> OK, will do.
>
> >
> > Secondly, why don't you make it per cgroup something along the lines of
> > memory controller where one can start the hierarchy lower in the cgroup
> > chain and not necessarily at the root. This way we can avoid some
> > accounting overhead for all the groups which are non-hierarchical.
>
> I'm not sure whether there's a actual use case that needs per cgroup "use_hierarchy".
> So for first step, I just give a global "use_hierarchy" in root group. If there're
> some actual requirements that need per cgroup "use_hierarchy". We may add the feature
> later.
>
I think there is some use case. Currently libvirt creates its own cgroups
for each VM. Depending on what cgroup libvirtd has been placed when it
started, it starts creating cgroups from there. So depending on distro,
one might mount blkio controller at /cgroup/blkio by default and then
libcgroup will create its own cgroups from there.
Now as of today, default is flat so the packages which takes care of
mounting blkio controller, I am not expecting them to suddenly change
default to using hierarchy.
Now if libvirt goes on to create its own cgroups under root cgroup
(/cgroup/blkio), then libvirt can't switch it to hierarchical even if
it wants to as children cgroups have already been created under root
and anyway libvirt is not supposed to control the settings of
use_hierarchy of root group.
So if we allow that a hierarchy can be defined from a child node, then
libvirt can easily do it only for its sub hierarchy.
pivot
/ \
root libvirtd
/ \
vm1 vm2
Here root will have use_hierarchy=0 and libvirtd will have use_hierarchy=1
Secondly, I am beginning to believe that overhead of updating the in
all the group of hierarchy might have significant overhead (though I don't
have the data yet) but you will take blkcg->stats_lock of each cgroup
in the path for each IO completion and CFQ updates so many stats. So
there also it might make sense that let libvirtd set use_hierarchy=1
if it needs to and incur additional overhead but global default will
not be run with use_hierarchy=1. I think libvirtd mounts memory controller
as of today with use_hierarchy=0.
Also I don't think it is lot of extra code to support per cgroup
use_hierarchy. So to me it makes sense to do it right now. I am more
concerned about getting it right now because it is part of user interface.
If we introduce something now and then change it 2 releases town the line
the we are stuck with one more convetntion of use_hiearchy and support it
making life even more complicated.
So I would say that do think through it and it should not be lot of extra
code to support it.
> >
> >> + };
> >> EXPORT_SYMBOL_GPL(blkio_root_cgroup);
> >>
> >> static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
> >> @@ -1385,10 +1388,73 @@ struct cftype blkio_files[] = {
> >> #endif
> >> };
> >>
> >> +static u64 blkiocg_use_hierarchy_read(struct cgroup *cgroup,
> >> + struct cftype *cftype)
> >> +{
> >> + struct blkio_cgroup *blkcg;
> >> +
> >> + blkcg = cgroup_to_blkio_cgroup(cgroup);
> >> + return (u64)blkcg->use_hierarchy;
> >> +}
> >> +
> >> +static int
> >> +blkiocg_use_hierarchy_write(struct cgroup *cgroup,
> >> + struct cftype *cftype, u64 val)
> >> +{
> >> + struct blkio_cgroup *blkcg;
> >> + struct blkio_group *blkg;
> >> + struct hlist_node *n;
> >> + struct blkio_policy_type *blkiop;
> >> +
> >> + blkcg = cgroup_to_blkio_cgroup(cgroup);
> >> +
> >> + if (val > 1 || !list_empty(&cgroup->children))
> >> + return -EINVAL;
> >> +
> >> + if (blkcg->use_hierarchy == val)
> >> + return 0;
> >> +
> >> + spin_lock(&blkio_list_lock);
> >> + blkcg->use_hierarchy = val;
> >> +
> >> + hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
> >> + list_for_each_entry(blkiop, &blkio_list, list) {
> >> + /*
> >> + * If this policy does not own the blkg, do not change
> >> + * cfq group scheduling mode.
> >> + */
> >> + if (blkiop->plid != blkg->plid)
> >> + continue;
> >> +
> >> + if (blkiop->ops.blkio_update_use_hierarchy_fn)
> >> + blkiop->ops.blkio_update_use_hierarchy_fn(blkg,
> >> + val);
> >
> > Should we really allow this? I mean allow changing hierarchy of a group
> > when there are already children groups. I think memory controller does
> > not allow this. We can design along the same lines. Keep use_hierarchy
> > as 0 by default. Allow changing it only if there are no children cgroups.
> > Otherwise we shall have to send notifications to subscribing policies
> > and then change their structure etc. Lets keep it simple.
>
> Yes, I really don't allow changing use_hierarchy if there are childre cgroups.
> Please consider following line in my patch.
>
> if (val > 1 || !list_empty(&cgroup->children))
> return -EINVAL;
If there are no children cgroups, then there can not be any children blkg
and there is no need to send any per blkg notification to each policy?
Thanks
Vivek
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality.
2010-12-16 15:44 ` Vivek Goyal
@ 2010-12-17 3:06 ` Gui Jianfeng
2010-12-17 23:03 ` Vivek Goyal
0 siblings, 1 reply; 41+ messages in thread
From: Gui Jianfeng @ 2010-12-17 3:06 UTC (permalink / raw)
To: Vivek Goyal
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
Vivek Goyal wrote:
> On Thu, Dec 16, 2010 at 10:42:42AM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Mon, Dec 13, 2010 at 09:45:07AM +0800, Gui Jianfeng wrote:
>>>> This patch adds "use_hierarchy" in Root CGroup with out any functionality.
>>>>
>>>> Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
>>>> ---
>>>> block/blk-cgroup.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++--
>>>> block/blk-cgroup.h | 5 +++-
>>>> block/cfq-iosched.c | 24 +++++++++++++++++
>>>> 3 files changed, 97 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>>>> index 455768a..9747ebb 100644
>>>> --- a/block/blk-cgroup.c
>>>> +++ b/block/blk-cgroup.c
>>>> @@ -25,7 +25,10 @@
>>>> static DEFINE_SPINLOCK(blkio_list_lock);
>>>> static LIST_HEAD(blkio_list);
>>>>
>>>> -struct blkio_cgroup blkio_root_cgroup = { .weight = 2*BLKIO_WEIGHT_DEFAULT };
>>>> +struct blkio_cgroup blkio_root_cgroup = {
>>>> + .weight = 2*BLKIO_WEIGHT_DEFAULT,
>>>> + .use_hierarchy = 1,
>>> Currently flat mode is the default. Lets not change the default. So lets
>>> start with use_hierarchy = 0.
>> OK, will do.
>>
>>> Secondly, why don't you make it per cgroup something along the lines of
>>> memory controller where one can start the hierarchy lower in the cgroup
>>> chain and not necessarily at the root. This way we can avoid some
>>> accounting overhead for all the groups which are non-hierarchical.
>> I'm not sure whether there's a actual use case that needs per cgroup "use_hierarchy".
>> So for first step, I just give a global "use_hierarchy" in root group. If there're
>> some actual requirements that need per cgroup "use_hierarchy". We may add the feature
>> later.
>>
>
> I think there is some use case. Currently libvirt creates its own cgroups
> for each VM. Depending on what cgroup libvirtd has been placed when it
> started, it starts creating cgroups from there. So depending on distro,
> one might mount blkio controller at /cgroup/blkio by default and then
> libcgroup will create its own cgroups from there.
>
> Now as of today, default is flat so the packages which takes care of
> mounting blkio controller, I am not expecting them to suddenly change
> default to using hierarchy.
>
> Now if libvirt goes on to create its own cgroups under root cgroup
> (/cgroup/blkio), then libvirt can't switch it to hierarchical even if
> it wants to as children cgroups have already been created under root
> and anyway libvirt is not supposed to control the settings of
> use_hierarchy of root group.
>
> So if we allow that a hierarchy can be defined from a child node, then
> libvirt can easily do it only for its sub hierarchy.
>
> pivot
> / \
> root libvirtd
> / \
> vm1 vm2
>
> Here root will have use_hierarchy=0 and libvirtd will have use_hierarchy=1
>
> Secondly, I am beginning to believe that overhead of updating the in
> all the group of hierarchy might have significant overhead (though I don't
> have the data yet) but you will take blkcg->stats_lock of each cgroup
> in the path for each IO completion and CFQ updates so many stats. So
> there also it might make sense that let libvirtd set use_hierarchy=1
> if it needs to and incur additional overhead but global default will
> not be run with use_hierarchy=1. I think libvirtd mounts memory controller
> as of today with use_hierarchy=0.
>
> Also I don't think it is lot of extra code to support per cgroup
> use_hierarchy. So to me it makes sense to do it right now. I am more
> concerned about getting it right now because it is part of user interface.
> If we introduce something now and then change it 2 releases town the line
> the we are stuck with one more convetntion of use_hiearchy and support it
> making life even more complicated.
>
> So I would say that do think through it and it should not be lot of extra
> code to support it.
>
>>>> + };
>>>> EXPORT_SYMBOL_GPL(blkio_root_cgroup);
>>>>
>>>> static struct cgroup_subsys_state *blkiocg_create(struct cgroup_subsys *,
>>>> @@ -1385,10 +1388,73 @@ struct cftype blkio_files[] = {
>>>> #endif
>>>> };
>>>>
>>>> +static u64 blkiocg_use_hierarchy_read(struct cgroup *cgroup,
>>>> + struct cftype *cftype)
>>>> +{
>>>> + struct blkio_cgroup *blkcg;
>>>> +
>>>> + blkcg = cgroup_to_blkio_cgroup(cgroup);
>>>> + return (u64)blkcg->use_hierarchy;
>>>> +}
>>>> +
>>>> +static int
>>>> +blkiocg_use_hierarchy_write(struct cgroup *cgroup,
>>>> + struct cftype *cftype, u64 val)
>>>> +{
>>>> + struct blkio_cgroup *blkcg;
>>>> + struct blkio_group *blkg;
>>>> + struct hlist_node *n;
>>>> + struct blkio_policy_type *blkiop;
>>>> +
>>>> + blkcg = cgroup_to_blkio_cgroup(cgroup);
>>>> +
>>>> + if (val > 1 || !list_empty(&cgroup->children))
>>>> + return -EINVAL;
>>>> +
>>>> + if (blkcg->use_hierarchy == val)
>>>> + return 0;
>>>> +
>>>> + spin_lock(&blkio_list_lock);
>>>> + blkcg->use_hierarchy = val;
>>>> +
>>>> + hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
>>>> + list_for_each_entry(blkiop, &blkio_list, list) {
>>>> + /*
>>>> + * If this policy does not own the blkg, do not change
>>>> + * cfq group scheduling mode.
>>>> + */
>>>> + if (blkiop->plid != blkg->plid)
>>>> + continue;
>>>> +
>>>> + if (blkiop->ops.blkio_update_use_hierarchy_fn)
>>>> + blkiop->ops.blkio_update_use_hierarchy_fn(blkg,
>>>> + val);
>>> Should we really allow this? I mean allow changing hierarchy of a group
>>> when there are already children groups. I think memory controller does
>>> not allow this. We can design along the same lines. Keep use_hierarchy
>>> as 0 by default. Allow changing it only if there are no children cgroups.
>>> Otherwise we shall have to send notifications to subscribing policies
>>> and then change their structure etc. Lets keep it simple.
>> Yes, I really don't allow changing use_hierarchy if there are childre cgroups.
>> Please consider following line in my patch.
>>
>> if (val > 1 || !list_empty(&cgroup->children))
>> return -EINVAL;
>
> If there are no children cgroups, then there can not be any children blkg
> and there is no need to send any per blkg notification to each policy?
Firsly, In my patch, per blkg notification only happens on root blkg.
Secondly, root cfqg is put onto "flat_service_tree" in flat mode,
where for hierarchical mode, it don't belongs to anybody. When switching, it
has to inform root cfqg to switch onto or switch off "flat_service_tree".
Anyway, If we're going to put root cfqg onto grp_service_tree regardless of
flat or hierarchical mode, this piece of code can be gone.
Thanks,
Gui
>
> Thanks
> Vivek
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality.
2010-12-17 3:06 ` Gui Jianfeng
@ 2010-12-17 23:03 ` Vivek Goyal
0 siblings, 0 replies; 41+ messages in thread
From: Vivek Goyal @ 2010-12-17 23:03 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Fri, Dec 17, 2010 at 11:06:46AM +0800, Gui Jianfeng wrote:
[..]
> >>>> +static int
> >>>> +blkiocg_use_hierarchy_write(struct cgroup *cgroup,
> >>>> + struct cftype *cftype, u64 val)
> >>>> +{
> >>>> + struct blkio_cgroup *blkcg;
> >>>> + struct blkio_group *blkg;
> >>>> + struct hlist_node *n;
> >>>> + struct blkio_policy_type *blkiop;
> >>>> +
> >>>> + blkcg = cgroup_to_blkio_cgroup(cgroup);
> >>>> +
> >>>> + if (val > 1 || !list_empty(&cgroup->children))
> >>>> + return -EINVAL;
> >>>> +
> >>>> + if (blkcg->use_hierarchy == val)
> >>>> + return 0;
> >>>> +
> >>>> + spin_lock(&blkio_list_lock);
> >>>> + blkcg->use_hierarchy = val;
> >>>> +
> >>>> + hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
> >>>> + list_for_each_entry(blkiop, &blkio_list, list) {
> >>>> + /*
> >>>> + * If this policy does not own the blkg, do not change
> >>>> + * cfq group scheduling mode.
> >>>> + */
> >>>> + if (blkiop->plid != blkg->plid)
> >>>> + continue;
> >>>> +
> >>>> + if (blkiop->ops.blkio_update_use_hierarchy_fn)
> >>>> + blkiop->ops.blkio_update_use_hierarchy_fn(blkg,
> >>>> + val);
> >>> Should we really allow this? I mean allow changing hierarchy of a group
> >>> when there are already children groups. I think memory controller does
> >>> not allow this. We can design along the same lines. Keep use_hierarchy
> >>> as 0 by default. Allow changing it only if there are no children cgroups.
> >>> Otherwise we shall have to send notifications to subscribing policies
> >>> and then change their structure etc. Lets keep it simple.
> >> Yes, I really don't allow changing use_hierarchy if there are childre cgroups.
> >> Please consider following line in my patch.
> >>
> >> if (val > 1 || !list_empty(&cgroup->children))
> >> return -EINVAL;
> >
> > If there are no children cgroups, then there can not be any children blkg
> > and there is no need to send any per blkg notification to each policy?
>
> Firsly, In my patch, per blkg notification only happens on root blkg.
> Secondly, root cfqg is put onto "flat_service_tree" in flat mode,
> where for hierarchical mode, it don't belongs to anybody. When switching, it
> has to inform root cfqg to switch onto or switch off "flat_service_tree".
>
> Anyway, If we're going to put root cfqg onto grp_service_tree regardless of
> flat or hierarchical mode, this piece of code can be gone.
>
Exactly. Keeping everything on grp_service_tree() both for flat and
hierarchical mode will make sure no root group moving around business
and no notifications when use_hierarchy is set.
Thanks
Vivek
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy"
2010-12-13 1:45 ` [PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy" Gui Jianfeng
@ 2010-12-20 19:43 ` Vivek Goyal
0 siblings, 0 replies; 41+ messages in thread
From: Vivek Goyal @ 2010-12-20 19:43 UTC (permalink / raw)
To: Gui Jianfeng
Cc: Jens Axboe, Corrado Zoccolo, Chad Talbott, Nauman Rafique,
Divyesh Shah, linux kernel mailing list
On Mon, Dec 13, 2010 at 09:45:14AM +0800, Gui Jianfeng wrote:
[..]
>
> + /* Service tree for cfq group flat scheduling mode. */
> + struct cfq_rb_root flat_service_tree;
> +
> /*
> * The priority currently being served
> */
> @@ -647,12 +650,16 @@ cfq_group_slice(struct cfq_data *cfqd, struct cfq_group *cfqg)
> struct cfq_entity *cfqe = &cfqg->cfqe;
> struct cfq_rb_root *st = cfqe->service_tree;
>
> - if (st)
> - return cfq_target_latency * cfqe->weight
> - / st->total_weight;
> - else
> - /* If this is the root group, give it a full slice. */
> - return cfq_target_latency;
> + if (cfqd->use_hierarchy) {
> + if (st)
> + return cfq_target_latency * cfqe->weight
> + / st->total_weight;
> + else
> + /* If this is the root group, give it a full slice. */
> + return cfq_target_latency;
> + } else {
> + return cfq_target_latency * cfqe->weight / st->total_weight;
> + }
> }
Once you have moved the notion of entity and weight of the entity, I think
you can simplify things a bit and come up with a notion of entity slice
in a hieararhy and we can avoid using separate mechanisms for queues and
groups.
There can be multiple ways of doing this and you shall have to see what
is a simple way which works. For queues was keeping a track of average
number of queues and that way estimating the slice length. You can
try keeping track of average number of entities in a group or something
like that. But do think everything now in terms of entities and simplify
the logic a bit.
Thanks
Vivek
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2010-12-20 19:44 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <4CDF7BC5.9080803@cn.fujitsu.com>
[not found] ` <4CDF9CD8.8010207@cn.fujitsu.com>
[not found] ` <20101115193352.GB3396@redhat.com>
2010-11-29 2:32 ` [RFC] [PATCH 3/8] cfq-iosched: Introduce vdisktime and io weight for CFQ queue Gui Jianfeng
[not found] ` <4CDF9CE0.3060606@cn.fujitsu.com>
[not found] ` <20101115194855.GC3396@redhat.com>
2010-11-29 2:34 ` [RFC] [PATCH 4/8] cfq-iosched: Get rid of st->active Gui Jianfeng
[not found] ` <4CDF9D06.6070800@cn.fujitsu.com>
[not found] ` <20101115195428.GE3396@redhat.com>
2010-11-29 2:35 ` [RFC] [PATCH 7/8] cfq-iosched: Enable deep hierarchy in CGgroup Gui Jianfeng
[not found] ` <4CDF9D0D.4060806@cn.fujitsu.com>
[not found] ` <20101115204459.GF3396@redhat.com>
2010-11-29 2:42 ` [RFC] [PATCH 8/8] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level Gui Jianfeng
2010-11-29 14:31 ` Vivek Goyal
2010-11-30 1:15 ` Gui Jianfeng
[not found] ` <4CDF9CC6.2040106@cn.fujitsu.com>
[not found] ` <20101115165319.GI30792@redhat.com>
[not found] ` <4CE2718C.6010406@kernel.dk>
2010-12-13 1:44 ` [PATCH 0/8 v2] Introduce CFQ group hierarchical scheduling and "use_hierarchy" interface Gui Jianfeng
2010-12-13 13:36 ` Jens Axboe
2010-12-14 3:30 ` Gui Jianfeng
2010-12-13 14:29 ` Vivek Goyal
2010-12-14 3:06 ` Gui Jianfeng
2010-12-14 3:29 ` Vivek Goyal
[not found] ` <4D01C6AB.9040807@cn.fujitsu.com>
2010-12-13 1:44 ` [PATCH 1/8 v2] cfq-iosched: Introduce cfq_entity for CFQ queue Gui Jianfeng
2010-12-13 15:44 ` Vivek Goyal
2010-12-14 1:30 ` Gui Jianfeng
2010-12-13 1:44 ` [PATCH 2/8 v2] cfq-iosched: Introduce cfq_entity for CFQ group Gui Jianfeng
2010-12-13 16:59 ` Vivek Goyal
2010-12-14 1:33 ` Gui Jianfeng
2010-12-14 1:47 ` Gui Jianfeng
2010-12-13 1:44 ` [PATCH 3/8 v2] cfq-iosched: Introduce vdisktime and io weight for CFQ queue Gui Jianfeng
2010-12-13 16:59 ` Vivek Goyal
2010-12-14 2:41 ` Gui Jianfeng
2010-12-14 2:47 ` Vivek Goyal
2010-12-13 1:44 ` [PATCH 4/8 v2] cfq-iosched: Extract some common code of service tree handling for CFQ queue and CFQ group Gui Jianfeng
2010-12-13 22:11 ` Vivek Goyal
2010-12-13 1:45 ` [PATCH 5/8 v2] cfq-iosched: Introduce hierarchical scheduling with CFQ queue and group at the same level Gui Jianfeng
2010-12-14 3:49 ` Vivek Goyal
2010-12-14 6:09 ` Gui Jianfeng
2010-12-15 7:02 ` Gui Jianfeng
2010-12-15 22:04 ` Vivek Goyal
2010-12-13 1:45 ` [PATCH 6/8] blkio-cgroup: "use_hierarchy" interface without any functionality Gui Jianfeng
2010-12-15 21:26 ` Vivek Goyal
2010-12-16 2:42 ` Gui Jianfeng
2010-12-16 15:44 ` Vivek Goyal
2010-12-17 3:06 ` Gui Jianfeng
2010-12-17 23:03 ` Vivek Goyal
2010-12-13 1:45 ` [PATCH 7/8] cfq-iosched: Add flat mode and switch between two modes by "use_hierarchy" Gui Jianfeng
2010-12-20 19:43 ` Vivek Goyal
2010-12-13 1:45 ` [PATCH 8/8] blkio-cgroup: Document for blkio.use_hierarchy Gui Jianfeng
2010-12-13 15:10 ` Vivek Goyal
2010-12-14 2:52 ` Gui Jianfeng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).