[RFC, PATCH 0/2] sched: add multiple hierarchy support to the CFS group scheduler

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC, PATCH 0/2] sched: add multiple hierarchy support to the CFS group scheduler
@ 2008-02-25 14:15 Dhaval Giani
  2008-02-25 14:16 ` [RFC, PATCH 1/2] sched: change the fairness model of " Dhaval Giani
  2008-02-25 14:17 ` [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels Dhaval Giani
  0 siblings, 2 replies; 11+ messages in thread
From: Dhaval Giani @ 2008-02-25 14:15 UTC (permalink / raw)
  To: Srivatsa Vaddagiri, Ingo Molnar, Peter Zijlstra
  Cc: Sudhir Kumar, Balbir Singh, Aneesh Kumar KV, lkml, vgoyal, serue, menage

Hi Ingo,

These patches change the fairness model as discussed in
http://lkml.org/lkml/2008/1/30/634

Patch 1 -> Changes the fairness model
Patch 2 -> Allows one to create multiple levels of cgroups

The second patch is not very good with SMP yet, that is the next TODO.
Also it changes the behaviour of fair user. The root task group is the
parent task group and the other users are its children.

Thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC, PATCH 1/2] sched: change the fairness model of the CFS group scheduler
  2008-02-25 14:15 [RFC, PATCH 0/2] sched: add multiple hierarchy support to the CFS group scheduler Dhaval Giani
@ 2008-02-25 14:16 ` Dhaval Giani
  2008-02-27 20:23   ` Serge E. Hallyn
  2008-02-28 19:42   ` Peter Zijlstra
  2008-02-25 14:17 ` [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels Dhaval Giani
  1 sibling, 2 replies; 11+ messages in thread
From: Dhaval Giani @ 2008-02-25 14:16 UTC (permalink / raw)
  To: Srivatsa Vaddagiri, Ingo Molnar, Peter Zijlstra
  Cc: Sudhir Kumar, Balbir Singh, Aneesh Kumar KV, lkml, vgoyal, serue, menage

This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
 kernel/sched.c      |   46 +++++++++++++++++++++
 kernel/sched_fair.c |  113 +++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 137 insertions(+), 22 deletions(-)

Index: linux-2.6.25-rc2/kernel/sched.c
===================================================================
--- linux-2.6.25-rc2.orig/kernel/sched.c
+++ linux-2.6.25-rc2/kernel/sched.c
@@ -224,10 +224,13 @@ struct task_group {
 };
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+
+#ifdef CONFIG_USER_SCHED
 /* Default task group's sched entity on each cpu */
 static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
 /* Default task group's cfs_rq on each cpu */
 static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) ____cacheline_aligned_in_smp;
+#endif
 
 static struct sched_entity *init_sched_entity_p[NR_CPUS];
 static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
@@ -7163,6 +7166,10 @@ static void init_tg_cfs_entry(struct rq 
 		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
 
 	tg->se[cpu] = se;
+	/* se could be NULL for init_task_group */
+	if (!se)
+		return;
+
 	se->cfs_rq = &rq->cfs;
 	se->my_q = cfs_rq;
 	se->load.weight = tg->shares;
@@ -7217,11 +7224,46 @@ void __init sched_init(void)
 #ifdef CONFIG_FAIR_GROUP_SCHED
 		init_task_group.shares = init_task_group_load;
 		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
+#ifdef CONFIG_CGROUP_SCHED
+		/*
+		 * How much cpu bandwidth does init_task_group get?
+		 *
+		 * In case of task-groups formed thr' the cgroup filesystem, it
+		 * gets 100% of the cpu resources in the system. This overall
+		 * system cpu resource is divided among the tasks of
+		 * init_task_group and its child task-groups in a fair manner,
+		 * based on each entity's (task or task-group's) weight
+		 * (se->load.weight).
+		 *
+		 * In other words, if init_task_group has 10 tasks of weight
+		 * 1024) and two child groups A0 and A1 (of weight 1024 each),
+		 * then A0's share of the cpu resource is:
+		 *
+		 * 	A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%
+		 *
+		 * We achieve this by letting init_task_group's tasks sit
+		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
+		 */
+		init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
+		init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
+#elif defined CONFIG_USER_SCHED
+		/*
+		 * In case of task-groups formed thr' the user id of tasks,
+		 * init_task_group represents tasks belonging to root user.
+		 * Hence it forms a sibling of all subsequent groups formed.
+		 * In this case, init_task_group gets only a fraction of overall
+		 * system cpu resource, based on the weight assigned to root
+		 * user's cpu share (INIT_TASK_GROUP_LOAD). This is accomplished
+		 * by letting tasks of init_task_group sit in a separate cfs_rq
+		 * (init_cfs_rq) and having one entity represent this group of
+		 * tasks in rq->cfs (i.e init_task_group->se[] != NULL).
+		 */
 		init_tg_cfs_entry(rq, &init_task_group,
 				&per_cpu(init_cfs_rq, i),
 				&per_cpu(init_sched_entity, i), i, 1);
 
 #endif
+#endif /* CONFIG_FAIR_GROUP_SCHED */
 #ifdef CONFIG_RT_GROUP_SCHED
 		init_task_group.rt_runtime =
 			sysctl_sched_rt_runtime * NSEC_PER_USEC;
@@ -7435,6 +7477,10 @@ static int rebalance_shares(struct sched
 		unsigned long total_load = 0, total_shares;
 		struct task_group *tg = cfs_rq->tg;
 
+		/* Skip this group if there is no associated group entity */
+		if (unlikely(!tg->se[this_cpu]))
+			continue;
+
 		/* Gather total task load of this group across cpus */
 		for_each_cpu_mask(i, sdspan)
 			total_load += tg->cfs_rq[i]->load.weight;
Index: linux-2.6.25-rc2/kernel/sched_fair.c
===================================================================
--- linux-2.6.25-rc2.orig/kernel/sched_fair.c
+++ linux-2.6.25-rc2/kernel/sched_fair.c
@@ -732,6 +732,21 @@ static inline struct sched_entity *paren
 	return se->parent;
 }
 
+/* return the cpu load contributed by a given group on a given cpu */
+static inline unsigned long group_cpu_load(struct task_group *tg, int cpu)
+{
+	struct sched_entity *se = tg->se[cpu], *top_se;
+	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+
+	if (unlikely(!se))
+		return cfs_rq->load.weight;
+
+	for_each_sched_entity(se)
+		top_se = se;
+
+	return top_se->load.weight;
+}
+
 #define GROUP_IMBALANCE_PCT	20
 
 #else	/* CONFIG_FAIR_GROUP_SCHED */
@@ -1073,6 +1088,17 @@ out_set_cpu:
 }
 #endif /* CONFIG_SMP */
 
+/* return depth at which a sched entity is present in the hierarchy */
+static inline int depth_se(struct sched_entity *se)
+{
+	int depth = 0;
+
+	for_each_sched_entity(se)
+		depth++;
+
+	return depth;
+}
+
 
 /*
  * Preempt the current task with a newly woken task if needed:
@@ -1083,6 +1109,7 @@ static void check_preempt_wakeup(struct 
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	unsigned long gran;
+	int se_depth, pse_depth;
 
 	if (unlikely(rt_prio(p->prio))) {
 		update_rq_clock(rq);
@@ -1100,6 +1127,27 @@ static void check_preempt_wakeup(struct 
 	if (!sched_feat(WAKEUP_PREEMPT))
 		return;
 
+	/*
+	 * preemption test can be made between sibling entities who are in the
+	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
+	 * both tasks untill we find their ancestors who are siblings of common
+	 * parent.
+	 */
+
+	/* First walk up until both entities are at same depth */
+	se_depth = depth_se(se);
+	pse_depth = depth_se(pse);
+
+	while (se_depth > pse_depth) {
+		se_depth--;
+		se = parent_entity(se);
+	}
+
+	while (pse_depth > se_depth) {
+		pse_depth--;
+		pse = parent_entity(pse);
+	}
+
 	while (!is_same_group(se, pse)) {
 		se = parent_entity(se);
 		pse = parent_entity(pse);
@@ -1166,13 +1214,22 @@ static void put_prev_task_fair(struct rq
 static struct task_struct *
 __load_balance_iterator(struct cfs_rq *cfs_rq, struct rb_node *curr)
 {
-	struct task_struct *p;
+	struct task_struct *p = NULL;
+	struct sched_entity *se;
 
 	if (!curr)
 		return NULL;
 
-	p = rb_entry(curr, struct task_struct, se.run_node);
-	cfs_rq->rb_load_balance_curr = rb_next(curr);
+	/* Skip over entities that are not tasks */
+	do {
+		se = rb_entry(curr, struct sched_entity, run_node);
+		curr = rb_next(curr);
+	} while (curr && !entity_is_task(se));
+
+	cfs_rq->rb_load_balance_curr = curr;
+
+	if (entity_is_task(se))
+		p = task_of(se);
 
 	return p;
 }
@@ -1210,21 +1267,28 @@ load_balance_fair(struct rq *this_rq, in
 		struct cfs_rq *this_cfs_rq = busy_cfs_rq->tg->cfs_rq[this_cpu];
 		unsigned long maxload, task_load, group_weight;
 		unsigned long thisload, per_task_load;
-		struct sched_entity *se = busy_cfs_rq->tg->se[busiest->cpu];
+		struct task_group *tg = busy_cfs_rq->tg;
+		struct sched_entity *se = tg->se[busiest->cpu],
+				    *this_se = tg->se[this_cpu];
 
 		task_load = busy_cfs_rq->load.weight;
-		group_weight = se->load.weight;
+		if (!task_load)
+			continue;
+
+		group_weight = group_cpu_load(tg, busiest->cpu);
 
 		/*
-		 * 'group_weight' is contributed by tasks of total weight
+		 * 'group_weight' is contributed by entities of total weight
 		 * 'task_load'. To move 'rem_load_move' worth of weight only,
 		 * we need to move a maximum task load of:
 		 *
 		 * 	maxload = (remload / group_weight) * task_load;
 		 */
 		maxload = (rem_load_move * task_load) / group_weight;
+		if (!maxload)
+			continue;
 
-		if (!maxload || !task_load)
+		if (!busy_cfs_rq->nr_running)
 			continue;
 
 		per_task_load = task_load / busy_cfs_rq->nr_running;
@@ -1253,23 +1317,28 @@ load_balance_fair(struct rq *this_rq, in
 					       &cfs_rq_iterator);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-		/*
-		 * load_moved holds the task load that was moved. The
-		 * effective (group) weight moved would be:
-		 * 	load_moved_eff = load_moved/task_load * group_weight;
-		 */
-		load_moved = (group_weight * load_moved) / task_load;
-
 		/* Adjust shares on both cpus to reflect load_moved */
-		group_weight -= load_moved;
-		set_se_shares(se, group_weight);
+		if (likely(se)) {
+			unsigned long load_moved_eff;
+			unsigned long se_shares;
 
-		se = busy_cfs_rq->tg->se[this_cpu];
-		if (!thisload)
-			group_weight = load_moved;
-		else
-			group_weight = se->load.weight + load_moved;
-		set_se_shares(se, group_weight);
+			/*
+			 * load_moved holds the task load that was moved. The
+			 * effective (group) weight moved would be:
+			 *	load_moved_eff = load_moved/task_load *
+			 *					group_weight;
+			 */
+			load_moved_eff = (se->load.weight *
+						 load_moved) / task_load;
+
+			set_se_shares(se, se->load.weight - load_moved_eff);
+			if (!thisload)
+				se_shares = load_moved_eff;
+			else
+				se_shares = this_se->load.weight +
+							load_moved_eff;
+			set_se_shares(this_se, se_shares);
+		}
 #endif
 
 		rem_load_move -= load_moved;
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels
  2008-02-25 14:15 [RFC, PATCH 0/2] sched: add multiple hierarchy support to the CFS group scheduler Dhaval Giani
  2008-02-25 14:16 ` [RFC, PATCH 1/2] sched: change the fairness model of " Dhaval Giani
@ 2008-02-25 14:17 ` Dhaval Giani
  2008-02-25 14:17   ` Dhaval Giani
  2008-02-28 19:42   ` Peter Zijlstra
  1 sibling, 2 replies; 11+ messages in thread
From: Dhaval Giani @ 2008-02-25 14:17 UTC (permalink / raw)
  To: Srivatsa Vaddagiri, Ingo Molnar, Peter Zijlstra
  Cc: Sudhir Kumar, Balbir Singh, Aneesh Kumar KV, lkml, vgoyal, serue, menage

This patch makes the group scheduler multi hierarchy aware.

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>

---
 include/linux/sched.h |    2 +-
 kernel/sched.c        |   41 ++++++++++++++++++++++++-----------------
 2 files changed, 25 insertions(+), 18 deletions(-)

Index: linux-2.6.25-rc2/include/linux/sched.h
===================================================================
--- linux-2.6.25-rc2.orig/include/linux/sched.h
+++ linux-2.6.25-rc2/include/linux/sched.h
@@ -2031,7 +2031,7 @@ extern void normalize_rt_tasks(void);
 
 extern struct task_group init_task_group;
 
-extern struct task_group *sched_create_group(void);
+extern struct task_group *sched_create_group(struct task_group *parent);
 extern void sched_destroy_group(struct task_group *tg);
 extern void sched_move_task(struct task_struct *tsk);
 #ifdef CONFIG_FAIR_GROUP_SCHED
Index: linux-2.6.25-rc2/kernel/sched.c
===================================================================
--- linux-2.6.25-rc2.orig/kernel/sched.c
+++ linux-2.6.25-rc2/kernel/sched.c
@@ -7155,10 +7155,11 @@ static void init_rt_rq(struct rt_rq *rt_
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void init_tg_cfs_entry(struct rq *rq, struct task_group *tg,
-		struct cfs_rq *cfs_rq, struct sched_entity *se,
-		int cpu, int add)
+static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
+				struct sched_entity *se, int cpu, int add,
+				struct sched_entity *parent)
 {
+	struct rq *rq = cpu_rq(cpu);
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
@@ -7170,7 +7171,11 @@ static void init_tg_cfs_entry(struct rq 
 	if (!se)
 		return;
 
-	se->cfs_rq = &rq->cfs;
+	if (parent == NULL)
+		se->cfs_rq = &rq->cfs;
+	else
+		se->cfs_rq = parent->my_q;
+
 	se->my_q = cfs_rq;
 	se->load.weight = tg->shares;
 	se->load.inv_weight = div64_64(1ULL<<32, se->load.weight);
@@ -7244,7 +7249,8 @@ void __init sched_init(void)
 		 * We achieve this by letting init_task_group's tasks sit
 		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
 		 */
-		init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
+		init_tg_cfs_entry(&init_task_group, &rq->cfs,
+							NULL, i, 1, NULL);
 		init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
 #elif defined CONFIG_USER_SCHED
 		/*
@@ -7260,7 +7266,7 @@ void __init sched_init(void)
 		 */
 		init_tg_cfs_entry(rq, &init_task_group,
 				&per_cpu(init_cfs_rq, i),
-				&per_cpu(init_sched_entity, i), i, 1);
+				&per_cpu(init_sched_entity, i), i, 1, NULL);
 
 #endif
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -7630,7 +7636,8 @@ static void free_fair_sched_group(struct
 	kfree(tg->se);
 }
 
-static int alloc_fair_sched_group(struct task_group *tg)
+static int alloc_fair_sched_group(struct task_group *tg,
+					struct task_group *parent)
 {
 	struct cfs_rq *cfs_rq;
 	struct sched_entity *se;
@@ -7658,8 +7665,11 @@ static int alloc_fair_sched_group(struct
 				GFP_KERNEL|__GFP_ZERO, cpu_to_node(i));
 		if (!se)
 			goto err;
-
-		init_tg_cfs_entry(rq, tg, cfs_rq, se, i, 0);
+		if (!parent) {
+			init_tg_cfs_entry(tg, cfs_rq, se, i, 0,	parent->se[i]);
+		} else {
+			init_tg_cfs_entry(tg, cfs_rq, se, i, 0, NULL);
+		}
 	}
 
 	return 1;
@@ -7788,7 +7798,7 @@ static void free_sched_group(struct task
 }
 
 /* allocate runqueue etc for a new task group */
-struct task_group *sched_create_group(void)
+struct task_group *sched_create_group(struct task_group *parent)
 {
 	struct task_group *tg;
 	unsigned long flags;
@@ -7798,7 +7808,7 @@ struct task_group *sched_create_group(vo
 	if (!tg)
 		return ERR_PTR(-ENOMEM);
 
-	if (!alloc_fair_sched_group(tg))
+	if (!alloc_fair_sched_group(tg, parent))
 		goto err;
 
 	if (!alloc_rt_sched_group(tg))
@@ -8049,7 +8059,7 @@ static inline struct task_group *cgroup_
 static struct cgroup_subsys_state *
 cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
-	struct task_group *tg;
+	struct task_group *tg, *parent;
 
 	if (!cgrp->parent) {
 		/* This is early initialization for the top cgroup */
@@ -8057,11 +8067,8 @@ cpu_cgroup_create(struct cgroup_subsys *
 		return &init_task_group.css;
 	}
 
-	/* we support only 1-level deep hierarchical scheduler atm */
-	if (cgrp->parent->parent)
-		return ERR_PTR(-EINVAL);
-
-	tg = sched_create_group();
+	parent = cgroup_tg(cgrp->parent);
+	tg = sched_create_group(parent);
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
 
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels
  2008-02-25 14:17 ` [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels Dhaval Giani
@ 2008-02-25 14:17   ` Dhaval Giani
  2008-02-28 19:42   ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Dhaval Giani @ 2008-02-25 14:17 UTC (permalink / raw)
  To: Srivatsa Vaddagiri, Ingo Molnar, Peter Zijlstra
  Cc: Sudhir Kumar, Balbir Singh, Aneesh Kumar KV, lkml, vgoyal, serue, menage

Meant 2/2 in $subject.
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC, PATCH 1/2] sched: change the fairness model of the CFS group scheduler
  2008-02-25 14:16 ` [RFC, PATCH 1/2] sched: change the fairness model of " Dhaval Giani
@ 2008-02-27 20:23   ` Serge E. Hallyn
  2008-02-28 19:42   ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Serge E. Hallyn @ 2008-02-27 20:23 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Srivatsa Vaddagiri, Ingo Molnar, Peter Zijlstra, Sudhir Kumar,
	Balbir Singh, Aneesh Kumar KV, lkml, vgoyal, serue, menage

Quoting Dhaval Giani (dhaval@linux.vnet.ibm.com):
> This patch allows tasks and groups to exist in the same cfs_rq. With this
> change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
> fairness model where M tasks and N groups exist at the cfs_rq level.
> 
> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> ---
>  kernel/sched.c      |   46 +++++++++++++++++++++
>  kernel/sched_fair.c |  113 +++++++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 137 insertions(+), 22 deletions(-)
> 
> Index: linux-2.6.25-rc2/kernel/sched.c
> ===================================================================
> --- linux-2.6.25-rc2.orig/kernel/sched.c
> +++ linux-2.6.25-rc2/kernel/sched.c
> @@ -224,10 +224,13 @@ struct task_group {
>  };
> 
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +
> +#ifdef CONFIG_USER_SCHED
>  /* Default task group's sched entity on each cpu */
>  static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
>  /* Default task group's cfs_rq on each cpu */
>  static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) ____cacheline_aligned_in_smp;
> +#endif
> 
>  static struct sched_entity *init_sched_entity_p[NR_CPUS];
>  static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
> @@ -7163,6 +7166,10 @@ static void init_tg_cfs_entry(struct rq 
>  		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
> 
>  	tg->se[cpu] = se;
> +	/* se could be NULL for init_task_group */
> +	if (!se)
> +		return;
> +
>  	se->cfs_rq = &rq->cfs;
>  	se->my_q = cfs_rq;
>  	se->load.weight = tg->shares;
> @@ -7217,11 +7224,46 @@ void __init sched_init(void)
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  		init_task_group.shares = init_task_group_load;
>  		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
> +#ifdef CONFIG_CGROUP_SCHED
> +		/*
> +		 * How much cpu bandwidth does init_task_group get?
> +		 *
> +		 * In case of task-groups formed thr' the cgroup filesystem, it
> +		 * gets 100% of the cpu resources in the system. This overall
> +		 * system cpu resource is divided among the tasks of
> +		 * init_task_group and its child task-groups in a fair manner,
> +		 * based on each entity's (task or task-group's) weight
> +		 * (se->load.weight).
> +		 *
> +		 * In other words, if init_task_group has 10 tasks of weight
> +		 * 1024) and two child groups A0 and A1 (of weight 1024 each),
> +		 * then A0's share of the cpu resource is:
> +		 *
> +		 * 	A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%

Cool, this is precisely what I'd expect and was hoping to see.

thanks,
-serge

> +		 *
> +		 * We achieve this by letting init_task_group's tasks sit
> +		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
> +		 */
> +		init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
> +		init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
> +#elif defined CONFIG_USER_SCHED
> +		/*
> +		 * In case of task-groups formed thr' the user id of tasks,
> +		 * init_task_group represents tasks belonging to root user.
> +		 * Hence it forms a sibling of all subsequent groups formed.
> +		 * In this case, init_task_group gets only a fraction of overall
> +		 * system cpu resource, based on the weight assigned to root
> +		 * user's cpu share (INIT_TASK_GROUP_LOAD). This is accomplished
> +		 * by letting tasks of init_task_group sit in a separate cfs_rq
> +		 * (init_cfs_rq) and having one entity represent this group of
> +		 * tasks in rq->cfs (i.e init_task_group->se[] != NULL).
> +		 */
>  		init_tg_cfs_entry(rq, &init_task_group,
>  				&per_cpu(init_cfs_rq, i),
>  				&per_cpu(init_sched_entity, i), i, 1);
> 
>  #endif
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
>  #ifdef CONFIG_RT_GROUP_SCHED
>  		init_task_group.rt_runtime =
>  			sysctl_sched_rt_runtime * NSEC_PER_USEC;
> @@ -7435,6 +7477,10 @@ static int rebalance_shares(struct sched
>  		unsigned long total_load = 0, total_shares;
>  		struct task_group *tg = cfs_rq->tg;
> 
> +		/* Skip this group if there is no associated group entity */
> +		if (unlikely(!tg->se[this_cpu]))
> +			continue;
> +
>  		/* Gather total task load of this group across cpus */
>  		for_each_cpu_mask(i, sdspan)
>  			total_load += tg->cfs_rq[i]->load.weight;
> Index: linux-2.6.25-rc2/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.25-rc2.orig/kernel/sched_fair.c
> +++ linux-2.6.25-rc2/kernel/sched_fair.c
> @@ -732,6 +732,21 @@ static inline struct sched_entity *paren
>  	return se->parent;
>  }
> 
> +/* return the cpu load contributed by a given group on a given cpu */
> +static inline unsigned long group_cpu_load(struct task_group *tg, int cpu)
> +{
> +	struct sched_entity *se = tg->se[cpu], *top_se;
> +	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
> +
> +	if (unlikely(!se))
> +		return cfs_rq->load.weight;
> +
> +	for_each_sched_entity(se)
> +		top_se = se;
> +
> +	return top_se->load.weight;
> +}
> +
>  #define GROUP_IMBALANCE_PCT	20
> 
>  #else	/* CONFIG_FAIR_GROUP_SCHED */
> @@ -1073,6 +1088,17 @@ out_set_cpu:
>  }
>  #endif /* CONFIG_SMP */
> 
> +/* return depth at which a sched entity is present in the hierarchy */
> +static inline int depth_se(struct sched_entity *se)
> +{
> +	int depth = 0;
> +
> +	for_each_sched_entity(se)
> +		depth++;
> +
> +	return depth;
> +}
> +
> 
>  /*
>   * Preempt the current task with a newly woken task if needed:
> @@ -1083,6 +1109,7 @@ static void check_preempt_wakeup(struct 
>  	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>  	struct sched_entity *se = &curr->se, *pse = &p->se;
>  	unsigned long gran;
> +	int se_depth, pse_depth;
> 
>  	if (unlikely(rt_prio(p->prio))) {
>  		update_rq_clock(rq);
> @@ -1100,6 +1127,27 @@ static void check_preempt_wakeup(struct 
>  	if (!sched_feat(WAKEUP_PREEMPT))
>  		return;
> 
> +	/*
> +	 * preemption test can be made between sibling entities who are in the
> +	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
> +	 * both tasks untill we find their ancestors who are siblings of common
> +	 * parent.
> +	 */
> +
> +	/* First walk up until both entities are at same depth */
> +	se_depth = depth_se(se);
> +	pse_depth = depth_se(pse);
> +
> +	while (se_depth > pse_depth) {
> +		se_depth--;
> +		se = parent_entity(se);
> +	}
> +
> +	while (pse_depth > se_depth) {
> +		pse_depth--;
> +		pse = parent_entity(pse);
> +	}
> +
>  	while (!is_same_group(se, pse)) {
>  		se = parent_entity(se);
>  		pse = parent_entity(pse);
> @@ -1166,13 +1214,22 @@ static void put_prev_task_fair(struct rq
>  static struct task_struct *
>  __load_balance_iterator(struct cfs_rq *cfs_rq, struct rb_node *curr)
>  {
> -	struct task_struct *p;
> +	struct task_struct *p = NULL;
> +	struct sched_entity *se;
> 
>  	if (!curr)
>  		return NULL;
> 
> -	p = rb_entry(curr, struct task_struct, se.run_node);
> -	cfs_rq->rb_load_balance_curr = rb_next(curr);
> +	/* Skip over entities that are not tasks */
> +	do {
> +		se = rb_entry(curr, struct sched_entity, run_node);
> +		curr = rb_next(curr);
> +	} while (curr && !entity_is_task(se));
> +
> +	cfs_rq->rb_load_balance_curr = curr;
> +
> +	if (entity_is_task(se))
> +		p = task_of(se);
> 
>  	return p;
>  }
> @@ -1210,21 +1267,28 @@ load_balance_fair(struct rq *this_rq, in
>  		struct cfs_rq *this_cfs_rq = busy_cfs_rq->tg->cfs_rq[this_cpu];
>  		unsigned long maxload, task_load, group_weight;
>  		unsigned long thisload, per_task_load;
> -		struct sched_entity *se = busy_cfs_rq->tg->se[busiest->cpu];
> +		struct task_group *tg = busy_cfs_rq->tg;
> +		struct sched_entity *se = tg->se[busiest->cpu],
> +				    *this_se = tg->se[this_cpu];
> 
>  		task_load = busy_cfs_rq->load.weight;
> -		group_weight = se->load.weight;
> +		if (!task_load)
> +			continue;
> +
> +		group_weight = group_cpu_load(tg, busiest->cpu);
> 
>  		/*
> -		 * 'group_weight' is contributed by tasks of total weight
> +		 * 'group_weight' is contributed by entities of total weight
>  		 * 'task_load'. To move 'rem_load_move' worth of weight only,
>  		 * we need to move a maximum task load of:
>  		 *
>  		 * 	maxload = (remload / group_weight) * task_load;
>  		 */
>  		maxload = (rem_load_move * task_load) / group_weight;
> +		if (!maxload)
> +			continue;
> 
> -		if (!maxload || !task_load)
> +		if (!busy_cfs_rq->nr_running)
>  			continue;
> 
>  		per_task_load = task_load / busy_cfs_rq->nr_running;
> @@ -1253,23 +1317,28 @@ load_balance_fair(struct rq *this_rq, in
>  					       &cfs_rq_iterator);
> 
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -		/*
> -		 * load_moved holds the task load that was moved. The
> -		 * effective (group) weight moved would be:
> -		 * 	load_moved_eff = load_moved/task_load * group_weight;
> -		 */
> -		load_moved = (group_weight * load_moved) / task_load;
> -
>  		/* Adjust shares on both cpus to reflect load_moved */
> -		group_weight -= load_moved;
> -		set_se_shares(se, group_weight);
> +		if (likely(se)) {
> +			unsigned long load_moved_eff;
> +			unsigned long se_shares;
> 
> -		se = busy_cfs_rq->tg->se[this_cpu];
> -		if (!thisload)
> -			group_weight = load_moved;
> -		else
> -			group_weight = se->load.weight + load_moved;
> -		set_se_shares(se, group_weight);
> +			/*
> +			 * load_moved holds the task load that was moved. The
> +			 * effective (group) weight moved would be:
> +			 *	load_moved_eff = load_moved/task_load *
> +			 *					group_weight;
> +			 */
> +			load_moved_eff = (se->load.weight *
> +						 load_moved) / task_load;
> +
> +			set_se_shares(se, se->load.weight - load_moved_eff);
> +			if (!thisload)
> +				se_shares = load_moved_eff;
> +			else
> +				se_shares = this_se->load.weight +
> +							load_moved_eff;
> +			set_se_shares(this_se, se_shares);
> +		}
>  #endif
> 
>  		rem_load_move -= load_moved;
> -- 
> regards,
> Dhaval

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels
  2008-02-25 14:17 ` [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels Dhaval Giani
  2008-02-25 14:17   ` Dhaval Giani
@ 2008-02-28 19:42   ` Peter Zijlstra
  2008-02-29  5:57     ` Dhaval Giani
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2008-02-28 19:42 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Srivatsa Vaddagiri, Ingo Molnar, Sudhir Kumar, Balbir Singh,
	Aneesh Kumar KV, lkml, vgoyal, serue, menage


On Mon, 2008-02-25 at 19:47 +0530, Dhaval Giani wrote:
> This patch makes the group scheduler multi hierarchy aware.

Ok, good thing to do in principle

> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> 
> ---
>  include/linux/sched.h |    2 +-
>  kernel/sched.c        |   41 ++++++++++++++++++++++++-----------------
>  2 files changed, 25 insertions(+), 18 deletions(-)
> 
> Index: linux-2.6.25-rc2/include/linux/sched.h
> ===================================================================
> --- linux-2.6.25-rc2.orig/include/linux/sched.h
> +++ linux-2.6.25-rc2/include/linux/sched.h
> @@ -2031,7 +2031,7 @@ extern void normalize_rt_tasks(void);
>  
>  extern struct task_group init_task_group;
>  
> -extern struct task_group *sched_create_group(void);
> +extern struct task_group *sched_create_group(struct task_group *parent);
>  extern void sched_destroy_group(struct task_group *tg);
>  extern void sched_move_task(struct task_struct *tsk);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> Index: linux-2.6.25-rc2/kernel/sched.c
> ===================================================================
> --- linux-2.6.25-rc2.orig/kernel/sched.c
> +++ linux-2.6.25-rc2/kernel/sched.c
> @@ -7155,10 +7155,11 @@ static void init_rt_rq(struct rt_rq *rt_
>  }
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -static void init_tg_cfs_entry(struct rq *rq, struct task_group *tg,
> -		struct cfs_rq *cfs_rq, struct sched_entity *se,
> -		int cpu, int add)
> +static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> +				struct sched_entity *se, int cpu, int add,
> +				struct sched_entity *parent)
>  {
> +	struct rq *rq = cpu_rq(cpu);
>  	tg->cfs_rq[cpu] = cfs_rq;
>  	init_cfs_rq(cfs_rq, rq);
>  	cfs_rq->tg = tg;
> @@ -7170,7 +7171,11 @@ static void init_tg_cfs_entry(struct rq 
>  	if (!se)
>  		return;
>  
> -	se->cfs_rq = &rq->cfs;
> +	if (parent == NULL)

!parent ?

> +		se->cfs_rq = &rq->cfs;
> +	else
> +		se->cfs_rq = parent->my_q;
> +
>  	se->my_q = cfs_rq;
>  	se->load.weight = tg->shares;
>  	se->load.inv_weight = div64_64(1ULL<<32, se->load.weight);
> @@ -7244,7 +7249,8 @@ void __init sched_init(void)
>  		 * We achieve this by letting init_task_group's tasks sit
>  		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
>  		 */
> -		init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
> +		init_tg_cfs_entry(&init_task_group, &rq->cfs,
> +							NULL, i, 1, NULL);
>  		init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
>  #elif defined CONFIG_USER_SCHED
>  		/*
> @@ -7260,7 +7266,7 @@ void __init sched_init(void)
>  		 */
>  		init_tg_cfs_entry(rq, &init_task_group,
>  				&per_cpu(init_cfs_rq, i),
> -				&per_cpu(init_sched_entity, i), i, 1);
> +				&per_cpu(init_sched_entity, i), i, 1, NULL);
>  
>  #endif
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> @@ -7630,7 +7636,8 @@ static void free_fair_sched_group(struct
>  	kfree(tg->se);
>  }
>  
> -static int alloc_fair_sched_group(struct task_group *tg)
> +static int alloc_fair_sched_group(struct task_group *tg,
> +					struct task_group *parent)
>  {
>  	struct cfs_rq *cfs_rq;
>  	struct sched_entity *se;
> @@ -7658,8 +7665,11 @@ static int alloc_fair_sched_group(struct
>  				GFP_KERNEL|__GFP_ZERO, cpu_to_node(i));
>  		if (!se)
>  			goto err;
> -
> -		init_tg_cfs_entry(rq, tg, cfs_rq, se, i, 0);
> +		if (!parent) {
> +			init_tg_cfs_entry(tg, cfs_rq, se, i, 0,	parent->se[i]);
> +		} else {
> +			init_tg_cfs_entry(tg, cfs_rq, se, i, 0, NULL);
> +		}

Looks like you got the cases switched, this looks like an instant NULL
deref.

>  	}
>  
>  	return 1;
> @@ -7788,7 +7798,7 @@ static void free_sched_group(struct task
>  }
>  
>  /* allocate runqueue etc for a new task group */
> -struct task_group *sched_create_group(void)
> +struct task_group *sched_create_group(struct task_group *parent)
>  {
>  	struct task_group *tg;
>  	unsigned long flags;
> @@ -7798,7 +7808,7 @@ struct task_group *sched_create_group(vo
>  	if (!tg)
>  		return ERR_PTR(-ENOMEM);
>  
> -	if (!alloc_fair_sched_group(tg))
> +	if (!alloc_fair_sched_group(tg, parent))
>  		goto err;
>  
>  	if (!alloc_rt_sched_group(tg))
> @@ -8049,7 +8059,7 @@ static inline struct task_group *cgroup_
>  static struct cgroup_subsys_state *
>  cpu_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cgrp)
>  {
> -	struct task_group *tg;
> +	struct task_group *tg, *parent;
>  
>  	if (!cgrp->parent) {
>  		/* This is early initialization for the top cgroup */
> @@ -8057,11 +8067,8 @@ cpu_cgroup_create(struct cgroup_subsys *
>  		return &init_task_group.css;
>  	}
>  
> -	/* we support only 1-level deep hierarchical scheduler atm */
> -	if (cgrp->parent->parent)
> -		return ERR_PTR(-EINVAL);
> -
> -	tg = sched_create_group();
> +	parent = cgroup_tg(cgrp->parent);
> +	tg = sched_create_group(parent);
>  	if (IS_ERR(tg))
>  		return ERR_PTR(-ENOMEM);
>  


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC, PATCH 1/2] sched: change the fairness model of the CFS group scheduler
  2008-02-25 14:16 ` [RFC, PATCH 1/2] sched: change the fairness model of " Dhaval Giani
  2008-02-27 20:23   ` Serge E. Hallyn
@ 2008-02-28 19:42   ` Peter Zijlstra
  2008-02-29  9:04     ` Dhaval Giani
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2008-02-28 19:42 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Srivatsa Vaddagiri, Ingo Molnar, Sudhir Kumar, Balbir Singh,
	Aneesh Kumar KV, lkml, vgoyal, serue, menage


On Mon, 2008-02-25 at 19:46 +0530, Dhaval Giani wrote:
> This patch allows tasks and groups to exist in the same cfs_rq. With this
> change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
> fairness model where M tasks and N groups exist at the cfs_rq level.

Good. You know I agree with this direction.

> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> ---
>  kernel/sched.c      |   46 +++++++++++++++++++++
>  kernel/sched_fair.c |  113 +++++++++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 137 insertions(+), 22 deletions(-)
> 
> Index: linux-2.6.25-rc2/kernel/sched.c
> ===================================================================
> --- linux-2.6.25-rc2.orig/kernel/sched.c
> +++ linux-2.6.25-rc2/kernel/sched.c
> @@ -224,10 +224,13 @@ struct task_group {
>  };
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +
> +#ifdef CONFIG_USER_SCHED
>  /* Default task group's sched entity on each cpu */
>  static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
>  /* Default task group's cfs_rq on each cpu */
>  static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) ____cacheline_aligned_in_smp;
> +#endif
>  
>  static struct sched_entity *init_sched_entity_p[NR_CPUS];
>  static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
> @@ -7163,6 +7166,10 @@ static void init_tg_cfs_entry(struct rq 
>  		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
>  
>  	tg->se[cpu] = se;
> +	/* se could be NULL for init_task_group */
> +	if (!se)
> +		return;
> +
>  	se->cfs_rq = &rq->cfs;
>  	se->my_q = cfs_rq;
>  	se->load.weight = tg->shares;
> @@ -7217,11 +7224,46 @@ void __init sched_init(void)
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  		init_task_group.shares = init_task_group_load;
>  		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
> +#ifdef CONFIG_CGROUP_SCHED
> +		/*
> +		 * How much cpu bandwidth does init_task_group get?
> +		 *
> +		 * In case of task-groups formed thr' the cgroup filesystem, it
> +		 * gets 100% of the cpu resources in the system. This overall
> +		 * system cpu resource is divided among the tasks of
> +		 * init_task_group and its child task-groups in a fair manner,
> +		 * based on each entity's (task or task-group's) weight
> +		 * (se->load.weight).
> +		 *
> +		 * In other words, if init_task_group has 10 tasks of weight
> +		 * 1024) and two child groups A0 and A1 (of weight 1024 each),
> +		 * then A0's share of the cpu resource is:
> +		 *
> +		 * 	A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%
> +		 *
> +		 * We achieve this by letting init_task_group's tasks sit
> +		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
> +		 */
> +		init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
> +		init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);

That seems to agree with that.

> +#elif defined CONFIG_USER_SCHED
> +		/*
> +		 * In case of task-groups formed thr' the user id of tasks,
> +		 * init_task_group represents tasks belonging to root user.
> +		 * Hence it forms a sibling of all subsequent groups formed.
> +		 * In this case, init_task_group gets only a fraction of overall
> +		 * system cpu resource, based on the weight assigned to root
> +		 * user's cpu share (INIT_TASK_GROUP_LOAD). This is accomplished
> +		 * by letting tasks of init_task_group sit in a separate cfs_rq
> +		 * (init_cfs_rq) and having one entity represent this group of
> +		 * tasks in rq->cfs (i.e init_task_group->se[] != NULL).
> +		 */
>  		init_tg_cfs_entry(rq, &init_task_group,
>  				&per_cpu(init_cfs_rq, i),
>  				&per_cpu(init_sched_entity, i), i, 1);

But I fail to parse this lengthy comment. What does it do:

    init_group
   /     |    \
uid-0 uid-1000 uid-n

or does it blend uid-0 into the init_group?

>  
>  #endif
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
>  #ifdef CONFIG_RT_GROUP_SCHED
>  		init_task_group.rt_runtime =
>  			sysctl_sched_rt_runtime * NSEC_PER_USEC;
> @@ -7435,6 +7477,10 @@ static int rebalance_shares(struct sched
>  		unsigned long total_load = 0, total_shares;
>  		struct task_group *tg = cfs_rq->tg;
>  
> +		/* Skip this group if there is no associated group entity */
> +		if (unlikely(!tg->se[this_cpu]))
> +			continue;

I'm not directly seeing where tg->se[cpu] == NULL comes from..

>  		/* Gather total task load of this group across cpus */
>  		for_each_cpu_mask(i, sdspan)
>  			total_load += tg->cfs_rq[i]->load.weight;
> Index: linux-2.6.25-rc2/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.25-rc2.orig/kernel/sched_fair.c
> +++ linux-2.6.25-rc2/kernel/sched_fair.c
> @@ -732,6 +732,21 @@ static inline struct sched_entity *paren
>  	return se->parent;
>  }
>  
> +/* return the cpu load contributed by a given group on a given cpu */
> +static inline unsigned long group_cpu_load(struct task_group *tg, int cpu)
> +{
> +	struct sched_entity *se = tg->se[cpu], *top_se;
> +	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
> +
> +	if (unlikely(!se))
> +		return cfs_rq->load.weight;
> +
> +	for_each_sched_entity(se)
> +		top_se = se;
> +
> +	return top_se->load.weight;
> +}
> +
>  #define GROUP_IMBALANCE_PCT	20
>  
>  #else	/* CONFIG_FAIR_GROUP_SCHED */
> @@ -1073,6 +1088,17 @@ out_set_cpu:
>  }
>  #endif /* CONFIG_SMP */
>  
> +/* return depth at which a sched entity is present in the hierarchy */
> +static inline int depth_se(struct sched_entity *se)
> +{
> +	int depth = 0;
> +
> +	for_each_sched_entity(se)
> +		depth++;
> +
> +	return depth;
> +}
> +
>  
>  /*
>   * Preempt the current task with a newly woken task if needed:
> @@ -1083,6 +1109,7 @@ static void check_preempt_wakeup(struct 
>  	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
>  	struct sched_entity *se = &curr->se, *pse = &p->se;
>  	unsigned long gran;
> +	int se_depth, pse_depth;
>  
>  	if (unlikely(rt_prio(p->prio))) {
>  		update_rq_clock(rq);
> @@ -1100,6 +1127,27 @@ static void check_preempt_wakeup(struct 
>  	if (!sched_feat(WAKEUP_PREEMPT))
>  		return;
>  
> +	/*
> +	 * preemption test can be made between sibling entities who are in the
> +	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
> +	 * both tasks untill we find their ancestors who are siblings of common
> +	 * parent.
> +	 */
> +
> +	/* First walk up until both entities are at same depth */
> +	se_depth = depth_se(se);
> +	pse_depth = depth_se(pse);
> +
> +	while (se_depth > pse_depth) {
> +		se_depth--;
> +		se = parent_entity(se);
> +	}
> +
> +	while (pse_depth > se_depth) {
> +		pse_depth--;
> +		pse = parent_entity(pse);
> +	}

Sad, but needed.. for now..

>  	while (!is_same_group(se, pse)) {
>  		se = parent_entity(se);
>  		pse = parent_entity(pse);
> @@ -1166,13 +1214,22 @@ static void put_prev_task_fair(struct rq
>  static struct task_struct *
>  __load_balance_iterator(struct cfs_rq *cfs_rq, struct rb_node *curr)
>  {
> -	struct task_struct *p;
> +	struct task_struct *p = NULL;
> +	struct sched_entity *se;
>  
>  	if (!curr)
>  		return NULL;
>  
> -	p = rb_entry(curr, struct task_struct, se.run_node);
> -	cfs_rq->rb_load_balance_curr = rb_next(curr);
> +	/* Skip over entities that are not tasks */
> +	do {
> +		se = rb_entry(curr, struct sched_entity, run_node);
> +		curr = rb_next(curr);
> +	} while (curr && !entity_is_task(se));
> +
> +	cfs_rq->rb_load_balance_curr = curr;
> +
> +	if (entity_is_task(se))
> +		p = task_of(se);
>  
>  	return p;
>  }
> @@ -1210,21 +1267,28 @@ load_balance_fair(struct rq *this_rq, in
>  		struct cfs_rq *this_cfs_rq = busy_cfs_rq->tg->cfs_rq[this_cpu];
>  		unsigned long maxload, task_load, group_weight;
>  		unsigned long thisload, per_task_load;
> -		struct sched_entity *se = busy_cfs_rq->tg->se[busiest->cpu];
> +		struct task_group *tg = busy_cfs_rq->tg;
> +		struct sched_entity *se = tg->se[busiest->cpu],
> +				    *this_se = tg->se[this_cpu];
>  
>  		task_load = busy_cfs_rq->load.weight;
> -		group_weight = se->load.weight;
> +		if (!task_load)
> +			continue;
> +
> +		group_weight = group_cpu_load(tg, busiest->cpu);
>  
>  		/*
> -		 * 'group_weight' is contributed by tasks of total weight
> +		 * 'group_weight' is contributed by entities of total weight
>  		 * 'task_load'. To move 'rem_load_move' worth of weight only,
>  		 * we need to move a maximum task load of:
>  		 *
>  		 * 	maxload = (remload / group_weight) * task_load;
>  		 */
>  		maxload = (rem_load_move * task_load) / group_weight;
> +		if (!maxload)
> +			continue;
>  
> -		if (!maxload || !task_load)
> +		if (!busy_cfs_rq->nr_running)
>  			continue;
>  
>  		per_task_load = task_load / busy_cfs_rq->nr_running;
> @@ -1253,23 +1317,28 @@ load_balance_fair(struct rq *this_rq, in
>  					       &cfs_rq_iterator);
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -		/*
> -		 * load_moved holds the task load that was moved. The
> -		 * effective (group) weight moved would be:
> -		 * 	load_moved_eff = load_moved/task_load * group_weight;
> -		 */
> -		load_moved = (group_weight * load_moved) / task_load;
> -
>  		/* Adjust shares on both cpus to reflect load_moved */
> -		group_weight -= load_moved;
> -		set_se_shares(se, group_weight);
> +		if (likely(se)) {
> +			unsigned long load_moved_eff;
> +			unsigned long se_shares;
>  
> -		se = busy_cfs_rq->tg->se[this_cpu];
> -		if (!thisload)
> -			group_weight = load_moved;
> -		else
> -			group_weight = se->load.weight + load_moved;
> -		set_se_shares(se, group_weight);
> +			/*
> +			 * load_moved holds the task load that was moved. The
> +			 * effective (group) weight moved would be:
> +			 *	load_moved_eff = load_moved/task_load *
> +			 *					group_weight;
> +			 */
> +			load_moved_eff = (se->load.weight *
> +						 load_moved) / task_load;
> +
> +			set_se_shares(se, se->load.weight - load_moved_eff);
> +			if (!thisload)
> +				se_shares = load_moved_eff;
> +			else
> +				se_shares = this_se->load.weight +
> +							load_moved_eff;
> +			set_se_shares(this_se, se_shares);
> +		}
>  #endif
>  
>  		rem_load_move -= load_moved;

Didn't pay too much attention to the lb part as you said that needs
work.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels
  2008-02-28 19:42   ` Peter Zijlstra
@ 2008-02-29  5:57     ` Dhaval Giani
  0 siblings, 0 replies; 11+ messages in thread
From: Dhaval Giani @ 2008-02-29  5:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Ingo Molnar, Sudhir Kumar, Balbir Singh,
	Aneesh Kumar KV, lkml, vgoyal, serue, menage

On Thu, Feb 28, 2008 at 08:42:46PM +0100, Peter Zijlstra wrote:
> 
> On Mon, 2008-02-25 at 19:47 +0530, Dhaval Giani wrote:
> > This patch makes the group scheduler multi hierarchy aware.
> 
> Ok, good thing to do in principle
> 

thanks!

> > Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> > 
> > ---
> >  include/linux/sched.h |    2 +-
> >  kernel/sched.c        |   41 ++++++++++++++++++++++++-----------------
> >  2 files changed, 25 insertions(+), 18 deletions(-)
> > 
> > Index: linux-2.6.25-rc2/include/linux/sched.h
> > ===================================================================
> > --- linux-2.6.25-rc2.orig/include/linux/sched.h
> > +++ linux-2.6.25-rc2/include/linux/sched.h
> > @@ -2031,7 +2031,7 @@ extern void normalize_rt_tasks(void);
> >  
> >  extern struct task_group init_task_group;
> >  
> > -extern struct task_group *sched_create_group(void);
> > +extern struct task_group *sched_create_group(struct task_group *parent);
> >  extern void sched_destroy_group(struct task_group *tg);
> >  extern void sched_move_task(struct task_struct *tsk);
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > Index: linux-2.6.25-rc2/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.25-rc2.orig/kernel/sched.c
> > +++ linux-2.6.25-rc2/kernel/sched.c
> > @@ -7155,10 +7155,11 @@ static void init_rt_rq(struct rt_rq *rt_
> >  }
> >  
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > -static void init_tg_cfs_entry(struct rq *rq, struct task_group *tg,
> > -		struct cfs_rq *cfs_rq, struct sched_entity *se,
> > -		int cpu, int add)
> > +static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> > +				struct sched_entity *se, int cpu, int add,
> > +				struct sched_entity *parent)
> >  {
> > +	struct rq *rq = cpu_rq(cpu);
> >  	tg->cfs_rq[cpu] = cfs_rq;
> >  	init_cfs_rq(cfs_rq, rq);
> >  	cfs_rq->tg = tg;
> > @@ -7170,7 +7171,11 @@ static void init_tg_cfs_entry(struct rq 
> >  	if (!se)
> >  		return;
> >  
> > -	se->cfs_rq = &rq->cfs;
> > +	if (parent == NULL)
> 
> !parent ?
> 

Will do. (need to start remembering to doing this one!)

> > +		se->cfs_rq = &rq->cfs;
> > +	else
> > +		se->cfs_rq = parent->my_q;
> > +
> >  	se->my_q = cfs_rq;
> >  	se->load.weight = tg->shares;
> >  	se->load.inv_weight = div64_64(1ULL<<32, se->load.weight);
> > @@ -7244,7 +7249,8 @@ void __init sched_init(void)
> >  		 * We achieve this by letting init_task_group's tasks sit
> >  		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
> >  		 */
> > -		init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
> > +		init_tg_cfs_entry(&init_task_group, &rq->cfs,
> > +							NULL, i, 1, NULL);
> >  		init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
> >  #elif defined CONFIG_USER_SCHED
> >  		/*
> > @@ -7260,7 +7266,7 @@ void __init sched_init(void)
> >  		 */
> >  		init_tg_cfs_entry(rq, &init_task_group,
> >  				&per_cpu(init_cfs_rq, i),
> > -				&per_cpu(init_sched_entity, i), i, 1);
> > +				&per_cpu(init_sched_entity, i), i, 1, NULL);
> >  
> >  #endif
> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
> > @@ -7630,7 +7636,8 @@ static void free_fair_sched_group(struct
> >  	kfree(tg->se);
> >  }
> >  
> > -static int alloc_fair_sched_group(struct task_group *tg)
> > +static int alloc_fair_sched_group(struct task_group *tg,
> > +					struct task_group *parent)
> >  {
> >  	struct cfs_rq *cfs_rq;
> >  	struct sched_entity *se;
> > @@ -7658,8 +7665,11 @@ static int alloc_fair_sched_group(struct
> >  				GFP_KERNEL|__GFP_ZERO, cpu_to_node(i));
> >  		if (!se)
> >  			goto err;
> > -
> > -		init_tg_cfs_entry(rq, tg, cfs_rq, se, i, 0);
> > +		if (!parent) {
> > +			init_tg_cfs_entry(tg, cfs_rq, se, i, 0,	parent->se[i]);
> > +		} else {
> > +			init_tg_cfs_entry(tg, cfs_rq, se, i, 0, NULL);
> > +		}
> 
> Looks like you got the cases switched, this looks like an instant NULL
> deref.
> 

eeeks! Yeah, you are right. Funny though how I did not hit it while
testing.
 

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC, PATCH 1/2] sched: change the fairness model of the CFS group scheduler
  2008-02-28 19:42   ` Peter Zijlstra
@ 2008-02-29  9:04     ` Dhaval Giani
  2008-02-29 10:37       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Dhaval Giani @ 2008-02-29  9:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Ingo Molnar, Sudhir Kumar, Balbir Singh,
	Aneesh Kumar KV, lkml, vgoyal, serue, menage

On Thu, Feb 28, 2008 at 08:42:48PM +0100, Peter Zijlstra wrote:
> 
> On Mon, 2008-02-25 at 19:46 +0530, Dhaval Giani wrote:
> > This patch allows tasks and groups to exist in the same cfs_rq. With this
> > change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
> > fairness model where M tasks and N groups exist at the cfs_rq level.
> 
> Good. You know I agree with this direction.
> 
> > Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> > Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> > ---
> >  kernel/sched.c      |   46 +++++++++++++++++++++
> >  kernel/sched_fair.c |  113 +++++++++++++++++++++++++++++++++++++++++-----------
> >  2 files changed, 137 insertions(+), 22 deletions(-)
> > 
> > Index: linux-2.6.25-rc2/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.25-rc2.orig/kernel/sched.c
> > +++ linux-2.6.25-rc2/kernel/sched.c
> > @@ -224,10 +224,13 @@ struct task_group {
> >  };
> >  
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> > +
> > +#ifdef CONFIG_USER_SCHED
> >  /* Default task group's sched entity on each cpu */
> >  static DEFINE_PER_CPU(struct sched_entity, init_sched_entity);
> >  /* Default task group's cfs_rq on each cpu */
> >  static DEFINE_PER_CPU(struct cfs_rq, init_cfs_rq) ____cacheline_aligned_in_smp;
> > +#endif
> >  
> >  static struct sched_entity *init_sched_entity_p[NR_CPUS];
> >  static struct cfs_rq *init_cfs_rq_p[NR_CPUS];
> > @@ -7163,6 +7166,10 @@ static void init_tg_cfs_entry(struct rq 
> >  		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
> >  
> >  	tg->se[cpu] = se;
> > +	/* se could be NULL for init_task_group */
> > +	if (!se)
> > +		return;
> > +
> >  	se->cfs_rq = &rq->cfs;
> >  	se->my_q = cfs_rq;
> >  	se->load.weight = tg->shares;
> > @@ -7217,11 +7224,46 @@ void __init sched_init(void)
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >  		init_task_group.shares = init_task_group_load;
> >  		INIT_LIST_HEAD(&rq->leaf_cfs_rq_list);
> > +#ifdef CONFIG_CGROUP_SCHED
> > +		/*
> > +		 * How much cpu bandwidth does init_task_group get?
> > +		 *
> > +		 * In case of task-groups formed thr' the cgroup filesystem, it
> > +		 * gets 100% of the cpu resources in the system. This overall
> > +		 * system cpu resource is divided among the tasks of
> > +		 * init_task_group and its child task-groups in a fair manner,
> > +		 * based on each entity's (task or task-group's) weight
> > +		 * (se->load.weight).
> > +		 *
> > +		 * In other words, if init_task_group has 10 tasks of weight
> > +		 * 1024) and two child groups A0 and A1 (of weight 1024 each),
> > +		 * then A0's share of the cpu resource is:
> > +		 *
> > +		 * 	A0's bandwidth = 1024 / (10*1024 + 1024 + 1024) = 8.33%
> > +		 *
> > +		 * We achieve this by letting init_task_group's tasks sit
> > +		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
> > +		 */
> > +		init_tg_cfs_entry(rq, &init_task_group, &rq->cfs, NULL, i, 1);
> > +		init_tg_rt_entry(rq, &init_task_group, &rq->rt, NULL, i, 1);
> 
> That seems to agree with that.
> 
> > +#elif defined CONFIG_USER_SCHED
> > +		/*
> > +		 * In case of task-groups formed thr' the user id of tasks,
> > +		 * init_task_group represents tasks belonging to root user.
> > +		 * Hence it forms a sibling of all subsequent groups formed.
> > +		 * In this case, init_task_group gets only a fraction of overall
> > +		 * system cpu resource, based on the weight assigned to root
> > +		 * user's cpu share (INIT_TASK_GROUP_LOAD). This is accomplished
> > +		 * by letting tasks of init_task_group sit in a separate cfs_rq
> > +		 * (init_cfs_rq) and having one entity represent this group of
> > +		 * tasks in rq->cfs (i.e init_task_group->se[] != NULL).
> > +		 */
> >  		init_tg_cfs_entry(rq, &init_task_group,
> >  				&per_cpu(init_cfs_rq, i),
> >  				&per_cpu(init_sched_entity, i), i, 1);
> 
> But I fail to parse this lengthy comment. What does it do:
> 
>     init_group
>    /     |    \
> uid-0 uid-1000 uid-n
> 
> or does it blend uid-0 into the init_group?
> 

It blends uid-0 (root) into init_group.

<snip>

> > @@ -1100,6 +1127,27 @@ static void check_preempt_wakeup(struct 
> >  	if (!sched_feat(WAKEUP_PREEMPT))
> >  		return;
> >  
> > +	/*
> > +	 * preemption test can be made between sibling entities who are in the
> > +	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
> > +	 * both tasks untill we find their ancestors who are siblings of common
> > +	 * parent.
> > +	 */
> > +
> > +	/* First walk up until both entities are at same depth */
> > +	se_depth = depth_se(se);
> > +	pse_depth = depth_se(pse);
> > +
> > +	while (se_depth > pse_depth) {
> > +		se_depth--;
> > +		se = parent_entity(se);
> > +	}
> > +
> > +	while (pse_depth > se_depth) {
> > +		pse_depth--;
> > +		pse = parent_entity(pse);
> > +	}
> 
> Sad, but needed.. for now..
> 

better ideas if any are welcome! Cannot think of anything right now :(

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC, PATCH 1/2] sched: change the fairness model of the CFS group scheduler
  2008-02-29  9:04     ` Dhaval Giani
@ 2008-02-29 10:37       ` Peter Zijlstra
  2008-03-04  9:49         ` Dhaval Giani
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2008-02-29 10:37 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Srivatsa Vaddagiri, Ingo Molnar, Sudhir Kumar, Balbir Singh,
	Aneesh Kumar KV, lkml, vgoyal, serue, menage


On Fri, 2008-02-29 at 14:34 +0530, Dhaval Giani wrote:

> > > +#elif defined CONFIG_USER_SCHED
> > > +		/*
> > > +		 * In case of task-groups formed thr' the user id of tasks,
> > > +		 * init_task_group represents tasks belonging to root user.
> > > +		 * Hence it forms a sibling of all subsequent groups formed.
> > > +		 * In this case, init_task_group gets only a fraction of overall
> > > +		 * system cpu resource, based on the weight assigned to root
> > > +		 * user's cpu share (INIT_TASK_GROUP_LOAD). This is accomplished
> > > +		 * by letting tasks of init_task_group sit in a separate cfs_rq
> > > +		 * (init_cfs_rq) and having one entity represent this group of
> > > +		 * tasks in rq->cfs (i.e init_task_group->se[] != NULL).
> > > +		 */
> > >  		init_tg_cfs_entry(rq, &init_task_group,
> > >  				&per_cpu(init_cfs_rq, i),
> > >  				&per_cpu(init_sched_entity, i), i, 1);
> > 
> > But I fail to parse this lengthy comment. What does it do:
> > 
> >     init_group
> >    /     |    \
> > uid-0 uid-1000 uid-n
> > 
> > or does it blend uid-0 into the init_group?
> > 
> 
> It blends uid-0 (root) into init_group.

Any particular reason why? It seems to me uid-0 should be treated like
any other uid.

> > > @@ -1100,6 +1127,27 @@ static void check_preempt_wakeup(struct 
> > >  	if (!sched_feat(WAKEUP_PREEMPT))
> > >  		return;
> > >  
> > > +	/*
> > > +	 * preemption test can be made between sibling entities who are in the
> > > +	 * same cfs_rq i.e who have a common parent. Walk up the hierarchy of
> > > +	 * both tasks untill we find their ancestors who are siblings of common
> > > +	 * parent.
> > > +	 */
> > > +
> > > +	/* First walk up until both entities are at same depth */
> > > +	se_depth = depth_se(se);
> > > +	pse_depth = depth_se(pse);
> > > +
> > > +	while (se_depth > pse_depth) {
> > > +		se_depth--;
> > > +		se = parent_entity(se);
> > > +	}
> > > +
> > > +	while (pse_depth > se_depth) {
> > > +		pse_depth--;
> > > +		pse = parent_entity(pse);
> > > +	}
> > 
> > Sad, but needed.. for now..
> > 
> 
> better ideas if any are welcome! Cannot think of anything right now :(

Single rq can do without :-) If only I could get that latency isolation
going...


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC, PATCH 1/2] sched: change the fairness model of the CFS group scheduler
  2008-02-29 10:37       ` Peter Zijlstra
@ 2008-03-04  9:49         ` Dhaval Giani
  0 siblings, 0 replies; 11+ messages in thread
From: Dhaval Giani @ 2008-03-04  9:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Ingo Molnar, Sudhir Kumar, Balbir Singh,
	Aneesh Kumar KV, lkml, vgoyal, serue, menage

On Fri, Feb 29, 2008 at 11:37:55AM +0100, Peter Zijlstra wrote:
> 
> On Fri, 2008-02-29 at 14:34 +0530, Dhaval Giani wrote:
> 
> > > > +#elif defined CONFIG_USER_SCHED
> > > > +		/*
> > > > +		 * In case of task-groups formed thr' the user id of tasks,
> > > > +		 * init_task_group represents tasks belonging to root user.
> > > > +		 * Hence it forms a sibling of all subsequent groups formed.
> > > > +		 * In this case, init_task_group gets only a fraction of overall
> > > > +		 * system cpu resource, based on the weight assigned to root
> > > > +		 * user's cpu share (INIT_TASK_GROUP_LOAD). This is accomplished
> > > > +		 * by letting tasks of init_task_group sit in a separate cfs_rq
> > > > +		 * (init_cfs_rq) and having one entity represent this group of
> > > > +		 * tasks in rq->cfs (i.e init_task_group->se[] != NULL).
> > > > +		 */
> > > >  		init_tg_cfs_entry(rq, &init_task_group,
> > > >  				&per_cpu(init_cfs_rq, i),
> > > >  				&per_cpu(init_sched_entity, i), i, 1);
> > > 
> > > But I fail to parse this lengthy comment. What does it do:
> > > 
> > >     init_group
> > >    /     |    \
> > > uid-0 uid-1000 uid-n
> > > 
> > > or does it blend uid-0 into the init_group?
> > > 
> > 
> > It blends uid-0 (root) into init_group.
> 
> Any particular reason why? It seems to me uid-0 should be treated like
> any other uid.
> 

Ah, I misunderstood your question. We have not changed anything for UID
scheduling as no task can (should) exist at the root level (init_group).
Your initial figure is right, sorry for the confusion.

Thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-03-04  9:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-25 14:15 [RFC, PATCH 0/2] sched: add multiple hierarchy support to the CFS group scheduler Dhaval Giani
2008-02-25 14:16 ` [RFC, PATCH 1/2] sched: change the fairness model of " Dhaval Giani
2008-02-27 20:23   ` Serge E. Hallyn
2008-02-28 19:42   ` Peter Zijlstra
2008-02-29  9:04     ` Dhaval Giani
2008-02-29 10:37       ` Peter Zijlstra
2008-03-04  9:49         ` Dhaval Giani
2008-02-25 14:17 ` [RFC, PATCH 1/2] sched: allow the CFS group scheduler to have multiple levels Dhaval Giani
2008-02-25 14:17   ` Dhaval Giani
2008-02-28 19:42   ` Peter Zijlstra
2008-02-29  5:57     ` Dhaval Giani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).