[PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
@ 2012-12-10  8:22 Alex Shi
  2012-12-10  8:22 ` [PATCH 01/18] sched: select_task_rq_fair clean up Alex Shi
                   ` (18 more replies)
  0 siblings, 19 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

This patchset base on tip/sched/core tree temporary, since it is more 
steady than tip/master. and it's easy to rebase on tip/master.

It includes 3 parts changes.

1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in
find_idlest_group and select_task_rq_fair. it can increase 10+%
hackbench process and thread performance on our 4 sockets SNB EP machine.

2, enable load average into LB, patch 5~9, that using load average in
load balancing, with a runnable load value industrialization bug fix and
new fork task load contrib enhancement.

3, power awareness scheduling, patch 10~18, 
Defined 2 new power aware policy balance and
powersaving, and then try to spread or shrink tasks on CPU unit
according the different scheduler policy. That can save much power when
task number in system is no more then cpu number.

Any comments are appreciated!

Best regards!
Alex

[PATCH 01/18] sched: select_task_rq_fair clean up
[PATCH 02/18] sched: fix find_idlest_group mess logical
[PATCH 03/18] sched: don't need go to smaller sched domain
[PATCH 04/18] sched: remove domain iterations in fork/exec/wake
[PATCH 05/18] sched: load tracking bug fix
[PATCH 06/18] sched: set initial load avg of new forked task as its
[PATCH 07/18] sched: compute runnable load avg in cpu_load and
[PATCH 08/18] sched: consider runnable load average in move_tasks
[PATCH 09/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[PATCH 10/18] sched: add sched_policy in kernel
[PATCH 11/18] sched: add sched_policy and it's sysfs interface
[PATCH 12/18] sched: log the cpu utilization at rq
[PATCH 13/18] sched: add power aware scheduling in fork/exec/wake
[PATCH 14/18] sched: add power/performance balance allowed flag
[PATCH 15/18] sched: don't care if the local group has capacity
[PATCH 16/18] sched: pull all tasks from source group
[PATCH 17/18] sched: power aware load balance,
[PATCH 18/18] sched: lazy powersaving balance

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 01/18] sched: select_task_rq_fair clean up
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-11  4:23   ` Preeti U Murthy
  2012-12-10  8:22 ` [PATCH 02/18] sched: fix find_idlest_group mess logical Alex Shi
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

It is impossible to miss a task allowed cpu in a eligible group.

And since find_idlest_group only return a different group which
excludes old cpu, it's also imporissible to find a new cpu same as old
cpu.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59e072b..df99456 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3150,11 +3150,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		}
 
 		new_cpu = find_idlest_cpu(group, p, cpu);
-		if (new_cpu == -1 || new_cpu == cpu) {
-			/* Now try balancing at a lower domain level of cpu */
-			sd = sd->child;
-			continue;
-		}
 
 		/* Now try balancing at a lower domain level of new_cpu */
 		cpu = new_cpu;
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 02/18] sched: fix find_idlest_group mess logical
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
  2012-12-10  8:22 ` [PATCH 01/18] sched: select_task_rq_fair clean up Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-11  5:08   ` Preeti U Murthy
  2012-12-10  8:22 ` [PATCH 03/18] sched: don't need go to smaller sched domain Alex Shi
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

There is 4 situations in the function:
1, no task allowed group;
	so min_load = ULONG_MAX, this_load = 0, idlest = NULL
2, only local group task allowed;
	so min_load = ULONG_MAX, this_load assigned, idlest = NULL
3, only non-local task group allowed;
	so min_load assigned, this_load = 0, idlest != NULL
4, local group + another group are task allowed.
	so min_load assigned, this_load assigned, idlest != NULL

Current logical will return NULL in first 3 kinds of scenarios.
And still return NULL, if idlest group is heavier then the
local group in the 4th situation.

Actually, I thought groups in situation 2,3 are also eligible to host
the task. And in 4th situation, agree to bias toward local group.
So, has this patch.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df99456..b40bc2b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int load_idx)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
+	struct sched_group *this_group = NULL;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		if (local_group) {
 			this_load = avg_load;
-		} else if (avg_load < min_load) {
+			this_group = group;
+		}
+		if (avg_load < min_load) {
 			min_load = avg_load;
 			idlest = group;
 		}
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
-		return NULL;
+	if (this_group && idlest != this_group)
+		/* Bias toward our group again */
+		if (100*this_load < imbalance*min_load)
+			idlest = this_group;
+
 	return idlest;
 }
 
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 03/18] sched: don't need go to smaller sched domain
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
  2012-12-10  8:22 ` [PATCH 01/18] sched: select_task_rq_fair clean up Alex Shi
  2012-12-10  8:22 ` [PATCH 02/18] sched: fix find_idlest_group mess logical Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 04/18] sched: remove domain iterations in fork/exec/wake Alex Shi
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

If parent sched domain has no task allowed cpu find. neither find in
it's child. So, go out to save useless checking.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b40bc2b..05ee54e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3150,10 +3150,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			load_idx = sd->wake_idx;
 
 		group = find_idlest_group(sd, p, cpu, load_idx);
-		if (!group) {
-			sd = sd->child;
-			continue;
-		}
+		if (!group)
+			goto unlock;
 
 		new_cpu = find_idlest_cpu(group, p, cpu);
 
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 04/18] sched: remove domain iterations in fork/exec/wake
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (2 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 03/18] sched: don't need go to smaller sched domain Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 05/18] sched: load tracking bug fix Alex Shi
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

Guess the search cpu from bottom to up in domain tree come from
commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
balancing over tasks on all level domains.

This balancing cost much if there has many domain/groups in a large
system. And force spreading task among different domains may cause
performance issue due to bad locality.

If we remove this code, we will get quick fork/exec/wake, plus better
balancing among whole system, that also reduce migrations in future
load balancing.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |   20 +-------------------
 1 files changed, 1 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 05ee54e..1faf89f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3136,15 +3136,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		goto unlock;
 	}
 
-	while (sd) {
+	if (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
-		int weight;
-
-		if (!(sd->flags & sd_flag)) {
-			sd = sd->child;
-			continue;
-		}
 
 		if (sd_flag & SD_BALANCE_WAKE)
 			load_idx = sd->wake_idx;
@@ -3154,18 +3148,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			goto unlock;
 
 		new_cpu = find_idlest_cpu(group, p, cpu);
-
-		/* Now try balancing at a lower domain level of new_cpu */
-		cpu = new_cpu;
-		weight = sd->span_weight;
-		sd = NULL;
-		for_each_domain(cpu, tmp) {
-			if (weight <= tmp->span_weight)
-				break;
-			if (tmp->flags & sd_flag)
-				sd = tmp;
-		}
-		/* while loop will break here if sd == NULL */
 	}
 unlock:
 	rcu_read_unlock();
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 05/18] sched: load tracking bug fix
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (3 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 04/18] sched: remove domain iterations in fork/exec/wake Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 06/18] sched: set initial load avg of new forked task as its load weight Alex Shi
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
after a new task forked.
Otherwise random values of above variable cause mess when do new task
enqueue:
    enqueue_task_fair
        enqueue_entity
            enqueue_entity_load_avg

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..e6533e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1534,6 +1534,8 @@ static void __sched_fork(struct task_struct *p)
 #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
+	p->se.avg.decay_count = 0;
+	p->se.avg.load_avg_contrib = 0;
 #endif
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 06/18] sched: set initial load avg of new forked task as its load weight
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (4 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 05/18] sched: load tracking bug fix Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-21  4:33   ` Namhyung Kim
  2012-12-10  8:22 ` [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

New task has no runnable sum at its first runnable time, that make
burst forking just select few idle cpus to put tasks.
Set initial load avg of new forked task as its load weight to resolve
this issue.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    2 +-
 kernel/sched/fair.c   |   13 +++++++++++--
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5dafac3..093f9cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1058,6 +1058,7 @@ struct sched_domain;
 #else
 #define ENQUEUE_WAKING		0
 #endif
+#define ENQUEUE_NEWTASK		8
 
 #define DEQUEUE_SLEEP		1
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e6533e1..96fa5f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1648,7 +1648,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 
 	rq = __task_rq_lock(p);
-	activate_task(rq, p, 0);
+	activate_task(rq, p, ENQUEUE_NEWTASK);
 	p->on_rq = 1;
 	trace_sched_wakeup_new(p, true);
 	check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1faf89f..61c8d24 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1277,8 +1277,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
-						  int wakeup)
+						  int flags)
 {
+	int wakeup = flags & ENQUEUE_WAKEUP;
 	/*
 	 * We track migrations using entity decay_count <= 0, on a wake-up
 	 * migration we use a negative decay count to track the remote decays
@@ -1312,6 +1313,12 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 		update_entity_load_avg(se, 0);
 	}
 
+	/*
+	 * set the initial load avg of new task same as its load
+	 * in order to avoid brust fork make few cpu too heavier
+	 */
+	if (flags & ENQUEUE_NEWTASK)
+		se->avg.load_avg_contrib = se->load.weight;
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
@@ -1476,7 +1483,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 */
 	update_curr(cfs_rq);
 	account_entity_enqueue(cfs_rq, se);
-	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
+	enqueue_entity_load_avg(cfs_rq, se, flags &
+				(ENQUEUE_WAKEUP | ENQUEUE_NEWTASK));
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -2586,6 +2594,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		cfs_rq->h_nr_running++;
 
 		flags = ENQUEUE_WAKEUP;
+		flags &= ~ENQUEUE_NEWTASK;
 	}
 
 	for_each_sched_entity(se) {
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (5 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 06/18] sched: set initial load avg of new forked task as its load weight Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-12  3:57   ` Preeti U Murthy
  2012-12-10  8:22 ` [PATCH 08/18] sched: consider runnable load average in move_tasks Alex Shi
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c |    4 ++--
 kernel/sched/fair.c |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 96fa5f1..0ecb907 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2487,7 +2487,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 void update_idle_cpu_load(struct rq *this_rq)
 {
 	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
-	unsigned long load = this_rq->load.weight;
+	unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
 	unsigned long pending_updates;
 
 	/*
@@ -2537,7 +2537,7 @@ static void update_cpu_load_active(struct rq *this_rq)
 	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
 	 */
 	this_rq->last_load_update_tick = jiffies;
-	__update_cpu_load(this_rq, this_rq->load.weight, 1);
+	__update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
 
 	calc_load_account_active(this_rq);
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61c8d24..6d893a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2680,7 +2680,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
 {
-	return cpu_rq(cpu)->load.weight;
+	return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
 }
 
 /*
@@ -2727,7 +2727,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
 
 	if (nr_running)
-		return rq->load.weight / nr_running;
+		return rq->cfs.runnable_load_avg / nr_running;
 
 	return 0;
 }
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 08/18] sched: consider runnable load average in move_tasks
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (6 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-12  4:41   ` Preeti U Murthy
  2012-12-10  8:22 ` [PATCH 09/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d893a6..bbb069c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3741,6 +3741,15 @@ static unsigned long task_h_load(struct task_struct *p);
 
 static const unsigned int sched_nr_migrate_break = 32;
 
+static unsigned long task_h_load_avg(struct task_struct *p)
+{
+	u32 period = p->se.avg.runnable_avg_period;
+	if (!period)
+		return 0;
+
+	return task_h_load(p) * p->se.avg.runnable_avg_sum / period;
+}
+
 /*
  * move_tasks tries to move up to imbalance weighted load from busiest to
  * this_rq, as part of a balancing operation within domain "sd".
@@ -3776,7 +3785,7 @@ static int move_tasks(struct lb_env *env)
 		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 			goto next;
 
-		load = task_h_load(p);
+		load = task_h_load_avg(p);
 
 		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
 			goto next;
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 09/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (7 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 08/18] sched: consider runnable load average in move_tasks Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 10/18] sched: add sched_policy in kernel Alex Shi
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

This reverts commit f4e26b120b9de84cb627bc7361ba43cfdc51341f

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |    8 +-------
 kernel/sched/core.c   |    7 +------
 kernel/sched/fair.c   |   13 ++-----------
 kernel/sched/sched.h  |    9 +--------
 4 files changed, 5 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 093f9cd..62dbc74 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1184,13 +1184,7 @@ struct sched_entity {
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq		*my_q;
 #endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
-	/* Per-entity load-tracking */
+#ifdef CONFIG_SMP
 	struct sched_avg	avg;
 #endif
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0ecb907..05167f0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1526,12 +1526,7 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
 	p->se.avg.decay_count = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bbb069c..55c7e4f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,8 +882,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 /*
  * We choose a half-life close to 1 scheduling period.
  * Note: The tables below are dependent on this value.
@@ -3165,12 +3164,6 @@ unlock:
 }
 
 /*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
@@ -3193,7 +3186,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 		atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
 	}
 }
-#endif
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -5894,9 +5886,8 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	.migrate_task_rq	= migrate_task_rq_fair,
-#endif
+
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5eca173..0a75a43 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -225,12 +225,6 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
 	 * CFS Load tracking
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -240,8 +234,7 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u32 tg_runnable_contrib;
 	u64 tg_load_contrib;
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 10/18] sched: add sched_policy in kernel
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (8 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 09/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 11/18] sched: add sched_policy and it's sysfs interface Alex Shi
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

Current scheduler behavior is just consider the for larger performance
of system. So it try to spread tasks on more cpu sockets and cpu cores

To adding the consideration of power awareness, the patchset adds
2 kinds of scheduler policy: powersaving and balance, the old scheduling
was taken as performance.

performance: the current scheduling behaviour, try to spread tasks
                on more CPU sockets or cores.
powersaving: will shrink tasks into sched group until all LCPU in the
                group is nearly full.
balance    : will shrink tasks into sched group until group_capacity
                numbers CPU is nearly full.

The following patches will enable powersaving scheduling in CFS.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c  |    2 ++
 kernel/sched/sched.h |    6 ++++++
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 55c7e4f..2cf8673 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,8 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 	return rr_interval;
 }
 
+/* The default scheduler policy is 'performance'. */
+int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
 /*
  * All the scheduling class methods:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0a75a43..7a5eae4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -8,6 +8,12 @@
 
 extern __read_mostly int scheduler_running;
 
+#define SCHED_POLICY_PERFORMANCE	(0x1)
+#define SCHED_POLICY_POWERSAVING	(0x2)
+#define SCHED_POLICY_BALANCE		(0x4)
+
+extern int __read_mostly sched_policy;
+
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 11/18] sched: add sched_policy and it's sysfs interface
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (9 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 10/18] sched: add sched_policy in kernel Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 12/18] sched: log the cpu utilization at rq Alex Shi
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

This patch add the power aware scheduler knob into sysfs:

$cat /sys/devices/system/cpu/sched_policy/available_sched_policy
performance powersaving balance
$cat /sys/devices/system/cpu/sched_policy/current_sched_policy
powersaving

This means the using sched policy is 'powersaving'.

User can change the policy by commend 'echo':
 echo performance > /sys/devices/system/cpu/current_sched_policy

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |   24 +++++++
 drivers/base/cpu.c                                 |    2 +
 include/linux/cpu.h                                |    2 +
 kernel/sched/fair.c                                |   71 ++++++++++++++++++++
 4 files changed, 99 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 6943133..9c9acbf 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -53,6 +53,30 @@ Description:	Dynamic addition and removal of CPU's.  This is not hotplug
 		the system.  Information writtento the file to remove CPU's
 		is architecture specific.
 
+What:		/sys/devices/system/cpu/sched_policy/current_sched_policy
+		/sys/devices/system/cpu/sched_policy/available_sched_policy
+Date:		Oct 2012
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:	CFS scheduler policy showing and setting interface.
+
+		available_sched_policy shows there are 3 kinds of policy now:
+		performance, balance and powersaving.
+		current_sched_policy shows current scheduler policy. And user
+		can change the policy by writing it.
+
+		Policy decides that CFS scheduler how to distribute tasks onto
+		which CPU unit when tasks number less than LCPU number in system
+
+		performance: try to spread tasks onto more CPU sockets,
+		more CPU cores.
+
+		powersaving: try to shrink tasks onto same core or same CPU
+		until every LCPUs are busy.
+
+		balance:     try to shrink tasks onto same core or same CPU
+		until full powered CPUs are busy. This policy also consider
+		system performance when try to save power.
+
 What:		/sys/devices/system/cpu/cpu#/node
 Date:		October 2009
 Contact:	Linux memory management mailing list <linux-mm@kvack.org>
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 6345294..5f6a573 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -330,4 +330,6 @@ void __init cpu_dev_init(void)
 		panic("Failed to register CPU subsystem");
 
 	cpu_dev_register_generic();
+
+	create_sysfs_sched_policy_group(cpu_subsys.dev_root);
 }
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index ce7a074..b2e9265 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -36,6 +36,8 @@ extern void cpu_remove_dev_attr(struct device_attribute *attr);
 extern int cpu_add_dev_attr_group(struct attribute_group *attrs);
 extern void cpu_remove_dev_attr_group(struct attribute_group *attrs);
 
+extern int create_sysfs_sched_policy_group(struct device *dev);
+
 #ifdef CONFIG_HOTPLUG_CPU
 extern void unregister_cpu(struct cpu *cpu);
 extern ssize_t arch_cpu_probe(const char *, size_t);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2cf8673..1b1deb8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5871,6 +5871,77 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 
 /* The default scheduler policy is 'performance'. */
 int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
+
+#ifdef CONFIG_SYSFS
+static ssize_t show_available_sched_policy(struct device *dev,
+		struct device_attribute *attr,
+		char *buf)
+{
+	return sprintf(buf, "performance balance powersaving\n");
+}
+
+static ssize_t show_current_sched_policy(struct device *dev,
+		struct device_attribute *attr,
+		char *buf)
+{
+	if (sched_policy == SCHED_POLICY_PERFORMANCE)
+		return sprintf(buf, "performance\n");
+	else if (sched_policy == SCHED_POLICY_POWERSAVING)
+		return sprintf(buf, "powersaving\n");
+	else if (sched_policy == SCHED_POLICY_BALANCE)
+		return sprintf(buf, "balance\n");
+	return 0;
+}
+
+static ssize_t set_sched_policy(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	unsigned int ret = -EINVAL;
+	char    str_policy[16];
+
+	ret = sscanf(buf, "%15s", str_policy);
+	if (ret != 1)
+		return -EINVAL;
+
+	if (!strcmp(str_policy, "performance"))
+		sched_policy = SCHED_POLICY_PERFORMANCE;
+	else if (!strcmp(str_policy, "powersaving"))
+		sched_policy = SCHED_POLICY_POWERSAVING;
+	else if (!strcmp(str_policy, "balance"))
+		sched_policy = SCHED_POLICY_BALANCE;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+/*
+ *  * Sysfs setup bits:
+ *   */
+static DEVICE_ATTR(current_sched_policy, 0644, show_current_sched_policy,
+						set_sched_policy);
+
+static DEVICE_ATTR(available_sched_policy, 0444,
+		show_available_sched_policy, NULL);
+
+static struct attribute *sched_policy_default_attrs[] = {
+	&dev_attr_current_sched_policy.attr,
+	&dev_attr_available_sched_policy.attr,
+	NULL
+};
+static struct attribute_group sched_policy_attr_group = {
+	.attrs = sched_policy_default_attrs,
+	.name = "sched_policy",
+};
+
+int __init create_sysfs_sched_policy_group(struct device *dev)
+{
+	return sysfs_create_group(&dev->kobj, &sched_policy_attr_group);
+}
+#else
+int __init create_sysfs_sched_policy_group(struct device *dev) {}
+#endif /* CONFIG_SYSFS */
+
 /*
  * All the scheduling class methods:
  */
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 12/18] sched: log the cpu utilization at rq
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (10 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 11/18] sched: add sched_policy and it's sysfs interface Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake Alex Shi
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

The cpu's utilization is to measure how busy is the cpu.
        util = cpu_rq(cpu)->avg.runnable_avg_sum
                / cpu_rq(cpu)->avg.runnable_avg_period;

Since the util is no more than 1, we use its percentage value in later
caculations.

Considering there are interval between balancing, if the cpu util is
up to 97%, we can think it is full of load in a health kernel.

In later power aware scheduling, we are sensitive for how busy of the
cpu, not how weight of its load. As to power comsuming, it is more
related with busy time, not the load weight.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c  |    4 ++++
 kernel/sched/sched.h |    3 +++
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1b1deb8..4cc1764 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1269,8 +1269,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
+	int period;
 	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+
+	period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
+	rq->util = rq->avg.runnable_avg_sum * 100 / period;
 }
 
 /* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7a5eae4..5247560 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -350,6 +350,8 @@ extern struct root_domain def_root_domain;
 
 #endif /* CONFIG_SMP */
 
+/* Take as full load, if the cpu util is up to 97% */
+#define FULL_UTIL	97
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -481,6 +483,7 @@ struct rq {
 #endif
 
 	struct sched_avg avg;
+	unsigned int util;
 };
 
 static inline int cpu_of(struct rq *rq)
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (11 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 12/18] sched: log the cpu utilization at rq Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 14/18] sched: add power/performance balance allowed flag Alex Shi
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest while still has utilization group. That's
will save power for other groups.

The trade off is adding a power aware statistics collection in group
seeking. But since the collection just happened in power scheduling
eligible condition, the worst case of hackbench testing just drops
about 2% with powersaving/balance policy. No clear change for
performance policy.

When the system burst by fork, the new tasks utils are may zero,
(rq->util == 0). that make new tasks go to few idle cpus, then will
be migrated to others in periodic load balance. That's not helpful
for both power/performance.
So this patch doesn't use rq.util to judge if the cpu has vacancy,
instead it uses nr_running of the rq.

BTW,
I had tried to tracking the burst forking, like just use nr_running when
the system has 2 or more forking in same tick. But it's still bad since
runnable load avg is tracking about 4S rq util, so one tick care is far
not enough.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |  230 +++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 179 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4cc1764..729f35d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3092,25 +3092,189 @@ done:
 }
 
 /*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ *		during load balancing.
+ */
+struct sd_lb_stats {
+	struct sched_group *busiest; /* Busiest group in this sd */
+	struct sched_group *this;  /* Local group in this sd */
+	unsigned long total_load;  /* Total load of all groups in sd */
+	unsigned long total_pwr;   /*	Total power of all groups in sd */
+	unsigned long avg_load;	   /* Average load across all groups in sd */
+
+	/** Statistics of this group */
+	unsigned long this_load;
+	unsigned long this_load_per_task;
+	unsigned long this_nr_running;
+	unsigned int  this_has_capacity;
+	unsigned int  this_idle_cpus;
+
+	/* Statistics of the busiest group */
+	unsigned int  busiest_idle_cpus;
+	unsigned long max_load;
+	unsigned long busiest_load_per_task;
+	unsigned long busiest_nr_running;
+	unsigned long busiest_group_capacity;
+	unsigned int  busiest_has_capacity;
+	unsigned int  busiest_group_weight;
+
+	int group_imb; /* Is there imbalance in this sd */
+
+	/* Varibles of power awaring scheduling */
+	unsigned int  sd_utils;	/* sum utilizations of this domain */
+	unsigned long sd_capacity;	/* capacity of this domain */
+	struct sched_group *group_leader; /* Group which relieves group_min */
+	unsigned long min_load_per_task; /* load_per_task in group_min */
+	unsigned int  leader_util;	/* sum utilizations of group_leader */
+	unsigned int  min_util;		/* sum utilizations of group_min */
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ */
+struct sg_lb_stats {
+	unsigned long avg_load; /*Avg load across the CPUs of the group */
+	unsigned long group_load; /* Total load over the CPUs of the group */
+	unsigned long sum_nr_running; /* Nr tasks running in the group */
+	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+	unsigned long group_capacity;
+	unsigned long idle_cpus;
+	unsigned long group_weight;
+	int group_imb; /* Is there an imbalance in the group ? */
+	int group_has_capacity; /* Is there extra capacity in the group? */
+	unsigned int group_utils;	/* sum utilizations of group */
+
+	unsigned long sum_shared_running;	/* 0 on non-NUMA */
+};
+
+static inline int
+fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
+
+/*
+ * Try to collect the task running number and capacity of the group.
+ */
+static void get_sg_power_stats(struct sched_group *group,
+	struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(group)) {
+		struct rq *rq = cpu_rq(i);
+
+			sgs->group_utils += rq->nr_running;
+	}
+
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
+						SCHED_POWER_SCALE);
+	if (!sgs->group_capacity)
+		sgs->group_capacity = fix_small_capacity(sd, group);
+	sgs->group_weight = group->group_weight;
+}
+
+/*
+ * Try to collect the task running number and capacity of the doamin.
+ */
+static void get_sd_power_stats(struct sched_domain *sd,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	struct sched_group *group;
+	struct sg_lb_stats sgs;
+	int sd_min_delta = INT_MAX;
+	int cpu = task_cpu(p);
+
+	group = sd->groups;
+	do {
+		long g_delta;
+		unsigned long threshold;
+
+		if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
+			continue;
+
+		memset(&sgs, 0, sizeof(sgs));
+		get_sg_power_stats(group, sd, &sgs);
+
+		if (sched_policy == SCHED_POLICY_POWERSAVING)
+			threshold = sgs.group_weight;
+		else
+			threshold = sgs.group_capacity;
+
+		g_delta = threshold - sgs.group_utils;
+
+		if (g_delta > 0 && g_delta < sd_min_delta) {
+			sd_min_delta = g_delta;
+			sds->group_leader = group;
+		}
+
+		sds->sd_utils += sgs.group_utils;
+		sds->total_pwr += group->sgp->power;
+	} while  (group = group->next, group != sd->groups);
+
+	sds->sd_capacity = DIV_ROUND_CLOSEST(sds->total_pwr,
+						SCHED_POWER_SCALE);
+}
+
+/*
+ * Execute power policy if this domain is not full.
+ */
+static inline int get_sd_sched_policy(struct sched_domain *sd,
+	int cpu, struct task_struct *p, struct sd_lb_stats *sds)
+{
+	unsigned long threshold;
+
+	if (sched_policy == SCHED_POLICY_PERFORMANCE)
+		return SCHED_POLICY_PERFORMANCE;
+
+	if (sched_policy == SCHED_POLICY_POWERSAVING)
+		threshold = sd->span_weight;
+	else
+		threshold = sds->sd_capacity;
+
+	memset(sds, 0, sizeof(*sds));
+	get_sd_power_stats(sd, p, sds);
+
+	/* still can hold one more task in this domain */
+	if (sds->sd_utils < threshold)
+		return sched_policy;
+
+	return SCHED_POLICY_PERFORMANCE;
+}
+
+/*
+ * If power policy is eligible for this domain, and it has task allowed cpu.
+ * we will select CPU from this domain.
+ */
+static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	int policy;
+	int new_cpu = -1;
+
+	policy = get_sd_sched_policy(sd, cpu, p, sds);
+	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
+		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+	}
+	return new_cpu;
+}
+
+/*
+ * select_task_rq_fair: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
  * SD_BALANCE_EXEC.
  *
- * Balance, ie. select the least loaded group.
- *
  * Returns the target CPU number, or the same CPU if no balancing is needed.
  *
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
-	int sync = wake_flags & WF_SYNC;
+	int sync = flags & WF_SYNC;
+	struct sd_lb_stats sds;
 
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
@@ -3136,11 +3300,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			break;
 		}
 
-		if (tmp->flags & sd_flag)
+		if (tmp->flags & sd_flag) {
 			sd = tmp;
+
+			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+			if (new_cpu != -1)
+				goto unlock;
+		}
 	}
 
 	if (affine_sd) {
+		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+		if (new_cpu != -1)
+			goto unlock;
+
 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 			prev_cpu = cpu;
 
@@ -3950,51 +4123,6 @@ static unsigned long task_h_load(struct task_struct *p)
 #endif
 
 /********** Helpers for find_busiest_group ************************/
-/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * 		during load balancing.
- */
-struct sd_lb_stats {
-	struct sched_group *busiest; /* Busiest group in this sd */
-	struct sched_group *this;  /* Local group in this sd */
-	unsigned long total_load;  /* Total load of all groups in sd */
-	unsigned long total_pwr;   /*	Total power of all groups in sd */
-	unsigned long avg_load;	   /* Average load across all groups in sd */
-
-	/** Statistics of this group */
-	unsigned long this_load;
-	unsigned long this_load_per_task;
-	unsigned long this_nr_running;
-	unsigned long this_has_capacity;
-	unsigned int  this_idle_cpus;
-
-	/* Statistics of the busiest group */
-	unsigned int  busiest_idle_cpus;
-	unsigned long max_load;
-	unsigned long busiest_load_per_task;
-	unsigned long busiest_nr_running;
-	unsigned long busiest_group_capacity;
-	unsigned long busiest_has_capacity;
-	unsigned int  busiest_group_weight;
-
-	int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
- */
-struct sg_lb_stats {
-	unsigned long avg_load; /*Avg load across the CPUs of the group */
-	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long sum_nr_running; /* Nr tasks running in the group */
-	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
-	unsigned long group_capacity;
-	unsigned long idle_cpus;
-	unsigned long group_weight;
-	int group_imb; /* Is there an imbalance in the group ? */
-	int group_has_capacity; /* Is there extra capacity in the group? */
-};
-
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 14/18] sched: add power/performance balance allowed flag
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (12 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 15/18] sched: don't care if the local group has capacity Alex Shi
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

If the cpu condition is suitable for power balance, power_lb
will be set, perf_lb will be clean. If the condition is suitable for
performance balance, their value will will set oppositely.

If the domain is suitable for power balance, but balance should not
be down by this cpu, both of perf_lb and power_lb are cleared to wait a
suitable cpu to do power balance. That mean no any balance, neither
power balance nor performance balance in this domain.

This logical will be implemented by following patches.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 729f35d..57a85cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3759,6 +3759,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+	int			power_lb;  /* if power balance needed */
+	int			perf_lb;   /* if performance balance needed */
 };
 
 /*
@@ -4909,6 +4911,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.power_lb	= 0,
+		.perf_lb	= 1,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 15/18] sched: don't care if the local group has capacity
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (13 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 14/18] sched: add power/performance balance allowed flag Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 16/18] sched: pull all tasks from source group Alex Shi
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

In power aware scheduling, we don't care load weight and
want not to pull tasks just because local group has capacity.
Because the local group maybe no tasks at the time, that is the power
balance hope so.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57a85cc..fe0ba07 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4496,8 +4496,12 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
+		 *
+		 * In power aware scheduling, we don't care load weight and
+		 * want not to pull tasks just because local group has capacity.
 		 */
-		if (prefer_sibling && !local_group && sds->this_has_capacity)
+		if (prefer_sibling && !local_group && sds->this_has_capacity
+				&& env->perf_lb)
 			sgs.group_capacity = min(sgs.group_capacity, 1UL);
 
 		if (local_group) {
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 16/18] sched: pull all tasks from source group
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (14 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 15/18] sched: don't care if the local group has capacity Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 17/18] sched: power aware load balance, Alex Shi
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

In power balance, we hope some sched groups are fully empty to save
CPU power of them. So, we want to move all tasks from them.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe0ba07..27630ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4843,7 +4843,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu power.
 		 */
-		if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+		if (rq->nr_running == 0 ||
+			(!env->power_lb && capacity &&
+				rq->nr_running == 1 && wl > env->imbalance))
 			continue;
 
 		/*
@@ -4947,7 +4949,8 @@ redo:
 
 	ld_moved = 0;
 	lb_iterations = 1;
-	if (busiest->nr_running > 1) {
+	if (busiest->nr_running > 1 ||
+		(busiest->nr_running == 1 && env.power_lb)) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 17/18] sched: power aware load balance,
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (15 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 16/18] sched: pull all tasks from source group Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-10  8:22 ` [PATCH 18/18] sched: lazy powersaving balance Alex Shi
  2012-12-11  0:51 ` [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

This patch enabled the power aware consideration in load balance.

As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, shrink tasks on less sched_groups will reduce power consumption

The first assumption make performance policy take over scheduling when
system busy.
The second assumption make power aware scheduling try to move
disperse tasks into fewer groups until that groups are full of tasks.

This patch reuse some of Suresh's power saving load balance code.
The enabling logical summary here:
1, Collect power aware scheduler statistics during performance load
balance statistics collection.
2, if the balance cpu is eligible for power load balance, just do it
and forget performance load balance. but if the domain is suitable for
power balance, while the cpu is not appropriate, stop both
power/performance balance, else do performance load balance.

A test can show the effort on different policy:
for ((i = 0; i < I; i++)) ; do while true; do :; done  &   done

On my SNB laptop with 4core* HT: the data is Watts
        powersaving     balance         performance
i = 2   40              54              54
i = 4   57              64*             68
i = 8   68              68              68

Note:
When i = 4 with balance policy, the power may change in 57~68Watt,
since the HT capacity and core capacity are both 1.

on SNB EP machine with 2 sockets * 8 cores * HT:
        powersaving     balance         performance
i = 4   190             201             238
i = 8   205             241             268
i = 16  271             348             376

If system has few continued tasks, use power policy can get
the performance/power gain. Like sysbench fileio randrw test with 16
thread on the SNB EP box,

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c |  128 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 125 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 27630ae..e2ba22f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3124,6 +3124,7 @@ struct sd_lb_stats {
 	unsigned int  sd_utils;	/* sum utilizations of this domain */
 	unsigned long sd_capacity;	/* capacity of this domain */
 	struct sched_group *group_leader; /* Group which relieves group_min */
+	struct sched_group *group_min;	/* Least loaded group in sd */
 	unsigned long min_load_per_task; /* load_per_task in group_min */
 	unsigned int  leader_util;	/* sum utilizations of group_leader */
 	unsigned int  min_util;		/* sum utilizations of group_min */
@@ -4125,6 +4126,111 @@ static unsigned long task_h_load(struct task_struct *p)
 #endif
 
 /********** Helpers for find_busiest_group ************************/
+
+/**
+ * init_sd_lb_power_stats - Initialize power savings statistics for
+ * the given sched_domain, during load balancing.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics for sd.
+ */
+static inline void init_sd_lb_power_stats(struct lb_env *env,
+						struct sd_lb_stats *sds)
+{
+	if (sched_policy == SCHED_POLICY_PERFORMANCE ||
+				env->idle == CPU_NOT_IDLE) {
+		env->power_lb = 0;
+		env->perf_lb = 1;
+		return;
+	}
+	env->perf_lb = 0;
+	env->power_lb = 1;
+	sds->min_util= UINT_MAX;
+	sds->leader_util= 0;
+}
+
+/**
+ * update_sd_lb_power_stats - Update the power saving stats for a
+ * sched_domain while performing load balancing.
+ *
+ * @env: The load balancing environment.
+ * @group: sched_group belonging to the sched_domain under consideration.
+ * @sds: Variable containing the statistics of the sched_domain
+ * @local_group: Does group contain the CPU for which we're performing
+ * load balancing?
+ * @sgs: Variable containing the statistics of the group.
+ */
+static inline void update_sd_lb_power_stats(struct lb_env *env,
+			struct sched_group *group, struct sd_lb_stats *sds,
+			int local_group, struct sg_lb_stats *sgs)
+{
+	unsigned long threshold, threshold_util;
+
+	if (!env->power_lb)
+		return;
+
+	if (sched_policy == SCHED_POLICY_POWERSAVING)
+		threshold = sgs->group_weight;
+	else
+		threshold = sgs->group_capacity;
+	threshold_util = threshold * FULL_UTIL;
+
+	/*
+	 * If the local group is idle or full loaded
+	 * no need to do power savings balance at this domain
+	 */
+	if (local_group && (!sgs->sum_nr_running ||
+			(sgs->sum_nr_running == threshold &&
+			sgs->group_utils >= threshold_util)))
+		env->power_lb = 0;
+
+	/*
+	 * Do performance load balance if any group overload or maybe
+	 * potentially overload.
+	 */
+	if (sgs->group_utils > threshold * 100 ||
+			sgs->sum_nr_running > threshold) {
+		env->perf_lb = 1;
+		env->power_lb = 0;
+	}
+
+	/*
+	 * If a group is idle,
+	 * don't include that group in power savings calculations
+	 */
+	if (!env->power_lb || !sgs->sum_nr_running)
+		return;
+
+	/*
+	 * Calculate the group which has the least non-idle load.
+	 * This is the group from where we need to pick up the load
+	 * for saving power
+	 */
+	if ((sgs->group_utils < sds->min_util) ||
+	    (sgs->group_utils == sds->min_util &&
+	     group_first_cpu(group) > group_first_cpu(sds->group_min))) {
+		sds->group_min = group;
+		sds->min_util = sgs->group_utils;
+		sds->min_load_per_task = sgs->sum_weighted_load /
+						sgs->sum_nr_running;
+	}
+
+	/*
+	 * Calculate the group which is almost near its
+	 * capacity but still has some space to pick up some load
+	 * from other group and save more power
+	 */
+	if (sgs->group_utils + FULL_UTIL > threshold * 100)
+		return;
+
+	if (sgs->group_utils > sds->leader_util ||
+	    (sgs->group_utils == sds->leader_util && sds->group_leader &&
+	     group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
+		sds->group_leader = group;
+		sds->leader_util = sgs->group_utils;
+	}
+}
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -4364,6 +4470,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
 		sgs->sum_weighted_load += weighted_cpuload(i);
+		sgs->group_utils += rq->util;
+
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
 	}
@@ -4472,6 +4580,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
+	init_sd_lb_power_stats(env, sds);
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
@@ -4523,6 +4632,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 			sds->group_imb = sgs.group_imb;
 		}
 
+		update_sd_lb_power_stats(env, sg, sds, local_group, &sgs);
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -4740,6 +4850,19 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 */
 	update_sd_lb_stats(env, balance, &sds);
 
+	if (!env->perf_lb && !env->power_lb)
+		return  NULL;
+
+	if (env->power_lb) {
+		if (sds.this == sds.group_leader &&
+				sds.group_leader != sds.group_min) {
+			env->imbalance = sds.min_load_per_task;
+			return sds.group_min;
+		}
+		env->power_lb = 0;
+		return NULL;
+	}
+
 	/*
 	 * this_cpu is not the appropriate cpu to perform load balancing at
 	 * this level.
@@ -4917,8 +5040,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
-		.power_lb	= 0,
-		.perf_lb	= 1,
+		.power_lb	= 1,
+		.perf_lb	= 0,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -5996,7 +6119,6 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-
 static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
 {
 	struct sched_entity *se = &task->se;
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 18/18] sched: lazy powersaving balance
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (16 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 17/18] sched: power aware load balance, Alex Shi
@ 2012-12-10  8:22 ` Alex Shi
  2012-12-11  0:51 ` [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
  18 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-10  8:22 UTC (permalink / raw)
  To: rob, mingo, peterz
  Cc: gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot

When active task number in sched domain wave around the powersaving
scheduling creteria, scheduling will thresh between the powersaving
balance and performance balance, bring unnecessary task migration.
The typical benchmark generate the issue is 'make -j x'.

To remove such issue, introduce a u64 perf_lb_record variable to record
performance load balance history. If there is no performance LB for
continuing 32 times load balancing, or no LB for 8 times max_interval ms,
or only 4 times performance LB in last 64 times load balancing, then we
accept a powersaving LB. Otherwise, give up this power awareness
LB chance.

With this patch, the worst case for power scheduling -- kbuild, gets
similar even better performance/power value between balance and
performance policy, while powersaving is worse.

So, maybe we'd better to use 'balance' policy in general scenarios.

On my SNB EP 2 sockets machine with 8 cores * HT: 'make -j x' results:

		powersaving		balance		performance
x = 1    175.603 /417 13          175.220 /416 13        176.073 /407 13
x = 2    186.026 /246 21          190.182 /208 25        200.873 /210 23
x = 4    198.883 /145 34          204.856 /120 40        218.843 /116 39
x = 6    208.458 /106 45          214.981 /93 50         233.561 /86 49
x = 8    218.304 /86 53           223.527 /76 58         233.008 /75 57
x = 12   231.829 /71 60           268.98  /55 67         247.013 /60 67
x = 16   262.112 /53 71           267.898 /50 74         344.589 /41 70
x = 32   306.969 /36 90           310.774 /37 86         313.359 /38 83

data explains: 175.603 /417 13
	175.603: avagerage Watts
	417: seconds(compile time)
	13:  scaled performance/power = 1000000 / time / power

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/fair.c   |   66 ++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62dbc74..b2837d5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -930,6 +930,7 @@ struct sched_domain {
 	unsigned long last_balance;	/* init to jiffies. units in jiffies */
 	unsigned int balance_interval;	/* initialise to 1. units in ms. */
 	unsigned int nr_balance_failed; /* initialise to 0 */
+	u64	perf_lb_record;	/* performance balance record */
 
 	u64 last_update;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e2ba22f..1cee892 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4231,6 +4231,58 @@ static inline void update_sd_lb_power_stats(struct lb_env *env,
 	}
 }
 
+#define PERF_LB_HH_MASK		0xffffffff00000000ULL
+#define PERF_LB_LH_MASK		0xffffffffULL
+
+/**
+ * need_perf_balance - Check if the performance load balance needed
+ * in the sched_domain.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics of the sched_domain
+ */
+static int need_perf_balance(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	env->sd->perf_lb_record <<= 1;
+
+	if (env->perf_lb) {
+		env->sd->perf_lb_record |= 0x1;
+		return 1;
+	}
+
+	/*
+	 * The situtatoin isn't egligible for performance balance. If this_cpu
+	 * is not egligible or the timing is not suitable for lazy powersaving
+	 * balance, we will stop both powersaving and performance balance.
+	 */
+	if (env->power_lb && sds->this == sds->group_leader
+			&& sds->group_leader != sds->group_min) {
+		int interval;
+
+		/* powersaving balance interval set as 8 * max_interval */
+		interval = msecs_to_jiffies(8 * env->sd->max_interval);
+		if (time_after(jiffies, env->sd->last_balance + interval))
+			env->sd->perf_lb_record = 0;
+
+		/*
+		 * A eligible timing is no performance balance in last 32
+		 * balance and performance balance is no more than 4 times
+		 * in last 64 balance, or no balance in powersaving interval
+		 * time.
+		 */
+		if ((hweight64(env->sd->perf_lb_record & PERF_LB_HH_MASK ) <= 4)
+			&& !(env->sd->perf_lb_record & PERF_LB_LH_MASK)) {
+
+			env->imbalance = sds->min_load_per_task;
+			return 0;
+		}
+
+	}
+	env->power_lb = 0;
+	sds->group_min = NULL;
+	return 0;
+}
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -4817,7 +4869,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 }
 
 /******* find_busiest_group() helpers end here *********************/
-
 /**
  * find_busiest_group - Returns the busiest group within the sched_domain
  * if there is an imbalance. If there isn't an imbalance, and
@@ -4850,17 +4901,8 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 */
 	update_sd_lb_stats(env, balance, &sds);
 
-	if (!env->perf_lb && !env->power_lb)
-		return  NULL;
-
-	if (env->power_lb) {
-		if (sds.this == sds.group_leader &&
-				sds.group_leader != sds.group_min) {
-			env->imbalance = sds.min_load_per_task;
-			return sds.group_min;
-		}
-		env->power_lb = 0;
-		return NULL;
+	if (!need_perf_balance(env, &sds)) {
+		return sds.group_min;
 	}
 
 	/*
-- 
1.7.5.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (17 preceding siblings ...)
  2012-12-10  8:22 ` [PATCH 18/18] sched: lazy powersaving balance Alex Shi
@ 2012-12-11  0:51 ` Alex Shi
  2012-12-11 12:10   ` Alex Shi
  18 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-11  0:51 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot,
	Preeti U Murthy, Arjan van de Ven

On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi <alex.shi@intel.com> wrote:
> This patchset base on tip/sched/core tree temporary, since it is more
> steady than tip/master. and it's easy to rebase on tip/master.
>
> It includes 3 parts changes.
>
> 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in
> find_idlest_group and select_task_rq_fair. it can increase 10+%
> hackbench process and thread performance on our 4 sockets SNB EP machine.
>
> 2, enable load average into LB, patch 5~9, that using load average in
> load balancing, with a runnable load value industrialization bug fix and
> new fork task load contrib enhancement.
>
> 3, power awareness scheduling, patch 10~18,
> Defined 2 new power aware policy balance and
> powersaving, and then try to spread or shrink tasks on CPU unit
> according the different scheduler policy. That can save much power when
> task number in system is no more then cpu number.

tried with sysbench fileio test rndrw mode, with half thread of LCPU number,
performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP
and NHM EP boxes.

Any comments :)
>
> Any comments are appreciated!
>
> Best regards!
> Alex
>
> [PATCH 01/18] sched: select_task_rq_fair clean up
> [PATCH 02/18] sched: fix find_idlest_group mess logical
> [PATCH 03/18] sched: don't need go to smaller sched domain
> [PATCH 04/18] sched: remove domain iterations in fork/exec/wake
> [PATCH 05/18] sched: load tracking bug fix
> [PATCH 06/18] sched: set initial load avg of new forked task as its
> [PATCH 07/18] sched: compute runnable load avg in cpu_load and
> [PATCH 08/18] sched: consider runnable load average in move_tasks
> [PATCH 09/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
> [PATCH 10/18] sched: add sched_policy in kernel
> [PATCH 11/18] sched: add sched_policy and it's sysfs interface
> [PATCH 12/18] sched: log the cpu utilization at rq
> [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake
> [PATCH 14/18] sched: add power/performance balance allowed flag
> [PATCH 15/18] sched: don't care if the local group has capacity
> [PATCH 16/18] sched: pull all tasks from source group
> [PATCH 17/18] sched: power aware load balance,
> [PATCH 18/18] sched: lazy powersaving balance
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/18] sched: select_task_rq_fair clean up
  2012-12-10  8:22 ` [PATCH 01/18] sched: select_task_rq_fair clean up Alex Shi
@ 2012-12-11  4:23   ` Preeti U Murthy
  2012-12-11  5:28     ` Alex Shi
  0 siblings, 1 reply; 56+ messages in thread
From: Preeti U Murthy @ 2012-12-11  4:23 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

Hi Alex,

On 12/10/2012 01:52 PM, Alex Shi wrote:
> It is impossible to miss a task allowed cpu in a eligible group.

The one thing I am concerned with here is if there is a possibility of
the task changing its tsk_cpus_allowed() while this code is running.

i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
for the task changes,perhaps by the user himself,which might not include
the cpus in the idle group.After this find_idlest_cpu() is called.I mean
a race condition in short.Then we might not have an eligible cpu in that
group right?

> And since find_idlest_group only return a different group which
> excludes old cpu, it's also imporissible to find a new cpu same as old
> cpu.

This I agree with.

> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c |    5 -----
>  1 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 59e072b..df99456 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3150,11 +3150,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  		}
>  
>  		new_cpu = find_idlest_cpu(group, p, cpu);
> -		if (new_cpu == -1 || new_cpu == cpu) {
> -			/* Now try balancing at a lower domain level of cpu */
> -			sd = sd->child;
> -			continue;
> -		}
>  
>  		/* Now try balancing at a lower domain level of new_cpu */
>  		cpu = new_cpu;
> 
Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/18] sched: fix find_idlest_group mess logical
  2012-12-10  8:22 ` [PATCH 02/18] sched: fix find_idlest_group mess logical Alex Shi
@ 2012-12-11  5:08   ` Preeti U Murthy
  2012-12-11  5:29     ` Alex Shi
  0 siblings, 1 reply; 56+ messages in thread
From: Preeti U Murthy @ 2012-12-11  5:08 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

Hi Alex,

On 12/10/2012 01:52 PM, Alex Shi wrote:
> There is 4 situations in the function:
> 1, no task allowed group;
> 	so min_load = ULONG_MAX, this_load = 0, idlest = NULL
> 2, only local group task allowed;
> 	so min_load = ULONG_MAX, this_load assigned, idlest = NULL
> 3, only non-local task group allowed;
> 	so min_load assigned, this_load = 0, idlest != NULL
> 4, local group + another group are task allowed.
> 	so min_load assigned, this_load assigned, idlest != NULL
> 
> Current logical will return NULL in first 3 kinds of scenarios.
> And still return NULL, if idlest group is heavier then the
> local group in the 4th situation.
> 
> Actually, I thought groups in situation 2,3 are also eligible to host
> the task. And in 4th situation, agree to bias toward local group.
> So, has this patch.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c |   12 +++++++++---
>  1 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df99456..b40bc2b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  		  int this_cpu, int load_idx)
>  {
>  	struct sched_group *idlest = NULL, *group = sd->groups;
> +	struct sched_group *this_group = NULL;
>  	unsigned long min_load = ULONG_MAX, this_load = 0;
>  	int imbalance = 100 + (sd->imbalance_pct-100)/2;
>  
> @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  
>  		if (local_group) {
>  			this_load = avg_load;
> -		} else if (avg_load < min_load) {
> +			this_group = group;
> +		}
> +		if (avg_load < min_load) {
>  			min_load = avg_load;
>  			idlest = group;
>  		}
>  	} while (group = group->next, group != sd->groups);
>  
> -	if (!idlest || 100*this_load < imbalance*min_load)
> -		return NULL;
> +	if (this_group && idlest != this_group)
> +		/* Bias toward our group again */
> +		if (100*this_load < imbalance*min_load)
> +			idlest = this_group;

If the idlest group is heavier than this_group(or to put it better if
the difference in the loads of the local group and idlest group is less
than a threshold,it means there is no point moving the load from the
local group) you return NULL,that immediately means this_group is chosen
as the candidate group for the task to run,one does not have to
explicitly return that.

Let me explain:
find_idlest_group()-if it returns NULL to mark your case4,it means there
is no idler group than the group to which this_cpu belongs to, at that
level of sched domain.Which is fair enough.

So now the question is under such a circumstance which is the idlest
group so far.It is the group containing this_cpu,i.e.this_group.After
this sd->child is chosen which is nothing but this_group(sd hierarchy
moves towards the cpu it belongs to). Again here the idlest group search
begins.

> +
>  	return idlest;
>  }
>  
> 
Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/18] sched: select_task_rq_fair clean up
  2012-12-11  4:23   ` Preeti U Murthy
@ 2012-12-11  5:28     ` Alex Shi
  2012-12-11  6:30       ` Preeti U Murthy
  0 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-11  5:28 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

On 12/11/2012 12:23 PM, Preeti U Murthy wrote:
> Hi Alex,
> 
> On 12/10/2012 01:52 PM, Alex Shi wrote:
>> It is impossible to miss a task allowed cpu in a eligible group.
> 
> The one thing I am concerned with here is if there is a possibility of
> the task changing its tsk_cpus_allowed() while this code is running.
> 
> i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
> for the task changes,perhaps by the user himself,which might not include
> the cpus in the idle group.After this find_idlest_cpu() is called.I mean
> a race condition in short.Then we might not have an eligible cpu in that
> group right?

your worry make sense, but the code handle the situation, in
select_task_rq(), it will check the cpu allowed again. if the answer is
no, it will fallback to old cpu.
> 
>> And since find_idlest_group only return a different group which
>> excludes old cpu, it's also imporissible to find a new cpu same as old
>> cpu.
> 
> This I agree with.
> 
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  kernel/sched/fair.c |    5 -----
>>  1 files changed, 0 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 59e072b..df99456 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3150,11 +3150,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>  		}
>>  
>>  		new_cpu = find_idlest_cpu(group, p, cpu);
>> -		if (new_cpu == -1 || new_cpu == cpu) {
>> -			/* Now try balancing at a lower domain level of cpu */
>> -			sd = sd->child;
>> -			continue;
>> -		}
>>  
>>  		/* Now try balancing at a lower domain level of new_cpu */
>>  		cpu = new_cpu;
>>
> Regards
> Preeti U Murthy
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/18] sched: fix find_idlest_group mess logical
  2012-12-11  5:08   ` Preeti U Murthy
@ 2012-12-11  5:29     ` Alex Shi
  2012-12-11  5:50       ` Preeti U Murthy
  0 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-11  5:29 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

On 12/11/2012 01:08 PM, Preeti U Murthy wrote:
> Hi Alex,
> 
> On 12/10/2012 01:52 PM, Alex Shi wrote:
>> There is 4 situations in the function:
>> 1, no task allowed group;
>> 	so min_load = ULONG_MAX, this_load = 0, idlest = NULL
>> 2, only local group task allowed;
>> 	so min_load = ULONG_MAX, this_load assigned, idlest = NULL
>> 3, only non-local task group allowed;
>> 	so min_load assigned, this_load = 0, idlest != NULL
>> 4, local group + another group are task allowed.
>> 	so min_load assigned, this_load assigned, idlest != NULL
>>
>> Current logical will return NULL in first 3 kinds of scenarios.
>> And still return NULL, if idlest group is heavier then the
>> local group in the 4th situation.
>>
>> Actually, I thought groups in situation 2,3 are also eligible to host
>> the task. And in 4th situation, agree to bias toward local group.
>> So, has this patch.
>>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  kernel/sched/fair.c |   12 +++++++++---
>>  1 files changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index df99456..b40bc2b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>  		  int this_cpu, int load_idx)
>>  {
>>  	struct sched_group *idlest = NULL, *group = sd->groups;
>> +	struct sched_group *this_group = NULL;
>>  	unsigned long min_load = ULONG_MAX, this_load = 0;
>>  	int imbalance = 100 + (sd->imbalance_pct-100)/2;
>>  
>> @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>  
>>  		if (local_group) {
>>  			this_load = avg_load;
>> -		} else if (avg_load < min_load) {
>> +			this_group = group;
>> +		}
>> +		if (avg_load < min_load) {
>>  			min_load = avg_load;
>>  			idlest = group;
>>  		}
>>  	} while (group = group->next, group != sd->groups);
>>  
>> -	if (!idlest || 100*this_load < imbalance*min_load)
>> -		return NULL;
>> +	if (this_group && idlest != this_group)
>> +		/* Bias toward our group again */
>> +		if (100*this_load < imbalance*min_load)
>> +			idlest = this_group;
> 
> If the idlest group is heavier than this_group(or to put it better if
> the difference in the loads of the local group and idlest group is less
> than a threshold,it means there is no point moving the load from the
> local group) you return NULL,that immediately means this_group is chosen
> as the candidate group for the task to run,one does not have to
> explicitly return that.

In situation 4, this_group is not NULL.
> 
> Let me explain:
> find_idlest_group()-if it returns NULL to mark your case4,it means there
> is no idler group than the group to which this_cpu belongs to, at that
> level of sched domain.Which is fair enough.
> 
> So now the question is under such a circumstance which is the idlest
> group so far.It is the group containing this_cpu,i.e.this_group.After
> this sd->child is chosen which is nothing but this_group(sd hierarchy
> moves towards the cpu it belongs to). Again here the idlest group search
> begins.
> 
>> +
>>  	return idlest;
>>  }
>>  
>>
> Regards
> Preeti U Murthy
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/18] sched: fix find_idlest_group mess logical
  2012-12-11  5:29     ` Alex Shi
@ 2012-12-11  5:50       ` Preeti U Murthy
  2012-12-11 11:55         ` Alex Shi
  0 siblings, 1 reply; 56+ messages in thread
From: Preeti U Murthy @ 2012-12-11  5:50 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

Hi Alex,
On 12/11/2012 10:59 AM, Alex Shi wrote:
> On 12/11/2012 01:08 PM, Preeti U Murthy wrote:
>> Hi Alex,
>>
>> On 12/10/2012 01:52 PM, Alex Shi wrote:
>>> There is 4 situations in the function:
>>> 1, no task allowed group;
>>> 	so min_load = ULONG_MAX, this_load = 0, idlest = NULL
>>> 2, only local group task allowed;
>>> 	so min_load = ULONG_MAX, this_load assigned, idlest = NULL
>>> 3, only non-local task group allowed;
>>> 	so min_load assigned, this_load = 0, idlest != NULL
>>> 4, local group + another group are task allowed.
>>> 	so min_load assigned, this_load assigned, idlest != NULL
>>>
>>> Current logical will return NULL in first 3 kinds of scenarios.
>>> And still return NULL, if idlest group is heavier then the
>>> local group in the 4th situation.
>>>
>>> Actually, I thought groups in situation 2,3 are also eligible to host
>>> the task. And in 4th situation, agree to bias toward local group.
>>> So, has this patch.
>>>
>>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>>> ---
>>>  kernel/sched/fair.c |   12 +++++++++---
>>>  1 files changed, 9 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index df99456..b40bc2b 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>  		  int this_cpu, int load_idx)
>>>  {
>>>  	struct sched_group *idlest = NULL, *group = sd->groups;
>>> +	struct sched_group *this_group = NULL;
>>>  	unsigned long min_load = ULONG_MAX, this_load = 0;
>>>  	int imbalance = 100 + (sd->imbalance_pct-100)/2;
>>>  
>>> @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>  
>>>  		if (local_group) {
>>>  			this_load = avg_load;
>>> -		} else if (avg_load < min_load) {
>>> +			this_group = group;
>>> +		}
>>> +		if (avg_load < min_load) {
>>>  			min_load = avg_load;
>>>  			idlest = group;
>>>  		}
>>>  	} while (group = group->next, group != sd->groups);
>>>  
>>> -	if (!idlest || 100*this_load < imbalance*min_load)
>>> -		return NULL;
>>> +	if (this_group && idlest != this_group)
>>> +		/* Bias toward our group again */
>>> +		if (100*this_load < imbalance*min_load)
>>> +			idlest = this_group;
>>
>> If the idlest group is heavier than this_group(or to put it better if
>> the difference in the loads of the local group and idlest group is less
>> than a threshold,it means there is no point moving the load from the
>> local group) you return NULL,that immediately means this_group is chosen
>> as the candidate group for the task to run,one does not have to
>> explicitly return that.
> 
> In situation 4, this_group is not NULL.

True.The return value of find_idlest_group() indicates that there is no
other idle group other than the local group(the group to which cpu
belongs to). it does not indicate that there is no host group for the
task.If this is the case,select_task_rq_fair() falls back to the
group(sd->child) to which the cpu chosen in the previous iteration
belongs to,This is nothing but this_group in the current iteration.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/18] sched: select_task_rq_fair clean up
  2012-12-11  5:28     ` Alex Shi
@ 2012-12-11  6:30       ` Preeti U Murthy
  2012-12-11 11:53         ` Alex Shi
  2012-12-21  4:28         ` Namhyung Kim
  0 siblings, 2 replies; 56+ messages in thread
From: Preeti U Murthy @ 2012-12-11  6:30 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

On 12/11/2012 10:58 AM, Alex Shi wrote:
> On 12/11/2012 12:23 PM, Preeti U Murthy wrote:
>> Hi Alex,
>>
>> On 12/10/2012 01:52 PM, Alex Shi wrote:
>>> It is impossible to miss a task allowed cpu in a eligible group.
>>
>> The one thing I am concerned with here is if there is a possibility of
>> the task changing its tsk_cpus_allowed() while this code is running.
>>
>> i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
>> for the task changes,perhaps by the user himself,which might not include
>> the cpus in the idle group.After this find_idlest_cpu() is called.I mean
>> a race condition in short.Then we might not have an eligible cpu in that
>> group right?
> 
> your worry make sense, but the code handle the situation, in
> select_task_rq(), it will check the cpu allowed again. if the answer is
> no, it will fallback to old cpu.
>>
>>> And since find_idlest_group only return a different group which
>>> excludes old cpu, it's also imporissible to find a new cpu same as old
>>> cpu.

I doubt this will work correctly.Consider the following situation:sched
domain begins with sd that encloses both socket1 and socket2

cpu0 cpu1  | cpu2 cpu3
-----------|-------------
 socket1   |  socket2

old cpu = cpu1

Iteration1:
1.find_idlest_group() returns socket2 to be idlest.
2.task changes tsk_allowed_cpus to 0,1
3.find_idlest_cpu() returns cpu2

* without your patch
   1.the condition after find_idlest_cpu() returns -1,and sd->child is
chosen which happens to be socket1
   2.in the next iteration, find_idlest_group() and find_idlest_cpu()
will probably choose cpu0 which happens to be idler than cpu1,which is
in tsk_allowed_cpu.

* with your patch
   1.the condition after find_idlest_cpu() does not exist,therefore
a sched domain to which cpu2 belongs to is chosen.this is socket2.(under
the for_each_domain() loop).
   2.in the next iteration, find_idlest_group() return NULL,because
there is no cpu which intersects with tsk_allowed_cpus.
   3.in select task rq,the fallback cpu is chosen even when an idle cpu
existed.

So my concern is though select_task_rq() checks the
tsk_allowed_cpus(),you might end up choosing a different path of
sched_domains compared to without this patch as shown above.

In short without the "if(new_cpu==-1)" condition we might get misled
doing unnecessary iterations over the wrong sched domains in
select_task_rq_fair().(Think about situations when not all the cpus of
socket2 are disallowed by the task,then there will more iterations in
the wrong path of sched_domains before exit,compared to what is shown
above.)

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/18] sched: select_task_rq_fair clean up
  2012-12-11  6:30       ` Preeti U Murthy
@ 2012-12-11 11:53         ` Alex Shi
  2012-12-12  5:26           ` Preeti U Murthy
  2012-12-21  4:28         ` Namhyung Kim
  1 sibling, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-11 11:53 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

On 12/11/2012 02:30 PM, Preeti U Murthy wrote:
> On 12/11/2012 10:58 AM, Alex Shi wrote:
>> On 12/11/2012 12:23 PM, Preeti U Murthy wrote:
>>> Hi Alex,
>>>
>>> On 12/10/2012 01:52 PM, Alex Shi wrote:
>>>> It is impossible to miss a task allowed cpu in a eligible group.
>>>
>>> The one thing I am concerned with here is if there is a possibility of
>>> the task changing its tsk_cpus_allowed() while this code is running.
>>>
>>> i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
>>> for the task changes,perhaps by the user himself,which might not include
>>> the cpus in the idle group.After this find_idlest_cpu() is called.I mean
>>> a race condition in short.Then we might not have an eligible cpu in that
>>> group right?
>>
>> your worry make sense, but the code handle the situation, in
>> select_task_rq(), it will check the cpu allowed again. if the answer is
>> no, it will fallback to old cpu.
>>>
>>>> And since find_idlest_group only return a different group which
>>>> excludes old cpu, it's also imporissible to find a new cpu same as old
>>>> cpu.
> 
> I doubt this will work correctly.Consider the following situation:sched
> domain begins with sd that encloses both socket1 and socket2
> 
> cpu0 cpu1  | cpu2 cpu3
> -----------|-------------
>  socket1   |  socket2
> 
> old cpu = cpu1
> 
> Iteration1:
> 1.find_idlest_group() returns socket2 to be idlest.
> 2.task changes tsk_allowed_cpus to 0,1
> 3.find_idlest_cpu() returns cpu2
> 
> * without your patch
>    1.the condition after find_idlest_cpu() returns -1,and sd->child is
> chosen which happens to be socket1
>    2.in the next iteration, find_idlest_group() and find_idlest_cpu()
> will probably choose cpu0 which happens to be idler than cpu1,which is
> in tsk_allowed_cpu.

Thanks for question Preeti! :)

Yes, with more iteration you has more possibility to get task allowed
cpu in select_task_rq_fair. but how many opportunity the situation
happened?  how much gain you get here?
With LCPU increasing many many iterations cause scalability issue. that
is the simplified forking patchset for. and that why 10% performance
gain on hackbench process/thread.

and if you insist want not to miss your chance in strf(), the current
iteration is still not enough. How you know the idlest cpu is still
idlest after this function finished? how to ensure the allowed cpu won't
be changed again?

A quick snapshot is enough in balancing here. we still has periodic
balacning.

> 
> * with your patch
>    1.the condition after find_idlest_cpu() does not exist,therefore
> a sched domain to which cpu2 belongs to is chosen.this is socket2.(under
> the for_each_domain() loop).
>    2.in the next iteration, find_idlest_group() return NULL,because
> there is no cpu which intersects with tsk_allowed_cpus.
>    3.in select task rq,the fallback cpu is chosen even when an idle cpu
> existed.
> 
> So my concern is though select_task_rq() checks the
> tsk_allowed_cpus(),you might end up choosing a different path of
> sched_domains compared to without this patch as shown above.
> 
> In short without the "if(new_cpu==-1)" condition we might get misled
> doing unnecessary iterations over the wrong sched domains in
> select_task_rq_fair().(Think about situations when not all the cpus of
> socket2 are disallowed by the task,then there will more iterations in

After read the first 4 patches, believe you will find the patchset is
trying to reduce iterations, not increase them.

> the wrong path of sched_domains before exit,compared to what is shown
> above.)
> 
> Regards
> Preeti U Murthy
> 
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/18] sched: fix find_idlest_group mess logical
  2012-12-11  5:50       ` Preeti U Murthy
@ 2012-12-11 11:55         ` Alex Shi
  0 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-11 11:55 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

On 12/11/2012 01:50 PM, Preeti U Murthy wrote:
> Hi Alex,
> On 12/11/2012 10:59 AM, Alex Shi wrote:
>> On 12/11/2012 01:08 PM, Preeti U Murthy wrote:
>>> Hi Alex,
>>>
>>> On 12/10/2012 01:52 PM, Alex Shi wrote:
>>>> There is 4 situations in the function:
>>>> 1, no task allowed group;
>>>> 	so min_load = ULONG_MAX, this_load = 0, idlest = NULL
>>>> 2, only local group task allowed;
>>>> 	so min_load = ULONG_MAX, this_load assigned, idlest = NULL
>>>> 3, only non-local task group allowed;
>>>> 	so min_load assigned, this_load = 0, idlest != NULL
>>>> 4, local group + another group are task allowed.
>>>> 	so min_load assigned, this_load assigned, idlest != NULL
>>>>
>>>> Current logical will return NULL in first 3 kinds of scenarios.
>>>> And still return NULL, if idlest group is heavier then the
>>>> local group in the 4th situation.
>>>>
>>>> Actually, I thought groups in situation 2,3 are also eligible to host
>>>> the task. And in 4th situation, agree to bias toward local group.
>>>> So, has this patch.
>>>>
>>>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>>>> ---
>>>>  kernel/sched/fair.c |   12 +++++++++---
>>>>  1 files changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index df99456..b40bc2b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>>  		  int this_cpu, int load_idx)
>>>>  {
>>>>  	struct sched_group *idlest = NULL, *group = sd->groups;
>>>> +	struct sched_group *this_group = NULL;
>>>>  	unsigned long min_load = ULONG_MAX, this_load = 0;
>>>>  	int imbalance = 100 + (sd->imbalance_pct-100)/2;
>>>>  
>>>> @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>>>  
>>>>  		if (local_group) {
>>>>  			this_load = avg_load;
>>>> -		} else if (avg_load < min_load) {
>>>> +			this_group = group;
>>>> +		}
>>>> +		if (avg_load < min_load) {
>>>>  			min_load = avg_load;
>>>>  			idlest = group;
>>>>  		}
>>>>  	} while (group = group->next, group != sd->groups);
>>>>  
>>>> -	if (!idlest || 100*this_load < imbalance*min_load)
>>>> -		return NULL;
>>>> +	if (this_group && idlest != this_group)
>>>> +		/* Bias toward our group again */
>>>> +		if (100*this_load < imbalance*min_load)
>>>> +			idlest = this_group;
>>>
>>> If the idlest group is heavier than this_group(or to put it better if
>>> the difference in the loads of the local group and idlest group is less
>>> than a threshold,it means there is no point moving the load from the
>>> local group) you return NULL,that immediately means this_group is chosen
>>> as the candidate group for the task to run,one does not have to
>>> explicitly return that.
>>
>> In situation 4, this_group is not NULL.
> 
> True.The return value of find_idlest_group() indicates that there is no
> other idle group other than the local group(the group to which cpu
> belongs to). it does not indicate that there is no host group for the
> task.If this is the case,select_task_rq_fair() falls back to the
> group(sd->child) to which the cpu chosen in the previous iteration
> belongs to,This is nothing but this_group in the current iteration.

Sorry, I didn't get you here.
> 
> Regards
> Preeti U Murthy
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-11  0:51 ` [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
@ 2012-12-11 12:10   ` Alex Shi
  2012-12-11 15:48     ` Borislav Petkov
  0 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-11 12:10 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot,
	Preeti U Murthy, Arjan van de Ven

On 12/11/2012 08:51 AM, Alex Shi wrote:
> On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi <alex.shi@intel.com> wrote:
>> This patchset base on tip/sched/core tree temporary, since it is more
>> steady than tip/master. and it's easy to rebase on tip/master.
>>
>> It includes 3 parts changes.
>>
>> 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in
>> find_idlest_group and select_task_rq_fair. it can increase 10+%
>> hackbench process and thread performance on our 4 sockets SNB EP machine.
>>
>> 2, enable load average into LB, patch 5~9, that using load average in
>> load balancing, with a runnable load value industrialization bug fix and
>> new fork task load contrib enhancement.
>>
>> 3, power awareness scheduling, patch 10~18,
>> Defined 2 new power aware policy balance and
>> powersaving, and then try to spread or shrink tasks on CPU unit
>> according the different scheduler policy. That can save much power when
>> task number in system is no more then cpu number.
> 
> tried with sysbench fileio test rndrw mode, with half thread of LCPU number,
> performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP
> and NHM EP boxes.

Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux* &> /dev/null

On a NHM EP box
         powersaving               balance   	         performance
x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76

On a 2 sockets SNB EP box.
         powersaving               balance   	         performance
x = 4    190.995 /149 35          200.6 /129 38          208.561 /135 35
x = 8    197.969 /108 46          208.885 /103 46        213.96 /108 43
x = 16   205.163 /76 64           212.144 /91 51         229.287 /97 44

data format is: 166.516 /88 68
        166.516: avagerage Watts
        88: seconds(compress time)
        68:  scaled performance/power = 1000000 / time / power



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-11 12:10   ` Alex Shi
@ 2012-12-11 15:48     ` Borislav Petkov
  2012-12-11 16:03       ` Arjan van de Ven
  0 siblings, 1 reply; 56+ messages in thread
From: Borislav Petkov @ 2012-12-11 15:48 UTC (permalink / raw)
  To: Alex Shi
  Cc: Alex Shi, rob, mingo, peterz, gregkh, andre.przywara, rjw,
	paul.gortmaker, akpm, paulmck, linux-kernel, pjt,
	vincent.guittot, Preeti U Murthy, Arjan van de Ven

On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
> Another testing of parallel compress with pigz on Linus' git tree.
> results show we get much better performance/power with powersaving and
> balance policy:
> 
> testing command:
> #pigz -k -c  -p$x -r linux* &> /dev/null
> 
> On a NHM EP box
>          powersaving               balance   	         performance
> x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
> x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76

This looks funny: so "performance" is eating less watts than
"powersaving" and "balance" on NHM. Could it be that the average watts
measurements on NHM are not correct/precise..? On SNB they look as
expected, according to your scheme.

Also, shouldn't you have the shortest compress times with "performance"?

> 
> On a 2 sockets SNB EP box.
>          powersaving               balance   	         performance
> x = 4    190.995 /149 35          200.6 /129 38          208.561 /135 35
> x = 8    197.969 /108 46          208.885 /103 46        213.96 /108 43
> x = 16   205.163 /76 64           212.144 /91 51         229.287 /97 44

Ditto here, compress times with "performance" are not the shortest. Or
does "performance" mean something else? :-)

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-11 15:48     ` Borislav Petkov
@ 2012-12-11 16:03       ` Arjan van de Ven
  2012-12-11 16:13         ` Borislav Petkov
  0 siblings, 1 reply; 56+ messages in thread
From: Arjan van de Ven @ 2012-12-11 16:03 UTC (permalink / raw)
  To: Borislav Petkov, Alex Shi, Alex Shi, rob, mingo, peterz, gregkh,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	pjt, vincent.guittot, Preeti U Murthy

On 12/11/2012 7:48 AM, Borislav Petkov wrote:
> On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
>> Another testing of parallel compress with pigz on Linus' git tree.
>> results show we get much better performance/power with powersaving and
>> balance policy:
>>
>> testing command:
>> #pigz -k -c  -p$x -r linux* &> /dev/null
>>
>> On a NHM EP box
>>           powersaving               balance   	         performance
>> x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
>> x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76
>
> This looks funny: so "performance" is eating less watts than
> "powersaving" and "balance" on NHM. Could it be that the average watts
> measurements on NHM are not correct/precise..? On SNB they look as
> expected, according to your scheme.

well... it's not always beneficial to group or to spread out
it depends on cache behavior mostly which is best


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-11 16:03       ` Arjan van de Ven
@ 2012-12-11 16:13         ` Borislav Petkov
  2012-12-11 16:40           ` Arjan van de Ven
  2012-12-12  1:14           ` Alex Shi
  0 siblings, 2 replies; 56+ messages in thread
From: Borislav Petkov @ 2012-12-11 16:13 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Alex Shi, Alex Shi, rob, mingo, peterz, gregkh, andre.przywara,
	rjw, paul.gortmaker, akpm, paulmck, linux-kernel, pjt,
	vincent.guittot, Preeti U Murthy

On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
> On 12/11/2012 7:48 AM, Borislav Petkov wrote:
> >On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
> >>Another testing of parallel compress with pigz on Linus' git tree.
> >>results show we get much better performance/power with powersaving and
> >>balance policy:
> >>
> >>testing command:
> >>#pigz -k -c  -p$x -r linux* &> /dev/null
> >>
> >>On a NHM EP box
> >>          powersaving               balance   	         performance
> >>x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
> >>x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76
> >
> >This looks funny: so "performance" is eating less watts than
> >"powersaving" and "balance" on NHM. Could it be that the average watts
> >measurements on NHM are not correct/precise..? On SNB they look as
> >expected, according to your scheme.
> 
> well... it's not always beneficial to group or to spread out
> it depends on cache behavior mostly which is best

Let me try to understand what this means: so "performance" above with
8 threads means that those threads are spread out across more than one
socket, no?

If so, this would mean that you have a smaller amount of tasks on each
socket, thus the smaller wattage.

The "powersaving" method OTOH fills up the one socket up to the brim,
thus the slightly higher consumption due to all threads being occupied.

Is that it?

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-11 16:13         ` Borislav Petkov
@ 2012-12-11 16:40           ` Arjan van de Ven
  2012-12-12  9:52             ` Amit Kucheria
  2012-12-12 14:41             ` Borislav Petkov
  2012-12-12  1:14           ` Alex Shi
  1 sibling, 2 replies; 56+ messages in thread
From: Arjan van de Ven @ 2012-12-11 16:40 UTC (permalink / raw)
  To: Borislav Petkov, Alex Shi, Alex Shi, rob, mingo, peterz, gregkh,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	pjt, vincent.guittot, Preeti U Murthy

On 12/11/2012 8:13 AM, Borislav Petkov wrote:
> On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
>> On 12/11/2012 7:48 AM, Borislav Petkov wrote:
>>> On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
>>>> Another testing of parallel compress with pigz on Linus' git tree.
>>>> results show we get much better performance/power with powersaving and
>>>> balance policy:
>>>>
>>>> testing command:
>>>> #pigz -k -c  -p$x -r linux* &> /dev/null
>>>>
>>>> On a NHM EP box
>>>>           powersaving               balance   	         performance
>>>> x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
>>>> x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76
>>>
>>> This looks funny: so "performance" is eating less watts than
>>> "powersaving" and "balance" on NHM. Could it be that the average watts
>>> measurements on NHM are not correct/precise..? On SNB they look as
>>> expected, according to your scheme.
>>
>> well... it's not always beneficial to group or to spread out
>> it depends on cache behavior mostly which is best
>
> Let me try to understand what this means: so "performance" above with
> 8 threads means that those threads are spread out across more than one
> socket, no?
>
> If so, this would mean that you have a smaller amount of tasks on each
> socket, thus the smaller wattage.
>
> The "powersaving" method OTOH fills up the one socket up to the brim,
> thus the slightly higher consumption due to all threads being occupied.
>
> Is that it?

not sure.

by and large, power efficiency is the same as performance efficiency, with some twists.
or to reword that to be more clear
if you waste performance due to something that becomes inefficient, you're wasting power as well.
now, you might have some hardware effects that can then save you power... but those effects
then first need to overcome the waste from the performance inefficiency... and that almost never happens.

for example, if you have two workloads that each fit barely inside the last level cache...
it's much more efficient to spread these over two sockets... where each has its own full LLC
to use.
If you'd group these together, both would thrash the cache all the time and run inefficient --> bad for power.

now, on the other hand, if you have two threads of a process that share a bunch of data structures,
and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot,
running inefficient --> bad for power.

having said all this, if you have to tasks that don't have such cache effects, the most efficient way
of running things will be on 2 hyperthreading halves... it's very hard to beat the power efficiency of that.
But this assumes the tasks don't compete with resources much on the HT level, and achieve good scaling.
and this still has to compete with "race to halt", because if you're done quicker, you can put the memory
in self refresh quicker.

none of this stuff is easy for humans or computer programs to determine ahead of time... or sometimes even afterwards.
heck, even for just performance it's really really hard already, never mind adding power.

my personal gut feeling is that we should just optimize this scheduler stuff for performance, and that
we're going to be doing quite well on power already if we achieve that.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-11 16:13         ` Borislav Petkov
  2012-12-11 16:40           ` Arjan van de Ven
@ 2012-12-12  1:14           ` Alex Shi
  1 sibling, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-12  1:14 UTC (permalink / raw)
  To: Borislav Petkov, Arjan van de Ven, Alex Shi, rob, mingo, peterz,
	gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot, Preeti U Murthy

On 12/12/2012 12:13 AM, Borislav Petkov wrote:
> On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
>> On 12/11/2012 7:48 AM, Borislav Petkov wrote:
>>> On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
>>>> Another testing of parallel compress with pigz on Linus' git tree.
>>>> results show we get much better performance/power with powersaving and
>>>> balance policy:
>>>>
>>>> testing command:
>>>> #pigz -k -c  -p$x -r linux* &> /dev/null
>>>>
>>>> On a NHM EP box
>>>>          powersaving               balance   	         performance
>>>> x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
>>>> x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76
>>>
>>> This looks funny: so "performance" is eating less watts than
>>> "powersaving" and "balance" on NHM. Could it be that the average watts
>>> measurements on NHM are not correct/precise..? On SNB they look as
>>> expected, according to your scheme.
>>
>> well... it's not always beneficial to group or to spread out
>> it depends on cache behavior mostly which is best
> 
> Let me try to understand what this means: so "performance" above with
> 8 threads means that those threads are spread out across more than one
> socket, no?
> 
> If so, this would mean that you have a smaller amount of tasks on each
> socket, thus the smaller wattage.
> 
> The "powersaving" method OTOH fills up the one socket up to the brim,
> thus the slightly higher consumption due to all threads being occupied.
> 

As Arjan said we know the performance increase should be due to the
cache sharing in LLC.
As to power consumption value between powersaving and performance, when
we burn 2 socket CPU, the cpu load is not 100%, so some LCPU still has
time to go idle or to run with low frequency, that also can save some
power.
That's just generalise situation, as to different hardware, different
CPU, they may has different tuning in CPU packages, core, uncore part
etc. So as to different benchmark, the result are also different.


> Is that it?
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2012-12-10  8:22 ` [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
@ 2012-12-12  3:57   ` Preeti U Murthy
  2012-12-12  5:52     ` Alex Shi
  2012-12-13  8:45     ` Alex Shi
  0 siblings, 2 replies; 56+ messages in thread
From: Preeti U Murthy @ 2012-12-12  3:57 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

Hi Alex,
On 12/10/2012 01:52 PM, Alex Shi wrote:
> They are the base values in load balance, update them with rq runnable
> load average, then the load balance will consider runnable load avg
> naturally.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/core.c |    4 ++--
>  kernel/sched/fair.c |    4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 96fa5f1..0ecb907 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2487,7 +2487,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>  void update_idle_cpu_load(struct rq *this_rq)
>  {
>  	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> -	unsigned long load = this_rq->load.weight;
> +	unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
>  	unsigned long pending_updates;
>  
>  	/*
> @@ -2537,7 +2537,7 @@ static void update_cpu_load_active(struct rq *this_rq)
>  	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
>  	 */
>  	this_rq->last_load_update_tick = jiffies;
> -	__update_cpu_load(this_rq, this_rq->load.weight, 1);
> +	__update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
>  
>  	calc_load_account_active(this_rq);
>  }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 61c8d24..6d893a6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2680,7 +2680,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  /* Used instead of source_load when we know the type == 0 */
>  static unsigned long weighted_cpuload(const int cpu)
>  {
> -	return cpu_rq(cpu)->load.weight;
> +	return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;

I was wondering why you have typecasted the cfs.runnable_load_avg to
unsigned long.Have you looked into why it was declared as u64 in the
first place?

>  }
>  
>  /*
> @@ -2727,7 +2727,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
>  	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>  
>  	if (nr_running)
> -		return rq->load.weight / nr_running;
> +		return rq->cfs.runnable_load_avg / nr_running;

rq->cfs.runnable_load_avg is u64 type.you will need to typecast it here
also right? how does this division work? because the return type is
unsigned long.
>  
>  	return 0;
>  }
> 

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/18] sched: consider runnable load average in move_tasks
  2012-12-10  8:22 ` [PATCH 08/18] sched: consider runnable load average in move_tasks Alex Shi
@ 2012-12-12  4:41   ` Preeti U Murthy
  2012-12-12  6:26     ` Alex Shi
  0 siblings, 1 reply; 56+ messages in thread
From: Preeti U Murthy @ 2012-12-12  4:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

Hi Alex,
On 12/10/2012 01:52 PM, Alex Shi wrote:
> Except using runnable load average in background, move_tasks is also
> the key functions in load balance. We need consider the runnable load
> average in it in order to the apple to apple load comparison.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c |   11 ++++++++++-
>  1 files changed, 10 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6d893a6..bbb069c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3741,6 +3741,15 @@ static unsigned long task_h_load(struct task_struct *p);
>  
>  static const unsigned int sched_nr_migrate_break = 32;
>  
> +static unsigned long task_h_load_avg(struct task_struct *p)
> +{
> +	u32 period = p->se.avg.runnable_avg_period;
> +	if (!period)
> +		return 0;
> +
> +	return task_h_load(p) * p->se.avg.runnable_avg_sum / period;
                        ^^^^^^^^^^^^
This might result in an overflow,considering you are multiplying two 32
bit integers.Below is how this is handled in
__update_task_entity_contrib in kernel/sched/fair.c

u32 contrib;
/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
contrib /= (se->avg.runnable_avg_period + 1);
se->avg.load_avg_contrib = scale_load(contrib);

Also why can't p->se.load_avg_contrib be used directly? as a return
value for task_h_load_avg? since this is already updated in
update_task_entity_contrib and update_group_entity_contrib.
> +}
> +
>  /*
>   * move_tasks tries to move up to imbalance weighted load from busiest to
>   * this_rq, as part of a balancing operation within domain "sd".
> @@ -3776,7 +3785,7 @@ static int move_tasks(struct lb_env *env)
>  		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>  			goto next;
>  
> -		load = task_h_load(p);
> +		load = task_h_load_avg(p);
>  
>  		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
>  			goto next;
> 

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/18] sched: select_task_rq_fair clean up
  2012-12-11 11:53         ` Alex Shi
@ 2012-12-12  5:26           ` Preeti U Murthy
  0 siblings, 0 replies; 56+ messages in thread
From: Preeti U Murthy @ 2012-12-12  5:26 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

On 12/11/2012 05:23 PM, Alex Shi wrote:
> On 12/11/2012 02:30 PM, Preeti U Murthy wrote:
>> On 12/11/2012 10:58 AM, Alex Shi wrote:
>>> On 12/11/2012 12:23 PM, Preeti U Murthy wrote:
>>>> Hi Alex,
>>>>
>>>> On 12/10/2012 01:52 PM, Alex Shi wrote:
>>>>> It is impossible to miss a task allowed cpu in a eligible group.
>>>>
>>>> The one thing I am concerned with here is if there is a possibility of
>>>> the task changing its tsk_cpus_allowed() while this code is running.
>>>>
>>>> i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
>>>> for the task changes,perhaps by the user himself,which might not include
>>>> the cpus in the idle group.After this find_idlest_cpu() is called.I mean
>>>> a race condition in short.Then we might not have an eligible cpu in that
>>>> group right?
>>>
>>> your worry make sense, but the code handle the situation, in
>>> select_task_rq(), it will check the cpu allowed again. if the answer is
>>> no, it will fallback to old cpu.
>>>>
>>>>> And since find_idlest_group only return a different group which
>>>>> excludes old cpu, it's also imporissible to find a new cpu same as old
>>>>> cpu.
>>
>> I doubt this will work correctly.Consider the following situation:sched
>> domain begins with sd that encloses both socket1 and socket2
>>
>> cpu0 cpu1  | cpu2 cpu3
>> -----------|-------------
>>  socket1   |  socket2
>>
>> old cpu = cpu1
>>
>> Iteration1:
>> 1.find_idlest_group() returns socket2 to be idlest.
>> 2.task changes tsk_allowed_cpus to 0,1
>> 3.find_idlest_cpu() returns cpu2
>>
>> * without your patch
>>    1.the condition after find_idlest_cpu() returns -1,and sd->child is
>> chosen which happens to be socket1
>>    2.in the next iteration, find_idlest_group() and find_idlest_cpu()
>> will probably choose cpu0 which happens to be idler than cpu1,which is
>> in tsk_allowed_cpu.
> 
> Thanks for question Preeti! :)
> 
> Yes, with more iteration you has more possibility to get task allowed
> cpu in select_task_rq_fair. but how many opportunity the situation
> happened?  how much gain you get here?
> With LCPU increasing many many iterations cause scalability issue. that
> is the simplified forking patchset for. and that why 10% performance
> gain on hackbench process/thread.
> 
> and if you insist want not to miss your chance in strf(), the current
> iteration is still not enough. How you know the idlest cpu is still
> idlest after this function finished? how to ensure the allowed cpu won't
> be changed again?
> 
> A quick snapshot is enough in balancing here. we still has periodic
> balacning.

Hmm ok,let me look at this more closely.
> 
>>
>> * with your patch
>>    1.the condition after find_idlest_cpu() does not exist,therefore
>> a sched domain to which cpu2 belongs to is chosen.this is socket2.(under
>> the for_each_domain() loop).
>>    2.in the next iteration, find_idlest_group() return NULL,because
>> there is no cpu which intersects with tsk_allowed_cpus.
>>    3.in select task rq,the fallback cpu is chosen even when an idle cpu
>> existed.
>>
>> So my concern is though select_task_rq() checks the
>> tsk_allowed_cpus(),you might end up choosing a different path of
>> sched_domains compared to without this patch as shown above.
>>
>> In short without the "if(new_cpu==-1)" condition we might get misled
>> doing unnecessary iterations over the wrong sched domains in
>> select_task_rq_fair().(Think about situations when not all the cpus of
>> socket2 are disallowed by the task,then there will more iterations in
> 
> After read the first 4 patches, believe you will find the patchset is
> trying to reduce iterations, not increase them.

Right,sorry about not noticing this.
> 
>> the wrong path of sched_domains before exit,compared to what is shown
>> above.)
>>
Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2012-12-12  3:57   ` Preeti U Murthy
@ 2012-12-12  5:52     ` Alex Shi
  2012-12-13  8:45     ` Alex Shi
  1 sibling, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-12  5:52 UTC (permalink / raw)
  To: Preeti U Murthy, pjt
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, vincent.guittot

On 12/12/2012 11:57 AM, Preeti U Murthy wrote:
> Hi Alex,
> On 12/10/2012 01:52 PM, Alex Shi wrote:
>> They are the base values in load balance, update them with rq runnable
>> load average, then the load balance will consider runnable load avg
>> naturally.
>>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  kernel/sched/core.c |    4 ++--
>>  kernel/sched/fair.c |    4 ++--
>>  2 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 96fa5f1..0ecb907 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2487,7 +2487,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>>  void update_idle_cpu_load(struct rq *this_rq)
>>  {
>>  	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
>> -	unsigned long load = this_rq->load.weight;
>> +	unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
>>  	unsigned long pending_updates;
>>  
>>  	/*
>> @@ -2537,7 +2537,7 @@ static void update_cpu_load_active(struct rq *this_rq)
>>  	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
>>  	 */
>>  	this_rq->last_load_update_tick = jiffies;
>> -	__update_cpu_load(this_rq, this_rq->load.weight, 1);
>> +	__update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
>>  
>>  	calc_load_account_active(this_rq);
>>  }
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 61c8d24..6d893a6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2680,7 +2680,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>  /* Used instead of source_load when we know the type == 0 */
>>  static unsigned long weighted_cpuload(const int cpu)
>>  {
>> -	return cpu_rq(cpu)->load.weight;
>> +	return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
> 
> I was wondering why you have typecasted the cfs.runnable_load_avg to
> unsigned long.Have you looked into why it was declared as u64 in the
> first place?

PJT:
Could we changed the cfs.runnable_load_avg to unsigned long?  since it's
a unsigned long value multiple a value less then 1.

> 
>>  }
>>  
>>  /*
>> @@ -2727,7 +2727,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
>>  	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>>  
>>  	if (nr_running)
>> -		return rq->load.weight / nr_running;
>> +		return rq->cfs.runnable_load_avg / nr_running;
> 
> rq->cfs.runnable_load_avg is u64 type.you will need to typecast it here
> also right? how does this division work? because the return type is
> unsigned long.

Yes, a clear cast is better.
>>  
>>  	return 0;
>>  }
>>
> 
> Regards
> Preeti U Murthy
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/18] sched: consider runnable load average in move_tasks
  2012-12-12  4:41   ` Preeti U Murthy
@ 2012-12-12  6:26     ` Alex Shi
  2012-12-21  4:43       ` Namhyung Kim
  0 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-12  6:26 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

On 12/12/2012 12:41 PM, Preeti U Murthy wrote:
> Hi Alex,
> On 12/10/2012 01:52 PM, Alex Shi wrote:
>> Except using runnable load average in background, move_tasks is also
>> the key functions in load balance. We need consider the runnable load
>> average in it in order to the apple to apple load comparison.
>>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  kernel/sched/fair.c |   11 ++++++++++-
>>  1 files changed, 10 insertions(+), 1 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6d893a6..bbb069c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3741,6 +3741,15 @@ static unsigned long task_h_load(struct task_struct *p);
>>  
>>  static const unsigned int sched_nr_migrate_break = 32;
>>  
>> +static unsigned long task_h_load_avg(struct task_struct *p)
>> +{
>> +	u32 period = p->se.avg.runnable_avg_period;
>> +	if (!period)
>> +		return 0;
>> +
>> +	return task_h_load(p) * p->se.avg.runnable_avg_sum / period;
>                         ^^^^^^^^^^^^
> This might result in an overflow,considering you are multiplying two 32
> bit integers.Below is how this is handled in
> __update_task_entity_contrib in kernel/sched/fair.c
> 
> u32 contrib;
> /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
> contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
> contrib /= (se->avg.runnable_avg_period + 1);
> se->avg.load_avg_contrib = scale_load(contrib);

scale_load_down is do nothing.
> 
> Also why can't p->se.load_avg_contrib be used directly? as a return
> value for task_h_load_avg? since this is already updated in
> update_task_entity_contrib and update_group_entity_contrib.

No, only non task entity goes to  update_group_entity_contrib. not task
entity.


>> +}
>> +
>>  /*
>>   * move_tasks tries to move up to imbalance weighted load from busiest to
>>   * this_rq, as part of a balancing operation within domain "sd".
>> @@ -3776,7 +3785,7 @@ static int move_tasks(struct lb_env *env)
>>  		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>  			goto next;
>>  
>> -		load = task_h_load(p);
>> +		load = task_h_load_avg(p);
>>  
>>  		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
>>  			goto next;
>>
> 
> Regards
> Preeti U Murthy
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-11 16:40           ` Arjan van de Ven
@ 2012-12-12  9:52             ` Amit Kucheria
  2012-12-12 13:55               ` Alex Shi
  2012-12-12 14:41             ` Borislav Petkov
  1 sibling, 1 reply; 56+ messages in thread
From: Amit Kucheria @ 2012-12-12  9:52 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Borislav Petkov, Alex Shi, Alex Shi, rob, mingo, peterz, gregkh,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	pjt, vincent.guittot, Preeti U Murthy

On Tue, Dec 11, 2012 at 10:10 PM, Arjan van de Ven
<arjan@linux.intel.com> wrote:
> On 12/11/2012 8:13 AM, Borislav Petkov wrote:
>>
>> On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
>>>
>>> On 12/11/2012 7:48 AM, Borislav Petkov wrote:
>>>>
>>>> On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
>>>>>
>>>>> Another testing of parallel compress with pigz on Linus' git tree.
>>>>> results show we get much better performance/power with powersaving and
>>>>> balance policy:
>>>>>
>>>>> testing command:
>>>>> #pigz -k -c  -p$x -r linux* &> /dev/null
>>>>>
>>>>> On a NHM EP box
>>>>>           powersaving               balance              performance
>>>>> x = 4    166.516 /88 68           170.515 /82 71         165.283 /103
>>>>> 58
>>>>> x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76
>>>>
>>>>
>>>> This looks funny: so "performance" is eating less watts than
>>>> "powersaving" and "balance" on NHM. Could it be that the average watts
>>>> measurements on NHM are not correct/precise..? On SNB they look as
>>>> expected, according to your scheme.
>>>
>>>
>>> well... it's not always beneficial to group or to spread out
>>> it depends on cache behavior mostly which is best
>>
>>
>> Let me try to understand what this means: so "performance" above with
>> 8 threads means that those threads are spread out across more than one
>> socket, no?
>>
>> If so, this would mean that you have a smaller amount of tasks on each
>> socket, thus the smaller wattage.
>>
>> The "powersaving" method OTOH fills up the one socket up to the brim,
>> thus the slightly higher consumption due to all threads being occupied.
>>
>> Is that it?
>
>
> not sure.
>
> by and large, power efficiency is the same as performance efficiency, with
> some twists.
> or to reword that to be more clear
> if you waste performance due to something that becomes inefficient, you're
> wasting power as well.
> now, you might have some hardware effects that can then save you power...
> but those effects
> then first need to overcome the waste from the performance inefficiency...
> and that almost never happens.
>
> for example, if you have two workloads that each fit barely inside the last
> level cache...
> it's much more efficient to spread these over two sockets... where each has
> its own full LLC
> to use.
> If you'd group these together, both would thrash the cache all the time and
> run inefficient --> bad for power.
>
> now, on the other hand, if you have two threads of a process that share a
> bunch of data structures,
> and you'd spread these over 2 sockets, you end up bouncing data between the
> two sockets a lot,
> running inefficient --> bad for power.
>

Agree with all of the above. However..

> having said all this, if you have to tasks that don't have such cache
> effects, the most efficient way
> of running things will be on 2 hyperthreading halves... it's very hard to
> beat the power efficiency of that.

.. there are alternatives to hyperthreading. On ARM's big.LITTLE
architecture you could simply schedule them on the LITTLE cores. The
big cores just can't beat the power efficiency of the LITTLE ones even
with 'race to halt' that you allude to below. And usecases like mp3
playback simply don't require the kind of performance that the big
cores can offer.

> But this assumes the tasks don't compete with resources much on the HT
> level, and achieve good scaling.
> and this still has to compete with "race to halt", because if you're done
> quicker, you can put the memory
> in self refresh quicker.
>
> none of this stuff is easy for humans or computer programs to determine
> ahead of time... or sometimes even afterwards.
> heck, even for just performance it's really really hard already, never mind
> adding power.
>
> my personal gut feeling is that we should just optimize this scheduler stuff
> for performance, and that
> we're going to be doing quite well on power already if we achieve that.

If Linux is to continue to work efficiently on heterogeneous
multi-processing platforms, it needs to provide scheduling mechanisms
that can be exploited as per the demands of the HW architecture. An
example is the "small task packing (and spreading)" for which Vincent
Guittot has posted a patchset[1] earlier and so has Alex now.

[1] http://lwn.net/Articles/518834/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-12  9:52             ` Amit Kucheria
@ 2012-12-12 13:55               ` Alex Shi
  2012-12-12 14:21                 ` Vincent Guittot
  0 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-12 13:55 UTC (permalink / raw)
  To: Amit Kucheria
  Cc: Arjan van de Ven, Borislav Petkov, Alex Shi, rob, mingo, peterz,
	gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot, Preeti U Murthy

>>>>
>>>>
>>>> well... it's not always beneficial to group or to spread out
>>>> it depends on cache behavior mostly which is best
>>>
>>>
>>> Let me try to understand what this means: so "performance" above with
>>> 8 threads means that those threads are spread out across more than one
>>> socket, no?
>>>
>>> If so, this would mean that you have a smaller amount of tasks on each
>>> socket, thus the smaller wattage.
>>>
>>> The "powersaving" method OTOH fills up the one socket up to the brim,
>>> thus the slightly higher consumption due to all threads being occupied.
>>>
>>> Is that it?
>>
>>
>> not sure.
>>
>> by and large, power efficiency is the same as performance efficiency, with
>> some twists.
>> or to reword that to be more clear
>> if you waste performance due to something that becomes inefficient, you're
>> wasting power as well.
>> now, you might have some hardware effects that can then save you power...
>> but those effects
>> then first need to overcome the waste from the performance inefficiency...
>> and that almost never happens.
>>
>> for example, if you have two workloads that each fit barely inside the last
>> level cache...
>> it's much more efficient to spread these over two sockets... where each has
>> its own full LLC
>> to use.
>> If you'd group these together, both would thrash the cache all the time and
>> run inefficient --> bad for power.
>>
>> now, on the other hand, if you have two threads of a process that share a
>> bunch of data structures,
>> and you'd spread these over 2 sockets, you end up bouncing data between the
>> two sockets a lot,
>> running inefficient --> bad for power.
>>
>
> Agree with all of the above. However..
>
>> having said all this, if you have to tasks that don't have such cache
>> effects, the most efficient way
>> of running things will be on 2 hyperthreading halves... it's very hard to
>> beat the power efficiency of that.
>
> .. there are alternatives to hyperthreading. On ARM's big.LITTLE
> architecture you could simply schedule them on the LITTLE cores. The
> big cores just can't beat the power efficiency of the LITTLE ones even
> with 'race to halt' that you allude to below. And usecases like mp3
> playback simply don't require the kind of performance that the big
> cores can offer.
>
>> But this assumes the tasks don't compete with resources much on the HT
>> level, and achieve good scaling.
>> and this still has to compete with "race to halt", because if you're done
>> quicker, you can put the memory
>> in self refresh quicker.
>>
>> none of this stuff is easy for humans or computer programs to determine
>> ahead of time... or sometimes even afterwards.
>> heck, even for just performance it's really really hard already, never mind
>> adding power.
>>
>> my personal gut feeling is that we should just optimize this scheduler stuff
>> for performance, and that
>> we're going to be doing quite well on power already if we achieve that.
>
> If Linux is to continue to work efficiently on heterogeneous
> multi-processing platforms, it needs to provide scheduling mechanisms
> that can be exploited as per the demands of the HW architecture.

Linus definitely disagree such ideas. :) So, need to summaries the
logical beyond all hardware.

> example is the "small task packing (and spreading)" for which Vincent
> Guittot has posted a patchset[1] earlier and so has Alex now.

Sure. I just thought my patchset should handled the 'small task
packing' scenario. Could you guy like to have a try?
>
> [1] http://lwn.net/Articles/518834/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-12 13:55               ` Alex Shi
@ 2012-12-12 14:21                 ` Vincent Guittot
  2012-12-13  2:51                   ` Alex Shi
  0 siblings, 1 reply; 56+ messages in thread
From: Vincent Guittot @ 2012-12-12 14:21 UTC (permalink / raw)
  To: Alex Shi
  Cc: Amit Kucheria, Arjan van de Ven, Borislav Petkov, Alex Shi, rob,
	mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker, akpm,
	paulmck, linux-kernel, pjt, Preeti U Murthy

On 12 December 2012 14:55, Alex Shi <lkml.alex@gmail.com> wrote:
>>>>>
>>>>>
>>>>> well... it's not always beneficial to group or to spread out
>>>>> it depends on cache behavior mostly which is best
>>>>
>>>>
>>>> Let me try to understand what this means: so "performance" above with
>>>> 8 threads means that those threads are spread out across more than one
>>>> socket, no?
>>>>
>>>> If so, this would mean that you have a smaller amount of tasks on each
>>>> socket, thus the smaller wattage.
>>>>
>>>> The "powersaving" method OTOH fills up the one socket up to the brim,
>>>> thus the slightly higher consumption due to all threads being occupied.
>>>>
>>>> Is that it?
>>>
>>>
>>> not sure.
>>>
>>> by and large, power efficiency is the same as performance efficiency, with
>>> some twists.
>>> or to reword that to be more clear
>>> if you waste performance due to something that becomes inefficient, you're
>>> wasting power as well.
>>> now, you might have some hardware effects that can then save you power...
>>> but those effects
>>> then first need to overcome the waste from the performance inefficiency...
>>> and that almost never happens.
>>>
>>> for example, if you have two workloads that each fit barely inside the last
>>> level cache...
>>> it's much more efficient to spread these over two sockets... where each has
>>> its own full LLC
>>> to use.
>>> If you'd group these together, both would thrash the cache all the time and
>>> run inefficient --> bad for power.
>>>
>>> now, on the other hand, if you have two threads of a process that share a
>>> bunch of data structures,
>>> and you'd spread these over 2 sockets, you end up bouncing data between the
>>> two sockets a lot,
>>> running inefficient --> bad for power.
>>>
>>
>> Agree with all of the above. However..
>>
>>> having said all this, if you have to tasks that don't have such cache
>>> effects, the most efficient way
>>> of running things will be on 2 hyperthreading halves... it's very hard to
>>> beat the power efficiency of that.
>>
>> .. there are alternatives to hyperthreading. On ARM's big.LITTLE
>> architecture you could simply schedule them on the LITTLE cores. The
>> big cores just can't beat the power efficiency of the LITTLE ones even
>> with 'race to halt' that you allude to below. And usecases like mp3
>> playback simply don't require the kind of performance that the big
>> cores can offer.
>>
>>> But this assumes the tasks don't compete with resources much on the HT
>>> level, and achieve good scaling.
>>> and this still has to compete with "race to halt", because if you're done
>>> quicker, you can put the memory
>>> in self refresh quicker.
>>>
>>> none of this stuff is easy for humans or computer programs to determine
>>> ahead of time... or sometimes even afterwards.
>>> heck, even for just performance it's really really hard already, never mind
>>> adding power.
>>>
>>> my personal gut feeling is that we should just optimize this scheduler stuff
>>> for performance, and that
>>> we're going to be doing quite well on power already if we achieve that.
>>
>> If Linux is to continue to work efficiently on heterogeneous
>> multi-processing platforms, it needs to provide scheduling mechanisms
>> that can be exploited as per the demands of the HW architecture.
>
> Linus definitely disagree such ideas. :) So, need to summaries the
> logical beyond all hardware.
>
>> example is the "small task packing (and spreading)" for which Vincent
>> Guittot has posted a patchset[1] earlier and so has Alex now.
>
> Sure. I just thought my patchset should handled the 'small task
> packing' scenario. Could you guy like to have a try?

Hi Alex,

Yes, I will do a try with your patchset when i will have some spare time

Vincent

>>
>> [1] http://lwn.net/Articles/518834/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-11 16:40           ` Arjan van de Ven
  2012-12-12  9:52             ` Amit Kucheria
@ 2012-12-12 14:41             ` Borislav Petkov
  2012-12-13  3:07               ` Alex Shi
  1 sibling, 1 reply; 56+ messages in thread
From: Borislav Petkov @ 2012-12-12 14:41 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Alex Shi, Alex Shi, rob, mingo, peterz, gregkh, andre.przywara,
	rjw, paul.gortmaker, akpm, paulmck, linux-kernel, pjt,
	vincent.guittot, Preeti U Murthy

On Tue, Dec 11, 2012 at 08:40:40AM -0800, Arjan van de Ven wrote:

> >Let me try to understand what this means: so "performance" above with
> >8 threads means that those threads are spread out across more than one
> >socket, no?
> >
> >If so, this would mean that you have a smaller amount of tasks on each
> >socket, thus the smaller wattage.
> >
> >The "powersaving" method OTOH fills up the one socket up to the brim,
> >thus the slightly higher consumption due to all threads being occupied.
> >
> >Is that it?
>
> not sure.
>
> by and large, power efficiency is the same as performance efficiency,
> with some twists. or to reword that to be more clear if you waste
> performance due to something that becomes inefficient, you're wasting
> power as well. now, you might have some hardware effects that can
> then save you power... but those effects then first need to overcome
> the waste from the performance inefficiency... and that almost never
> happens.
>
> for example, if you have two workloads that each fit barely inside
> the last level cache... it's much more efficient to spread these over
> two sockets... where each has its own full LLC to use. If you'd group
> these together, both would thrash the cache all the time and run
> inefficient --> bad for power.

Hmm, are you saying that powering up the second socket so that the
working set fully fits in the LLC is still less power used than the cost
of going up to memory and bringing those lines back in?

I'd say there's breakeven point depending on the workload duration, no?

Which means that we need to be able to look into the future in order to
know what to do... ;-/

> now, on the other hand, if you have two threads of a process that
> share a bunch of data structures, and you'd spread these over 2
> sockets, you end up bouncing data between the two sockets a lot,
> running inefficient --> bad for power.

Yeah, that should be addressed by the NUMA patches people are working on
right now.

> having said all this, if you have to tasks that don't have such
> cache effects, the most efficient way of running things will be on 2
> hyperthreading halves... it's very hard to beat the power efficiency
> of that. But this assumes the tasks don't compete with resources much
> on the HT level, and achieve good scaling. and this still has to
> compete with "race to halt", because if you're done quicker, you can
> put the memory in self refresh quicker.

Right, how are we addressing the breakeven in that case? AFAIK, we
do schedule them now on two different cores (not HT threads, i.e. no
resource sharing besides L2) so that we get done faster, i.e. race to
idle in the performance case. And in the powersavings' case we leave
them as tightly packed as possible.

> none of this stuff is easy for humans or computer programs to
> determine ahead of time... or sometimes even afterwards. heck, even
> for just performance it's really really hard already, never mind
> adding power.
>
> my personal gut feeling is that we should just optimize this scheduler
> stuff for performance, and that we're going to be doing quite well on
> power already if we achieve that.

Probably. I wonder if there is a way to measure power consumption of
different workloads in perf and then run those with different scheduling
policies.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-12 14:21                 ` Vincent Guittot
@ 2012-12-13  2:51                   ` Alex Shi
  0 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-13  2:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Alex Shi, Amit Kucheria, Arjan van de Ven, Borislav Petkov, rob,
	mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker, akpm,
	paulmck, linux-kernel, pjt, Preeti U Murthy

On 12/12/2012 10:21 PM, Vincent Guittot wrote:
>>> >> If Linux is to continue to work efficiently on heterogeneous
>>> >> multi-processing platforms, it needs to provide scheduling mechanisms
>>> >> that can be exploited as per the demands of the HW architecture.
>> >
>> > Linus definitely disagree such ideas. :) So, need to summaries the
>> > logical beyond all hardware.
>> >
>>> >> example is the "small task packing (and spreading)" for which Vincent
>>> >> Guittot has posted a patchset[1] earlier and so has Alex now.
>> >
>> > Sure. I just thought my patchset should handled the 'small task
>> > packing' scenario. Could you guy like to have a try?
> Hi Alex,
> 
> Yes, I will do a try with your patchset when i will have some spare time

Thanks Vincent! the balance and powersaving policy should have effect.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-12 14:41             ` Borislav Petkov
@ 2012-12-13  3:07               ` Alex Shi
  2012-12-13 11:35                 ` Borislav Petkov
  0 siblings, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-13  3:07 UTC (permalink / raw)
  To: Borislav Petkov, Arjan van de Ven, Alex Shi, rob, mingo, peterz,
	gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot, Preeti U Murthy


>> now, on the other hand, if you have two threads of a process that
>> share a bunch of data structures, and you'd spread these over 2
>> sockets, you end up bouncing data between the two sockets a lot,
>> running inefficient --> bad for power.
> 
> Yeah, that should be addressed by the NUMA patches people are working on
> right now.


Yes, as to balance/powersaving policy, we can tight pack tasks firstly,
then NUMA balancing will make memory follow us.

BTW, NUMA balancing is more related with page in memory. not LLC.
> 
>> having said all this, if you have to tasks that don't have such
>> cache effects, the most efficient way of running things will be on 2
>> hyperthreading halves... it's very hard to beat the power efficiency
>> of that. But this assumes the tasks don't compete with resources much
>> on the HT level, and achieve good scaling. and this still has to
>> compete with "race to halt", because if you're done quicker, you can
>> put the memory in self refresh quicker.
> 
> Right, how are we addressing the breakeven in that case? AFAIK, we
> do schedule them now on two different cores (not HT threads, i.e. no
> resource sharing besides L2) so that we get done faster, i.e. race to

that's balance policy for. :)
> idle in the performance case. And in the powersavings' case we leave
> them as tightly packed as possible.
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2012-12-12  3:57   ` Preeti U Murthy
  2012-12-12  5:52     ` Alex Shi
@ 2012-12-13  8:45     ` Alex Shi
  2012-12-21  4:35       ` Namhyung Kim
  1 sibling, 1 reply; 56+ messages in thread
From: Alex Shi @ 2012-12-13  8:45 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

On 12/12/2012 11:57 AM, Preeti U Murthy wrote:
> Hi Alex,
> On 12/10/2012 01:52 PM, Alex Shi wrote:
>> They are the base values in load balance, update them with rq runnable
>> load average, then the load balance will consider runnable load avg
>> naturally.
>>

updated with UP config fix:


==========
>From d271c93b40411660dd0e54d99946367c87002cc8 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Sat, 17 Nov 2012 13:56:11 +0800
Subject: [PATCH 07/18] sched: compute runnable load avg in cpu_load and
 cpu_avg_load_per_task

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c | 8 ++++++--
 kernel/sched/fair.c | 4 ++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 96fa5f1..d306a84 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2487,7 +2487,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 void update_idle_cpu_load(struct rq *this_rq)
 {
 	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
-	unsigned long load = this_rq->load.weight;
+	unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
 	unsigned long pending_updates;
 
 	/*
@@ -2537,8 +2537,12 @@ static void update_cpu_load_active(struct rq *this_rq)
 	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
 	 */
 	this_rq->last_load_update_tick = jiffies;
-	__update_cpu_load(this_rq, this_rq->load.weight, 1);
 
+#ifdef CONFIG_SMP
+	__update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
+#else
+	__update_cpu_load(this_rq, this_rq->load.weight, 1);
+#endif
 	calc_load_account_active(this_rq);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61c8d24..9ca917c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2680,7 +2680,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
 {
-	return cpu_rq(cpu)->load.weight;
+	return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
 }
 
 /*
@@ -2727,7 +2727,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
 
 	if (nr_running)
-		return rq->load.weight / nr_running;
+		return (unsigned long)rq->cfs.runnable_load_avg / nr_running;
 
 	return 0;
 }
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-13  3:07               ` Alex Shi
@ 2012-12-13 11:35                 ` Borislav Petkov
  2012-12-14  1:56                   ` Alex Shi
  0 siblings, 1 reply; 56+ messages in thread
From: Borislav Petkov @ 2012-12-13 11:35 UTC (permalink / raw)
  To: Alex Shi
  Cc: Arjan van de Ven, Alex Shi, rob, mingo, peterz, gregkh,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	pjt, vincent.guittot, Preeti U Murthy

On Thu, Dec 13, 2012 at 11:07:43AM +0800, Alex Shi wrote:
> >> now, on the other hand, if you have two threads of a process that
> >> share a bunch of data structures, and you'd spread these over 2
> >> sockets, you end up bouncing data between the two sockets a lot,
> >> running inefficient --> bad for power.
> >
> > Yeah, that should be addressed by the NUMA patches people are
> > working on right now.
>
> Yes, as to balance/powersaving policy, we can tight pack tasks
> firstly, then NUMA balancing will make memory follow us.
>
> BTW, NUMA balancing is more related with page in memory. not LLC.

Sure, let's look at the worst and best cases:

* worst case: you have memory shared by multiple threads on one node
*and* working set doesn't fit in LLC.

Here, if you pack threads tightly only on one node, you still suffer the
working set kicking out parts of itself out of LLC.

If you spread threads around, you still cannot avoid the LLC thrashing
because the LLC of the node containing the shared memory needs to cache
all those transactions. *In* *addition*, you get the cross-node traffic
because the shared pages are on the first node.

Major suckage.

Does it matter? I don't know. It can be decided on a case-by-case basis.
If people care about singlethread perf, they would likely want to spread
around and buy in the cross-node traffic.

If they care for power, then maybe they don't want to turn on the second
socket yet.

* the optimal case is where memory follows threads and gets spread
around such that LLC doesn't get thrashed and cross-node traffic gets
avoided.

Now, you can think of all those other scenarios in between :-/

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling
  2012-12-13 11:35                 ` Borislav Petkov
@ 2012-12-14  1:56                   ` Alex Shi
  0 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-14  1:56 UTC (permalink / raw)
  To: Borislav Petkov, Arjan van de Ven, Alex Shi, rob, mingo, peterz,
	gregkh, andre.przywara, rjw, paul.gortmaker, akpm, paulmck,
	linux-kernel, pjt, vincent.guittot, Preeti U Murthy

On 12/13/2012 07:35 PM, Borislav Petkov wrote:
> On Thu, Dec 13, 2012 at 11:07:43AM +0800, Alex Shi wrote:
>>>> now, on the other hand, if you have two threads of a process that
>>>> share a bunch of data structures, and you'd spread these over 2
>>>> sockets, you end up bouncing data between the two sockets a lot,
>>>> running inefficient --> bad for power.
>>>
>>> Yeah, that should be addressed by the NUMA patches people are
>>> working on right now.
>>
>> Yes, as to balance/powersaving policy, we can tight pack tasks
>> firstly, then NUMA balancing will make memory follow us.
>>
>> BTW, NUMA balancing is more related with page in memory. not LLC.
> 
> Sure, let's look at the worst and best cases:
> 
> * worst case: you have memory shared by multiple threads on one node
> *and* working set doesn't fit in LLC.
> 
> Here, if you pack threads tightly only on one node, you still suffer the
> working set kicking out parts of itself out of LLC.
> 
> If you spread threads around, you still cannot avoid the LLC thrashing
> because the LLC of the node containing the shared memory needs to cache
> all those transactions. *In* *addition*, you get the cross-node traffic
> because the shared pages are on the first node.
> 
> Major suckage.
> 
> Does it matter? I don't know. It can be decided on a case-by-case basis.
> If people care about singlethread perf, they would likely want to spread
> around and buy in the cross-node traffic.
> 
> If they care for power, then maybe they don't want to turn on the second
> socket yet.
> 
> * the optimal case is where memory follows threads and gets spread
> around such that LLC doesn't get thrashed and cross-node traffic gets
> avoided.
> 
> Now, you can think of all those other scenarios in between :-/

You are right. thanks for explanation! :)

Actually, what I went to say is that numa balancing target is pages in
different node memory, but of course, it may improve LLC performance.
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/18] sched: select_task_rq_fair clean up
  2012-12-11  6:30       ` Preeti U Murthy
  2012-12-11 11:53         ` Alex Shi
@ 2012-12-21  4:28         ` Namhyung Kim
  2012-12-23 12:17           ` Alex Shi
  1 sibling, 1 reply; 56+ messages in thread
From: Namhyung Kim @ 2012-12-21  4:28 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Alex Shi, rob, mingo, peterz, gregkh, andre.przywara, rjw,
	paul.gortmaker, akpm, paulmck, linux-kernel, pjt,
	vincent.guittot

Hi,

On Tue, 11 Dec 2012 12:00:55 +0530, Preeti U. Murthy wrote:
> On 12/11/2012 10:58 AM, Alex Shi wrote:
>> On 12/11/2012 12:23 PM, Preeti U Murthy wrote:
>>> Hi Alex,
>>>
>>> On 12/10/2012 01:52 PM, Alex Shi wrote:
>>>> It is impossible to miss a task allowed cpu in a eligible group.
>>>
>>> The one thing I am concerned with here is if there is a possibility of
>>> the task changing its tsk_cpus_allowed() while this code is running.
>>>
>>> i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
>>> for the task changes,perhaps by the user himself,which might not include
>>> the cpus in the idle group.After this find_idlest_cpu() is called.I mean
>>> a race condition in short.Then we might not have an eligible cpu in that
>>> group right?
>> 
>> your worry make sense, but the code handle the situation, in
>> select_task_rq(), it will check the cpu allowed again. if the answer is
>> no, it will fallback to old cpu.
>>>
>>>> And since find_idlest_group only return a different group which
>>>> excludes old cpu, it's also imporissible to find a new cpu same as old
>>>> cpu.
>
> I doubt this will work correctly.Consider the following situation:sched
> domain begins with sd that encloses both socket1 and socket2
>
> cpu0 cpu1  | cpu2 cpu3
> -----------|-------------
>  socket1   |  socket2
>
> old cpu = cpu1
>
> Iteration1:
> 1.find_idlest_group() returns socket2 to be idlest.
> 2.task changes tsk_allowed_cpus to 0,1
> 3.find_idlest_cpu() returns cpu2

AFAIK The tsk->cpus_allowed cannot be changed during the operation since
it's protected by tsk->pi_lock.  I can see the following comment:

kernel/sched/core.c:
/*
 * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
 */
static inline
int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
{
	int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);



Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 06/18] sched: set initial load avg of new forked task as its load weight
  2012-12-10  8:22 ` [PATCH 06/18] sched: set initial load avg of new forked task as its load weight Alex Shi
@ 2012-12-21  4:33   ` Namhyung Kim
  2012-12-23 12:00     ` Alex Shi
  0 siblings, 1 reply; 56+ messages in thread
From: Namhyung Kim @ 2012-12-21  4:33 UTC (permalink / raw)
  To: Alex Shi
  Cc: rob, mingo, peterz, gregkh, andre.przywara, rjw, paul.gortmaker,
	akpm, paulmck, linux-kernel, pjt, vincent.guittot

On Mon, 10 Dec 2012 16:22:22 +0800, Alex Shi wrote:
> New task has no runnable sum at its first runnable time, that make
> burst forking just select few idle cpus to put tasks.
> Set initial load avg of new forked task as its load weight to resolve
> this issue.
>
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  include/linux/sched.h |    1 +
>  kernel/sched/core.c   |    2 +-
>  kernel/sched/fair.c   |   13 +++++++++++--
>  3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5dafac3..093f9cd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1058,6 +1058,7 @@ struct sched_domain;
>  #else
>  #define ENQUEUE_WAKING		0
>  #endif
> +#define ENQUEUE_NEWTASK		8
>  
>  #define DEQUEUE_SLEEP		1
>  
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e6533e1..96fa5f1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1648,7 +1648,7 @@ void wake_up_new_task(struct task_struct *p)
>  #endif
>  
>  	rq = __task_rq_lock(p);
> -	activate_task(rq, p, 0);
> +	activate_task(rq, p, ENQUEUE_NEWTASK);
>  	p->on_rq = 1;
>  	trace_sched_wakeup_new(p, true);
>  	check_preempt_curr(rq, p, WF_FORK);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1faf89f..61c8d24 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1277,8 +1277,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>  /* Add the load generated by se into cfs_rq's child load-average */
>  static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>  						  struct sched_entity *se,
> -						  int wakeup)
> +						  int flags)
>  {
> +	int wakeup = flags & ENQUEUE_WAKEUP;
>  	/*
>  	 * We track migrations using entity decay_count <= 0, on a wake-up
>  	 * migration we use a negative decay count to track the remote decays
> @@ -1312,6 +1313,12 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>  		update_entity_load_avg(se, 0);
>  	}
>  
> +	/*
> +	 * set the initial load avg of new task same as its load
> +	 * in order to avoid brust fork make few cpu too heavier
> +	 */
> +	if (flags & ENQUEUE_NEWTASK)
> +		se->avg.load_avg_contrib = se->load.weight;
>  	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
>  	/* we force update consideration on load-balancer moves */
>  	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
> @@ -1476,7 +1483,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 */
>  	update_curr(cfs_rq);
>  	account_entity_enqueue(cfs_rq, se);
> -	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
> +	enqueue_entity_load_avg(cfs_rq, se, flags &
> +				(ENQUEUE_WAKEUP | ENQUEUE_NEWTASK));

It seems that just passing 'flags' is enough.

>  
>  	if (flags & ENQUEUE_WAKEUP) {
>  		place_entity(cfs_rq, se, 0);
> @@ -2586,6 +2594,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		cfs_rq->h_nr_running++;
>  
>  		flags = ENQUEUE_WAKEUP;
> +		flags &= ~ENQUEUE_NEWTASK;

Why is this needed?

Thanks,
Namhyung


>  	}
>  
>  	for_each_sched_entity(se) {

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2012-12-13  8:45     ` Alex Shi
@ 2012-12-21  4:35       ` Namhyung Kim
  2012-12-23 11:42         ` Alex Shi
  0 siblings, 1 reply; 56+ messages in thread
From: Namhyung Kim @ 2012-12-21  4:35 UTC (permalink / raw)
  To: Alex Shi
  Cc: Preeti U Murthy, rob, mingo, peterz, gregkh, andre.przywara, rjw,
	paul.gortmaker, akpm, paulmck, linux-kernel, pjt,
	vincent.guittot

On Thu, 13 Dec 2012 16:45:44 +0800, Alex Shi wrote:
> On 12/12/2012 11:57 AM, Preeti U Murthy wrote:
>> Hi Alex,
>> On 12/10/2012 01:52 PM, Alex Shi wrote:
>>> They are the base values in load balance, update them with rq runnable
>>> load average, then the load balance will consider runnable load avg
>>> naturally.
>>>
>
> updated with UP config fix:
>
>
> ==========
>> From d271c93b40411660dd0e54d99946367c87002cc8 Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@intel.com>
> Date: Sat, 17 Nov 2012 13:56:11 +0800
> Subject: [PATCH 07/18] sched: compute runnable load avg in cpu_load and
>  cpu_avg_load_per_task
>
> They are the base values in load balance, update them with rq runnable
> load average, then the load balance will consider runnable load avg
> naturally.
>
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/core.c | 8 ++++++--
>  kernel/sched/fair.c | 4 ++--
>  2 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 96fa5f1..d306a84 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2487,7 +2487,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>  void update_idle_cpu_load(struct rq *this_rq)
>  {
>  	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> -	unsigned long load = this_rq->load.weight;
> +	unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;

So shouldn't this line be guarded with CONFIG_SMP too?

Thanks,
Namhyung


>  	unsigned long pending_updates;
>  
>  	/*
> @@ -2537,8 +2537,12 @@ static void update_cpu_load_active(struct rq *this_rq)
>  	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
>  	 */
>  	this_rq->last_load_update_tick = jiffies;
> -	__update_cpu_load(this_rq, this_rq->load.weight, 1);
>  
> +#ifdef CONFIG_SMP
> +	__update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
> +#else
> +	__update_cpu_load(this_rq, this_rq->load.weight, 1);
> +#endif
>  	calc_load_account_active(this_rq);
>  }

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/18] sched: consider runnable load average in move_tasks
  2012-12-12  6:26     ` Alex Shi
@ 2012-12-21  4:43       ` Namhyung Kim
  2012-12-23 12:29         ` Alex Shi
  0 siblings, 1 reply; 56+ messages in thread
From: Namhyung Kim @ 2012-12-21  4:43 UTC (permalink / raw)
  To: Alex Shi
  Cc: Preeti U Murthy, rob, mingo, peterz, gregkh, andre.przywara, rjw,
	paul.gortmaker, akpm, paulmck, linux-kernel, pjt,
	vincent.guittot

On Wed, 12 Dec 2012 14:26:44 +0800, Alex Shi wrote:
> On 12/12/2012 12:41 PM, Preeti U Murthy wrote:
>> Also why can't p->se.load_avg_contrib be used directly? as a return
>> value for task_h_load_avg? since this is already updated in
>> update_task_entity_contrib and update_group_entity_contrib.
>
> No, only non task entity goes to  update_group_entity_contrib. not task
> entity.

???

But task entity goes to __update_task_entity_contrib()?


/* Compute the current contribution to load_avg by se, return any delta */
static long __update_entity_load_avg_contrib(struct sched_entity *se)
{
	long old_contrib = se->avg.load_avg_contrib;

	if (entity_is_task(se)) {
		__update_task_entity_contrib(se);
	} else {
		__update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
		__update_group_entity_contrib(se);
	}

	return se->avg.load_avg_contrib - old_contrib;
}


Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2012-12-21  4:35       ` Namhyung Kim
@ 2012-12-23 11:42         ` Alex Shi
  0 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-23 11:42 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Alex Shi, Preeti U Murthy, rob, mingo, peterz, gregkh,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	pjt, vincent.guittot

>> @@ -2487,7 +2487,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>>  void update_idle_cpu_load(struct rq *this_rq)
>>  {
>>       unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
>> -     unsigned long load = this_rq->load.weight;
>> +     unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
>
> So shouldn't this line be guarded with CONFIG_SMP too?

Thanks reminder. Yes, I already found this problem and plan to resend
the patch base on latest tree.

>
> Thanks,

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 06/18] sched: set initial load avg of new forked task as its load weight
  2012-12-21  4:33   ` Namhyung Kim
@ 2012-12-23 12:00     ` Alex Shi
  0 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-23 12:00 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Alex Shi, rob, mingo, peterz, gregkh, andre.przywara, rjw,
	paul.gortmaker, akpm, paulmck, linux-kernel, pjt,
	vincent.guittot

>>       update_curr(cfs_rq);
>>       account_entity_enqueue(cfs_rq, se);
>> -     enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
>> +     enqueue_entity_load_avg(cfs_rq, se, flags &
>> +                             (ENQUEUE_WAKEUP | ENQUEUE_NEWTASK));
>
> It seems that just passing 'flags' is enough.

Uh, Yes, it's true. will remove this.
>
>>
>>       if (flags & ENQUEUE_WAKEUP) {
>>               place_entity(cfs_rq, se, 0);
>> @@ -2586,6 +2594,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>               cfs_rq->h_nr_running++;
>>
>>               flags = ENQUEUE_WAKEUP;
>> +             flags &= ~ENQUEUE_NEWTASK;
>
> Why is this needed?

Uh. not needed, will remove this too. Thanks a lot!

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/18] sched: select_task_rq_fair clean up
  2012-12-21  4:28         ` Namhyung Kim
@ 2012-12-23 12:17           ` Alex Shi
  0 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-23 12:17 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Preeti U Murthy, Alex Shi, rob, mingo, peterz, gregkh,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	pjt, vincent.guittot

On Fri, Dec 21, 2012 at 12:28 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> Hi,
>
> On Tue, 11 Dec 2012 12:00:55 +0530, Preeti U. Murthy wrote:
>> On 12/11/2012 10:58 AM, Alex Shi wrote:
>>> On 12/11/2012 12:23 PM, Preeti U Murthy wrote:
>>>> Hi Alex,
>>>>
>>>> On 12/10/2012 01:52 PM, Alex Shi wrote:
>>>>> It is impossible to miss a task allowed cpu in a eligible group.
>>>>
>>>> The one thing I am concerned with here is if there is a possibility of
>>>> the task changing its tsk_cpus_allowed() while this code is running.
>>>>
>>>> i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
>>>> for the task changes,perhaps by the user himself,which might not include
>>>> the cpus in the idle group.After this find_idlest_cpu() is called.I mean
>>>> a race condition in short.Then we might not have an eligible cpu in that
>>>> group right?
>>>
>>> your worry make sense, but the code handle the situation, in
>>> select_task_rq(), it will check the cpu allowed again. if the answer is
>>> no, it will fallback to old cpu.
>>>>
>>>>> And since find_idlest_group only return a different group which
>>>>> excludes old cpu, it's also imporissible to find a new cpu same as old
>>>>> cpu.
>>
>> I doubt this will work correctly.Consider the following situation:sched
>> domain begins with sd that encloses both socket1 and socket2
>>
>> cpu0 cpu1  | cpu2 cpu3
>> -----------|-------------
>>  socket1   |  socket2
>>
>> old cpu = cpu1
>>
>> Iteration1:
>> 1.find_idlest_group() returns socket2 to be idlest.
>> 2.task changes tsk_allowed_cpus to 0,1
>> 3.find_idlest_cpu() returns cpu2
>
> AFAIK The tsk->cpus_allowed cannot be changed during the operation since
> it's protected by tsk->pi_lock.  I can see the following comment:

You are right. I misunderstand some comments in wake_up_new_task.


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/18] sched: consider runnable load average in move_tasks
  2012-12-21  4:43       ` Namhyung Kim
@ 2012-12-23 12:29         ` Alex Shi
  0 siblings, 0 replies; 56+ messages in thread
From: Alex Shi @ 2012-12-23 12:29 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Alex Shi, Preeti U Murthy, rob, mingo, peterz, gregkh,
	andre.przywara, rjw, paul.gortmaker, akpm, paulmck, linux-kernel,
	pjt, vincent.guittot

On Fri, Dec 21, 2012 at 12:43 PM, Namhyung Kim <namhyung@kernel.org> wrote:
> On Wed, 12 Dec 2012 14:26:44 +0800, Alex Shi wrote:
>> On 12/12/2012 12:41 PM, Preeti U Murthy wrote:
>>> Also why can't p->se.load_avg_contrib be used directly? as a return
>>> value for task_h_load_avg? since this is already updated in
>>> update_task_entity_contrib and update_group_entity_contrib.
>>
>> No, only non task entity goes to  update_group_entity_contrib. not task
>> entity.
>
> ???
>
> But task entity goes to __update_task_entity_contrib()?

yes, but if the task is in a task_group, its load weight need to
reweight with its task_group's fraction.
can not use direct load weight in balancing calculation, so in
move_tasks, it uses task_h_load.
>
>
> /* Compute the current contribution to load_avg by se, return any delta */
> static long __update_entity_load_avg_contrib(struct sched_entity *se)
> {
>         long old_contrib = se->avg.load_avg_contrib;
>
>         if (entity_is_task(se)) {
>                 __update_task_entity_contrib(se);
>         } else {
>                 __update_tg_runnable_avg(&se->avg, group_cfs_rq(se));
>                 __update_group_entity_contrib(se);
>         }
>
>         return se->avg.load_avg_contrib - old_contrib;
> }
>
>
> Thanks,
> Namhyung
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2012-12-23 12:29 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-10  8:22 [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
2012-12-10  8:22 ` [PATCH 01/18] sched: select_task_rq_fair clean up Alex Shi
2012-12-11  4:23   ` Preeti U Murthy
2012-12-11  5:28     ` Alex Shi
2012-12-11  6:30       ` Preeti U Murthy
2012-12-11 11:53         ` Alex Shi
2012-12-12  5:26           ` Preeti U Murthy
2012-12-21  4:28         ` Namhyung Kim
2012-12-23 12:17           ` Alex Shi
2012-12-10  8:22 ` [PATCH 02/18] sched: fix find_idlest_group mess logical Alex Shi
2012-12-11  5:08   ` Preeti U Murthy
2012-12-11  5:29     ` Alex Shi
2012-12-11  5:50       ` Preeti U Murthy
2012-12-11 11:55         ` Alex Shi
2012-12-10  8:22 ` [PATCH 03/18] sched: don't need go to smaller sched domain Alex Shi
2012-12-10  8:22 ` [PATCH 04/18] sched: remove domain iterations in fork/exec/wake Alex Shi
2012-12-10  8:22 ` [PATCH 05/18] sched: load tracking bug fix Alex Shi
2012-12-10  8:22 ` [PATCH 06/18] sched: set initial load avg of new forked task as its load weight Alex Shi
2012-12-21  4:33   ` Namhyung Kim
2012-12-23 12:00     ` Alex Shi
2012-12-10  8:22 ` [PATCH 07/18] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
2012-12-12  3:57   ` Preeti U Murthy
2012-12-12  5:52     ` Alex Shi
2012-12-13  8:45     ` Alex Shi
2012-12-21  4:35       ` Namhyung Kim
2012-12-23 11:42         ` Alex Shi
2012-12-10  8:22 ` [PATCH 08/18] sched: consider runnable load average in move_tasks Alex Shi
2012-12-12  4:41   ` Preeti U Murthy
2012-12-12  6:26     ` Alex Shi
2012-12-21  4:43       ` Namhyung Kim
2012-12-23 12:29         ` Alex Shi
2012-12-10  8:22 ` [PATCH 09/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2012-12-10  8:22 ` [PATCH 10/18] sched: add sched_policy in kernel Alex Shi
2012-12-10  8:22 ` [PATCH 11/18] sched: add sched_policy and it's sysfs interface Alex Shi
2012-12-10  8:22 ` [PATCH 12/18] sched: log the cpu utilization at rq Alex Shi
2012-12-10  8:22 ` [PATCH 13/18] sched: add power aware scheduling in fork/exec/wake Alex Shi
2012-12-10  8:22 ` [PATCH 14/18] sched: add power/performance balance allowed flag Alex Shi
2012-12-10  8:22 ` [PATCH 15/18] sched: don't care if the local group has capacity Alex Shi
2012-12-10  8:22 ` [PATCH 16/18] sched: pull all tasks from source group Alex Shi
2012-12-10  8:22 ` [PATCH 17/18] sched: power aware load balance, Alex Shi
2012-12-10  8:22 ` [PATCH 18/18] sched: lazy powersaving balance Alex Shi
2012-12-11  0:51 ` [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
2012-12-11 12:10   ` Alex Shi
2012-12-11 15:48     ` Borislav Petkov
2012-12-11 16:03       ` Arjan van de Ven
2012-12-11 16:13         ` Borislav Petkov
2012-12-11 16:40           ` Arjan van de Ven
2012-12-12  9:52             ` Amit Kucheria
2012-12-12 13:55               ` Alex Shi
2012-12-12 14:21                 ` Vincent Guittot
2012-12-13  2:51                   ` Alex Shi
2012-12-12 14:41             ` Borislav Petkov
2012-12-13  3:07               ` Alex Shi
2012-12-13 11:35                 ` Borislav Petkov
2012-12-14  1:56                   ` Alex Shi
2012-12-12  1:14           ` Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).