linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling
@ 2013-01-05  8:37 Alex Shi
  2013-01-05  8:37 ` [PATCH v3 01/22] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
                   ` (22 more replies)
  0 siblings, 23 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

The patch set base on Linus tree, includes 3 parts,
1, bug fix and fork/wake balancing clean up. patch 1~6,
the first patch remove one domain level. patch 2~6 simplified fork/wake
balancing, it can increase 10+% hackbench performance on our 4 sockets
SNB EP machine.

V3 change:
a, added the first patch to remove one domain level on x86 platform.
b, some small changes according to Namhyung Kim's comments, thanks!

2, bug fix for load average and implement it into LB, patch 7~12,
That using load average in load balancing, with a initial runnable load
value bug fix.

V3 change:
a, use rq->cfs.runnable_load_avg as cpu load not
rq->avg.load_avg_contrib, since the latter need much time to accumulate
for new forked task,
b, a build issue fixed with Namhyung Kim's reminder.

3, power awareness scheduling, patch 13~22,
The subset implement my previous power aware scheduling proposal:
https://lkml.org/lkml/2012/8/13/139
It defines 2 new power aware policy balance and powersaving, and then
try to spread or pack tasks on each of sched group level according the
different scheduler policy. That can save much power when task number in
system is no more then LCPU number.

V3 change:
a, engaged nr_running in max potential utils consideration in periodic
power balancing.
b, try exec/wake small tasks on running cpu not idle cpu.

Thanks comments on previous version. and Any more comments are appreciated!

-- Thanks Alex


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v3 01/22] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 02/22] sched: select_task_rq_fair clean up Alex Shi
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
frist commit b5d978e0c7e79a, and was removed uncarefully when clear up
obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
domain only. It works, but it introduces a extra domain level since this
cause MC/CPU different.

So, recover the the flag in MC domain too to remove a domain level in
x86 platform.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/topology.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..386bcf4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -132,6 +132,7 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
+				| 1*SD_PREFER_SIBLING			\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 02/22] sched: select_task_rq_fair clean up
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
  2013-01-05  8:37 ` [PATCH v3 01/22] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-11  4:57   ` Preeti U Murthy
  2013-01-05  8:37 ` [PATCH v3 03/22] sched: fix find_idlest_group mess logical Alex Shi
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

It is impossible to miss a task allowed cpu in a eligible group.

And since find_idlest_group only return a different group which
excludes old cpu, it's also imporissible to find a new cpu same as old
cpu.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..6d3a95d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3378,11 +3378,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		}
 
 		new_cpu = find_idlest_cpu(group, p, cpu);
-		if (new_cpu == -1 || new_cpu == cpu) {
-			/* Now try balancing at a lower domain level of cpu */
-			sd = sd->child;
-			continue;
-		}
 
 		/* Now try balancing at a lower domain level of new_cpu */
 		cpu = new_cpu;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 03/22] sched: fix find_idlest_group mess logical
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
  2013-01-05  8:37 ` [PATCH v3 01/22] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
  2013-01-05  8:37 ` [PATCH v3 02/22] sched: select_task_rq_fair clean up Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-11  4:59   ` Preeti U Murthy
  2013-01-05  8:37 ` [PATCH v3 04/22] sched: don't need go to smaller sched domain Alex Shi
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

There is 4 situations in the function:
1, no task allowed group;
	so min_load = ULONG_MAX, this_load = 0, idlest = NULL
2, only local group task allowed;
	so min_load = ULONG_MAX, this_load assigned, idlest = NULL
3, only non-local task group allowed;
	so min_load assigned, this_load = 0, idlest != NULL
4, local group + another group are task allowed.
	so min_load assigned, this_load assigned, idlest != NULL

Current logical will return NULL in first 3 kinds of scenarios.
And still return NULL, if idlest group is heavier then the
local group in the 4th situation.

Actually, I thought groups in situation 2,3 are also eligible to host
the task. And in 4th situation, agree to bias toward local group.
So, has this patch.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d3a95d..3c7b09a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3181,6 +3181,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int load_idx)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
+	struct sched_group *this_group = NULL;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -3215,14 +3216,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		if (local_group) {
 			this_load = avg_load;
-		} else if (avg_load < min_load) {
+			this_group = group;
+		}
+		if (avg_load < min_load) {
 			min_load = avg_load;
 			idlest = group;
 		}
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
-		return NULL;
+	if (this_group && idlest != this_group)
+		/* Bias toward our group again */
+		if (100*this_load < imbalance*min_load)
+			idlest = this_group;
+
 	return idlest;
 }
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 04/22] sched: don't need go to smaller sched domain
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (2 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 03/22] sched: fix find_idlest_group mess logical Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-09 17:38   ` Morten Rasmussen
  2013-01-11  5:02   ` Preeti U Murthy
  2013-01-05  8:37 ` [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake Alex Shi
                   ` (18 subsequent siblings)
  22 siblings, 2 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

If parent sched domain has no task allowed cpu find. neither find in
it's child. So, go out to save useless checking.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c7b09a..ecfbf8e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3378,10 +3378,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			load_idx = sd->wake_idx;
 
 		group = find_idlest_group(sd, p, cpu, load_idx);
-		if (!group) {
-			sd = sd->child;
-			continue;
-		}
+		if (!group)
+			goto unlock;
 
 		new_cpu = find_idlest_cpu(group, p, cpu);
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (3 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 04/22] sched: don't need go to smaller sched domain Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-09 18:21   ` Morten Rasmussen
  2013-01-05  8:37 ` [PATCH v3 06/22] sched: load tracking bug fix Alex Shi
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

Guess the search cpu from bottom to up in domain tree come from
commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
balancing over tasks on all level domains.

This balancing cost much if there has many domain/groups in a large
system. And force spreading task among different domains may cause
performance issue due to bad locality.

If we remove this code, we will get quick fork/exec/wake, plus better
balancing among whole system, that also reduce migrations in future
load balancing.

This patch increases 10+% performance of hackbench on my 4 sockets
NHM and SNB machines.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 20 +-------------------
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecfbf8e..895a3f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		goto unlock;
 	}
 
-	while (sd) {
+	if (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
-		int weight;
-
-		if (!(sd->flags & sd_flag)) {
-			sd = sd->child;
-			continue;
-		}
 
 		if (sd_flag & SD_BALANCE_WAKE)
 			load_idx = sd->wake_idx;
@@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			goto unlock;
 
 		new_cpu = find_idlest_cpu(group, p, cpu);
-
-		/* Now try balancing at a lower domain level of new_cpu */
-		cpu = new_cpu;
-		weight = sd->span_weight;
-		sd = NULL;
-		for_each_domain(cpu, tmp) {
-			if (weight <= tmp->span_weight)
-				break;
-			if (tmp->flags & sd_flag)
-				sd = tmp;
-		}
-		/* while loop will break here if sd == NULL */
 	}
 unlock:
 	rcu_read_unlock();
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 06/22] sched: load tracking bug fix
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (4 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 07/22] sched: set initial load avg of new forked task Alex Shi
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
after a new task forked.
Otherwise random values of above variable cause mess when do new task
enqueue:
    enqueue_task_fair
        enqueue_entity
            enqueue_entity_load_avg

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..66c1718 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1558,6 +1558,8 @@ static void __sched_fork(struct task_struct *p)
 #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
+	p->se.avg.decay_count = 0;
+	p->se.avg.load_avg_contrib = 0;
 #endif
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 07/22] sched: set initial load avg of new forked task
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (5 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 06/22] sched: load tracking bug fix Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-11  5:10   ` Preeti U Murthy
  2013-01-05  8:37 ` [PATCH v3 08/22] sched: update cpu load after task_tick Alex Shi
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

New task has no runnable sum at its first runnable time, that make
burst forking just select few idle cpus to put tasks.
Set initial load avg of new forked task as its load weight to resolve
this issue.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  2 +-
 kernel/sched/fair.c   | 11 +++++++++--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..fb7aab5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1069,6 +1069,7 @@ struct sched_domain;
 #else
 #define ENQUEUE_WAKING		0
 #endif
+#define ENQUEUE_NEWTASK		8
 
 #define DEQUEUE_SLEEP		1
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66c1718..66ce1f1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1705,7 +1705,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 
 	rq = __task_rq_lock(p);
-	activate_task(rq, p, 0);
+	activate_task(rq, p, ENQUEUE_NEWTASK);
 	p->on_rq = 1;
 	trace_sched_wakeup_new(p, true);
 	check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 895a3f4..5c545e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1503,8 +1503,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
-						  int wakeup)
+						  int flags)
 {
+	int wakeup = flags & ENQUEUE_WAKEUP;
 	/*
 	 * We track migrations using entity decay_count <= 0, on a wake-up
 	 * migration we use a negative decay count to track the remote decays
@@ -1538,6 +1539,12 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 		update_entity_load_avg(se, 0);
 	}
 
+	/*
+	 * set the initial load avg of new task same as its load
+	 * in order to avoid brust fork make few cpu too heavier
+	 */
+	if (flags & ENQUEUE_NEWTASK)
+		se->avg.load_avg_contrib = se->load.weight;
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
@@ -1701,7 +1708,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
+	enqueue_entity_load_avg(cfs_rq, se, flags);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 08/22] sched: update cpu load after task_tick.
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (6 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 07/22] sched: set initial load avg of new forked task Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 66ce1f1..06d27af 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2700,8 +2700,8 @@ void scheduler_tick(void)
 
 	raw_spin_lock(&rq->lock);
 	update_rq_clock(rq);
-	update_cpu_load_active(rq);
 	curr->sched_class->task_tick(rq, curr, 0);
+	update_cpu_load_active(rq);
 	raw_spin_unlock(&rq->lock);
 
 	perf_event_task_tick();
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (7 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 08/22] sched: update cpu load after task_tick Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:56   ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 10/22] sched: consider runnable load average in move_tasks Alex Shi
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c | 8 ++++++++
 kernel/sched/fair.c | 4 ++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 06d27af..5feed5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2544,7 +2544,11 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 void update_idle_cpu_load(struct rq *this_rq)
 {
 	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
+#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+	unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
+#else
 	unsigned long load = this_rq->load.weight;
+#endif
 	unsigned long pending_updates;
 
 	/*
@@ -2594,7 +2598,11 @@ static void update_cpu_load_active(struct rq *this_rq)
 	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
 	 */
 	this_rq->last_load_update_tick = jiffies;
+#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+	__update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
+#else
 	__update_cpu_load(this_rq, this_rq->load.weight, 1);
+#endif
 
 	calc_load_account_active(this_rq);
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c545e4..84a6517 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2906,7 +2906,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
 {
-	return cpu_rq(cpu)->load.weight;
+	return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
 }
 
 /*
@@ -2953,7 +2953,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
 	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
 
 	if (nr_running)
-		return rq->load.weight / nr_running;
+		return (unsigned long)rq->cfs.runnable_load_avg / nr_running;
 
 	return 0;
 }
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 10/22] sched: consider runnable load average in move_tasks
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (8 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 11/22] sched: consider runnable load average in effective_load Alex Shi
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

Except using runnable load average in background, move_tasks is also
the key functions in load balance. We need consider the runnable load
average in it in order to the apple to apple load comparison.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84a6517..cab62aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3967,6 +3967,15 @@ static unsigned long task_h_load(struct task_struct *p);
 
 static const unsigned int sched_nr_migrate_break = 32;
 
+static unsigned long task_h_load_avg(struct task_struct *p)
+{
+	u32 period = p->se.avg.runnable_avg_period;
+	if (!period)
+		return 0;
+
+	return task_h_load(p) * p->se.avg.runnable_avg_sum / period;
+}
+
 /*
  * move_tasks tries to move up to imbalance weighted load from busiest to
  * this_rq, as part of a balancing operation within domain "sd".
@@ -4002,7 +4011,7 @@ static int move_tasks(struct lb_env *env)
 		if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 			goto next;
 
-		load = task_h_load(p);
+		load = task_h_load_avg(p);
 
 		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
 			goto next;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 11/22] sched: consider runnable load average in effective_load
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (9 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 10/22] sched: consider runnable load average in move_tasks Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-10 11:28   ` Morten Rasmussen
  2013-01-05  8:37 ` [PATCH v3 12/22] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

effective_load calculates the load change as seen from the
root_task_group. It needs to multiple cfs_rq's tg_runnable_contrib
when we turn to runnable load average balance.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cab62aa..247d6a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2982,7 +2982,8 @@ static void task_waking_fair(struct task_struct *p)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 /*
- * effective_load() calculates the load change as seen from the root_task_group
+ * effective_load() calculates the runnable load average change as seen from
+ * the root_task_group
  *
  * Adding load to a group doesn't make a group heavier, but can cause movement
  * of group shares between cpus. Assuming the shares were perfectly aligned one
@@ -3030,13 +3031,17 @@ static void task_waking_fair(struct task_struct *p)
  * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
  * times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
  * 4/7) times the weight of the group.
+ *
+ * After get effective_load of the load moving, will multiple the cpu own
+ * cfs_rq's runnable contrib of root_task_group.
  */
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
 	if (!tg->parent)	/* the trivial, non-cgroup case */
-		return wl;
+		return wl * tg->cfs_rq[cpu]->tg_runnable_contrib
+						>> NICE_0_SHIFT;
 
 	for_each_sched_entity(se) {
 		long w, W;
@@ -3084,7 +3089,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 		wg = 0;
 	}
 
-	return wl;
+	return wl * tg->cfs_rq[cpu]->tg_runnable_contrib >> NICE_0_SHIFT;
 }
 #else
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 12/22] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (10 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 11/22] sched: consider runnable load average in effective_load Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 13/22] sched: add sched_policy in kernel Alex Shi
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

This reverts commit f4e26b120b9de84cb627bc7361ba43cfdc51341f
and plus some other CONFIG_FAIR_GROUP_SCHED checking.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |  8 +-------
 kernel/sched/core.c   | 11 +++--------
 kernel/sched/fair.c   | 13 ++-----------
 kernel/sched/sched.h  |  9 +--------
 4 files changed, 7 insertions(+), 34 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fb7aab5..2b309c6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1195,13 +1195,7 @@ struct sched_entity {
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq		*my_q;
 #endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
-	/* Per-entity load-tracking */
+#ifdef CONFIG_SMP
 	struct sched_avg	avg;
 #endif
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5feed5e..e8cc6b8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1550,12 +1550,7 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
 	p->se.avg.decay_count = 0;
@@ -2544,7 +2539,7 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
 void update_idle_cpu_load(struct rq *this_rq)
 {
 	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 	unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
 #else
 	unsigned long load = this_rq->load.weight;
@@ -2598,7 +2593,7 @@ static void update_cpu_load_active(struct rq *this_rq)
 	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
 	 */
 	this_rq->last_load_update_tick = jiffies;
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 	__update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
 #else
 	__update_cpu_load(this_rq, this_rq->load.weight, 1);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 247d6a8..6fa10ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 /*
  * We choose a half-life close to 1 scheduling period.
  * Note: The tables below are dependent on this value.
@@ -3396,12 +3395,6 @@ unlock:
 }
 
 /*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
@@ -3424,7 +3417,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 		atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
 	}
 }
-#endif
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -6125,9 +6117,8 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	.migrate_task_rq	= migrate_task_rq_fair,
-#endif
+
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..ae3511e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -225,12 +225,6 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
 	 * CFS Load tracking
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -240,8 +234,7 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u32 tg_runnable_contrib;
 	u64 tg_load_contrib;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 13/22] sched: add sched_policy in kernel
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (11 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 12/22] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 14/22] sched: add sched_policy and it's sysfs interface Alex Shi
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

Current scheduler behavior is just consider the for larger performance
of system. So it try to spread tasks on more cpu sockets and cpu cores

To adding the consideration of power awareness, the patchset adds
2 kinds of scheduler policy: powersaving and balance, the old scheduling
was taken as performance.

performance: the current scheduling behaviour, try to spread tasks
                on more CPU sockets or cores.
powersaving: will shrink tasks into sched group until all LCPU in the
                group is nearly full.
balance    : will shrink tasks into sched group until group_capacity
                numbers CPU is nearly full.

The following patches will enable powersaving scheduling in CFS.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c  | 2 ++
 kernel/sched/sched.h | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6fa10ac..f24aca6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6100,6 +6100,8 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 	return rr_interval;
 }
 
+/* The default scheduler policy is 'performance'. */
+int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
 /*
  * All the scheduling class methods:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae3511e..66b08a1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -8,6 +8,12 @@
 
 extern __read_mostly int scheduler_running;
 
+#define SCHED_POLICY_PERFORMANCE	(0x1)
+#define SCHED_POLICY_POWERSAVING	(0x2)
+#define SCHED_POLICY_BALANCE		(0x4)
+
+extern int __read_mostly sched_policy;
+
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 14/22] sched: add sched_policy and it's sysfs interface
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (12 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 13/22] sched: add sched_policy in kernel Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-14  6:53   ` Namhyung Kim
  2013-01-05  8:37 ` [PATCH v3 15/22] sched: log the cpu utilization at rq Alex Shi
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

This patch add the power aware scheduler knob into sysfs:

$cat /sys/devices/system/cpu/sched_policy/available_sched_policy
performance powersaving balance
$cat /sys/devices/system/cpu/sched_policy/current_sched_policy
powersaving

This means the using sched policy is 'powersaving'.

User can change the policy by commend 'echo':
 echo performance > /sys/devices/system/cpu/current_sched_policy

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 24 +++++++
 kernel/sched/fair.c                                | 76 ++++++++++++++++++++++
 2 files changed, 100 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 6943133..9c9acbf 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -53,6 +53,30 @@ Description:	Dynamic addition and removal of CPU's.  This is not hotplug
 		the system.  Information writtento the file to remove CPU's
 		is architecture specific.
 
+What:		/sys/devices/system/cpu/sched_policy/current_sched_policy
+		/sys/devices/system/cpu/sched_policy/available_sched_policy
+Date:		Oct 2012
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:	CFS scheduler policy showing and setting interface.
+
+		available_sched_policy shows there are 3 kinds of policy now:
+		performance, balance and powersaving.
+		current_sched_policy shows current scheduler policy. And user
+		can change the policy by writing it.
+
+		Policy decides that CFS scheduler how to distribute tasks onto
+		which CPU unit when tasks number less than LCPU number in system
+
+		performance: try to spread tasks onto more CPU sockets,
+		more CPU cores.
+
+		powersaving: try to shrink tasks onto same core or same CPU
+		until every LCPUs are busy.
+
+		balance:     try to shrink tasks onto same core or same CPU
+		until full powered CPUs are busy. This policy also consider
+		system performance when try to save power.
+
 What:		/sys/devices/system/cpu/cpu#/node
 Date:		October 2009
 Contact:	Linux memory management mailing list <linux-mm@kvack.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f24aca6..ee015b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6102,6 +6102,82 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 
 /* The default scheduler policy is 'performance'. */
 int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
+
+#ifdef CONFIG_SYSFS
+static ssize_t show_available_sched_policy(struct device *dev,
+		struct device_attribute *attr,
+		char *buf)
+{
+	return sprintf(buf, "performance balance powersaving\n");
+}
+
+static ssize_t show_current_sched_policy(struct device *dev,
+		struct device_attribute *attr,
+		char *buf)
+{
+	if (sched_policy == SCHED_POLICY_PERFORMANCE)
+		return sprintf(buf, "performance\n");
+	else if (sched_policy == SCHED_POLICY_POWERSAVING)
+		return sprintf(buf, "powersaving\n");
+	else if (sched_policy == SCHED_POLICY_BALANCE)
+		return sprintf(buf, "balance\n");
+	return 0;
+}
+
+static ssize_t set_sched_policy(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	unsigned int ret = -EINVAL;
+	char    str_policy[16];
+
+	ret = sscanf(buf, "%15s", str_policy);
+	if (ret != 1)
+		return -EINVAL;
+
+	if (!strcmp(str_policy, "performance"))
+		sched_policy = SCHED_POLICY_PERFORMANCE;
+	else if (!strcmp(str_policy, "powersaving"))
+		sched_policy = SCHED_POLICY_POWERSAVING;
+	else if (!strcmp(str_policy, "balance"))
+		sched_policy = SCHED_POLICY_BALANCE;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+/*
+ *  * Sysfs setup bits:
+ *   */
+static DEVICE_ATTR(current_sched_policy, 0644, show_current_sched_policy,
+						set_sched_policy);
+
+static DEVICE_ATTR(available_sched_policy, 0444,
+		show_available_sched_policy, NULL);
+
+static struct attribute *sched_policy_default_attrs[] = {
+	&dev_attr_current_sched_policy.attr,
+	&dev_attr_available_sched_policy.attr,
+	NULL
+};
+static struct attribute_group sched_policy_attr_group = {
+	.attrs = sched_policy_default_attrs,
+	.name = "sched_policy",
+};
+
+int __init create_sysfs_sched_policy_group(struct device *dev)
+{
+	return sysfs_create_group(&dev->kobj, &sched_policy_attr_group);
+}
+
+static int __init sched_policy_sysfs_init(void)
+{
+	return create_sysfs_sched_policy_group(cpu_subsys.dev_root);
+}
+
+core_initcall(sched_policy_sysfs_init);
+#endif /* CONFIG_SYSFS */
+
 /*
  * All the scheduling class methods:
  */
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 15/22] sched: log the cpu utilization at rq
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (13 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 14/22] sched: add sched_policy and it's sysfs interface Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-10 11:40   ` Morten Rasmussen
  2013-01-05  8:37 ` [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake Alex Shi
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

The cpu's utilization is to measure how busy is the cpu.
        util = cpu_rq(cpu)->avg.runnable_avg_sum
                / cpu_rq(cpu)->avg.runnable_avg_period;

Since the util is no more than 1, we use its percentage value in later
caculations. And set the the FULL_UTIL as 99%.

In later power aware scheduling, we are sensitive for how busy of the
cpu, not how weight of its load. As to power consuming, it is more
related with busy time, not the load weight.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/debug.c | 1 +
 kernel/sched/fair.c  | 4 ++++
 kernel/sched/sched.h | 4 ++++
 3 files changed, 9 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2cd3c1b..e4035f7 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -318,6 +318,7 @@ do {									\
 
 	P(ttwu_count);
 	P(ttwu_local);
+	P(util);
 
 #undef P
 #undef P64
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee015b8..7bfbd69 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1495,8 +1495,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
+	u32 period;
 	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+
+	period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
+	rq->util = rq->avg.runnable_avg_sum * 100 / period;
 }
 
 /* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 66b08a1..3c6e803 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -350,6 +350,9 @@ extern struct root_domain def_root_domain;
 
 #endif /* CONFIG_SMP */
 
+/* Take as full load, if the cpu percentage util is up to 99 */
+#define FULL_UTIL	99
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -481,6 +484,7 @@ struct rq {
 #endif
 
 	struct sched_avg avg;
+	unsigned int util;
 };
 
 static inline int cpu_of(struct rq *rq)
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (14 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 15/22] sched: log the cpu utilization at rq Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-10 15:01   ` Morten Rasmussen
  2013-01-14  7:03   ` Namhyung Kim
  2013-01-05  8:37 ` [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing Alex Shi
                   ` (6 subsequent siblings)
  22 siblings, 2 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest while still has utilization group. That's
will save power for other groups.

The trade off is adding a power aware statistics collection in group
seeking. But since the collection just happened in power scheduling
eligible condition, the worst case of hackbench testing just drops
about 2% with powersaving/balance policy. No clear change for
performance policy.

I had tried to use rq load avg utilisation in this balancing, but since
the utilisation need much time to accumulate itself. It's unfit for any
burst balancing. So I use nr_running as instant rq utilisation.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 230 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 179 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7bfbd69..8d0d3af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3323,25 +3323,189 @@ done:
 }
 
 /*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ *		during load balancing.
+ */
+struct sd_lb_stats {
+	struct sched_group *busiest; /* Busiest group in this sd */
+	struct sched_group *this;  /* Local group in this sd */
+	unsigned long total_load;  /* Total load of all groups in sd */
+	unsigned long total_pwr;   /*	Total power of all groups in sd */
+	unsigned long avg_load;	   /* Average load across all groups in sd */
+
+	/** Statistics of this group */
+	unsigned long this_load;
+	unsigned long this_load_per_task;
+	unsigned long this_nr_running;
+	unsigned int  this_has_capacity;
+	unsigned int  this_idle_cpus;
+
+	/* Statistics of the busiest group */
+	unsigned int  busiest_idle_cpus;
+	unsigned long max_load;
+	unsigned long busiest_load_per_task;
+	unsigned long busiest_nr_running;
+	unsigned long busiest_group_capacity;
+	unsigned int  busiest_has_capacity;
+	unsigned int  busiest_group_weight;
+
+	int group_imb; /* Is there imbalance in this sd */
+
+	/* Varibles of power awaring scheduling */
+	unsigned int  sd_utils;	/* sum utilizations of this domain */
+	unsigned long sd_capacity;	/* capacity of this domain */
+	struct sched_group *group_leader; /* Group which relieves group_min */
+	unsigned long min_load_per_task; /* load_per_task in group_min */
+	unsigned int  leader_util;	/* sum utilizations of group_leader */
+	unsigned int  min_util;		/* sum utilizations of group_min */
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ */
+struct sg_lb_stats {
+	unsigned long avg_load; /*Avg load across the CPUs of the group */
+	unsigned long group_load; /* Total load over the CPUs of the group */
+	unsigned long sum_nr_running; /* Nr tasks running in the group */
+	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+	unsigned long group_capacity;
+	unsigned long idle_cpus;
+	unsigned long group_weight;
+	int group_imb; /* Is there an imbalance in the group ? */
+	int group_has_capacity; /* Is there extra capacity in the group? */
+	unsigned int group_utils;	/* sum utilizations of group */
+
+	unsigned long sum_shared_running;	/* 0 on non-NUMA */
+};
+
+static inline int
+fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
+
+/*
+ * Try to collect the task running number and capacity of the group.
+ */
+static void get_sg_power_stats(struct sched_group *group,
+	struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(group)) {
+		struct rq *rq = cpu_rq(i);
+
+		sgs->group_utils += rq->nr_running;
+	}
+
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
+						SCHED_POWER_SCALE);
+	if (!sgs->group_capacity)
+		sgs->group_capacity = fix_small_capacity(sd, group);
+	sgs->group_weight = group->group_weight;
+}
+
+/*
+ * Try to collect the task running number and capacity of the doamin.
+ */
+static void get_sd_power_stats(struct sched_domain *sd,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	struct sched_group *group;
+	struct sg_lb_stats sgs;
+	int sd_min_delta = INT_MAX;
+	int cpu = task_cpu(p);
+
+	group = sd->groups;
+	do {
+		long g_delta;
+		unsigned long threshold;
+
+		if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
+			continue;
+
+		memset(&sgs, 0, sizeof(sgs));
+		get_sg_power_stats(group, sd, &sgs);
+
+		if (sched_policy == SCHED_POLICY_POWERSAVING)
+			threshold = sgs.group_weight;
+		else
+			threshold = sgs.group_capacity;
+
+		g_delta = threshold - sgs.group_utils;
+
+		if (g_delta > 0 && g_delta < sd_min_delta) {
+			sd_min_delta = g_delta;
+			sds->group_leader = group;
+		}
+
+		sds->sd_utils += sgs.group_utils;
+		sds->total_pwr += group->sgp->power;
+	} while  (group = group->next, group != sd->groups);
+
+	sds->sd_capacity = DIV_ROUND_CLOSEST(sds->total_pwr,
+						SCHED_POWER_SCALE);
+}
+
+/*
+ * Execute power policy if this domain is not full.
+ */
+static inline int get_sd_sched_policy(struct sched_domain *sd,
+	int cpu, struct task_struct *p, struct sd_lb_stats *sds)
+{
+	unsigned long threshold;
+
+	if (sched_policy == SCHED_POLICY_PERFORMANCE)
+		return SCHED_POLICY_PERFORMANCE;
+
+	memset(sds, 0, sizeof(*sds));
+	get_sd_power_stats(sd, p, sds);
+
+	if (sched_policy == SCHED_POLICY_POWERSAVING)
+		threshold = sd->span_weight;
+	else
+		threshold = sds->sd_capacity;
+
+	/* still can hold one more task in this domain */
+	if (sds->sd_utils < threshold)
+		return sched_policy;
+
+	return SCHED_POLICY_PERFORMANCE;
+}
+
+/*
+ * If power policy is eligible for this domain, and it has task allowed cpu.
+ * we will select CPU from this domain.
+ */
+static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	int policy;
+	int new_cpu = -1;
+
+	policy = get_sd_sched_policy(sd, cpu, p, sds);
+	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
+		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+
+	return new_cpu;
+}
+
+/*
+ * select_task_rq_fair: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
  * SD_BALANCE_EXEC.
  *
- * Balance, ie. select the least loaded group.
- *
  * Returns the target CPU number, or the same CPU if no balancing is needed.
  *
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
-	int sync = wake_flags & WF_SYNC;
+	int sync = flags & WF_SYNC;
+	struct sd_lb_stats sds;
 
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
@@ -3367,11 +3531,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			break;
 		}
 
-		if (tmp->flags & sd_flag)
+		if (tmp->flags & sd_flag) {
 			sd = tmp;
+
+			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+			if (new_cpu != -1)
+				goto unlock;
+		}
 	}
 
 	if (affine_sd) {
+		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+		if (new_cpu != -1)
+			goto unlock;
+
 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 			prev_cpu = cpu;
 
@@ -4181,51 +4354,6 @@ static unsigned long task_h_load(struct task_struct *p)
 #endif
 
 /********** Helpers for find_busiest_group ************************/
-/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * 		during load balancing.
- */
-struct sd_lb_stats {
-	struct sched_group *busiest; /* Busiest group in this sd */
-	struct sched_group *this;  /* Local group in this sd */
-	unsigned long total_load;  /* Total load of all groups in sd */
-	unsigned long total_pwr;   /*	Total power of all groups in sd */
-	unsigned long avg_load;	   /* Average load across all groups in sd */
-
-	/** Statistics of this group */
-	unsigned long this_load;
-	unsigned long this_load_per_task;
-	unsigned long this_nr_running;
-	unsigned long this_has_capacity;
-	unsigned int  this_idle_cpus;
-
-	/* Statistics of the busiest group */
-	unsigned int  busiest_idle_cpus;
-	unsigned long max_load;
-	unsigned long busiest_load_per_task;
-	unsigned long busiest_nr_running;
-	unsigned long busiest_group_capacity;
-	unsigned long busiest_has_capacity;
-	unsigned int  busiest_group_weight;
-
-	int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
- */
-struct sg_lb_stats {
-	unsigned long avg_load; /*Avg load across the CPUs of the group */
-	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long sum_nr_running; /* Nr tasks running in the group */
-	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
-	unsigned long group_capacity;
-	unsigned long idle_cpus;
-	unsigned long group_weight;
-	int group_imb; /* Is there an imbalance in the group ? */
-	int group_has_capacity; /* Is there extra capacity in the group? */
-};
-
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (15 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-10 17:17   ` Morten Rasmussen
  2013-01-05  8:37 ` [PATCH v3 18/22] sched: add power/performance balance allowed flag Alex Shi
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

If the wake/exec task is small enough, utils < 12.5%, it will
has the chance to be packed into a cpu which is busy but still has space to
handle it.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d0d3af..0596e81 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct sched_domain *sd,
 }
 
 /*
+ * find_leader_cpu - find the busiest but still has enough leisure time cpu
+ * among the cpus in group.
+ */
+static int
+find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+{
+	unsigned vacancy, min_vacancy = UINT_MAX;
+	int idlest = -1;
+	int i;
+	/* percentage the task's util */
+	unsigned putil = p->se.avg.runnable_avg_sum * 100
+				/ (p->se.avg.runnable_avg_period + 1);
+
+	/* Traverse only the allowed CPUs */
+	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+		struct rq *rq = cpu_rq(i);
+		int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
+
+		/* only pack task which putil < 12.5% */
+		vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);
+
+		/* bias toward local cpu */
+		if (vacancy > 0 && (i == this_cpu))
+			return i;
+
+		if (vacancy > 0 && vacancy < min_vacancy) {
+			min_vacancy = vacancy;
+			idlest = i;
+		}
+	}
+	return idlest;
+}
+
+/*
  * If power policy is eligible for this domain, and it has task allowed cpu.
  * we will select CPU from this domain.
  */
 static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
-		struct task_struct *p, struct sd_lb_stats *sds)
+		struct task_struct *p, struct sd_lb_stats *sds, int fork)
 {
 	int policy;
 	int new_cpu = -1;
 
 	policy = get_sd_sched_policy(sd, cpu, p, sds);
-	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
-		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
-
+	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
+		if (!fork)
+			new_cpu = find_leader_cpu(sds->group_leader, p, cpu);
+		/* for fork balancing and a little busy task */
+		if (new_cpu == -1)
+			new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+	}
 	return new_cpu;
 }
 
@@ -3534,14 +3572,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
 		if (tmp->flags & sd_flag) {
 			sd = tmp;
 
-			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
+						flags & SD_BALANCE_FORK);
 			if (new_cpu != -1)
 				goto unlock;
 		}
 	}
 
 	if (affine_sd) {
-		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 0);
 		if (new_cpu != -1)
 			goto unlock;
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 18/22] sched: add power/performance balance allowed flag
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (16 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 19/22] sched: pull all tasks from source group Alex Shi
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

If the cpu condition is suitable for power balance, power_lb
will be set, perf_lb will be clean. If the condition is suitable for
performance balance, their value will will set oppositely.

If the domain is suitable for power balance, but balance should not
be down by this cpu, both of perf_lb and power_lb are cleared to wait a
suitable cpu to do power balance. That mean no any balance, neither
power balance nor performance balance in this domain.

This logical will be implemented by following patches.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0596e81..380c8cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4029,6 +4029,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+	int			power_lb;  /* if power balance needed */
+	int			perf_lb;   /* if performance balance needed */
 };
 
 /*
@@ -5179,6 +5181,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.power_lb	= 0,
+		.perf_lb	= 1,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 19/22] sched: pull all tasks from source group
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (17 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 18/22] sched: add power/performance balance allowed flag Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 20/22] sched: don't care if the local group has capacity Alex Shi
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

In power balance, we hope some sched groups are fully empty to save
CPU power of them. So, we want to move all tasks from them.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 380c8cf..d43fe6a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5109,7 +5109,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu power.
 		 */
-		if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+		if (rq->nr_running == 0 ||
+			(!env->power_lb && capacity &&
+				rq->nr_running == 1 && wl > env->imbalance))
 			continue;
 
 		/*
@@ -5213,7 +5215,8 @@ redo:
 
 	ld_moved = 0;
 	lb_iterations = 1;
-	if (busiest->nr_running > 1) {
+	if (busiest->nr_running > 1 ||
+		(busiest->nr_running == 1 && env.power_lb)) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 20/22] sched: don't care if the local group has capacity
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (18 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 19/22] sched: pull all tasks from source group Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 21/22] sched: power aware load balance, Alex Shi
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

In power aware scheduling, we don't care load weight and
want not to pull tasks just because local group has capacity.
Because the local group maybe no tasks at the time, that is the power
balance hope so.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d43fe6a..cf4955c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4766,8 +4766,12 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
+		 *
+		 * In power aware scheduling, we don't care load weight and
+		 * want not to pull tasks just because local group has capacity.
 		 */
-		if (prefer_sibling && !local_group && sds->this_has_capacity)
+		if (prefer_sibling && !local_group && sds->this_has_capacity
+				&& env->perf_lb)
 			sgs.group_capacity = min(sgs.group_capacity, 1UL);
 
 		if (local_group) {
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 21/22] sched: power aware load balance,
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (19 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 20/22] sched: don't care if the local group has capacity Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-05  8:37 ` [PATCH v3 22/22] sched: lazy powersaving balance Alex Shi
  2013-01-09 17:16 ` [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Morten Rasmussen
  22 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

This patch enabled the power aware consideration in load balance.

As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, shrink tasks on less sched_groups will reduce power consumption

The first assumption make performance policy take over scheduling when
system busy.
The second assumption make power aware scheduling try to move
disperse tasks into fewer groups until that groups are full of tasks.

This patch reuse some of Suresh's power saving load balance code.
The enabling logical summary here:
1, Collect power aware scheduler statistics during performance load
balance statistics collection.
2, if the balance cpu is eligible for power load balance, just do it
and forget performance load balance. but if the domain is suitable for
power balance, while the cpu is not appropriate, stop both
power/performance balance, else do performance load balance.

A test can show the effort on different policy:
for ((i = 0; i < I; i++)) ; do while true; do :; done  &   done

On my SNB laptop with 4core* HT: the data is Watts
        powersaving     balance         performance
i = 2   40              54              54
i = 4   57              64*             68
i = 8   68              68              68

Note:
When i = 4 with balance policy, the power may change in 57~68Watt,
since the HT capacity and core capacity are both 1.

on SNB EP machine with 2 sockets * 8 cores * HT:
        powersaving     balance         performance
i = 4   190             201             238
i = 8   205             241             268
i = 16  271             348             376

If system has few continued tasks, use power policy can get
the performance/power gain. Like sysbench fileio randrw test with 16
thread on the SNB EP box,

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 124 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cf4955c..c82536f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3355,6 +3355,7 @@ struct sd_lb_stats {
 	unsigned int  sd_utils;	/* sum utilizations of this domain */
 	unsigned long sd_capacity;	/* capacity of this domain */
 	struct sched_group *group_leader; /* Group which relieves group_min */
+	struct sched_group *group_min;	/* Least loaded group in sd */
 	unsigned long min_load_per_task; /* load_per_task in group_min */
 	unsigned int  leader_util;	/* sum utilizations of group_leader */
 	unsigned int  min_util;		/* sum utilizations of group_min */
@@ -4395,6 +4396,106 @@ static unsigned long task_h_load(struct task_struct *p)
 #endif
 
 /********** Helpers for find_busiest_group ************************/
+
+/**
+ * init_sd_lb_power_stats - Initialize power savings statistics for
+ * the given sched_domain, during load balancing.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics for sd.
+ */
+static inline void init_sd_lb_power_stats(struct lb_env *env,
+						struct sd_lb_stats *sds)
+{
+	if (sched_policy == SCHED_POLICY_PERFORMANCE ||
+				env->idle == CPU_NOT_IDLE) {
+		env->power_lb = 0;
+		env->perf_lb = 1;
+		return;
+	}
+	env->perf_lb = 0;
+	env->power_lb = 1;
+	sds->min_util = UINT_MAX;
+	sds->leader_util = 0;
+}
+
+/**
+ * update_sd_lb_power_stats - Update the power saving stats for a
+ * sched_domain while performing load balancing.
+ *
+ * @env: The load balancing environment.
+ * @group: sched_group belonging to the sched_domain under consideration.
+ * @sds: Variable containing the statistics of the sched_domain
+ * @local_group: Does group contain the CPU for which we're performing
+ * load balancing?
+ * @sgs: Variable containing the statistics of the group.
+ */
+static inline void update_sd_lb_power_stats(struct lb_env *env,
+			struct sched_group *group, struct sd_lb_stats *sds,
+			int local_group, struct sg_lb_stats *sgs)
+{
+	unsigned long threshold, threshold_util;
+
+	if (!env->power_lb)
+		return;
+
+	if (sched_policy == SCHED_POLICY_POWERSAVING)
+		threshold = sgs->group_weight;
+	else
+		threshold = sgs->group_capacity;
+	threshold_util = threshold * FULL_UTIL;
+
+	/*
+	 * If the local group is idle or full loaded
+	 * no need to do power savings balance at this domain
+	 */
+	if (local_group && (!sgs->sum_nr_running ||
+		sgs->group_utils + FULL_UTIL > threshold_util))
+		env->power_lb = 0;
+
+	/* Do performance load balance if any group overload */
+	if (sgs->group_utils > threshold_util) {
+		env->perf_lb = 1;
+		env->power_lb = 0;
+	}
+
+	/*
+	 * If a group is idle,
+	 * don't include that group in power savings calculations
+	 */
+	if (!env->power_lb || !sgs->sum_nr_running)
+		return;
+
+	/*
+	 * Calculate the group which has the least non-idle load.
+	 * This is the group from where we need to pick up the load
+	 * for saving power
+	 */
+	if ((sgs->group_utils < sds->min_util) ||
+	    (sgs->group_utils == sds->min_util &&
+	     group_first_cpu(group) > group_first_cpu(sds->group_min))) {
+		sds->group_min = group;
+		sds->min_util = sgs->group_utils;
+		sds->min_load_per_task = sgs->sum_weighted_load /
+						sgs->sum_nr_running;
+	}
+
+	/*
+	 * Calculate the group which is almost near its
+	 * capacity but still has some space to pick up some load
+	 * from other group and save more power
+	 */
+	if (sgs->group_utils + FULL_UTIL > threshold_util)
+		return;
+
+	if (sgs->group_utils > sds->leader_util ||
+	    (sgs->group_utils == sds->leader_util && sds->group_leader &&
+	     group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
+		sds->group_leader = group;
+		sds->leader_util = sgs->group_utils;
+	}
+}
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -4634,6 +4735,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
 		sgs->sum_weighted_load += weighted_cpuload(i);
+
+		/* accumulate the maximum potential util */
+		if (!nr_running)
+			nr_running = 1;
+		sgs->group_utils += rq->util * nr_running;
+
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
 	}
@@ -4742,6 +4849,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
+	init_sd_lb_power_stats(env, sds);
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
@@ -4793,6 +4901,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 			sds->group_imb = sgs.group_imb;
 		}
 
+		update_sd_lb_power_stats(env, sg, sds, local_group, &sgs);
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -5010,6 +5119,19 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 */
 	update_sd_lb_stats(env, balance, &sds);
 
+	if (!env->perf_lb && !env->power_lb)
+		return  NULL;
+
+	if (env->power_lb) {
+		if (sds.this == sds.group_leader &&
+				sds.group_leader != sds.group_min) {
+			env->imbalance = sds.min_load_per_task;
+			return sds.group_min;
+		}
+		env->power_lb = 0;
+		return NULL;
+	}
+
 	/*
 	 * this_cpu is not the appropriate cpu to perform load balancing at
 	 * this level.
@@ -5187,8 +5309,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
-		.power_lb	= 0,
-		.perf_lb	= 1,
+		.power_lb	= 1,
+		.perf_lb	= 0,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -6266,7 +6388,6 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-
 static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
 {
 	struct sched_entity *se = &task->se;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v3 22/22] sched: lazy powersaving balance
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (20 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 21/22] sched: power aware load balance, Alex Shi
@ 2013-01-05  8:37 ` Alex Shi
  2013-01-14  8:39   ` Namhyung Kim
  2013-01-09 17:16 ` [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Morten Rasmussen
  22 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:37 UTC (permalink / raw)
  To: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, linux-kernel, alex.shi

When active task number in sched domain wave around the powersaving
scheduling creteria, scheduling will thresh between the powersaving
balance and performance balance, bring unnecessary task migration.
The typical benchmark generate the issue is 'make -j x'.

To remove such issue, introduce a u64 perf_lb_record variable to record
performance load balance history. If there is no performance LB for
continuing 32 times load balancing, or no LB for 8 times max_interval ms,
or only 4 times performance LB in last 64 times load balancing, then we
accept a powersaving LB. Otherwise, give up this power awareness
LB chance.

With this patch, the worst case for power scheduling -- kbuild, gets
similar even better performance/power value between balance and
performance policy, while powersaving is worse.

So, maybe we'd better to use 'balance' policy in general scenarios.

On my SNB EP 2 sockets machine with 8 cores * HT: 'make -j x' results:

		powersaving		balance		performance
x = 1    175.603 /417 13          175.220 /416 13        176.073 /407 13
x = 2    186.026 /246 21          190.182 /208 25        200.873 /210 23
x = 4    198.883 /145 34          204.856 /120 40        218.843 /116 39
x = 6    208.458 /106 45          214.981 /93 50         233.561 /86 49
x = 8    218.304 /86 53           223.527 /76 58         233.008 /75 57
x = 12   231.829 /71 60           268.98  /55 67         247.013 /60 67
x = 16   262.112 /53 71           267.898 /50 74         344.589 /41 70
x = 32   306.969 /36 90           310.774 /37 86         313.359 /38 83

data explains: 175.603 /417 13
	175.603: avagerage Watts
	417: seconds(compile time)
	13:  scaled performance/power = 1000000 / time / power

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 67 +++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2b309c6..b0354a5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -941,6 +941,7 @@ struct sched_domain {
 	unsigned long last_balance;	/* init to jiffies. units in jiffies */
 	unsigned int balance_interval;	/* initialise to 1. units in ms. */
 	unsigned int nr_balance_failed; /* initialise to 0 */
+	u64	perf_lb_record;	/* performance balance record */
 
 	u64 last_update;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c82536f..604d0ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4496,6 +4496,58 @@ static inline void update_sd_lb_power_stats(struct lb_env *env,
 	}
 }
 
+#define PERF_LB_HH_MASK		0xffffffff00000000ULL
+#define PERF_LB_LH_MASK		0xffffffffULL
+
+/**
+ * need_perf_balance - Check if the performance load balance needed
+ * in the sched_domain.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics of the sched_domain
+ */
+static int need_perf_balance(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	env->sd->perf_lb_record <<= 1;
+
+	if (env->perf_lb) {
+		env->sd->perf_lb_record |= 0x1;
+		return 1;
+	}
+
+	/*
+	 * The situtatoin isn't egligible for performance balance. If this_cpu
+	 * is not egligible or the timing is not suitable for lazy powersaving
+	 * balance, we will stop both powersaving and performance balance.
+	 */
+	if (env->power_lb && sds->this == sds->group_leader
+			&& sds->group_leader != sds->group_min) {
+		int interval;
+
+		/* powersaving balance interval set as 8 * max_interval */
+		interval = msecs_to_jiffies(8 * env->sd->max_interval);
+		if (time_after(jiffies, env->sd->last_balance + interval))
+			env->sd->perf_lb_record = 0;
+
+		/*
+		 * A eligible timing is no performance balance in last 32
+		 * balance and performance balance is no more than 4 times
+		 * in last 64 balance, or no balance in powersaving interval
+		 * time.
+		 */
+		if ((hweight64(env->sd->perf_lb_record & PERF_LB_HH_MASK) <= 4)
+			&& !(env->sd->perf_lb_record & PERF_LB_LH_MASK)) {
+
+			env->imbalance = sds->min_load_per_task;
+			return 0;
+		}
+
+	}
+	env->power_lb = 0;
+	sds->group_min = NULL;
+	return 0;
+}
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -5086,7 +5138,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 }
 
 /******* find_busiest_group() helpers end here *********************/
-
 /**
  * find_busiest_group - Returns the busiest group within the sched_domain
  * if there is an imbalance. If there isn't an imbalance, and
@@ -5119,18 +5170,8 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 */
 	update_sd_lb_stats(env, balance, &sds);
 
-	if (!env->perf_lb && !env->power_lb)
-		return  NULL;
-
-	if (env->power_lb) {
-		if (sds.this == sds.group_leader &&
-				sds.group_leader != sds.group_min) {
-			env->imbalance = sds.min_load_per_task;
-			return sds.group_min;
-		}
-		env->power_lb = 0;
-		return NULL;
-	}
+	if (!need_perf_balance(env, &sds))
+		return sds.group_min;
 
 	/*
 	 * this_cpu is not the appropriate cpu to perform load balancing at
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-05  8:37 ` [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
@ 2013-01-05  8:56   ` Alex Shi
  2013-01-06  7:54     ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-05  8:56 UTC (permalink / raw)
  To: pjt
  Cc: Alex Shi, mingo, peterz, tglx, akpm, arjan, bp, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/05/2013 04:37 PM, Alex Shi wrote:
> They are the base values in load balance, update them with rq runnable
> load average, then the load balance will consider runnable load avg
> naturally.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/core.c | 8 ++++++++
>  kernel/sched/fair.c | 4 ++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 06d27af..5feed5e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2544,7 +2544,11 @@ static void __update_cpu_load(struct rq *this_rq, unsigned long this_load,
>  void update_idle_cpu_load(struct rq *this_rq)
>  {
>  	unsigned long curr_jiffies = ACCESS_ONCE(jiffies);
> +#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
> +	unsigned long load = (unsigned long)this_rq->cfs.runnable_load_avg;
> +#else
>  	unsigned long load = this_rq->load.weight;
> +#endif
>  	unsigned long pending_updates;
>  
>  	/*
> @@ -2594,7 +2598,11 @@ static void update_cpu_load_active(struct rq *this_rq)
>  	 * See the mess around update_idle_cpu_load() / update_cpu_load_nohz().
>  	 */
>  	this_rq->last_load_update_tick = jiffies;
> +#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
> +	__update_cpu_load(this_rq, this_rq->cfs.runnable_load_avg, 1);
> +#else
>  	__update_cpu_load(this_rq, this_rq->load.weight, 1);
> +#endif
>  
>  	calc_load_account_active(this_rq);
>  }
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5c545e4..84a6517 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2906,7 +2906,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  /* Used instead of source_load when we know the type == 0 */
>  static unsigned long weighted_cpuload(const int cpu)
>  {
> -	return cpu_rq(cpu)->load.weight;
> +	return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;

Above line change cause aim9 multitask benchmark drop about 10%
performance on many x86 machines. Profile just show there are more
cpuidle enter called.
The testing command:

#( echo $hostname ; echo test ; echo 1 ; echo 2000 ; echo 2 ; echo 2000
; echo 100 ) | ./multitask -nl

The oprofile output here:
with this patch set
101978 total                                      0.0134
 54406 cpuidle_wrap_enter                       499.1376
  2098 __do_page_fault                            2.0349
  1976 rwsem_wake                                29.0588
  1824 finish_task_switch                        12.4932
  1560 copy_user_generic_string                  24.3750
  1346 clear_page_c                              84.1250
  1249 unmap_single_vma                           0.6885
  1141 copy_page_rep                             71.3125
  1093 anon_vma_interval_tree_insert              8.1567

3.8-rc2
 68982 total                                      0.0090
 22166 cpuidle_wrap_enter                       203.3578
  2188 rwsem_wake                                32.1765
  2136 __do_page_fault                            2.0718
  1920 finish_task_switch                        13.1507
  1724 poll_idle                                 15.2566
  1433 copy_user_generic_string                  22.3906
  1237 clear_page_c                              77.3125
  1222 unmap_single_vma                           0.6736
  1053 anon_vma_interval_tree_insert              7.8582

Without load avg in periodic balancing, each cpu will weighted with all
tasks load.

with new load tracking, we just update the cfs_rq load avg with each
task at enqueue/dequeue moment, and with just update current task in
scheduler_tick. I am wondering if it's the sample is a bit rare.

What's your opinion of this, Paul?


>  }
>  
>  /*
> @@ -2953,7 +2953,7 @@ static unsigned long cpu_avg_load_per_task(int cpu)
>  	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>  
>  	if (nr_running)
> -		return rq->load.weight / nr_running;
> +		return (unsigned long)rq->cfs.runnable_load_avg / nr_running;
>  
>  	return 0;
>  }
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-05  8:56   ` Alex Shi
@ 2013-01-06  7:54     ` Alex Shi
  2013-01-06 18:31       ` Linus Torvalds
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-06  7:54 UTC (permalink / raw)
  To: Alex Shi, pjt, mingo
  Cc: peterz, tglx, akpm, arjan, bp, namhyung, efault, vincent.guittot,
	gregkh, preeti, linux-kernel, Linus Torvalds


>>  static unsigned long weighted_cpuload(const int cpu)
>>  {
>> -	return cpu_rq(cpu)->load.weight;
>> +	return (unsigned long)cpu_rq(cpu)->cfs.runnable_load_avg;
> 
> Above line change cause aim9 multitask benchmark drop about 10%
> performance on many x86 machines. Profile just show there are more
> cpuidle enter called.
> The testing command:
> 
> #( echo $hostname ; echo test ; echo 1 ; echo 2000 ; echo 2 ; echo 2000
> ; echo 100 ) | ./multitask -nl
> 
> The oprofile output here:
> with this patch set
> 101978 total                                      0.0134
>  54406 cpuidle_wrap_enter                       499.1376
>   2098 __do_page_fault                            2.0349
>   1976 rwsem_wake                                29.0588
>   1824 finish_task_switch                        12.4932
>   1560 copy_user_generic_string                  24.3750
>   1346 clear_page_c                              84.1250
>   1249 unmap_single_vma                           0.6885
>   1141 copy_page_rep                             71.3125
>   1093 anon_vma_interval_tree_insert              8.1567
> 
> 3.8-rc2
>  68982 total                                      0.0090
>  22166 cpuidle_wrap_enter                       203.3578
>   2188 rwsem_wake                                32.1765
>   2136 __do_page_fault                            2.0718
>   1920 finish_task_switch                        13.1507
>   1724 poll_idle                                 15.2566
>   1433 copy_user_generic_string                  22.3906
>   1237 clear_page_c                              77.3125
>   1222 unmap_single_vma                           0.6736
>   1053 anon_vma_interval_tree_insert              7.8582
> 
> Without load avg in periodic balancing, each cpu will weighted with all
> tasks load.
> 
> with new load tracking, we just update the cfs_rq load avg with each
> task at enqueue/dequeue moment, and with just update current task in
> scheduler_tick. I am wondering if it's the sample is a bit rare.
> 
> What's your opinion of this, Paul?
> 

Ingo & Paul:

I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
after all tasks ready, aim9 give a signal than all tasks burst waking up
and run until all finished.
Since each of tasks are finished very quickly, a imbalanced empty cpu
may goes to sleep till a regular balancing give it some new tasks. That
causes the performance dropping. cause more idle entering.

According to load avg's design, it needs time to accumulate its load
weight. So, it's hard to find a way resolving this problem.

As to other scheduler related benchmarks, like kbuild, specjbb2005,
hachbench, sysbench etc, I didn't find clear improvement or regression
on the load avg balancing.

Any comments for this problem?

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-06  7:54     ` Alex Shi
@ 2013-01-06 18:31       ` Linus Torvalds
  2013-01-07  7:00         ` Preeti U Murthy
                           ` (2 more replies)
  0 siblings, 3 replies; 91+ messages in thread
From: Linus Torvalds @ 2013-01-06 18:31 UTC (permalink / raw)
  To: Alex Shi
  Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton, Arjan van de Ven, Borislav Petkov, namhyung,
	Mike Galbraith, Vincent Guittot, Greg Kroah-Hartman, preeti,
	Linux Kernel Mailing List

On Sat, Jan 5, 2013 at 11:54 PM, Alex Shi <alex.shi@intel.com> wrote:
>
> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
> after all tasks ready, aim9 give a signal than all tasks burst waking up
> and run until all finished.
> Since each of tasks are finished very quickly, a imbalanced empty cpu
> may goes to sleep till a regular balancing give it some new tasks. That
> causes the performance dropping. cause more idle entering.

Sounds like for AIM (and possibly for other really bursty loads), we
might want to do some load-balancing at wakeup time by *just* looking
at the number of running tasks, rather than at the load average. Hmm?

The load average is fundamentally always going to run behind a bit,
and while you want to use it for long-term balancing, a short-term you
might want to do just a "if we have a huge amount of runnable
processes, do a load balancing *now*". Where "huge amount" should
probably be relative to the long-term load balancing (ie comparing the
number of runnable processes on this CPU right *now* with the load
average over the last second or so would show a clear spike, and a
reason for quick action).

           Linus

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-06 18:31       ` Linus Torvalds
@ 2013-01-07  7:00         ` Preeti U Murthy
  2013-01-08 14:27         ` Alex Shi
  2013-01-11  6:31         ` Alex Shi
  2 siblings, 0 replies; 91+ messages in thread
From: Preeti U Murthy @ 2013-01-07  7:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alex Shi, Paul Turner, Ingo Molnar, Peter Zijlstra,
	Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Borislav Petkov, namhyung, Mike Galbraith, Vincent Guittot,
	Greg Kroah-Hartman, Linux Kernel Mailing List

Hi everyone,

On 01/07/2013 12:01 AM, Linus Torvalds wrote:
> On Sat, Jan 5, 2013 at 11:54 PM, Alex Shi <alex.shi@intel.com> wrote:
>>
>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
>> after all tasks ready, aim9 give a signal than all tasks burst waking up
>> and run until all finished.
>> Since each of tasks are finished very quickly, a imbalanced empty cpu
>> may goes to sleep till a regular balancing give it some new tasks. That
>> causes the performance dropping. cause more idle entering.
> 
> Sounds like for AIM (and possibly for other really bursty loads), we
> might want to do some load-balancing at wakeup time by *just* looking
> at the number of running tasks, rather than at the load average. Hmm?

During wake ups,the load average is not even queried,is it? wake_affine() is called
to see in the affinity of which cpu(prev/waking),the task should go.But after that
select_idle_sibling() simply sees if there is an idle cpu to offload the task to.

Looks like only in the periodic load balancing we can correct this scenario as of now,
as pointed below.

> 
> The load average is fundamentally always going to run behind a bit,
> and while you want to use it for long-term balancing, a short-term you
> might want to do just a "if we have a huge amount of runnable
> processes, do a load balancing *now*". Where "huge amount" should
> probably be relative to the long-term load balancing (ie comparing the
> number of runnable processes on this CPU right *now* with the load
> average over the last second or so would show a clear spike, and a
> reason for quick action).
> 
>            Linus
> 

Earlier I had posted a patch,to address this.
https://lkml.org/lkml/2012/10/25/156
update_sd_pick_busiest() checks whether a sched group has too many running tasks
to be offloaded.

--------------START_PATCH-------------------------------------------------


The scenario which led to this patch is shown below:
Consider Task1 and Task2 to be a long running task
and Tasks 3,4,5,6 to be short running tasks

			Task3
			Task4
Task1			Task5
Task2			Task6
------			------
SCHED_GRP1		SCHED_GRP2

Normal load calculator would qualify SCHED_GRP2 as
the candidate for sd->busiest due to the following loads
that it calculates.

SCHED_GRP1:2048
SCHED_GRP2:4096

Load calculator would probably qualify SCHED_GRP1 as the candidate
for sd->busiest due to the following loads that it calculates

SCHED_GRP1:3200
SCHED_GRP2:1156

This patch aims to strike a balance between the loads of the
group and the number of tasks running on the group to decide the
busiest group in the sched_domain.

This means we will need to use the PJT's metrics but with an
additional constraint.

Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 kernel/sched/fair.c |   25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e02dad4..aafa3c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -165,7 +165,8 @@ void sched_init_granularity(void)
 #else
 # define WMULT_CONST	(1UL << 32)
 #endif
-
+#define NR_THRESHOLD 2
+#define LOAD_THRESHOLD 1
 #define WMULT_SHIFT	32

 /*
@@ -4169,6 +4170,7 @@ struct sd_lb_stats {
 	/* Statistics of the busiest group */
 	unsigned int  busiest_idle_cpus;
 	unsigned long max_load;
+	u64 max_sg_load; /* Equivalent of max_load but calculated using pjt's metric*/
 	unsigned long busiest_load_per_task;
 	unsigned long busiest_nr_running;
 	unsigned long busiest_group_capacity;
@@ -4628,8 +4630,24 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 				   struct sched_group *sg,
 				   struct sg_lb_stats *sgs)
 {
-	if (sgs->avg_load <= sds->max_load)
-		return false;
+	/* Use PJT's metrics to qualify a sched_group as busy
+ 	 *
+ 	 * But a low load sched group may be queueing up many tasks
+ 	 * So before dismissing a sched group with lesser load,ensure
+ 	 * that the number of processes on it is checked if it is
+ 	 * not too less loaded than the max load so far
+ 	 *
+ 	 * But as of now as LOAD_THRESHOLD is 1,this check is a nop.
+ 	 * But we could vary LOAD_THRESHOLD suitably to bring in this check
+ 	 */
+	if (sgs->avg_cfs_runnable_load <= sds->max_sg_load) {
+		if (sgs->avg_cfs_runnable_load > LOAD_THRESHOLD * sds->max_sg_load) {
+			if (sgs->sum_nr_running <= (NR_THRESHOLD + sds->busiest_nr_running))
+				return false;
+		} else {
+			return false;
+		}
+	}

 	if (sgs->sum_nr_running > sgs->group_capacity)
 		return true;
@@ -4708,6 +4726,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 			sds->this_idle_cpus = sgs.idle_cpus;
 		} else if (update_sd_pick_busiest(env, sds, sg, &sgs)) {
 			sds->max_load = sgs.avg_load;
+			sds->max_sg_load = sgs.avg_cfs_runnable_load;
 			sds->busiest = sg;
 			sds->busiest_nr_running = sgs.sum_nr_running;
 			sds->busiest_idle_cpus = sgs.idle_cpus;


Regards
Preeti U Murthy


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-06 18:31       ` Linus Torvalds
  2013-01-07  7:00         ` Preeti U Murthy
@ 2013-01-08 14:27         ` Alex Shi
  2013-01-11  6:31         ` Alex Shi
  2 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-08 14:27 UTC (permalink / raw)
  To: Linus Torvalds, Paul Turner, Ingo Molnar
  Cc: Peter Zijlstra, Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Borislav Petkov, namhyung, Mike Galbraith, Vincent Guittot,
	Greg Kroah-Hartman, preeti, Linux Kernel Mailing List

On 01/07/2013 02:31 AM, Linus Torvalds wrote:
> On Sat, Jan 5, 2013 at 11:54 PM, Alex Shi <alex.shi@intel.com> wrote:
>>
>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
>> after all tasks ready, aim9 give a signal than all tasks burst waking up
>> and run until all finished.
>> Since each of tasks are finished very quickly, a imbalanced empty cpu
>> may goes to sleep till a regular balancing give it some new tasks. That
>> causes the performance dropping. cause more idle entering.
> 
> Sounds like for AIM (and possibly for other really bursty loads), we
> might want to do some load-balancing at wakeup time by *just* looking
> at the number of running tasks, rather than at the load average. Hmm?

Millions thanks for your suggestions! :)

It's worth to try use instant load -- nr_running in waking balancing, I
will try this. but in this case, I tried to print sleeping tasks by
print_task() in sched/debug.c. Find the 2000 tasks were forked on just 2
LCPUs which in different cpu sockets whenever with/without this load avg
patch.

So, I am wondering if it's worth to consider the sleeping tasks' load in
fork/wake balancing. Does anyone consider this in history?

===
print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
 {
        if (rq->curr == p)
                SEQ_printf(m, "R");
+       else if (!p->on_rq)
+               SEQ_printf(m, "S");
        else
                SEQ_printf(m, " ");
...
@@ -166,13 +170,14 @@ static void print_rq(struct seq_file *m, struct rq
*rq, int rq_cpu)
        read_lock_irqsave(&tasklist_lock, flags);

        do_each_thread(g, p) {
-               if (!p->on_rq || task_cpu(p) != rq_cpu)
+               if (task_cpu(p) != rq_cpu)
                        continue;
===

> 
> The load average is fundamentally always going to run behind a bit,
> and while you want to use it for long-term balancing, a short-term you
> might want to do just a "if we have a huge amount of runnable
> processes, do a load balancing *now*". Where "huge amount" should
> probably be relative to the long-term load balancing (ie comparing the
> number of runnable processes on this CPU right *now* with the load
> average over the last second or so would show a clear spike, and a
> reason for quick action).

Many thanks for suggestion!
Will try it. :)
> 
>            Linus
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling
  2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
                   ` (21 preceding siblings ...)
  2013-01-05  8:37 ` [PATCH v3 22/22] sched: lazy powersaving balance Alex Shi
@ 2013-01-09 17:16 ` Morten Rasmussen
  2013-01-10  3:49   ` Alex Shi
  22 siblings, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-09 17:16 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

Hi Alex,

On Sat, Jan 05, 2013 at 08:37:29AM +0000, Alex Shi wrote:
> The patch set base on Linus tree, includes 3 parts,
> 1, bug fix and fork/wake balancing clean up. patch 1~6,
> the first patch remove one domain level. patch 2~6 simplified fork/wake
> balancing, it can increase 10+% hackbench performance on our 4 sockets
> SNB EP machine.
> 
> V3 change:
> a, added the first patch to remove one domain level on x86 platform.
> b, some small changes according to Namhyung Kim's comments, thanks!
> 
> 2, bug fix for load average and implement it into LB, patch 7~12,
> That using load average in load balancing, with a initial runnable load
> value bug fix.
> 
> V3 change:
> a, use rq->cfs.runnable_load_avg as cpu load not
> rq->avg.load_avg_contrib, since the latter need much time to accumulate
> for new forked task,
> b, a build issue fixed with Namhyung Kim's reminder.
> 
> 3, power awareness scheduling, patch 13~22,
> The subset implement my previous power aware scheduling proposal:
> https://lkml.org/lkml/2012/8/13/139
> It defines 2 new power aware policy balance and powersaving, and then
> try to spread or pack tasks on each of sched group level according the
> different scheduler policy. That can save much power when task number in
> system is no more then LCPU number.

Interesting stuff. I have read through your patches, but it is still not
clear to me what metrics you use to determine whether a sched group is
fully utilized or if it can be used for packing more tasks. Is it based on
nr_running or PJT's tracked load or both? How is the threshold defined?

Best regards,
Morten

> 
> V3 change:
> a, engaged nr_running in max potential utils consideration in periodic
> power balancing.
> b, try exec/wake small tasks on running cpu not idle cpu.
> 
> Thanks comments on previous version. and Any more comments are appreciated!
> 
> -- Thanks Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 04/22] sched: don't need go to smaller sched domain
  2013-01-05  8:37 ` [PATCH v3 04/22] sched: don't need go to smaller sched domain Alex Shi
@ 2013-01-09 17:38   ` Morten Rasmussen
  2013-01-10  3:16     ` Mike Galbraith
  2013-01-11  5:02   ` Preeti U Murthy
  1 sibling, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-09 17:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Sat, Jan 05, 2013 at 08:37:33AM +0000, Alex Shi wrote:
> If parent sched domain has no task allowed cpu find. neither find in
> it's child. So, go out to save useless checking.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3c7b09a..ecfbf8e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3378,10 +3378,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  			load_idx = sd->wake_idx;
>  
>  		group = find_idlest_group(sd, p, cpu, load_idx);

The previous patch changed the behavior of find_idlest_group() to
returning the local group if it is suitable. This effectively means that
you remove the recursive search for a suitable idle sched group. You
could as well merge find_idlest_group() and find_idlest_cpu() to avoid
iterating through the cpus of the same sched group twice.

Morten

> -		if (!group) {
> -			sd = sd->child;
> -			continue;
> -		}
> +		if (!group)
> +			goto unlock;
>  
>  		new_cpu = find_idlest_cpu(group, p, cpu);
>  
> -- 
> 1.7.12
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-05  8:37 ` [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake Alex Shi
@ 2013-01-09 18:21   ` Morten Rasmussen
  2013-01-11  2:46     ` Alex Shi
  2013-01-11  4:56     ` Preeti U Murthy
  0 siblings, 2 replies; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-09 18:21 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Sat, Jan 05, 2013 at 08:37:34AM +0000, Alex Shi wrote:
> Guess the search cpu from bottom to up in domain tree come from
> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
> balancing over tasks on all level domains.
> 
> This balancing cost much if there has many domain/groups in a large
> system. And force spreading task among different domains may cause
> performance issue due to bad locality.
> 
> If we remove this code, we will get quick fork/exec/wake, plus better
> balancing among whole system, that also reduce migrations in future
> load balancing.
> 
> This patch increases 10+% performance of hackbench on my 4 sockets
> NHM and SNB machines.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 20 +-------------------
>  1 file changed, 1 insertion(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ecfbf8e..895a3f4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  		goto unlock;
>  	}
>  
> -	while (sd) {
> +	if (sd) {
>  		int load_idx = sd->forkexec_idx;
>  		struct sched_group *group;
> -		int weight;
> -
> -		if (!(sd->flags & sd_flag)) {
> -			sd = sd->child;
> -			continue;
> -		}
>  
>  		if (sd_flag & SD_BALANCE_WAKE)
>  			load_idx = sd->wake_idx;
> @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  			goto unlock;
>  
>  		new_cpu = find_idlest_cpu(group, p, cpu);
> -
> -		/* Now try balancing at a lower domain level of new_cpu */
> -		cpu = new_cpu;
> -		weight = sd->span_weight;
> -		sd = NULL;
> -		for_each_domain(cpu, tmp) {
> -			if (weight <= tmp->span_weight)
> -				break;
> -			if (tmp->flags & sd_flag)
> -				sd = tmp;
> -		}
> -		/* while loop will break here if sd == NULL */

I agree that this should be a major optimization. I just can't figure
out why the existing recursive search for an idle cpu switches to the
new cpu near the end and then starts a search for an idle cpu in the new
cpu's domain. Is this to handle some exotic sched domain configurations?
If so, they probably wouldn't work with your optimizations.

Morten

>  	}
>  unlock:
>  	rcu_read_unlock();
> -- 
> 1.7.12
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 04/22] sched: don't need go to smaller sched domain
  2013-01-09 17:38   ` Morten Rasmussen
@ 2013-01-10  3:16     ` Mike Galbraith
  0 siblings, 0 replies; 91+ messages in thread
From: Mike Galbraith @ 2013-01-10  3:16 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Alex Shi, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	vincent.guittot, gregkh, preeti, linux-kernel

On Wed, 2013-01-09 at 17:38 +0000, Morten Rasmussen wrote: 
> On Sat, Jan 05, 2013 at 08:37:33AM +0000, Alex Shi wrote:
> > If parent sched domain has no task allowed cpu find. neither find in
> > it's child. So, go out to save useless checking.
> > 
> > Signed-off-by: Alex Shi <alex.shi@intel.com>
> > ---
> >  kernel/sched/fair.c | 6 ++----
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3c7b09a..ecfbf8e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3378,10 +3378,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> >  			load_idx = sd->wake_idx;
> >  
> >  		group = find_idlest_group(sd, p, cpu, load_idx);
> 
> The previous patch changed the behavior of find_idlest_group() to
> returning the local group if it is suitable. This effectively means that
> you remove the recursive search for a suitable idle sched group. You
> could as well merge find_idlest_group() and find_idlest_cpu() to avoid
> iterating through the cpus of the same sched group twice.

find_idlest_* could stop when seeing 0 too, can't get much more idle.

-Mike


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling
  2013-01-09 17:16 ` [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Morten Rasmussen
@ 2013-01-10  3:49   ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-10  3:49 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

>>
>> 3, power awareness scheduling, patch 13~22,
>> The subset implement my previous power aware scheduling proposal:
>> https://lkml.org/lkml/2012/8/13/139
>> It defines 2 new power aware policy balance and powersaving, and then
>> try to spread or pack tasks on each of sched group level according the
>> different scheduler policy. That can save much power when task number in
>> system is no more then LCPU number.
> 
> Interesting stuff. I have read through your patches, but it is still not
> clear to me what metrics you use to determine whether a sched group is
> fully utilized or if it can be used for packing more tasks. Is it based on
> nr_running or PJT's tracked load or both? How is the threshold defined?

Thanks review, Morten!

cpu utilisation = rq->util * (rq->nr_running? rq->running : 1),
here: rq->util = running time / whole period.

If nr_running == 2, util == 99%, the potential max 'utilisation' is 99 *
2 = 198, because both of tasks may has the possibility to run full time.

group utils = Sum of all cpu's util,
like a 2 LCPU group, A nr_running is 0, B cpu util is 99%, and has 3 tasks,
So, the group utils = A'util + 99 * 3, that is bigger than threshold =
99% * 2.

The above calculation bias to performance, and that is our purpose.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 11/22] sched: consider runnable load average in effective_load
  2013-01-05  8:37 ` [PATCH v3 11/22] sched: consider runnable load average in effective_load Alex Shi
@ 2013-01-10 11:28   ` Morten Rasmussen
  2013-01-11  3:26     ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-10 11:28 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Sat, Jan 05, 2013 at 08:37:40AM +0000, Alex Shi wrote:
> effective_load calculates the load change as seen from the
> root_task_group. It needs to multiple cfs_rq's tg_runnable_contrib
> when we turn to runnable load average balance.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cab62aa..247d6a8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2982,7 +2982,8 @@ static void task_waking_fair(struct task_struct *p)
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  /*
> - * effective_load() calculates the load change as seen from the root_task_group
> + * effective_load() calculates the runnable load average change as seen from
> + * the root_task_group
>   *
>   * Adding load to a group doesn't make a group heavier, but can cause movement
>   * of group shares between cpus. Assuming the shares were perfectly aligned one
> @@ -3030,13 +3031,17 @@ static void task_waking_fair(struct task_struct *p)
>   * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
>   * times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
>   * 4/7) times the weight of the group.
> + *
> + * After get effective_load of the load moving, will multiple the cpu own
> + * cfs_rq's runnable contrib of root_task_group.
>   */
>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>  {
>  	struct sched_entity *se = tg->se[cpu];
>  
>  	if (!tg->parent)	/* the trivial, non-cgroup case */
> -		return wl;
> +		return wl * tg->cfs_rq[cpu]->tg_runnable_contrib
> +						>> NICE_0_SHIFT;

Why do we need to scale the load of the task (wl) by runnable_contrib
when the task is in the root task group? Wouldn't the load change still
just be wl?

>  
>  	for_each_sched_entity(se) {
>  		long w, W;
> @@ -3084,7 +3089,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>  		wg = 0;
>  	}
>  
> -	return wl;
> +	return wl * tg->cfs_rq[cpu]->tg_runnable_contrib >> NICE_0_SHIFT;

I believe that effective_load() is only used in wake_affine() to compare
load scenarios of the same task group. Since the task group is the same
the effective load is scaled by the same factor and should not make any
difference?

Also, in wake_affine() the result of effective_load() is added with
target_load() which is load.weight of the cpu and not a tracked load
based on runnable_avg_*/contrib?

Finally, you have not scaled the result of effective_load() in the
function used when FAIR_GROUP_SCHED is disabled. Should that be scaled
too?

Morten

>  }
>  #else
>  
> -- 
> 1.7.12
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 15/22] sched: log the cpu utilization at rq
  2013-01-05  8:37 ` [PATCH v3 15/22] sched: log the cpu utilization at rq Alex Shi
@ 2013-01-10 11:40   ` Morten Rasmussen
  2013-01-11  3:30     ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-10 11:40 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Sat, Jan 05, 2013 at 08:37:44AM +0000, Alex Shi wrote:
> The cpu's utilization is to measure how busy is the cpu.
>         util = cpu_rq(cpu)->avg.runnable_avg_sum
>                 / cpu_rq(cpu)->avg.runnable_avg_period;
> 
> Since the util is no more than 1, we use its percentage value in later
> caculations. And set the the FULL_UTIL as 99%.
> 
> In later power aware scheduling, we are sensitive for how busy of the
> cpu, not how weight of its load. As to power consuming, it is more
> related with busy time, not the load weight.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/debug.c | 1 +
>  kernel/sched/fair.c  | 4 ++++
>  kernel/sched/sched.h | 4 ++++
>  3 files changed, 9 insertions(+)
> 
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 2cd3c1b..e4035f7 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -318,6 +318,7 @@ do {									\
>  
>  	P(ttwu_count);
>  	P(ttwu_local);
> +	P(util);
>  
>  #undef P
>  #undef P64
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ee015b8..7bfbd69 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1495,8 +1495,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>  
>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>  {
> +	u32 period;
>  	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
>  	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
> +
> +	period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
> +	rq->util = rq->avg.runnable_avg_sum * 100 / period;

The existing tg->runnable_avg and cfs_rq->tg_runnable_contrib variables
both holds
rq->avg.runnable_avg_sum / rq->avg.runnable_avg_period scaled by
NICE_0_LOAD (1024). Why not use one of the existing variables instead of
introducing a new one?

Morten

>  }
>  
>  /* Add the load generated by se into cfs_rq's child load-average */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 66b08a1..3c6e803 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -350,6 +350,9 @@ extern struct root_domain def_root_domain;
>  
>  #endif /* CONFIG_SMP */
>  
> +/* Take as full load, if the cpu percentage util is up to 99 */
> +#define FULL_UTIL	99
> +
>  /*
>   * This is the main, per-CPU runqueue data structure.
>   *
> @@ -481,6 +484,7 @@ struct rq {
>  #endif
>  
>  	struct sched_avg avg;
> +	unsigned int util;
>  };
>  
>  static inline int cpu_of(struct rq *rq)
> -- 
> 1.7.12
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake
  2013-01-05  8:37 ` [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake Alex Shi
@ 2013-01-10 15:01   ` Morten Rasmussen
  2013-01-11  7:08     ` Alex Shi
  2013-01-14  7:03   ` Namhyung Kim
  1 sibling, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-10 15:01 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Sat, Jan 05, 2013 at 08:37:45AM +0000, Alex Shi wrote:
> This patch add power aware scheduling in fork/exec/wake. It try to
> select cpu from the busiest while still has utilization group. That's
> will save power for other groups.
> 
> The trade off is adding a power aware statistics collection in group
> seeking. But since the collection just happened in power scheduling
> eligible condition, the worst case of hackbench testing just drops
> about 2% with powersaving/balance policy. No clear change for
> performance policy.
> 
> I had tried to use rq load avg utilisation in this balancing, but since
> the utilisation need much time to accumulate itself. It's unfit for any
> burst balancing. So I use nr_running as instant rq utilisation.

So you effective use a mix of nr_running (counting tasks) and PJT's
tracked load for balancing?

The problem of slow reaction time of the tracked load a cpu/rq is an
interesting one. Would it be possible to use it if you maintained a
sched group runnable_load_avg similar to cfs_rq->runnable_load_avg where
load contribution of a tasks is added when a task is enqueued and
removed again if it migrates to another cpu?
This way you would know the new load of the sched group/domain instantly
when you migrate a task there. It might not be precise as the load
contribution of the task to some extend depends on the load of the cpu
where it is running. But it would probably be a fair estimate, which is
quite likely to be better than just counting tasks (nr_running).

> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 230 ++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 179 insertions(+), 51 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7bfbd69..8d0d3af 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3323,25 +3323,189 @@ done:
>  }
>  
>  /*
> - * sched_balance_self: balance the current task (running on cpu) in domains
> + * sd_lb_stats - Structure to store the statistics of a sched_domain
> + *		during load balancing.
> + */
> +struct sd_lb_stats {
> +	struct sched_group *busiest; /* Busiest group in this sd */
> +	struct sched_group *this;  /* Local group in this sd */
> +	unsigned long total_load;  /* Total load of all groups in sd */
> +	unsigned long total_pwr;   /*	Total power of all groups in sd */
> +	unsigned long avg_load;	   /* Average load across all groups in sd */
> +
> +	/** Statistics of this group */
> +	unsigned long this_load;
> +	unsigned long this_load_per_task;
> +	unsigned long this_nr_running;
> +	unsigned int  this_has_capacity;
> +	unsigned int  this_idle_cpus;
> +
> +	/* Statistics of the busiest group */
> +	unsigned int  busiest_idle_cpus;
> +	unsigned long max_load;
> +	unsigned long busiest_load_per_task;
> +	unsigned long busiest_nr_running;
> +	unsigned long busiest_group_capacity;
> +	unsigned int  busiest_has_capacity;
> +	unsigned int  busiest_group_weight;
> +
> +	int group_imb; /* Is there imbalance in this sd */
> +
> +	/* Varibles of power awaring scheduling */
> +	unsigned int  sd_utils;	/* sum utilizations of this domain */
> +	unsigned long sd_capacity;	/* capacity of this domain */
> +	struct sched_group *group_leader; /* Group which relieves group_min */
> +	unsigned long min_load_per_task; /* load_per_task in group_min */
> +	unsigned int  leader_util;	/* sum utilizations of group_leader */
> +	unsigned int  min_util;		/* sum utilizations of group_min */
> +};
> +
> +/*
> + * sg_lb_stats - stats of a sched_group required for load_balancing
> + */
> +struct sg_lb_stats {
> +	unsigned long avg_load; /*Avg load across the CPUs of the group */
> +	unsigned long group_load; /* Total load over the CPUs of the group */
> +	unsigned long sum_nr_running; /* Nr tasks running in the group */
> +	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
> +	unsigned long group_capacity;
> +	unsigned long idle_cpus;
> +	unsigned long group_weight;
> +	int group_imb; /* Is there an imbalance in the group ? */
> +	int group_has_capacity; /* Is there extra capacity in the group? */
> +	unsigned int group_utils;	/* sum utilizations of group */
> +
> +	unsigned long sum_shared_running;	/* 0 on non-NUMA */
> +};
> +
> +static inline int
> +fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
> +
> +/*
> + * Try to collect the task running number and capacity of the group.
> + */
> +static void get_sg_power_stats(struct sched_group *group,
> +	struct sched_domain *sd, struct sg_lb_stats *sgs)
> +{
> +	int i;
> +
> +	for_each_cpu(i, sched_group_cpus(group)) {
> +		struct rq *rq = cpu_rq(i);
> +
> +		sgs->group_utils += rq->nr_running;

The utilization of the sched group is the number task active on the
runqueues of the cpus in the group.

> +	}
> +
> +	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
> +						SCHED_POWER_SCALE);
> +	if (!sgs->group_capacity)
> +		sgs->group_capacity = fix_small_capacity(sd, group);
> +	sgs->group_weight = group->group_weight;

If the cpus in the sched group have default cpu_power group_capacity =
group_weight = number of cpus in the group. The cpu_power of each cpu
needs to be significantly higher or lower than default to make
group_capacity different from group_weight. Or you need many cpus in
the sched group.

> +}
> +
> +/*
> + * Try to collect the task running number and capacity of the doamin.
> + */
> +static void get_sd_power_stats(struct sched_domain *sd,
> +		struct task_struct *p, struct sd_lb_stats *sds)
> +{
> +	struct sched_group *group;
> +	struct sg_lb_stats sgs;
> +	int sd_min_delta = INT_MAX;
> +	int cpu = task_cpu(p);
> +
> +	group = sd->groups;
> +	do {
> +		long g_delta;
> +		unsigned long threshold;
> +
> +		if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
> +			continue;
> +
> +		memset(&sgs, 0, sizeof(sgs));
> +		get_sg_power_stats(group, sd, &sgs);
> +
> +		if (sched_policy == SCHED_POLICY_POWERSAVING)
> +			threshold = sgs.group_weight;
> +		else
> +			threshold = sgs.group_capacity;

Is group_capacity larger or smaller than group_weight on your platform?

> +
> +		g_delta = threshold - sgs.group_utils;
> +
> +		if (g_delta > 0 && g_delta < sd_min_delta) {
> +			sd_min_delta = g_delta;
> +			sds->group_leader = group;

If I understand correctly, you pack tasks on the sched group with fewest
spare capacity? Capacity in this context is a low number of tasks, not
actual spare cpu time.

Morten

> +		}
> +
> +		sds->sd_utils += sgs.group_utils;
> +		sds->total_pwr += group->sgp->power;
> +	} while  (group = group->next, group != sd->groups);
> +
> +	sds->sd_capacity = DIV_ROUND_CLOSEST(sds->total_pwr,
> +						SCHED_POWER_SCALE);
> +}
> +
> +/*
> + * Execute power policy if this domain is not full.
> + */
> +static inline int get_sd_sched_policy(struct sched_domain *sd,
> +	int cpu, struct task_struct *p, struct sd_lb_stats *sds)
> +{
> +	unsigned long threshold;
> +
> +	if (sched_policy == SCHED_POLICY_PERFORMANCE)
> +		return SCHED_POLICY_PERFORMANCE;
> +
> +	memset(sds, 0, sizeof(*sds));
> +	get_sd_power_stats(sd, p, sds);
> +
> +	if (sched_policy == SCHED_POLICY_POWERSAVING)
> +		threshold = sd->span_weight;
> +	else
> +		threshold = sds->sd_capacity;
> +
> +	/* still can hold one more task in this domain */
> +	if (sds->sd_utils < threshold)
> +		return sched_policy;
> +
> +	return SCHED_POLICY_PERFORMANCE;
> +}
> +
> +/*
> + * If power policy is eligible for this domain, and it has task allowed cpu.
> + * we will select CPU from this domain.
> + */
> +static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
> +		struct task_struct *p, struct sd_lb_stats *sds)
> +{
> +	int policy;
> +	int new_cpu = -1;
> +
> +	policy = get_sd_sched_policy(sd, cpu, p, sds);
> +	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
> +		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
> +
> +	return new_cpu;
> +}
> +
> +/*
> + * select_task_rq_fair: balance the current task (running on cpu) in domains
>   * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
>   * SD_BALANCE_EXEC.
>   *
> - * Balance, ie. select the least loaded group.
> - *
>   * Returns the target CPU number, or the same CPU if no balancing is needed.
>   *
>   * preempt must be disabled.
>   */
>  static int
> -select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> +select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
>  {
>  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>  	int cpu = smp_processor_id();
>  	int prev_cpu = task_cpu(p);
>  	int new_cpu = cpu;
>  	int want_affine = 0;
> -	int sync = wake_flags & WF_SYNC;
> +	int sync = flags & WF_SYNC;
> +	struct sd_lb_stats sds;
>  
>  	if (p->nr_cpus_allowed == 1)
>  		return prev_cpu;
> @@ -3367,11 +3531,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  			break;
>  		}
>  
> -		if (tmp->flags & sd_flag)
> +		if (tmp->flags & sd_flag) {
>  			sd = tmp;
> +
> +			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
> +			if (new_cpu != -1)
> +				goto unlock;
> +		}
>  	}
>  
>  	if (affine_sd) {
> +		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
> +		if (new_cpu != -1)
> +			goto unlock;
> +
>  		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>  			prev_cpu = cpu;
>  
> @@ -4181,51 +4354,6 @@ static unsigned long task_h_load(struct task_struct *p)
>  #endif
>  
>  /********** Helpers for find_busiest_group ************************/
> -/*
> - * sd_lb_stats - Structure to store the statistics of a sched_domain
> - * 		during load balancing.
> - */
> -struct sd_lb_stats {
> -	struct sched_group *busiest; /* Busiest group in this sd */
> -	struct sched_group *this;  /* Local group in this sd */
> -	unsigned long total_load;  /* Total load of all groups in sd */
> -	unsigned long total_pwr;   /*	Total power of all groups in sd */
> -	unsigned long avg_load;	   /* Average load across all groups in sd */
> -
> -	/** Statistics of this group */
> -	unsigned long this_load;
> -	unsigned long this_load_per_task;
> -	unsigned long this_nr_running;
> -	unsigned long this_has_capacity;
> -	unsigned int  this_idle_cpus;
> -
> -	/* Statistics of the busiest group */
> -	unsigned int  busiest_idle_cpus;
> -	unsigned long max_load;
> -	unsigned long busiest_load_per_task;
> -	unsigned long busiest_nr_running;
> -	unsigned long busiest_group_capacity;
> -	unsigned long busiest_has_capacity;
> -	unsigned int  busiest_group_weight;
> -
> -	int group_imb; /* Is there imbalance in this sd */
> -};
> -
> -/*
> - * sg_lb_stats - stats of a sched_group required for load_balancing
> - */
> -struct sg_lb_stats {
> -	unsigned long avg_load; /*Avg load across the CPUs of the group */
> -	unsigned long group_load; /* Total load over the CPUs of the group */
> -	unsigned long sum_nr_running; /* Nr tasks running in the group */
> -	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
> -	unsigned long group_capacity;
> -	unsigned long idle_cpus;
> -	unsigned long group_weight;
> -	int group_imb; /* Is there an imbalance in the group ? */
> -	int group_has_capacity; /* Is there extra capacity in the group? */
> -};
> -
>  /**
>   * get_sd_load_idx - Obtain the load index for a given sched domain.
>   * @sd: The sched_domain whose load_idx is to be obtained.
> -- 
> 1.7.12
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
  2013-01-05  8:37 ` [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing Alex Shi
@ 2013-01-10 17:17   ` Morten Rasmussen
  2013-01-11  3:47     ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-10 17:17 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Sat, Jan 05, 2013 at 08:37:46AM +0000, Alex Shi wrote:
> If the wake/exec task is small enough, utils < 12.5%, it will
> has the chance to be packed into a cpu which is busy but still has space to
> handle it.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 45 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8d0d3af..0596e81 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct sched_domain *sd,
>  }
>  
>  /*
> + * find_leader_cpu - find the busiest but still has enough leisure time cpu
> + * among the cpus in group.
> + */
> +static int
> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> +{
> +	unsigned vacancy, min_vacancy = UINT_MAX;

unsigned int?

> +	int idlest = -1;
> +	int i;
> +	/* percentage the task's util */
> +	unsigned putil = p->se.avg.runnable_avg_sum * 100
> +				/ (p->se.avg.runnable_avg_period + 1);

Alternatively you could use se.avg.load_avg_contrib which is the same
ratio scaled by the task priority (se->load.weight). In the above
expression you don't take priority into account.

> +
> +	/* Traverse only the allowed CPUs */
> +	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> +		struct rq *rq = cpu_rq(i);
> +		int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
> +
> +		/* only pack task which putil < 12.5% */
> +		vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);

I can't follow this expression.

The variables can have the following values:
FULL_UTIL  = 99
rq->util   = [0..99]
nr_running = [1..inf]
putil      = [0..99]

Why multiply rq->util by nr_running?

Let's take an example where rq->util = 50, nr_running = 2, and putil =
10. In this case the value of putil doesn't really matter as vacancy
would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
However, with rq->util = 50 there should be plenty of spare cpu time to
take another task.

Also, why multiply putil by 8? rq->util must be very close to 0 for
vacancy to be positive if putil is close to 12 (12.5%).

The vacancy variable is declared unsigned, so it will underflow instead
of becoming negative. Is this intentional?

I may be missing something, but could the expression be something like
the below instead?

Create a putil < 12.5% check before the loop. There is no reason to
recheck it every iteration. Then:

vacancy = FULL_UTIL - (rq->util + putil)

should be enough?

> +
> +		/* bias toward local cpu */
> +		if (vacancy > 0 && (i == this_cpu))
> +			return i;
> +
> +		if (vacancy > 0 && vacancy < min_vacancy) {
> +			min_vacancy = vacancy;
> +			idlest = i;

"idlest" may be a bit misleading here as you actually select busiest cpu
that have enough spare capacity to take the task.

Morten

> +		}
> +	}
> +	return idlest;
> +}
> +
> +/*
>   * If power policy is eligible for this domain, and it has task allowed cpu.
>   * we will select CPU from this domain.
>   */
>  static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
> -		struct task_struct *p, struct sd_lb_stats *sds)
> +		struct task_struct *p, struct sd_lb_stats *sds, int fork)
>  {
>  	int policy;
>  	int new_cpu = -1;
>  
>  	policy = get_sd_sched_policy(sd, cpu, p, sds);
> -	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
> -		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
> -
> +	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
> +		if (!fork)
> +			new_cpu = find_leader_cpu(sds->group_leader, p, cpu);
> +		/* for fork balancing and a little busy task */
> +		if (new_cpu == -1)
> +			new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
> +	}
>  	return new_cpu;
>  }
>  
> @@ -3534,14 +3572,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
>  		if (tmp->flags & sd_flag) {
>  			sd = tmp;
>  
> -			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
> +			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
> +						flags & SD_BALANCE_FORK);
>  			if (new_cpu != -1)
>  				goto unlock;
>  		}
>  	}
>  
>  	if (affine_sd) {
> -		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
> +		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 0);
>  		if (new_cpu != -1)
>  			goto unlock;
>  
> -- 
> 1.7.12
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-09 18:21   ` Morten Rasmussen
@ 2013-01-11  2:46     ` Alex Shi
  2013-01-11 10:07       ` Morten Rasmussen
  2013-01-11  4:56     ` Preeti U Murthy
  1 sibling, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-11  2:46 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
>>  		new_cpu = find_idlest_cpu(group, p, cpu);
>> > -
>> > -		/* Now try balancing at a lower domain level of new_cpu */
>> > -		cpu = new_cpu;
>> > -		weight = sd->span_weight;
>> > -		sd = NULL;
>> > -		for_each_domain(cpu, tmp) {
>> > -			if (weight <= tmp->span_weight)
>> > -				break;
>> > -			if (tmp->flags & sd_flag)
>> > -				sd = tmp;
>> > -		}
>> > -		/* while loop will break here if sd == NULL */
> I agree that this should be a major optimization. I just can't figure
> out why the existing recursive search for an idle cpu switches to the
> new cpu near the end and then starts a search for an idle cpu in the new
> cpu's domain. Is this to handle some exotic sched domain configurations?
> If so, they probably wouldn't work with your optimizations.

I did not find odd configuration that asking for old logical.

According to Documentation/scheduler/sched-domains.txt, Maybe never.
"A domain's span MUST be a superset of it child's span (this restriction
could be relaxed if the need arises), and a base domain for CPU i MUST
span at least i."  etc. etc.


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 11/22] sched: consider runnable load average in effective_load
  2013-01-10 11:28   ` Morten Rasmussen
@ 2013-01-11  3:26     ` Alex Shi
  2013-01-14 12:01       ` Morten Rasmussen
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-11  3:26 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/10/2013 07:28 PM, Morten Rasmussen wrote:
> On Sat, Jan 05, 2013 at 08:37:40AM +0000, Alex Shi wrote:
>> effective_load calculates the load change as seen from the
>> root_task_group. It needs to multiple cfs_rq's tg_runnable_contrib
>> when we turn to runnable load average balance.
>>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  kernel/sched/fair.c | 11 ++++++++---
>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index cab62aa..247d6a8 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2982,7 +2982,8 @@ static void task_waking_fair(struct task_struct *p)
>>  
>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>  /*
>> - * effective_load() calculates the load change as seen from the root_task_group
>> + * effective_load() calculates the runnable load average change as seen from
>> + * the root_task_group
>>   *
>>   * Adding load to a group doesn't make a group heavier, but can cause movement
>>   * of group shares between cpus. Assuming the shares were perfectly aligned one
>> @@ -3030,13 +3031,17 @@ static void task_waking_fair(struct task_struct *p)
>>   * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
>>   * times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
>>   * 4/7) times the weight of the group.
>> + *
>> + * After get effective_load of the load moving, will multiple the cpu own
>> + * cfs_rq's runnable contrib of root_task_group.
>>   */
>>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>>  {
>>  	struct sched_entity *se = tg->se[cpu];
>>  
>>  	if (!tg->parent)	/* the trivial, non-cgroup case */
>> -		return wl;
>> +		return wl * tg->cfs_rq[cpu]->tg_runnable_contrib
>> +						>> NICE_0_SHIFT;
> 
> Why do we need to scale the load of the task (wl) by runnable_contrib
> when the task is in the root task group? Wouldn't the load change still
> just be wl?
> 

Here, wl is the load weight, runnable_contrib engaged the runnable time.
>>  
>>  	for_each_sched_entity(se) {
>>  		long w, W;
>> @@ -3084,7 +3089,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>>  		wg = 0;
>>  	}
>>  
>> -	return wl;
>> +	return wl * tg->cfs_rq[cpu]->tg_runnable_contrib >> NICE_0_SHIFT;
> 
> I believe that effective_load() is only used in wake_affine() to compare
> load scenarios of the same task group. Since the task group is the same
> the effective load is scaled by the same factor and should not make any
> difference?
> 
> Also, in wake_affine() the result of effective_load() is added with
> target_load() which is load.weight of the cpu and not a tracked load
> based on runnable_avg_*/contrib?
> 
> Finally, you have not scaled the result of effective_load() in the
> function used when FAIR_GROUP_SCHED is disabled. Should that be scaled
> too?

it should be, thanks reminder.

the wake up is not good for burst wakeup benchmark. I am thinking to
rewrite this part.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 15/22] sched: log the cpu utilization at rq
  2013-01-10 11:40   ` Morten Rasmussen
@ 2013-01-11  3:30     ` Alex Shi
  2013-01-14 13:59       ` Morten Rasmussen
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-11  3:30 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/10/2013 07:40 PM, Morten Rasmussen wrote:
>> >  #undef P64
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index ee015b8..7bfbd69 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -1495,8 +1495,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>> >  
>> >  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>> >  {
>> > +	u32 period;
>> >  	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
>> >  	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
>> > +
>> > +	period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
>> > +	rq->util = rq->avg.runnable_avg_sum * 100 / period;
> The existing tg->runnable_avg and cfs_rq->tg_runnable_contrib variables
> both holds
> rq->avg.runnable_avg_sum / rq->avg.runnable_avg_period scaled by
> NICE_0_LOAD (1024). Why not use one of the existing variables instead of
> introducing a new one?

we want to a rq variable that reflect the utilization of the cpu, not of
the tg
-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
  2013-01-10 17:17   ` Morten Rasmussen
@ 2013-01-11  3:47     ` Alex Shi
  2013-01-14  7:13       ` Namhyung Kim
  2013-01-14 17:00       ` Morten Rasmussen
  0 siblings, 2 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-11  3:47 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
> On Sat, Jan 05, 2013 at 08:37:46AM +0000, Alex Shi wrote:
>> If the wake/exec task is small enough, utils < 12.5%, it will
>> has the chance to be packed into a cpu which is busy but still has space to
>> handle it.
>>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++------
>>  1 file changed, 45 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8d0d3af..0596e81 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct sched_domain *sd,
>>  }
>>  
>>  /*
>> + * find_leader_cpu - find the busiest but still has enough leisure time cpu
>> + * among the cpus in group.
>> + */
>> +static int
>> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>> +{
>> +	unsigned vacancy, min_vacancy = UINT_MAX;
> 
> unsigned int?

yes
> 
>> +	int idlest = -1;
>> +	int i;
>> +	/* percentage the task's util */
>> +	unsigned putil = p->se.avg.runnable_avg_sum * 100
>> +				/ (p->se.avg.runnable_avg_period + 1);
> 
> Alternatively you could use se.avg.load_avg_contrib which is the same
> ratio scaled by the task priority (se->load.weight). In the above
> expression you don't take priority into account.

sure. but this seems more directly of meaningful.
> 
>> +
>> +	/* Traverse only the allowed CPUs */
>> +	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
>> +		struct rq *rq = cpu_rq(i);
>> +		int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
>> +
>> +		/* only pack task which putil < 12.5% */
>> +		vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);
> 
> I can't follow this expression.
> 
> The variables can have the following values:
> FULL_UTIL  = 99
> rq->util   = [0..99]
> nr_running = [1..inf]
> putil      = [0..99]
> 
> Why multiply rq->util by nr_running?
> 
> Let's take an example where rq->util = 50, nr_running = 2, and putil =
> 10. In this case the value of putil doesn't really matter as vacancy
> would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
> However, with rq->util = 50 there should be plenty of spare cpu time to
> take another task.

for this example, the util is not full maybe due to it was just wake up,
it still is possible like to run full time. So, I try to give it the
large guess load.
> 
> Also, why multiply putil by 8? rq->util must be very close to 0 for
> vacancy to be positive if putil is close to 12 (12.5%).

just want to pack small util tasks, since packing is possible to hurt
performance.
> 
> The vacancy variable is declared unsigned, so it will underflow instead
> of becoming negative. Is this intentional?

ops, my mistake.
> 
> I may be missing something, but could the expression be something like
> the below instead?
> 
> Create a putil < 12.5% check before the loop. There is no reason to
> recheck it every iteration. Then:
> 
> vacancy = FULL_UTIL - (rq->util + putil)
> 
> should be enough?
> 
>> +
>> +		/* bias toward local cpu */
>> +		if (vacancy > 0 && (i == this_cpu))
>> +			return i;
>> +
>> +		if (vacancy > 0 && vacancy < min_vacancy) {
>> +			min_vacancy = vacancy;
>> +			idlest = i;
> 
> "idlest" may be a bit misleading here as you actually select busiest cpu
> that have enough spare capacity to take the task.

Um, change to leader_cpu?
> 
> Morten
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-09 18:21   ` Morten Rasmussen
  2013-01-11  2:46     ` Alex Shi
@ 2013-01-11  4:56     ` Preeti U Murthy
  2013-01-11  8:01       ` li guang
                         ` (2 more replies)
  1 sibling, 3 replies; 91+ messages in thread
From: Preeti U Murthy @ 2013-01-11  4:56 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Alex Shi, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, linux-kernel

Hi Morten,Alex

On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
> On Sat, Jan 05, 2013 at 08:37:34AM +0000, Alex Shi wrote:
>> Guess the search cpu from bottom to up in domain tree come from
>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>> balancing over tasks on all level domains.
>>
>> This balancing cost much if there has many domain/groups in a large
>> system. And force spreading task among different domains may cause
>> performance issue due to bad locality.
>>
>> If we remove this code, we will get quick fork/exec/wake, plus better
>> balancing among whole system, that also reduce migrations in future
>> load balancing.
>>
>> This patch increases 10+% performance of hackbench on my 4 sockets
>> NHM and SNB machines.
>>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  kernel/sched/fair.c | 20 +-------------------
>>  1 file changed, 1 insertion(+), 19 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ecfbf8e..895a3f4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>  		goto unlock;
>>  	}
>>  
>> -	while (sd) {
>> +	if (sd) {
>>  		int load_idx = sd->forkexec_idx;
>>  		struct sched_group *group;
>> -		int weight;
>> -
>> -		if (!(sd->flags & sd_flag)) {
>> -			sd = sd->child;
>> -			continue;
>> -		}
>>  
>>  		if (sd_flag & SD_BALANCE_WAKE)
>>  			load_idx = sd->wake_idx;
>> @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>  			goto unlock;
>>  
>>  		new_cpu = find_idlest_cpu(group, p, cpu);
>> -
>> -		/* Now try balancing at a lower domain level of new_cpu */
>> -		cpu = new_cpu;
>> -		weight = sd->span_weight;
>> -		sd = NULL;
>> -		for_each_domain(cpu, tmp) {
>> -			if (weight <= tmp->span_weight)
>> -				break;
>> -			if (tmp->flags & sd_flag)
>> -				sd = tmp;
>> -		}
>> -		/* while loop will break here if sd == NULL */
> 
> I agree that this should be a major optimization. I just can't figure
> out why the existing recursive search for an idle cpu switches to the
> new cpu near the end and then starts a search for an idle cpu in the new
> cpu's domain. Is this to handle some exotic sched domain configurations?
> If so, they probably wouldn't work with your optimizations.

Let me explain my understanding of why the recursive search is the way
it is.

 _________________________  sd0
|                         |
|  ___sd1__   ___sd2__    |
| |        | |        |   |
| | sgx    | |  sga   |   |
| | sgy    | |  sgb   |   |
| |________| |________|   |
|_________________________|

What the current recursive search is doing is (assuming we start with
sd0-the top level sched domain whose flags are rightly set). we find
that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.

We could have ideally stopped the search here.But the problem with this
is that there is a possibility that sgx is more loaded than sgy; meaning
the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
idle and load balancing has not come to its rescue yet.According to the
search above, cpux1 is idle,but is *not the right candidate for
scheduling forked task,it is the right candidate for relieving the load
from cpux2* due to cache locality etc.

Therefore in the next recursive search we go one step inside sd1-the
chosen idlest group candidate,which also happens to be the *next level
sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the
idlest perhaps,if the situation happens to be better than what i have
described for sgx and an appropriate cpu there is chosen.

So in short a bird's eye view of a large sched domain to choose the cpu
would be very short sighted,we could end up creating imbalances within
lower level sched domains.To avoid this the recursive search plays safe
and chooses the best idle group after viewing the large sched domain in
detail.

Therefore even i feel that this patch should be implemented after
thorough tests.



> Morten

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 02/22] sched: select_task_rq_fair clean up
  2013-01-05  8:37 ` [PATCH v3 02/22] sched: select_task_rq_fair clean up Alex Shi
@ 2013-01-11  4:57   ` Preeti U Murthy
  0 siblings, 0 replies; 91+ messages in thread
From: Preeti U Murthy @ 2013-01-11  4:57 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, linux-kernel

On 01/05/2013 02:07 PM, Alex Shi wrote:
> It is impossible to miss a task allowed cpu in a eligible group.
> 
> And since find_idlest_group only return a different group which
> excludes old cpu, it's also imporissible to find a new cpu same as old
> cpu.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5eea870..6d3a95d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3378,11 +3378,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  		}
> 
>  		new_cpu = find_idlest_cpu(group, p, cpu);
> -		if (new_cpu == -1 || new_cpu == cpu) {
> -			/* Now try balancing at a lower domain level of cpu */
> -			sd = sd->child;
> -			continue;
> -		}
> 
>  		/* Now try balancing at a lower domain level of new_cpu */
>  		cpu = new_cpu;
> 
Reviewed-by:Preeti U Murthy


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 03/22] sched: fix find_idlest_group mess logical
  2013-01-05  8:37 ` [PATCH v3 03/22] sched: fix find_idlest_group mess logical Alex Shi
@ 2013-01-11  4:59   ` Preeti U Murthy
  0 siblings, 0 replies; 91+ messages in thread
From: Preeti U Murthy @ 2013-01-11  4:59 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, linux-kernel

On 01/05/2013 02:07 PM, Alex Shi wrote:
> There is 4 situations in the function:
> 1, no task allowed group;
> 	so min_load = ULONG_MAX, this_load = 0, idlest = NULL
> 2, only local group task allowed;
> 	so min_load = ULONG_MAX, this_load assigned, idlest = NULL
> 3, only non-local task group allowed;
> 	so min_load assigned, this_load = 0, idlest != NULL
> 4, local group + another group are task allowed.
> 	so min_load assigned, this_load assigned, idlest != NULL
> 
> Current logical will return NULL in first 3 kinds of scenarios.
> And still return NULL, if idlest group is heavier then the
> local group in the 4th situation.
> 
> Actually, I thought groups in situation 2,3 are also eligible to host
> the task. And in 4th situation, agree to bias toward local group.
> So, has this patch.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6d3a95d..3c7b09a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3181,6 +3181,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  		  int this_cpu, int load_idx)
>  {
>  	struct sched_group *idlest = NULL, *group = sd->groups;
> +	struct sched_group *this_group = NULL;
>  	unsigned long min_load = ULONG_MAX, this_load = 0;
>  	int imbalance = 100 + (sd->imbalance_pct-100)/2;
> 
> @@ -3215,14 +3216,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> 
>  		if (local_group) {
>  			this_load = avg_load;
> -		} else if (avg_load < min_load) {
> +			this_group = group;
> +		}
> +		if (avg_load < min_load) {
>  			min_load = avg_load;
>  			idlest = group;
>  		}
>  	} while (group = group->next, group != sd->groups);
> 
> -	if (!idlest || 100*this_load < imbalance*min_load)
> -		return NULL;
> +	if (this_group && idlest != this_group)
> +		/* Bias toward our group again */
> +		if (100*this_load < imbalance*min_load)
> +			idlest = this_group;
> +
>  	return idlest;
>  }
> 
Reviewed-by:Preeti U Murthy


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 04/22] sched: don't need go to smaller sched domain
  2013-01-05  8:37 ` [PATCH v3 04/22] sched: don't need go to smaller sched domain Alex Shi
  2013-01-09 17:38   ` Morten Rasmussen
@ 2013-01-11  5:02   ` Preeti U Murthy
  1 sibling, 0 replies; 91+ messages in thread
From: Preeti U Murthy @ 2013-01-11  5:02 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, linux-kernel

On 01/05/2013 02:07 PM, Alex Shi wrote:
> If parent sched domain has no task allowed cpu find. neither find in
> it's child. So, go out to save useless checking.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3c7b09a..ecfbf8e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3378,10 +3378,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>  			load_idx = sd->wake_idx;
> 
>  		group = find_idlest_group(sd, p, cpu, load_idx);
> -		if (!group) {
> -			sd = sd->child;
> -			continue;
> -		}
> +		if (!group)
> +			goto unlock;
> 
>  		new_cpu = find_idlest_cpu(group, p, cpu);
> 
Reviewed-by:Preeti U Murthy


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 07/22] sched: set initial load avg of new forked task
  2013-01-05  8:37 ` [PATCH v3 07/22] sched: set initial load avg of new forked task Alex Shi
@ 2013-01-11  5:10   ` Preeti U Murthy
  2013-01-11  5:44     ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Preeti U Murthy @ 2013-01-11  5:10 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, linux-kernel

On 01/05/2013 02:07 PM, Alex Shi wrote:
> New task has no runnable sum at its first runnable time, that make
> burst forking just select few idle cpus to put tasks.
> Set initial load avg of new forked task as its load weight to resolve
> this issue.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  include/linux/sched.h |  1 +
>  kernel/sched/core.c   |  2 +-
>  kernel/sched/fair.c   | 11 +++++++++--
>  3 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 206bb08..fb7aab5 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1069,6 +1069,7 @@ struct sched_domain;
>  #else
>  #define ENQUEUE_WAKING		0
>  #endif
> +#define ENQUEUE_NEWTASK		8
> 
>  #define DEQUEUE_SLEEP		1
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 66c1718..66ce1f1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1705,7 +1705,7 @@ void wake_up_new_task(struct task_struct *p)
>  #endif
> 
>  	rq = __task_rq_lock(p);
> -	activate_task(rq, p, 0);
> +	activate_task(rq, p, ENQUEUE_NEWTASK);
>  	p->on_rq = 1;
>  	trace_sched_wakeup_new(p, true);
>  	check_preempt_curr(rq, p, WF_FORK);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 895a3f4..5c545e4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1503,8 +1503,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>  /* Add the load generated by se into cfs_rq's child load-average */
>  static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>  						  struct sched_entity *se,
> -						  int wakeup)
> +						  int flags)
>  {
> +	int wakeup = flags & ENQUEUE_WAKEUP;
>  	/*
>  	 * We track migrations using entity decay_count <= 0, on a wake-up
>  	 * migration we use a negative decay count to track the remote decays
> @@ -1538,6 +1539,12 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>  		update_entity_load_avg(se, 0);
>  	}
> 
> +	/*
> +	 * set the initial load avg of new task same as its load
> +	 * in order to avoid brust fork make few cpu too heavier
> +	 */
> +	if (flags & ENQUEUE_NEWTASK)
> +		se->avg.load_avg_contrib = se->load.weight;
>  	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
>  	/* we force update consideration on load-balancer moves */
>  	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
> @@ -1701,7 +1708,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> -	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
> +	enqueue_entity_load_avg(cfs_rq, se, flags);
>  	account_entity_enqueue(cfs_rq, se);
>  	update_cfs_shares(cfs_rq);
> 
I had seen in my experiments, that the forked tasks with initial load to
be 0,would adversely affect the runqueue lengths.Since the load for
these tasks to pick up takes some time,the cpus on which the forked
tasks are scheduled, could be candidates for "dst_cpu" many times and
the runqueue lengths increase considerably.

This patch solves this issue by making the forked tasks contribute
actively to the runqueue load.

Reviewed-by:Preeti U Murthy


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 07/22] sched: set initial load avg of new forked task
  2013-01-11  5:10   ` Preeti U Murthy
@ 2013-01-11  5:44     ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-11  5:44 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, linux-kernel

On 01/11/2013 01:10 PM, Preeti U Murthy wrote:
>> >  	update_curr(cfs_rq);
>> > -	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
>> > +	enqueue_entity_load_avg(cfs_rq, se, flags);
>> >  	account_entity_enqueue(cfs_rq, se);
>> >  	update_cfs_shares(cfs_rq);
>> > 
> I had seen in my experiments, that the forked tasks with initial load to
> be 0,would adversely affect the runqueue lengths.Since the load for
> these tasks to pick up takes some time,the cpus on which the forked
> tasks are scheduled, could be candidates for "dst_cpu" many times and
> the runqueue lengths increase considerably.
> 
> This patch solves this issue by making the forked tasks contribute
> actively to the runqueue load.
> 
> Reviewed-by:Preeti U Murthy
> 

Thanks for review, Preeti! :)


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-06 18:31       ` Linus Torvalds
  2013-01-07  7:00         ` Preeti U Murthy
  2013-01-08 14:27         ` Alex Shi
@ 2013-01-11  6:31         ` Alex Shi
  2013-01-21 14:47           ` Alex Shi
  2 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-11  6:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Andrew Morton, Arjan van de Ven, Borislav Petkov, namhyung,
	Mike Galbraith, Vincent Guittot, Greg Kroah-Hartman, preeti,
	Linux Kernel Mailing List

On 01/07/2013 02:31 AM, Linus Torvalds wrote:
> On Sat, Jan 5, 2013 at 11:54 PM, Alex Shi <alex.shi@intel.com> wrote:
>>
>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
>> after all tasks ready, aim9 give a signal than all tasks burst waking up
>> and run until all finished.
>> Since each of tasks are finished very quickly, a imbalanced empty cpu
>> may goes to sleep till a regular balancing give it some new tasks. That
>> causes the performance dropping. cause more idle entering.
> 
> Sounds like for AIM (and possibly for other really bursty loads), we
> might want to do some load-balancing at wakeup time by *just* looking
> at the number of running tasks, rather than at the load average. Hmm?
> 
> The load average is fundamentally always going to run behind a bit,
> and while you want to use it for long-term balancing, a short-term you
> might want to do just a "if we have a huge amount of runnable
> processes, do a load balancing *now*". Where "huge amount" should
> probably be relative to the long-term load balancing (ie comparing the
> number of runnable processes on this CPU right *now* with the load
> average over the last second or so would show a clear spike, and a
> reason for quick action).
> 

Sorry for response late!

Just written a patch following your suggestion, but no clear improvement for this case.
I also tried change the burst checking interval, also no clear help.

If I totally give up runnable load in periodic balancing, the performance can recover 60%
of lose.

I will try to optimize wake up balancing in weekend.

Nice weekend!
Alex

---
>From 8f6f7317568a7bd8497e7a6e8d9afcc2b4e93a7e Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Wed, 9 Jan 2013 23:16:57 +0800
Subject: [PATCH] sched: use instant load weight in burst regular load balance

Runnable load tracking needs much time to accumulate the runnable
load, so when system burst wake up many sleep tasks, it needs more time
balance them well. This patch try to catch such scenario and use instant
load instead of runnable load to do balance.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/debug.c  |  1 +
 kernel/sched/fair.c   | 55 +++++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sysctl.c       |  7 +++++++
 4 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b0354a5..f6cf1b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2032,6 +2032,7 @@ extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
 extern unsigned int sysctl_sched_child_runs_first;
+extern unsigned int sysctl_sched_burst_check_ms;
 
 enum sched_tunable_scaling {
 	SCHED_TUNABLESCALING_NONE,
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e4035f7..d06fc3c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -380,6 +380,7 @@ static int sched_debug_show(struct seq_file *m, void *v)
 	PN(sysctl_sched_latency);
 	PN(sysctl_sched_min_granularity);
 	PN(sysctl_sched_wakeup_granularity);
+	PN(sysctl_sched_burst_check_ms);
 	P(sysctl_sched_child_runs_first);
 	P(sysctl_sched_features);
 #undef PN
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 604d0ee..875e7af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4032,6 +4032,7 @@ struct lb_env {
 	unsigned int		loop_max;
 	int			power_lb;  /* if power balance needed */
 	int			perf_lb;   /* if performance balance needed */
+	int			has_burst;
 };
 
 /*
@@ -4729,6 +4730,37 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
 	return 0;
 }
 
+DEFINE_PER_CPU(unsigned long, next_check);
+DEFINE_PER_CPU(unsigned int, last_running);
+
+/* do burst check no less than this interval */
+unsigned int sysctl_sched_burst_check_ms = 1000UL;
+
+/**
+ * check_burst - check if tasks bursts up on this cpu.
+ * @env: The load balancing environment.
+ */
+static void check_burst(struct lb_env *env)
+{
+	int cpu;
+	unsigned int curr_running, prev_running, interval;
+
+	cpu = env->dst_cpu;
+	curr_running = cpu_rq(cpu)->nr_running;
+	prev_running = per_cpu(last_running, cpu);
+	interval = sysctl_sched_burst_check_ms;
+
+	per_cpu(last_running, cpu) = curr_running;
+
+	if (time_after_eq(jiffies, per_cpu(next_check, cpu))) {
+		per_cpu(next_check, cpu) = jiffies + msecs_to_jiffies(interval);
+		/* find a pike from last balance on the cpu  */
+		if (curr_running  >  2 + (prev_running << 2))
+			env->has_burst = 1;
+	}
+	env->has_burst = 0;
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -4770,9 +4802,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 				balance_cpu = i;
 			}
 
-			load = target_load(i, load_idx);
+			if (env->has_burst)
+				load = rq->load.weight;
+			else
+				load = target_load(i, load_idx);
 		} else {
-			load = source_load(i, load_idx);
+			if (env->has_burst)
+				load = rq->load.weight;
+			else
+				load = source_load(i, load_idx);
 			if (load > max_cpu_load)
 				max_cpu_load = load;
 			if (min_cpu_load > load)
@@ -4786,7 +4824,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
-		sgs->sum_weighted_load += weighted_cpuload(i);
+		if (env->has_burst)
+			sgs->sum_weighted_load += cpu_rq(i)->load.weight;
+		else
+			sgs->sum_weighted_load += weighted_cpuload(i);
 
 		/* accumulate the maximum potential util */
 		if (!nr_running)
@@ -5164,6 +5205,8 @@ find_busiest_group(struct lb_env *env, int *balance)
 
 	memset(&sds, 0, sizeof(sds));
 
+	check_burst(env);
+
 	/*
 	 * Compute the various statistics relavent for load balancing at
 	 * this level.
@@ -5270,7 +5313,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		rq = cpu_rq(i);
-		wl = weighted_cpuload(i);
+		if (env->has_burst)
+			wl = rq->load.weight;
+		else
+			wl = weighted_cpuload(i);
 
 		/*
 		 * When comparing with imbalance, use weighted_cpuload()
@@ -5352,6 +5398,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.cpus		= cpus,
 		.power_lb	= 1,
 		.perf_lb	= 0,
+		.has_burst	= 0,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c88878d..25262b8 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -350,6 +350,13 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one,
 	},
+	{
+		.procname	= "sched_burst_check_ms",
+		.data		= &sysctl_sched_burst_check_ms,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 #endif /* CONFIG_SMP */
 #ifdef CONFIG_NUMA_BALANCING
 	{
-- 
1.7.12

>            Linus
> 


^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake
  2013-01-10 15:01   ` Morten Rasmussen
@ 2013-01-11  7:08     ` Alex Shi
  2013-01-14 16:09       ` Morten Rasmussen
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-11  7:08 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
> On Sat, Jan 05, 2013 at 08:37:45AM +0000, Alex Shi wrote:
>> This patch add power aware scheduling in fork/exec/wake. It try to
>> select cpu from the busiest while still has utilization group. That's
>> will save power for other groups.
>>
>> The trade off is adding a power aware statistics collection in group
>> seeking. But since the collection just happened in power scheduling
>> eligible condition, the worst case of hackbench testing just drops
>> about 2% with powersaving/balance policy. No clear change for
>> performance policy.
>>
>> I had tried to use rq load avg utilisation in this balancing, but since
>> the utilisation need much time to accumulate itself. It's unfit for any
>> burst balancing. So I use nr_running as instant rq utilisation.
> 
> So you effective use a mix of nr_running (counting tasks) and PJT's
> tracked load for balancing?

no, just task number here.
> 
> The problem of slow reaction time of the tracked load a cpu/rq is an
> interesting one. Would it be possible to use it if you maintained a
> sched group runnable_load_avg similar to cfs_rq->runnable_load_avg where
> load contribution of a tasks is added when a task is enqueued and
> removed again if it migrates to another cpu?
> This way you would know the new load of the sched group/domain instantly
> when you migrate a task there. It might not be precise as the load
> contribution of the task to some extend depends on the load of the cpu
> where it is running. But it would probably be a fair estimate, which is
> quite likely to be better than just counting tasks (nr_running).

For power consideration scenario, it ask task number less than Lcpu
number, don't care the load weight, since whatever the load weight, the
task only can burn one LCPU.

>> +
>> +		if (sched_policy == SCHED_POLICY_POWERSAVING)
>> +			threshold = sgs.group_weight;
>> +		else
>> +			threshold = sgs.group_capacity;
> 
> Is group_capacity larger or smaller than group_weight on your platform?

Guess most of your confusing come from the capacity != weight here.

In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
just bigger than a normal cpu power - 1024. but the capacity is still 1,
while the group weight is 2.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-11  4:56     ` Preeti U Murthy
@ 2013-01-11  8:01       ` li guang
  2013-01-11 14:56         ` Alex Shi
  2013-01-11 10:54       ` Morten Rasmussen
  2013-01-16  5:43       ` Alex Shi
  2 siblings, 1 reply; 91+ messages in thread
From: li guang @ 2013-01-11  8:01 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Morten Rasmussen, Alex Shi, mingo, peterz, tglx, akpm, arjan, bp,
	pjt, namhyung, efault, vincent.guittot, gregkh, linux-kernel

在 2013-01-11五的 10:26 +0530,Preeti U Murthy写道:
> Hi Morten,Alex
> 
> On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
> > On Sat, Jan 05, 2013 at 08:37:34AM +0000, Alex Shi wrote:
> >> Guess the search cpu from bottom to up in domain tree come from
> >> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
> >> balancing over tasks on all level domains.
> >>
> >> This balancing cost much if there has many domain/groups in a large
> >> system. And force spreading task among different domains may cause
> >> performance issue due to bad locality.
> >>
> >> If we remove this code, we will get quick fork/exec/wake, plus better
> >> balancing among whole system, that also reduce migrations in future
> >> load balancing.
> >>
> >> This patch increases 10+% performance of hackbench on my 4 sockets
> >> NHM and SNB machines.
> >>
> >> Signed-off-by: Alex Shi <alex.shi@intel.com>
> >> ---
> >>  kernel/sched/fair.c | 20 +-------------------
> >>  1 file changed, 1 insertion(+), 19 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index ecfbf8e..895a3f4 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> >>  		goto unlock;
> >>  	}
> >>  
> >> -	while (sd) {
> >> +	if (sd) {
> >>  		int load_idx = sd->forkexec_idx;
> >>  		struct sched_group *group;
> >> -		int weight;
> >> -
> >> -		if (!(sd->flags & sd_flag)) {
> >> -			sd = sd->child;
> >> -			continue;
> >> -		}
> >>  
> >>  		if (sd_flag & SD_BALANCE_WAKE)
> >>  			load_idx = sd->wake_idx;
> >> @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> >>  			goto unlock;
> >>  
> >>  		new_cpu = find_idlest_cpu(group, p, cpu);
> >> -
> >> -		/* Now try balancing at a lower domain level of new_cpu */
> >> -		cpu = new_cpu;
> >> -		weight = sd->span_weight;
> >> -		sd = NULL;
> >> -		for_each_domain(cpu, tmp) {
> >> -			if (weight <= tmp->span_weight)
> >> -				break;
> >> -			if (tmp->flags & sd_flag)
> >> -				sd = tmp;
> >> -		}
> >> -		/* while loop will break here if sd == NULL */
> > 
> > I agree that this should be a major optimization. I just can't figure
> > out why the existing recursive search for an idle cpu switches to the
> > new cpu near the end and then starts a search for an idle cpu in the new
> > cpu's domain. Is this to handle some exotic sched domain configurations?
> > If so, they probably wouldn't work with your optimizations.
> 
> Let me explain my understanding of why the recursive search is the way
> it is.
> 
>  _________________________  sd0
> |                         |
> |  ___sd1__   ___sd2__    |
> | |        | |        |   |
> | | sgx    | |  sga   |   |
> | | sgy    | |  sgb   |   |
> | |________| |________|   |
> |_________________________|
> 
> What the current recursive search is doing is (assuming we start with
> sd0-the top level sched domain whose flags are rightly set). we find
> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
> 
> We could have ideally stopped the search here.But the problem with this
> is that there is a possibility that sgx is more loaded than sgy; meaning
> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
> idle and load balancing has not come to its rescue yet.According to the
> search above, cpux1 is idle,but is *not the right candidate for
> scheduling forked task,it is the right candidate for relieving the load
> from cpux2* due to cache locality etc.

This corner case may occur after "[PATCH v3 03/22] sched: fix
find_idlest_group mess logical" brought in the local sched_group bias,
and assume balancing runs on cpux2.
ideally,  find_idlest_group should find the real idlest(this case: sgy),
then, this patch is reasonable.

> 
> Therefore in the next recursive search we go one step inside sd1-the
> chosen idlest group candidate,which also happens to be the *next level
> sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the
> idlest perhaps,if the situation happens to be better than what i have
> described for sgx and an appropriate cpu there is chosen.
> 
> So in short a bird's eye view of a large sched domain to choose the cpu
> would be very short sighted,we could end up creating imbalances within
> lower level sched domains.To avoid this the recursive search plays safe
> and chooses the best idle group after viewing the large sched domain in
> detail.
> 
> Therefore even i feel that this patch should be implemented after
> thorough tests.
> 
> 
> 
> > Morten
> 
> Regards
> Preeti U Murthy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
regards!
li guang


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-11  2:46     ` Alex Shi
@ 2013-01-11 10:07       ` Morten Rasmussen
  2013-01-11 14:50         ` Alex Shi
  2013-01-14  8:55         ` li guang
  0 siblings, 2 replies; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-11 10:07 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Fri, Jan 11, 2013 at 02:46:31AM +0000, Alex Shi wrote:
> On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
> >>  		new_cpu = find_idlest_cpu(group, p, cpu);
> >> > -
> >> > -		/* Now try balancing at a lower domain level of new_cpu */
> >> > -		cpu = new_cpu;
> >> > -		weight = sd->span_weight;
> >> > -		sd = NULL;
> >> > -		for_each_domain(cpu, tmp) {
> >> > -			if (weight <= tmp->span_weight)
> >> > -				break;
> >> > -			if (tmp->flags & sd_flag)
> >> > -				sd = tmp;
> >> > -		}
> >> > -		/* while loop will break here if sd == NULL */
> > I agree that this should be a major optimization. I just can't figure
> > out why the existing recursive search for an idle cpu switches to the
> > new cpu near the end and then starts a search for an idle cpu in the new
> > cpu's domain. Is this to handle some exotic sched domain configurations?
> > If so, they probably wouldn't work with your optimizations.
> 
> I did not find odd configuration that asking for old logical.
> 
> According to Documentation/scheduler/sched-domains.txt, Maybe never.
> "A domain's span MUST be a superset of it child's span (this restriction
> could be relaxed if the need arises), and a base domain for CPU i MUST
> span at least i."  etc. etc.

The reason for my suspicion is the SD_OVERLAP flag, which has something
to do overlapping sched domains. I haven't looked into what it does or
how it works. I'm just wondering if this optimization will affect the
use of that flag.

Morten

> 
> 
> -- 
> Thanks Alex
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-11  4:56     ` Preeti U Murthy
  2013-01-11  8:01       ` li guang
@ 2013-01-11 10:54       ` Morten Rasmussen
  2013-01-16  5:43       ` Alex Shi
  2 siblings, 0 replies; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-11 10:54 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Alex Shi, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, linux-kernel

Hi Preeti,

On Fri, Jan 11, 2013 at 04:56:09AM +0000, Preeti U Murthy wrote:
> Hi Morten,Alex
> 
> On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
> > On Sat, Jan 05, 2013 at 08:37:34AM +0000, Alex Shi wrote:
> >> Guess the search cpu from bottom to up in domain tree come from
> >> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
> >> balancing over tasks on all level domains.
> >>
> >> This balancing cost much if there has many domain/groups in a large
> >> system. And force spreading task among different domains may cause
> >> performance issue due to bad locality.
> >>
> >> If we remove this code, we will get quick fork/exec/wake, plus better
> >> balancing among whole system, that also reduce migrations in future
> >> load balancing.
> >>
> >> This patch increases 10+% performance of hackbench on my 4 sockets
> >> NHM and SNB machines.
> >>
> >> Signed-off-by: Alex Shi <alex.shi@intel.com>
> >> ---
> >>  kernel/sched/fair.c | 20 +-------------------
> >>  1 file changed, 1 insertion(+), 19 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index ecfbf8e..895a3f4 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> >>  		goto unlock;
> >>  	}
> >>  
> >> -	while (sd) {
> >> +	if (sd) {
> >>  		int load_idx = sd->forkexec_idx;
> >>  		struct sched_group *group;
> >> -		int weight;
> >> -
> >> -		if (!(sd->flags & sd_flag)) {
> >> -			sd = sd->child;
> >> -			continue;
> >> -		}
> >>  
> >>  		if (sd_flag & SD_BALANCE_WAKE)
> >>  			load_idx = sd->wake_idx;
> >> @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> >>  			goto unlock;
> >>  
> >>  		new_cpu = find_idlest_cpu(group, p, cpu);
> >> -
> >> -		/* Now try balancing at a lower domain level of new_cpu */
> >> -		cpu = new_cpu;
> >> -		weight = sd->span_weight;
> >> -		sd = NULL;
> >> -		for_each_domain(cpu, tmp) {
> >> -			if (weight <= tmp->span_weight)
> >> -				break;
> >> -			if (tmp->flags & sd_flag)
> >> -				sd = tmp;
> >> -		}
> >> -		/* while loop will break here if sd == NULL */
> > 
> > I agree that this should be a major optimization. I just can't figure
> > out why the existing recursive search for an idle cpu switches to the
> > new cpu near the end and then starts a search for an idle cpu in the new
> > cpu's domain. Is this to handle some exotic sched domain configurations?
> > If so, they probably wouldn't work with your optimizations.
> 
> Let me explain my understanding of why the recursive search is the way
> it is.
> 
>  _________________________  sd0
> |                         |
> |  ___sd1__   ___sd2__    |
> | |        | |        |   |
> | | sgx    | |  sga   |   |
> | | sgy    | |  sgb   |   |
> | |________| |________|   |
> |_________________________|
> 
> What the current recursive search is doing is (assuming we start with
> sd0-the top level sched domain whose flags are rightly set). we find
> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
> 
> We could have ideally stopped the search here.But the problem with this
> is that there is a possibility that sgx is more loaded than sgy; meaning
> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
> idle and load balancing has not come to its rescue yet.According to the
> search above, cpux1 is idle,but is *not the right candidate for
> scheduling forked task,it is the right candidate for relieving the load
> from cpux2* due to cache locality etc.
> 
> Therefore in the next recursive search we go one step inside sd1-the
> chosen idlest group candidate,which also happens to be the *next level
> sched domain for cpux1-the chosen idle cpu*. It then returns sgy as the
> idlest perhaps,if the situation happens to be better than what i have
> described for sgx and an appropriate cpu there is chosen.
> 
> So in short a bird's eye view of a large sched domain to choose the cpu
> would be very short sighted,we could end up creating imbalances within
> lower level sched domains.To avoid this the recursive search plays safe
> and chooses the best idle group after viewing the large sched domain in
> detail.

Thanks for your explanation. I see your point that the first search my
end a high level in the sched domain and pick a cpu in a very unbalanced
group. The extra search will then try to put things right.

This patch set removes the recursive search completely. So the overall
balance policy is changed from trying to achieve equal load across all
groups to always put tasks on the most idle cpu regardless of the load
of its group.

I'm not sure if this is a good or bad move. It is quicker.

Regards,
Morten

> 
> Therefore even i feel that this patch should be implemented after
> thorough tests.
> 
> 
> 
> > Morten
> 
> Regards
> Preeti U Murthy
> 
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-11 10:07       ` Morten Rasmussen
@ 2013-01-11 14:50         ` Alex Shi
  2013-01-14  8:55         ` li guang
  1 sibling, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-11 14:50 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/11/2013 06:07 PM, Morten Rasmussen wrote:
> On Fri, Jan 11, 2013 at 02:46:31AM +0000, Alex Shi wrote:
>> On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
>>>>  		new_cpu = find_idlest_cpu(group, p, cpu);
>>>>> -
>>>>> -		/* Now try balancing at a lower domain level of new_cpu */
>>>>> -		cpu = new_cpu;
>>>>> -		weight = sd->span_weight;
>>>>> -		sd = NULL;
>>>>> -		for_each_domain(cpu, tmp) {
>>>>> -			if (weight <= tmp->span_weight)
>>>>> -				break;
>>>>> -			if (tmp->flags & sd_flag)
>>>>> -				sd = tmp;
>>>>> -		}
>>>>> -		/* while loop will break here if sd == NULL */
>>> I agree that this should be a major optimization. I just can't figure
>>> out why the existing recursive search for an idle cpu switches to the
>>> new cpu near the end and then starts a search for an idle cpu in the new
>>> cpu's domain. Is this to handle some exotic sched domain configurations?
>>> If so, they probably wouldn't work with your optimizations.
>>
>> I did not find odd configuration that asking for old logical.
>>
>> According to Documentation/scheduler/sched-domains.txt, Maybe never.
>> "A domain's span MUST be a superset of it child's span (this restriction
>> could be relaxed if the need arises), and a base domain for CPU i MUST
>> span at least i."  etc. etc.
> 
> The reason for my suspicion is the SD_OVERLAP flag, which has something
> to do overlapping sched domains. I haven't looked into what it does or
> how it works. I'm just wondering if this optimization will affect the
> use of that flag.

I didn't know any machine has this flag, but if just some cpu overlap,
not stay alone without any domain, the patch won't miss eligible cpu.
> 
> Morten
> 
>>
>>
>> -- 
>> Thanks Alex
>>
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-11  8:01       ` li guang
@ 2013-01-11 14:56         ` Alex Shi
  2013-01-14  9:03           ` li guang
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-11 14:56 UTC (permalink / raw)
  To: li guang
  Cc: Preeti U Murthy, Morten Rasmussen, mingo, peterz, tglx, akpm,
	arjan, bp, pjt, namhyung, efault, vincent.guittot, gregkh,
	linux-kernel

On 01/11/2013 04:01 PM, li guang wrote:
> 在 2013-01-11五的 10:26 +0530,Preeti U Murthy写道:
>> Hi Morten,Alex
>>
>> On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
>>> On Sat, Jan 05, 2013 at 08:37:34AM +0000, Alex Shi wrote:
>>>> Guess the search cpu from bottom to up in domain tree come from
>>>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>>>> balancing over tasks on all level domains.
>>>>
>>>> This balancing cost much if there has many domain/groups in a large
>>>> system. And force spreading task among different domains may cause
>>>> performance issue due to bad locality.
>>>>
>>>> If we remove this code, we will get quick fork/exec/wake, plus better
>>>> balancing among whole system, that also reduce migrations in future
>>>> load balancing.
>>>>
>>>> This patch increases 10+% performance of hackbench on my 4 sockets
>>>> NHM and SNB machines.
>>>>
>>>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>>>> ---
>>>>  kernel/sched/fair.c | 20 +-------------------
>>>>  1 file changed, 1 insertion(+), 19 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index ecfbf8e..895a3f4 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>>>  		goto unlock;
>>>>  	}
>>>>  
>>>> -	while (sd) {
>>>> +	if (sd) {
>>>>  		int load_idx = sd->forkexec_idx;
>>>>  		struct sched_group *group;
>>>> -		int weight;
>>>> -
>>>> -		if (!(sd->flags & sd_flag)) {
>>>> -			sd = sd->child;
>>>> -			continue;
>>>> -		}
>>>>  
>>>>  		if (sd_flag & SD_BALANCE_WAKE)
>>>>  			load_idx = sd->wake_idx;
>>>> @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
>>>>  			goto unlock;
>>>>  
>>>>  		new_cpu = find_idlest_cpu(group, p, cpu);
>>>> -
>>>> -		/* Now try balancing at a lower domain level of new_cpu */
>>>> -		cpu = new_cpu;
>>>> -		weight = sd->span_weight;
>>>> -		sd = NULL;
>>>> -		for_each_domain(cpu, tmp) {
>>>> -			if (weight <= tmp->span_weight)
>>>> -				break;
>>>> -			if (tmp->flags & sd_flag)
>>>> -				sd = tmp;
>>>> -		}
>>>> -		/* while loop will break here if sd == NULL */
>>>
>>> I agree that this should be a major optimization. I just can't figure
>>> out why the existing recursive search for an idle cpu switches to the
>>> new cpu near the end and then starts a search for an idle cpu in the new
>>> cpu's domain. Is this to handle some exotic sched domain configurations?
>>> If so, they probably wouldn't work with your optimizations.
>>
>> Let me explain my understanding of why the recursive search is the way
>> it is.
>>
>>  _________________________  sd0
>> |                         |
>> |  ___sd1__   ___sd2__    |
>> | |        | |        |   |
>> | | sgx    | |  sga   |   |
>> | | sgy    | |  sgb   |   |
>> | |________| |________|   |
>> |_________________________|
>>
>> What the current recursive search is doing is (assuming we start with
>> sd0-the top level sched domain whose flags are rightly set). we find
>> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
>>
>> We could have ideally stopped the search here.But the problem with this
>> is that there is a possibility that sgx is more loaded than sgy; meaning
>> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
>> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
>> idle and load balancing has not come to its rescue yet.According to the
>> search above, cpux1 is idle,but is *not the right candidate for
>> scheduling forked task,it is the right candidate for relieving the load
>> from cpux2* due to cache locality etc.
> 
> This corner case may occur after "[PATCH v3 03/22] sched: fix
> find_idlest_group mess logical" brought in the local sched_group bias,
> and assume balancing runs on cpux2.
> ideally,  find_idlest_group should find the real idlest(this case: sgy),
> then, this patch is reasonable.
> 

Sure. but seems it is a bit hard to go down the idlest group.

and the old logical is real cost too much, on my 2 socket NHM/SNB
server, hackbench can increase 2~5% performance. and no clean
performance on kbuild/aim7 etc.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 14/22] sched: add sched_policy and it's sysfs interface
  2013-01-05  8:37 ` [PATCH v3 14/22] sched: add sched_policy and it's sysfs interface Alex Shi
@ 2013-01-14  6:53   ` Namhyung Kim
  2013-01-14  8:11     ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Namhyung Kim @ 2013-01-14  6:53 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

Hi Alex,

Just a few nitpickings..


On Sat,  5 Jan 2013 16:37:43 +0800, Alex Shi wrote:
> This patch add the power aware scheduler knob into sysfs:
>
> $cat /sys/devices/system/cpu/sched_policy/available_sched_policy
> performance powersaving balance
> $cat /sys/devices/system/cpu/sched_policy/current_sched_policy
> powersaving
>
> This means the using sched policy is 'powersaving'.
>
> User can change the policy by commend 'echo':
>  echo performance > /sys/devices/system/cpu/current_sched_policy
>
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  Documentation/ABI/testing/sysfs-devices-system-cpu | 24 +++++++
>  kernel/sched/fair.c                                | 76 ++++++++++++++++++++++
>  2 files changed, 100 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> index 6943133..9c9acbf 100644
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -53,6 +53,30 @@ Description:	Dynamic addition and removal of CPU's.  This is not hotplug
>  		the system.  Information writtento the file to remove CPU's
>  		is architecture specific.
>  
> +What:		/sys/devices/system/cpu/sched_policy/current_sched_policy
> +		/sys/devices/system/cpu/sched_policy/available_sched_policy
> +Date:		Oct 2012
> +Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
> +Description:	CFS scheduler policy showing and setting interface.
> +
> +		available_sched_policy shows there are 3 kinds of policy now:
> +		performance, balance and powersaving.
> +		current_sched_policy shows current scheduler policy. And user
> +		can change the policy by writing it.
> +
> +		Policy decides that CFS scheduler how to distribute tasks onto
> +		which CPU unit when tasks number less than LCPU number in system
> +
> +		performance: try to spread tasks onto more CPU sockets,
> +		more CPU cores.
> +
> +		powersaving: try to shrink tasks onto same core or same CPU
> +		until every LCPUs are busy.
> +
> +		balance:     try to shrink tasks onto same core or same CPU
> +		until full powered CPUs are busy. This policy also consider
> +		system performance when try to save power.
> +
>  What:		/sys/devices/system/cpu/cpu#/node
>  Date:		October 2009
>  Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f24aca6..ee015b8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6102,6 +6102,82 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
>  
>  /* The default scheduler policy is 'performance'. */
>  int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
> +
> +#ifdef CONFIG_SYSFS
> +static ssize_t show_available_sched_policy(struct device *dev,
> +		struct device_attribute *attr,
> +		char *buf)

This line can be combined to the above line.


> +{
> +	return sprintf(buf, "performance balance powersaving\n");
> +}
> +
> +static ssize_t show_current_sched_policy(struct device *dev,
> +		struct device_attribute *attr,
> +		char *buf)

Ditto.


> +{
> +	if (sched_policy == SCHED_POLICY_PERFORMANCE)
> +		return sprintf(buf, "performance\n");
> +	else if (sched_policy == SCHED_POLICY_POWERSAVING)
> +		return sprintf(buf, "powersaving\n");
> +	else if (sched_policy == SCHED_POLICY_BALANCE)
> +		return sprintf(buf, "balance\n");
> +	return 0;
> +}
> +
> +static ssize_t set_sched_policy(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	unsigned int ret = -EINVAL;
> +	char    str_policy[16];
> +
> +	ret = sscanf(buf, "%15s", str_policy);
> +	if (ret != 1)
> +		return -EINVAL;
> +
> +	if (!strcmp(str_policy, "performance"))
> +		sched_policy = SCHED_POLICY_PERFORMANCE;
> +	else if (!strcmp(str_policy, "powersaving"))
> +		sched_policy = SCHED_POLICY_POWERSAVING;
> +	else if (!strcmp(str_policy, "balance"))
> +		sched_policy = SCHED_POLICY_BALANCE;
> +	else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +/*
> + *  * Sysfs setup bits:
> + *   */

Unneeded asterisks.

Thanks,
Namhyung


> +static DEVICE_ATTR(current_sched_policy, 0644, show_current_sched_policy,
> +						set_sched_policy);
> +
> +static DEVICE_ATTR(available_sched_policy, 0444,
> +		show_available_sched_policy, NULL);
> +
> +static struct attribute *sched_policy_default_attrs[] = {
> +	&dev_attr_current_sched_policy.attr,
> +	&dev_attr_available_sched_policy.attr,
> +	NULL
> +};
> +static struct attribute_group sched_policy_attr_group = {
> +	.attrs = sched_policy_default_attrs,
> +	.name = "sched_policy",
> +};
> +
> +int __init create_sysfs_sched_policy_group(struct device *dev)
> +{
> +	return sysfs_create_group(&dev->kobj, &sched_policy_attr_group);
> +}
> +
> +static int __init sched_policy_sysfs_init(void)
> +{
> +	return create_sysfs_sched_policy_group(cpu_subsys.dev_root);
> +}
> +
> +core_initcall(sched_policy_sysfs_init);
> +#endif /* CONFIG_SYSFS */
> +
>  /*
>   * All the scheduling class methods:
>   */

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake
  2013-01-05  8:37 ` [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake Alex Shi
  2013-01-10 15:01   ` Morten Rasmussen
@ 2013-01-14  7:03   ` Namhyung Kim
  2013-01-14  8:30     ` Alex Shi
  1 sibling, 1 reply; 91+ messages in thread
From: Namhyung Kim @ 2013-01-14  7:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Sat,  5 Jan 2013 16:37:45 +0800, Alex Shi wrote:
> This patch add power aware scheduling in fork/exec/wake. It try to
> select cpu from the busiest while still has utilization group. That's
> will save power for other groups.
>
> The trade off is adding a power aware statistics collection in group
> seeking. But since the collection just happened in power scheduling
> eligible condition, the worst case of hackbench testing just drops
> about 2% with powersaving/balance policy. No clear change for
> performance policy.
>
> I had tried to use rq load avg utilisation in this balancing, but since
> the utilisation need much time to accumulate itself. It's unfit for any
> burst balancing. So I use nr_running as instant rq utilisation.
>
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
[snip]
> +/*
> + * Try to collect the task running number and capacity of the doamin.
> + */
> +static void get_sd_power_stats(struct sched_domain *sd,
> +		struct task_struct *p, struct sd_lb_stats *sds)
> +{
> +	struct sched_group *group;
> +	struct sg_lb_stats sgs;
> +	int sd_min_delta = INT_MAX;
> +	int cpu = task_cpu(p);
> +
> +	group = sd->groups;
> +	do {
> +		long g_delta;
> +		unsigned long threshold;
> +
> +		if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
> +			continue;

Why?

That means only local group's stat will be accounted for this domain,
right?  Is it your intension?

Thanks,
Namhyung


> +
> +		memset(&sgs, 0, sizeof(sgs));
> +		get_sg_power_stats(group, sd, &sgs);
> +
> +		if (sched_policy == SCHED_POLICY_POWERSAVING)
> +			threshold = sgs.group_weight;
> +		else
> +			threshold = sgs.group_capacity;
> +
> +		g_delta = threshold - sgs.group_utils;
> +
> +		if (g_delta > 0 && g_delta < sd_min_delta) {
> +			sd_min_delta = g_delta;
> +			sds->group_leader = group;
> +		}
> +
> +		sds->sd_utils += sgs.group_utils;
> +		sds->total_pwr += group->sgp->power;
> +	} while  (group = group->next, group != sd->groups);
> +
> +	sds->sd_capacity = DIV_ROUND_CLOSEST(sds->total_pwr,
> +						SCHED_POWER_SCALE);
> +}

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
  2013-01-11  3:47     ` Alex Shi
@ 2013-01-14  7:13       ` Namhyung Kim
  2013-01-16  6:11         ` Alex Shi
  2013-01-14 17:00       ` Morten Rasmussen
  1 sibling, 1 reply; 91+ messages in thread
From: Namhyung Kim @ 2013-01-14  7:13 UTC (permalink / raw)
  To: Alex Shi
  Cc: Morten Rasmussen, mingo, peterz, tglx, akpm, arjan, bp, pjt,
	efault, vincent.guittot, gregkh, preeti, linux-kernel

On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
> On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
>> On Sat, Jan 05, 2013 at 08:37:46AM +0000, Alex Shi wrote:
>>> If the wake/exec task is small enough, utils < 12.5%, it will
>>> has the chance to be packed into a cpu which is busy but still has space to
>>> handle it.
>>>
>>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>>> ---
[snip]
>> I may be missing something, but could the expression be something like
>> the below instead?
>> 
>> Create a putil < 12.5% check before the loop. There is no reason to
>> recheck it every iteration. Then:

Agreed.  Also suggest that the checking local cpu can also be moved
before the loop so that it can be used without going through the loop if
it's vacant enough.

>> 
>> vacancy = FULL_UTIL - (rq->util + putil)
>> 
>> should be enough?
>> 
>>> +
>>> +		/* bias toward local cpu */
>>> +		if (vacancy > 0 && (i == this_cpu))
>>> +			return i;
>>> +
>>> +		if (vacancy > 0 && vacancy < min_vacancy) {
>>> +			min_vacancy = vacancy;
>>> +			idlest = i;
>> 
>> "idlest" may be a bit misleading here as you actually select busiest cpu
>> that have enough spare capacity to take the task.
>
> Um, change to leader_cpu?

vacantest? ;-)

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 14/22] sched: add sched_policy and it's sysfs interface
  2013-01-14  6:53   ` Namhyung Kim
@ 2013-01-14  8:11     ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-14  8:11 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/14/2013 02:53 PM, Namhyung Kim wrote:
> Hi Alex,
> 
> Just a few nitpickings..

Got it. Thanks a lot!
> 
> 
> On Sat,  5 Jan 2013 16:37:43 +0800, Alex Shi wrote:
>> This patch add the power aware scheduler knob into sysfs:
>>
>> $cat /sys/devices/system/cpu/sched_policy/available_sched_policy
>> performance powersaving balance
>> $cat /sys/devices/system/cpu/sched_policy/current_sched_policy
>> powersaving
>>
>> This means the using sched policy is 'powersaving'.
>>
>> User can change the policy by commend 'echo':
>>  echo performance > /sys/devices/system/cpu/current_sched_policy
>>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  Documentation/ABI/testing/sysfs-devices-system-cpu | 24 +++++++
>>  kernel/sched/fair.c                                | 76 ++++++++++++++++++++++
>>  2 files changed, 100 insertions(+)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
>> index 6943133..9c9acbf 100644
>> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
>> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
>> @@ -53,6 +53,30 @@ Description:	Dynamic addition and removal of CPU's.  This is not hotplug
>>  		the system.  Information writtento the file to remove CPU's
>>  		is architecture specific.
>>  
>> +What:		/sys/devices/system/cpu/sched_policy/current_sched_policy
>> +		/sys/devices/system/cpu/sched_policy/available_sched_policy
>> +Date:		Oct 2012
>> +Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
>> +Description:	CFS scheduler policy showing and setting interface.
>> +
>> +		available_sched_policy shows there are 3 kinds of policy now:
>> +		performance, balance and powersaving.
>> +		current_sched_policy shows current scheduler policy. And user
>> +		can change the policy by writing it.
>> +
>> +		Policy decides that CFS scheduler how to distribute tasks onto
>> +		which CPU unit when tasks number less than LCPU number in system
>> +
>> +		performance: try to spread tasks onto more CPU sockets,
>> +		more CPU cores.
>> +
>> +		powersaving: try to shrink tasks onto same core or same CPU
>> +		until every LCPUs are busy.
>> +
>> +		balance:     try to shrink tasks onto same core or same CPU
>> +		until full powered CPUs are busy. This policy also consider
>> +		system performance when try to save power.
>> +
>>  What:		/sys/devices/system/cpu/cpu#/node
>>  Date:		October 2009
>>  Contact:	Linux memory management mailing list <linux-mm@kvack.org>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f24aca6..ee015b8 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6102,6 +6102,82 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
>>  
>>  /* The default scheduler policy is 'performance'. */
>>  int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
>> +
>> +#ifdef CONFIG_SYSFS
>> +static ssize_t show_available_sched_policy(struct device *dev,
>> +		struct device_attribute *attr,
>> +		char *buf)
> 
> This line can be combined to the above line.
> 
> 
>> +{
>> +	return sprintf(buf, "performance balance powersaving\n");
>> +}
>> +
>> +static ssize_t show_current_sched_policy(struct device *dev,
>> +		struct device_attribute *attr,
>> +		char *buf)
> 
> Ditto.
> 
> 
>> +{
>> +	if (sched_policy == SCHED_POLICY_PERFORMANCE)
>> +		return sprintf(buf, "performance\n");
>> +	else if (sched_policy == SCHED_POLICY_POWERSAVING)
>> +		return sprintf(buf, "powersaving\n");
>> +	else if (sched_policy == SCHED_POLICY_BALANCE)
>> +		return sprintf(buf, "balance\n");
>> +	return 0;
>> +}
>> +
>> +static ssize_t set_sched_policy(struct device *dev,
>> +		struct device_attribute *attr, const char *buf, size_t count)
>> +{
>> +	unsigned int ret = -EINVAL;
>> +	char    str_policy[16];
>> +
>> +	ret = sscanf(buf, "%15s", str_policy);
>> +	if (ret != 1)
>> +		return -EINVAL;
>> +
>> +	if (!strcmp(str_policy, "performance"))
>> +		sched_policy = SCHED_POLICY_PERFORMANCE;
>> +	else if (!strcmp(str_policy, "powersaving"))
>> +		sched_policy = SCHED_POLICY_POWERSAVING;
>> +	else if (!strcmp(str_policy, "balance"))
>> +		sched_policy = SCHED_POLICY_BALANCE;
>> +	else
>> +		return -EINVAL;
>> +
>> +	return count;
>> +}
>> +
>> +/*
>> + *  * Sysfs setup bits:
>> + *   */
> 
> Unneeded asterisks.
> 
> Thanks,
> Namhyung
> 
> 
>> +static DEVICE_ATTR(current_sched_policy, 0644, show_current_sched_policy,
>> +						set_sched_policy);
>> +
>> +static DEVICE_ATTR(available_sched_policy, 0444,
>> +		show_available_sched_policy, NULL);
>> +
>> +static struct attribute *sched_policy_default_attrs[] = {
>> +	&dev_attr_current_sched_policy.attr,
>> +	&dev_attr_available_sched_policy.attr,
>> +	NULL
>> +};
>> +static struct attribute_group sched_policy_attr_group = {
>> +	.attrs = sched_policy_default_attrs,
>> +	.name = "sched_policy",
>> +};
>> +
>> +int __init create_sysfs_sched_policy_group(struct device *dev)
>> +{
>> +	return sysfs_create_group(&dev->kobj, &sched_policy_attr_group);
>> +}
>> +
>> +static int __init sched_policy_sysfs_init(void)
>> +{
>> +	return create_sysfs_sched_policy_group(cpu_subsys.dev_root);
>> +}
>> +
>> +core_initcall(sched_policy_sysfs_init);
>> +#endif /* CONFIG_SYSFS */
>> +
>>  /*
>>   * All the scheduling class methods:
>>   */


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake
  2013-01-14  7:03   ` Namhyung Kim
@ 2013-01-14  8:30     ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-14  8:30 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, efault,
	vincent.guittot, gregkh, preeti, linux-kernel


>> +/*
>> + * Try to collect the task running number and capacity of the doamin.
>> + */
>> +static void get_sd_power_stats(struct sched_domain *sd,
>> +		struct task_struct *p, struct sd_lb_stats *sds)
>> +{
>> +	struct sched_group *group;
>> +	struct sg_lb_stats sgs;
>> +	int sd_min_delta = INT_MAX;
>> +	int cpu = task_cpu(p);
>> +
>> +	group = sd->groups;
>> +	do {
>> +		long g_delta;
>> +		unsigned long threshold;
>> +
>> +		if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
>> +			continue;
> 
> Why?
> 
> That means only local group's stat will be accounted for this domain,
> right?  Is it your intension?
> 

Uh, Thanks a lot for finding this bug!
it is a mistake, should be:
+               if (!cpumask_intersects(sched_group_cpus(group),
+                                       tsk_cpus_allowed(p)))
+                       continue;

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 22/22] sched: lazy powersaving balance
  2013-01-05  8:37 ` [PATCH v3 22/22] sched: lazy powersaving balance Alex Shi
@ 2013-01-14  8:39   ` Namhyung Kim
  2013-01-14  8:45     ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Namhyung Kim @ 2013-01-14  8:39 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Sat,  5 Jan 2013 16:37:51 +0800, Alex Shi wrote:
> +	/*
> +	 * The situtatoin isn't egligible for performance balance. If this_cpu

s/situtatoin/situation/

s/egligible/eligible/

> +	 * is not egligible or the timing is not suitable for lazy powersaving

Thanks,
Namhyung


> +	 * balance, we will stop both powersaving and performance balance.
> +	 */

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 22/22] sched: lazy powersaving balance
  2013-01-14  8:39   ` Namhyung Kim
@ 2013-01-14  8:45     ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-14  8:45 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/14/2013 04:39 PM, Namhyung Kim wrote:
> On Sat,  5 Jan 2013 16:37:51 +0800, Alex Shi wrote:
>> +	/*
>> +	 * The situtatoin isn't egligible for performance balance. If this_cpu
> 
> s/situtatoin/situation/
> 
> s/egligible/eligible/
> 

Thanks a lot!
>> +	 * is not egligible or the timing is not suitable for lazy powersaving
> 
> Thanks,
> Namhyung
> 
> 
>> +	 * balance, we will stop both powersaving and performance balance.
>> +	 */


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-11 10:07       ` Morten Rasmussen
  2013-01-11 14:50         ` Alex Shi
@ 2013-01-14  8:55         ` li guang
  2013-01-14  9:18           ` Alex Shi
  1 sibling, 1 reply; 91+ messages in thread
From: li guang @ 2013-01-14  8:55 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Alex Shi, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, linux-kernel

在 2013-01-11五的 10:07 +0000,Morten Rasmussen写道:
> On Fri, Jan 11, 2013 at 02:46:31AM +0000, Alex Shi wrote:
> > On 01/10/2013 02:21 AM, Morten Rasmussen wrote:
> > >>  		new_cpu = find_idlest_cpu(group, p, cpu);
> > >> > -
> > >> > -		/* Now try balancing at a lower domain level of new_cpu */
> > >> > -		cpu = new_cpu;
> > >> > -		weight = sd->span_weight;
> > >> > -		sd = NULL;
> > >> > -		for_each_domain(cpu, tmp) {
> > >> > -			if (weight <= tmp->span_weight)
> > >> > -				break;
> > >> > -			if (tmp->flags & sd_flag)
> > >> > -				sd = tmp;
> > >> > -		}
> > >> > -		/* while loop will break here if sd == NULL */
> > > I agree that this should be a major optimization. I just can't figure
> > > out why the existing recursive search for an idle cpu switches to the
> > > new cpu near the end and then starts a search for an idle cpu in the new
> > > cpu's domain. Is this to handle some exotic sched domain configurations?
> > > If so, they probably wouldn't work with your optimizations.
> > 
> > I did not find odd configuration that asking for old logical.
> > 
> > According to Documentation/scheduler/sched-domains.txt, Maybe never.
> > "A domain's span MUST be a superset of it child's span (this restriction
> > could be relaxed if the need arises), and a base domain for CPU i MUST
> > span at least i."  etc. etc.
> 
> The reason for my suspicion is the SD_OVERLAP flag, which has something
> to do overlapping sched domains. I haven't looked into what it does or
> how it works. I'm just wondering if this optimization will affect the
> use of that flag.

seems it did, SD_OVERLAP will not work after this change,
though this flag is maybe scarcely used.
because, this optimization assume all sched-domains span 
is super-set over child domain.
isn't it? Alex.

> 
> Morten
> 
> > 
> > 
> > -- 
> > Thanks Alex
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
regards!
li guang


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-11 14:56         ` Alex Shi
@ 2013-01-14  9:03           ` li guang
  2013-01-15  2:34             ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: li guang @ 2013-01-14  9:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: Preeti U Murthy, Morten Rasmussen, mingo, peterz, tglx, akpm,
	arjan, bp, pjt, namhyung, efault, vincent.guittot, gregkh,
	linux-kernel

在 2013-01-11五的 22:56 +0800,Alex Shi写道:
> On 01/11/2013 04:01 PM, li guang wrote:
> > 在 2013-01-11五的 10:26 +0530,Preeti U Murthy写道:
> >> Hi Morten,Alex
> >>
> >> On 01/09/2013 11:51 PM, Morten Rasmussen wrote:
> >>> On Sat, Jan 05, 2013 at 08:37:34AM +0000, Alex Shi wrote:
> >>>> Guess the search cpu from bottom to up in domain tree come from
> >>>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
> >>>> balancing over tasks on all level domains.
> >>>>
> >>>> This balancing cost much if there has many domain/groups in a large
> >>>> system. And force spreading task among different domains may cause
> >>>> performance issue due to bad locality.
> >>>>
> >>>> If we remove this code, we will get quick fork/exec/wake, plus better
> >>>> balancing among whole system, that also reduce migrations in future
> >>>> load balancing.
> >>>>
> >>>> This patch increases 10+% performance of hackbench on my 4 sockets
> >>>> NHM and SNB machines.
> >>>>
> >>>> Signed-off-by: Alex Shi <alex.shi@intel.com>
> >>>> ---
> >>>>  kernel/sched/fair.c | 20 +-------------------
> >>>>  1 file changed, 1 insertion(+), 19 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index ecfbf8e..895a3f4 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> >>>>  		goto unlock;
> >>>>  	}
> >>>>  
> >>>> -	while (sd) {
> >>>> +	if (sd) {
> >>>>  		int load_idx = sd->forkexec_idx;
> >>>>  		struct sched_group *group;
> >>>> -		int weight;
> >>>> -
> >>>> -		if (!(sd->flags & sd_flag)) {
> >>>> -			sd = sd->child;
> >>>> -			continue;
> >>>> -		}
> >>>>  
> >>>>  		if (sd_flag & SD_BALANCE_WAKE)
> >>>>  			load_idx = sd->wake_idx;
> >>>> @@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> >>>>  			goto unlock;
> >>>>  
> >>>>  		new_cpu = find_idlest_cpu(group, p, cpu);
> >>>> -
> >>>> -		/* Now try balancing at a lower domain level of new_cpu */
> >>>> -		cpu = new_cpu;
> >>>> -		weight = sd->span_weight;
> >>>> -		sd = NULL;
> >>>> -		for_each_domain(cpu, tmp) {
> >>>> -			if (weight <= tmp->span_weight)
> >>>> -				break;
> >>>> -			if (tmp->flags & sd_flag)
> >>>> -				sd = tmp;
> >>>> -		}
> >>>> -		/* while loop will break here if sd == NULL */
> >>>
> >>> I agree that this should be a major optimization. I just can't figure
> >>> out why the existing recursive search for an idle cpu switches to the
> >>> new cpu near the end and then starts a search for an idle cpu in the new
> >>> cpu's domain. Is this to handle some exotic sched domain configurations?
> >>> If so, they probably wouldn't work with your optimizations.
> >>
> >> Let me explain my understanding of why the recursive search is the way
> >> it is.
> >>
> >>  _________________________  sd0
> >> |                         |
> >> |  ___sd1__   ___sd2__    |
> >> | |        | |        |   |
> >> | | sgx    | |  sga   |   |
> >> | | sgy    | |  sgb   |   |
> >> | |________| |________|   |
> >> |_________________________|
> >>
> >> What the current recursive search is doing is (assuming we start with
> >> sd0-the top level sched domain whose flags are rightly set). we find
> >> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
> >>
> >> We could have ideally stopped the search here.But the problem with this
> >> is that there is a possibility that sgx is more loaded than sgy; meaning
> >> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
> >> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
> >> idle and load balancing has not come to its rescue yet.According to the
> >> search above, cpux1 is idle,but is *not the right candidate for
> >> scheduling forked task,it is the right candidate for relieving the load
> >> from cpux2* due to cache locality etc.
> > 
> > This corner case may occur after "[PATCH v3 03/22] sched: fix
> > find_idlest_group mess logical" brought in the local sched_group bias,
> > and assume balancing runs on cpux2.
> > ideally,  find_idlest_group should find the real idlest(this case: sgy),
> > then, this patch is reasonable.
> > 
> 
> Sure. but seems it is a bit hard to go down the idlest group.
> 
> and the old logical is real cost too much, on my 2 socket NHM/SNB
> server, hackbench can increase 2~5% performance. and no clean
> performance on kbuild/aim7 etc.

what about remove local group bias?


-- 
regards!
li guang


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-14  8:55         ` li guang
@ 2013-01-14  9:18           ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-14  9:18 UTC (permalink / raw)
  To: li guang
  Cc: Morten Rasmussen, mingo, peterz, tglx, akpm, arjan, bp, pjt,
	namhyung, efault, vincent.guittot, gregkh, preeti, linux-kernel

On 01/14/2013 04:55 PM, li guang wrote:
>>>>>> > > >> > -		/* while loop will break here if sd == NULL */
>>>> > > > I agree that this should be a major optimization. I just can't figure
>>>> > > > out why the existing recursive search for an idle cpu switches to the
>>>> > > > new cpu near the end and then starts a search for an idle cpu in the new
>>>> > > > cpu's domain. Is this to handle some exotic sched domain configurations?
>>>> > > > If so, they probably wouldn't work with your optimizations.
>>> > > 
>>> > > I did not find odd configuration that asking for old logical.
>>> > > 
>>> > > According to Documentation/scheduler/sched-domains.txt, Maybe never.
>>> > > "A domain's span MUST be a superset of it child's span (this restriction
>>> > > could be relaxed if the need arises), and a base domain for CPU i MUST
>>> > > span at least i."  etc. etc.
>> > 
>> > The reason for my suspicion is the SD_OVERLAP flag, which has something
>> > to do overlapping sched domains. I haven't looked into what it does or
>> > how it works. I'm just wondering if this optimization will affect the
>> > use of that flag.
> seems it did, SD_OVERLAP will not work after this change,
> though this flag is maybe scarcely used.
> because, this optimization assume all sched-domains span 
> is super-set over child domain.
> isn't it? Alex.
> 

As my understanding, overlap just said some cpu may appears in 2 or more
same level sub domains. If so, this change won't miss cpus.

Am I right, Peter?

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 11/22] sched: consider runnable load average in effective_load
  2013-01-11  3:26     ` Alex Shi
@ 2013-01-14 12:01       ` Morten Rasmussen
  2013-01-16  5:30         ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-14 12:01 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Fri, Jan 11, 2013 at 03:26:59AM +0000, Alex Shi wrote:
> On 01/10/2013 07:28 PM, Morten Rasmussen wrote:
> > On Sat, Jan 05, 2013 at 08:37:40AM +0000, Alex Shi wrote:
> >> effective_load calculates the load change as seen from the
> >> root_task_group. It needs to multiple cfs_rq's tg_runnable_contrib
> >> when we turn to runnable load average balance.
> >>
> >> Signed-off-by: Alex Shi <alex.shi@intel.com>
> >> ---
> >>  kernel/sched/fair.c | 11 ++++++++---
> >>  1 file changed, 8 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index cab62aa..247d6a8 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -2982,7 +2982,8 @@ static void task_waking_fair(struct task_struct *p)
> >>  
> >>  #ifdef CONFIG_FAIR_GROUP_SCHED
> >>  /*
> >> - * effective_load() calculates the load change as seen from the root_task_group
> >> + * effective_load() calculates the runnable load average change as seen from
> >> + * the root_task_group
> >>   *
> >>   * Adding load to a group doesn't make a group heavier, but can cause movement
> >>   * of group shares between cpus. Assuming the shares were perfectly aligned one
> >> @@ -3030,13 +3031,17 @@ static void task_waking_fair(struct task_struct *p)
> >>   * Therefore the effective change in loads on CPU 0 would be 5/56 (3/8 - 2/7)
> >>   * times the weight of the group. The effect on CPU 1 would be -4/56 (4/8 -
> >>   * 4/7) times the weight of the group.
> >> + *
> >> + * After get effective_load of the load moving, will multiple the cpu own
> >> + * cfs_rq's runnable contrib of root_task_group.
> >>   */
> >>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> >>  {
> >>  	struct sched_entity *se = tg->se[cpu];
> >>  
> >>  	if (!tg->parent)	/* the trivial, non-cgroup case */
> >> -		return wl;
> >> +		return wl * tg->cfs_rq[cpu]->tg_runnable_contrib
> >> +						>> NICE_0_SHIFT;
> > 
> > Why do we need to scale the load of the task (wl) by runnable_contrib
> > when the task is in the root task group? Wouldn't the load change still
> > just be wl?
> > 
> 
> Here, wl is the load weight, runnable_contrib engaged the runnable time.

Yes, wl is the load weight of the task. But I don't understand why you
multiply it with the tg_runnable_contrib of the group you want to insert
it into. Since effective_load() is supposed to return the load change
caused by adding the task to the cpu it would make more sense if you
multiplied with the task runnable_avg_sum / runnable_avg_period of the
task in question.

Morten

> >>  
> >>  	for_each_sched_entity(se) {
> >>  		long w, W;
> >> @@ -3084,7 +3089,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> >>  		wg = 0;
> >>  	}
> >>  
> >> -	return wl;
> >> +	return wl * tg->cfs_rq[cpu]->tg_runnable_contrib >> NICE_0_SHIFT;
> > 
> > I believe that effective_load() is only used in wake_affine() to compare
> > load scenarios of the same task group. Since the task group is the same
> > the effective load is scaled by the same factor and should not make any
> > difference?
> > 
> > Also, in wake_affine() the result of effective_load() is added with
> > target_load() which is load.weight of the cpu and not a tracked load
> > based on runnable_avg_*/contrib?
> > 
> > Finally, you have not scaled the result of effective_load() in the
> > function used when FAIR_GROUP_SCHED is disabled. Should that be scaled
> > too?
> 
> it should be, thanks reminder.
> 
> the wake up is not good for burst wakeup benchmark. I am thinking to
> rewrite this part.
> 
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 15/22] sched: log the cpu utilization at rq
  2013-01-11  3:30     ` Alex Shi
@ 2013-01-14 13:59       ` Morten Rasmussen
  2013-01-16  5:53         ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-14 13:59 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Fri, Jan 11, 2013 at 03:30:30AM +0000, Alex Shi wrote:
> On 01/10/2013 07:40 PM, Morten Rasmussen wrote:
> >> >  #undef P64
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index ee015b8..7bfbd69 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -1495,8 +1495,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
> >> >  
> >> >  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
> >> >  {
> >> > +	u32 period;
> >> >  	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
> >> >  	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
> >> > +
> >> > +	period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
> >> > +	rq->util = rq->avg.runnable_avg_sum * 100 / period;
> > The existing tg->runnable_avg and cfs_rq->tg_runnable_contrib variables
> > both holds
> > rq->avg.runnable_avg_sum / rq->avg.runnable_avg_period scaled by
> > NICE_0_LOAD (1024). Why not use one of the existing variables instead of
> > introducing a new one?
> 
> we want to a rq variable that reflect the utilization of the cpu, not of
> the tg

It is the same thing for the root tg. You use exactly the same variables
for calculating rq->util as is used to calculate both tg->runnable_avg and
cfs_rq->tg_runnable_contrib in __update_tg_runnable_avg(). The only
difference is that you scale by 100 while __update_tg_runnable_avg()
scale by NICE_0_LOAD.

Morten

> -- 
> Thanks Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake
  2013-01-11  7:08     ` Alex Shi
@ 2013-01-14 16:09       ` Morten Rasmussen
  2013-01-16  6:02         ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-14 16:09 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Fri, Jan 11, 2013 at 07:08:45AM +0000, Alex Shi wrote:
> On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
> > On Sat, Jan 05, 2013 at 08:37:45AM +0000, Alex Shi wrote:
> >> This patch add power aware scheduling in fork/exec/wake. It try to
> >> select cpu from the busiest while still has utilization group. That's
> >> will save power for other groups.
> >>
> >> The trade off is adding a power aware statistics collection in group
> >> seeking. But since the collection just happened in power scheduling
> >> eligible condition, the worst case of hackbench testing just drops
> >> about 2% with powersaving/balance policy. No clear change for
> >> performance policy.
> >>
> >> I had tried to use rq load avg utilisation in this balancing, but since
> >> the utilisation need much time to accumulate itself. It's unfit for any
> >> burst balancing. So I use nr_running as instant rq utilisation.
> > 
> > So you effective use a mix of nr_running (counting tasks) and PJT's
> > tracked load for balancing?
> 
> no, just task number here.
> > 
> > The problem of slow reaction time of the tracked load a cpu/rq is an
> > interesting one. Would it be possible to use it if you maintained a
> > sched group runnable_load_avg similar to cfs_rq->runnable_load_avg where
> > load contribution of a tasks is added when a task is enqueued and
> > removed again if it migrates to another cpu?
> > This way you would know the new load of the sched group/domain instantly
> > when you migrate a task there. It might not be precise as the load
> > contribution of the task to some extend depends on the load of the cpu
> > where it is running. But it would probably be a fair estimate, which is
> > quite likely to be better than just counting tasks (nr_running).
> 
> For power consideration scenario, it ask task number less than Lcpu
> number, don't care the load weight, since whatever the load weight, the
> task only can burn one LCPU.
> 

True, but you miss the opportunities for power saving when you have many
light tasks (> LCPU). Currently, the sd_utils < threshold check will go
for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
than the domain weight/capacity irrespective of the actual load caused
by those tasks.

If you used tracked task load weight for sd_utils instead you would be
able to go for power saving in scenarios with many light tasks as well.

> >> +
> >> +		if (sched_policy == SCHED_POLICY_POWERSAVING)
> >> +			threshold = sgs.group_weight;
> >> +		else
> >> +			threshold = sgs.group_capacity;
> > 
> > Is group_capacity larger or smaller than group_weight on your platform?
> 
> Guess most of your confusing come from the capacity != weight here.
> 
> In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
> just bigger than a normal cpu power - 1024. but the capacity is still 1,
> while the group weight is 2.
> 

Thanks for clarifying. To the best of my knowledge there are no
guidelines for how to specify cpu power so it may be a bit dangerous to
assume that capacity < weight when capacity is based on cpu power.

You could have architectures where the cpu power of each LCPU (HT, core,
cpu, whatever LCPU is on the particular platform) is greater than 1024
for most LCPUs. In that case, the capacity < weight assumption fails.
Also, on non-HT systems it is quite likely that you will have capacity =
weight.

Morten

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
  2013-01-11  3:47     ` Alex Shi
  2013-01-14  7:13       ` Namhyung Kim
@ 2013-01-14 17:00       ` Morten Rasmussen
  2013-01-16  7:32         ` Alex Shi
  1 sibling, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-14 17:00 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Fri, Jan 11, 2013 at 03:47:03AM +0000, Alex Shi wrote:
> On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
> > On Sat, Jan 05, 2013 at 08:37:46AM +0000, Alex Shi wrote:
> >> If the wake/exec task is small enough, utils < 12.5%, it will
> >> has the chance to be packed into a cpu which is busy but still has space to
> >> handle it.
> >>
> >> Signed-off-by: Alex Shi <alex.shi@intel.com>
> >> ---
> >>  kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++------
> >>  1 file changed, 45 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 8d0d3af..0596e81 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -3471,19 +3471,57 @@ static inline int get_sd_sched_policy(struct sched_domain *sd,
> >>  }
> >>  
> >>  /*
> >> + * find_leader_cpu - find the busiest but still has enough leisure time cpu
> >> + * among the cpus in group.
> >> + */
> >> +static int
> >> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> >> +{
> >> +	unsigned vacancy, min_vacancy = UINT_MAX;
> > 
> > unsigned int?
> 
> yes
> > 
> >> +	int idlest = -1;
> >> +	int i;
> >> +	/* percentage the task's util */
> >> +	unsigned putil = p->se.avg.runnable_avg_sum * 100
> >> +				/ (p->se.avg.runnable_avg_period + 1);
> > 
> > Alternatively you could use se.avg.load_avg_contrib which is the same
> > ratio scaled by the task priority (se->load.weight). In the above
> > expression you don't take priority into account.
> 
> sure. but this seems more directly of meaningful.
> > 
> >> +
> >> +	/* Traverse only the allowed CPUs */
> >> +	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> >> +		struct rq *rq = cpu_rq(i);
> >> +		int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
> >> +
> >> +		/* only pack task which putil < 12.5% */
> >> +		vacancy = FULL_UTIL - (rq->util * nr_running + putil * 8);
> > 
> > I can't follow this expression.
> > 
> > The variables can have the following values:
> > FULL_UTIL  = 99
> > rq->util   = [0..99]
> > nr_running = [1..inf]
> > putil      = [0..99]
> > 
> > Why multiply rq->util by nr_running?
> > 
> > Let's take an example where rq->util = 50, nr_running = 2, and putil =
> > 10. In this case the value of putil doesn't really matter as vacancy
> > would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
> > However, with rq->util = 50 there should be plenty of spare cpu time to
> > take another task.
> 
> for this example, the util is not full maybe due to it was just wake up,
> it still is possible like to run full time. So, I try to give it the
> large guess load.

I don't see why rq->util should be treated different depending on the
number of tasks causing the load. rq->util = 50 means that the cpu is
busy about 50% of the time no matter how many tasks contibute to that
load.

If nr_running = 1 instead in my example, you would consider the cpu
vacant if putil = 6, but if nr_running > 1 you would not. Why should the
two scenarios be treated differently?

> > 
> > Also, why multiply putil by 8? rq->util must be very close to 0 for
> > vacancy to be positive if putil is close to 12 (12.5%).
> 
> just want to pack small util tasks, since packing is possible to hurt
> performance.

I agree that packing may affect performance. But why don't you reduce
FULL_UTIL instead of multiplying by 8? With current expression you will
not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
pack a 6% task even if rq->util = 50 and the resulting cpu load is much
higher.

> > 
> > The vacancy variable is declared unsigned, so it will underflow instead
> > of becoming negative. Is this intentional?
> 
> ops, my mistake.
> > 
> > I may be missing something, but could the expression be something like
> > the below instead?
> > 
> > Create a putil < 12.5% check before the loop. There is no reason to
> > recheck it every iteration. Then:
> > 
> > vacancy = FULL_UTIL - (rq->util + putil)
> > 
> > should be enough?
> > 
> >> +
> >> +		/* bias toward local cpu */
> >> +		if (vacancy > 0 && (i == this_cpu))
> >> +			return i;
> >> +
> >> +		if (vacancy > 0 && vacancy < min_vacancy) {
> >> +			min_vacancy = vacancy;
> >> +			idlest = i;
> > 
> > "idlest" may be a bit misleading here as you actually select busiest cpu
> > that have enough spare capacity to take the task.
> 
> Um, change to leader_cpu?

Fine by me.

Morten

> > 
> > Morten
> > 
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-14  9:03           ` li guang
@ 2013-01-15  2:34             ` Alex Shi
  2013-01-16  1:54               ` li guang
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-15  2:34 UTC (permalink / raw)
  To: li guang
  Cc: Preeti U Murthy, Morten Rasmussen, mingo, peterz, tglx, akpm,
	arjan, bp, pjt, namhyung, efault, vincent.guittot, gregkh,
	linux-kernel

On 01/14/2013 05:03 PM, li guang wrote:
>>> > > This corner case may occur after "[PATCH v3 03/22] sched: fix
>>> > > find_idlest_group mess logical" brought in the local sched_group bias,
>>> > > and assume balancing runs on cpux2.
>>> > > ideally,  find_idlest_group should find the real idlest(this case: sgy),
>>> > > then, this patch is reasonable.
>>> > > 
>> > 
>> > Sure. but seems it is a bit hard to go down the idlest group.
>> > 
>> > and the old logical is real cost too much, on my 2 socket NHM/SNB
>> > server, hackbench can increase 2~5% performance. and no clean
>> > performance on kbuild/aim7 etc.
> what about remove local group bias?


Any theory profit for non local group? Usually, bias toward local group
will has cache locality profit.

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-15  2:34             ` Alex Shi
@ 2013-01-16  1:54               ` li guang
  0 siblings, 0 replies; 91+ messages in thread
From: li guang @ 2013-01-16  1:54 UTC (permalink / raw)
  To: Alex Shi
  Cc: Preeti U Murthy, Morten Rasmussen, mingo, peterz, tglx, akpm,
	arjan, bp, pjt, namhyung, efault, vincent.guittot, gregkh,
	linux-kernel

在 2013-01-15二的 10:34 +0800,Alex Shi写道:
> On 01/14/2013 05:03 PM, li guang wrote:
> >>> > > This corner case may occur after "[PATCH v3 03/22] sched: fix
> >>> > > find_idlest_group mess logical" brought in the local sched_group bias,
> >>> > > and assume balancing runs on cpux2.
> >>> > > ideally,  find_idlest_group should find the real idlest(this case: sgy),
> >>> > > then, this patch is reasonable.
> >>> > > 
> >> > 
> >> > Sure. but seems it is a bit hard to go down the idlest group.
> >> > 
> >> > and the old logical is real cost too much, on my 2 socket NHM/SNB
> >> > server, hackbench can increase 2~5% performance. and no clean
> >> > performance on kbuild/aim7 etc.
> > what about remove local group bias?
> 
> 
> Any theory profit for non local group? Usually, bias toward local group
> will has cache locality profit.
> 

but the disadvantage is miss correct balance sometime,
if mostly like you said before it will bring in better performance,
that's fine, I have no strong statistic to do more profound analysis.


-- 
regards!
li guang


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 11/22] sched: consider runnable load average in effective_load
  2013-01-14 12:01       ` Morten Rasmussen
@ 2013-01-16  5:30         ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-16  5:30 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/14/2013 08:01 PM, Morten Rasmussen wrote:
>>>>  static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>>>> > >>  {
>>>> > >>  	struct sched_entity *se = tg->se[cpu];
>>>> > >>  
>>>> > >>  	if (!tg->parent)	/* the trivial, non-cgroup case */
>>>> > >> -		return wl;
>>>> > >> +		return wl * tg->cfs_rq[cpu]->tg_runnable_contrib
>>>> > >> +						>> NICE_0_SHIFT;
>>> > > 
>>> > > Why do we need to scale the load of the task (wl) by runnable_contrib
>>> > > when the task is in the root task group? Wouldn't the load change still
>>> > > just be wl?
>>> > > 
>> > 
>> > Here, wl is the load weight, runnable_contrib engaged the runnable time.
> Yes, wl is the load weight of the task. But I don't understand why you
> multiply it with the tg_runnable_contrib of the group you want to insert
> it into. Since effective_load() is supposed to return the load change
> caused by adding the task to the cpu it would make more sense if you
> multiplied with the task runnable_avg_sum / runnable_avg_period of the
> task in question.
> 

I was consider the task will follow the cpu's runnable time, like
throttle etc.
But may it is a bit early to consider this. use the task's runnable avg
seems better.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-11  4:56     ` Preeti U Murthy
  2013-01-11  8:01       ` li guang
  2013-01-11 10:54       ` Morten Rasmussen
@ 2013-01-16  5:43       ` Alex Shi
  2013-01-16  7:41         ` Alex Shi
  2 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-16  5:43 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Morten Rasmussen, mingo, peterz, tglx, akpm, arjan, bp, pjt,
	namhyung, efault, vincent.guittot, gregkh, linux-kernel

-		/* while loop will break here if sd == NULL */
>>
>> I agree that this should be a major optimization. I just can't figure
>> out why the existing recursive search for an idle cpu switches to the
>> new cpu near the end and then starts a search for an idle cpu in the new
>> cpu's domain. Is this to handle some exotic sched domain configurations?
>> If so, they probably wouldn't work with your optimizations.
> 
> Let me explain my understanding of why the recursive search is the way
> it is.
> 
>  _________________________  sd0
> |                         |
> |  ___sd1__   ___sd2__    |
> | |        | |        |   |
> | | sgx    | |  sga   |   |
> | | sgy    | |  sgb   |   |
> | |________| |________|   |
> |_________________________|
> 
> What the current recursive search is doing is (assuming we start with
> sd0-the top level sched domain whose flags are rightly set). we find
> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
> 
> We could have ideally stopped the search here.But the problem with this
> is that there is a possibility that sgx is more loaded than sgy; meaning
> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
> idle and load balancing has not come to its rescue yet.According to the
> search above, cpux1 is idle,but is *not the right candidate for
> scheduling forked task,it is the right candidate for relieving the load
> from cpux2* due to cache locality etc.

The problem still exists on the current code. It still goes to cpux1.
and then goes up to sgx to seek idlest group ... idlest cpu, and back to
cpux1 again. nothing help.



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 15/22] sched: log the cpu utilization at rq
  2013-01-14 13:59       ` Morten Rasmussen
@ 2013-01-16  5:53         ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-16  5:53 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/14/2013 09:59 PM, Morten Rasmussen wrote:
> On Fri, Jan 11, 2013 at 03:30:30AM +0000, Alex Shi wrote:
>> On 01/10/2013 07:40 PM, Morten Rasmussen wrote:
>>>>>  #undef P64
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index ee015b8..7bfbd69 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -1495,8 +1495,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>>>>>  
>>>>>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>>>>>  {
>>>>> +	u32 period;
>>>>>  	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
>>>>>  	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
>>>>> +
>>>>> +	period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
>>>>> +	rq->util = rq->avg.runnable_avg_sum * 100 / period;
>>> The existing tg->runnable_avg and cfs_rq->tg_runnable_contrib variables
>>> both holds
>>> rq->avg.runnable_avg_sum / rq->avg.runnable_avg_period scaled by
>>> NICE_0_LOAD (1024). Why not use one of the existing variables instead of
>>> introducing a new one?
>>
>> we want to a rq variable that reflect the utilization of the cpu, not of
>> the tg
> 
> It is the same thing for the root tg. You use exactly the same variables
> for calculating rq->util as is used to calculate both tg->runnable_avg and
> cfs_rq->tg_runnable_contrib in __update_tg_runnable_avg(). The only
> difference is that you scale by 100 while __update_tg_runnable_avg()
> scale by NICE_0_LOAD.

yes, the root tg->runnable_avg has same meaningful, but normal tg not,
and more important it is hidden by CONFIG_FAIR_GROUP_SCHED,

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake
  2013-01-14 16:09       ` Morten Rasmussen
@ 2013-01-16  6:02         ` Alex Shi
  2013-01-16 14:27           ` Morten Rasmussen
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-16  6:02 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
> On Fri, Jan 11, 2013 at 07:08:45AM +0000, Alex Shi wrote:
>> On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
>>> On Sat, Jan 05, 2013 at 08:37:45AM +0000, Alex Shi wrote:
>>>> This patch add power aware scheduling in fork/exec/wake. It try to
>>>> select cpu from the busiest while still has utilization group. That's
>>>> will save power for other groups.
>>>>
>>>> The trade off is adding a power aware statistics collection in group
>>>> seeking. But since the collection just happened in power scheduling
>>>> eligible condition, the worst case of hackbench testing just drops
>>>> about 2% with powersaving/balance policy. No clear change for
>>>> performance policy.
>>>>
>>>> I had tried to use rq load avg utilisation in this balancing, but since
>>>> the utilisation need much time to accumulate itself. It's unfit for any
>>>> burst balancing. So I use nr_running as instant rq utilisation.
>>>
>>> So you effective use a mix of nr_running (counting tasks) and PJT's
>>> tracked load for balancing?
>>
>> no, just task number here.
>>>
>>> The problem of slow reaction time of the tracked load a cpu/rq is an
>>> interesting one. Would it be possible to use it if you maintained a
>>> sched group runnable_load_avg similar to cfs_rq->runnable_load_avg where
>>> load contribution of a tasks is added when a task is enqueued and
>>> removed again if it migrates to another cpu?
>>> This way you would know the new load of the sched group/domain instantly
>>> when you migrate a task there. It might not be precise as the load
>>> contribution of the task to some extend depends on the load of the cpu
>>> where it is running. But it would probably be a fair estimate, which is
>>> quite likely to be better than just counting tasks (nr_running).
>>
>> For power consideration scenario, it ask task number less than Lcpu
>> number, don't care the load weight, since whatever the load weight, the
>> task only can burn one LCPU.
>>
> 
> True, but you miss the opportunities for power saving when you have many
> light tasks (> LCPU). Currently, the sd_utils < threshold check will go
> for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
> than the domain weight/capacity irrespective of the actual load caused
> by those tasks.
> 
> If you used tracked task load weight for sd_utils instead you would be
> able to go for power saving in scenarios with many light tasks as well.

yes, that's right on power consideration. but for performance consider,
it's better to spread tasks on different LCPU to save CS cost. And if
the cpu usage is nearly full, we don't know if some tasks real want more
cpu time.
Even in the power sched policy, we still want to get better performance
if it's possible. :)
> 
>>>> +
>>>> +		if (sched_policy == SCHED_POLICY_POWERSAVING)
>>>> +			threshold = sgs.group_weight;
>>>> +		else
>>>> +			threshold = sgs.group_capacity;
>>>
>>> Is group_capacity larger or smaller than group_weight on your platform?
>>
>> Guess most of your confusing come from the capacity != weight here.
>>
>> In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
>> just bigger than a normal cpu power - 1024. but the capacity is still 1,
>> while the group weight is 2.
>>
> 
> Thanks for clarifying. To the best of my knowledge there are no
> guidelines for how to specify cpu power so it may be a bit dangerous to
> assume that capacity < weight when capacity is based on cpu power.

Sure. I also just got them from code. and don't know other arch how to
different them.
but currently, seems this cpu power concept works fine.
> 
> You could have architectures where the cpu power of each LCPU (HT, core,
> cpu, whatever LCPU is on the particular platform) is greater than 1024
> for most LCPUs. In that case, the capacity < weight assumption fails.
> Also, on non-HT systems it is quite likely that you will have capacity =
> weight.

yes.
> 
> Morten
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
  2013-01-14  7:13       ` Namhyung Kim
@ 2013-01-16  6:11         ` Alex Shi
  2013-01-16 12:52           ` Namhyung Kim
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-16  6:11 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Morten Rasmussen, mingo, peterz, tglx, akpm, arjan, bp, pjt,
	efault, vincent.guittot, gregkh, preeti, linux-kernel

On 01/14/2013 03:13 PM, Namhyung Kim wrote:
> On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
>> On 01/11/2013 01:17 AM, Morten Rasmussen wrote:
>>> On Sat, Jan 05, 2013 at 08:37:46AM +0000, Alex Shi wrote:
>>>> If the wake/exec task is small enough, utils < 12.5%, it will
>>>> has the chance to be packed into a cpu which is busy but still has space to
>>>> handle it.
>>>>
>>>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>>>> ---
> [snip]
>>> I may be missing something, but could the expression be something like
>>> the below instead?
>>>
>>> Create a putil < 12.5% check before the loop. There is no reason to
>>> recheck it every iteration. Then:
> 
> Agreed.  Also suggest that the checking local cpu can also be moved
> before the loop so that it can be used without going through the loop if
> it's vacant enough.

Yes, thanks for suggestion!
> 
>>>
>>> vacancy = FULL_UTIL - (rq->util + putil)
>>>
>>> should be enough?
>>>
>>>> +
>>>> +		/* bias toward local cpu */
>>>> +		if (vacancy > 0 && (i == this_cpu))
>>>> +			return i;
>>>> +
>>>> +		if (vacancy > 0 && vacancy < min_vacancy) {
>>>> +			min_vacancy = vacancy;
>>>> +			idlest = i;
>>>
>>> "idlest" may be a bit misleading here as you actually select busiest cpu
>>> that have enough spare capacity to take the task.
>>
>> Um, change to leader_cpu?
> 
> vacantest? ;-)

hard to the ward in google. are you sure it is better than leader_cpu?  :)
> 
> Thanks,
> Namhyung
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
  2013-01-14 17:00       ` Morten Rasmussen
@ 2013-01-16  7:32         ` Alex Shi
  2013-01-16 15:08           ` Morten Rasmussen
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-16  7:32 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
>>> Why multiply rq->util by nr_running?
>>> > > 
>>> > > Let's take an example where rq->util = 50, nr_running = 2, and putil =
>>> > > 10. In this case the value of putil doesn't really matter as vacancy
>>> > > would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
>>> > > However, with rq->util = 50 there should be plenty of spare cpu time to
>>> > > take another task.
>> > 
>> > for this example, the util is not full maybe due to it was just wake up,
>> > it still is possible like to run full time. So, I try to give it the
>> > large guess load.
> I don't see why rq->util should be treated different depending on the
> number of tasks causing the load. rq->util = 50 means that the cpu is
> busy about 50% of the time no matter how many tasks contibute to that
> load.
> 
> If nr_running = 1 instead in my example, you would consider the cpu
> vacant if putil = 6, but if nr_running > 1 you would not. Why should the
> two scenarios be treated differently?
> 
>>> > > 
>>> > > Also, why multiply putil by 8? rq->util must be very close to 0 for
>>> > > vacancy to be positive if putil is close to 12 (12.5%).
>> > 
>> > just want to pack small util tasks, since packing is possible to hurt
>> > performance.
> I agree that packing may affect performance. But why don't you reduce
> FULL_UTIL instead of multiplying by 8? With current expression you will
> not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
> pack a 6% task even if rq->util = 50 and the resulting cpu load is much
> higher.
> 

Yes, the threshold has no strong theory or experiment support. I had
tried cyclitest which Vicent used, the case's load avg is too small to
be caught. so just use half of Vicent value as 12.5%. If you has more
reasonable value, let me know.

As to nr_running engaged as multiple mode. it's base on 2 reasons.
1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
cost full cpu time, it still has 345ms with rq->util < 1.
2, if there are more tasks, like 2 tasks running on one cpu, it's
possible to has capacity to burn 200% cpu time, while the biggest
rq->util is still 100%.

Consider to figure out precise utils is complicate and cost much. I do
this simple calculation. It is not very precise, but it is efficient and
more bias toward performance.

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake
  2013-01-16  5:43       ` Alex Shi
@ 2013-01-16  7:41         ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-16  7:41 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Morten Rasmussen, mingo, peterz, tglx, akpm, arjan, bp, pjt,
	namhyung, efault, vincent.guittot, gregkh, linux-kernel

On 01/16/2013 01:43 PM, Alex Shi wrote:
> -		/* while loop will break here if sd == NULL */
>>>
>>> I agree that this should be a major optimization. I just can't figure
>>> out why the existing recursive search for an idle cpu switches to the
>>> new cpu near the end and then starts a search for an idle cpu in the new
>>> cpu's domain. Is this to handle some exotic sched domain configurations?
>>> If so, they probably wouldn't work with your optimizations.
>>
>> Let me explain my understanding of why the recursive search is the way
>> it is.
>>
>>  _________________________  sd0
>> |                         |
>> |  ___sd1__   ___sd2__    |
>> | |        | |        |   |
>> | | sgx    | |  sga   |   |
>> | | sgy    | |  sgb   |   |
>> | |________| |________|   |
>> |_________________________|
>>
>> What the current recursive search is doing is (assuming we start with
>> sd0-the top level sched domain whose flags are rightly set). we find
>> that sd1 is the idlest group,and a cpux1 in sgx is the idlest cpu.
>>
>> We could have ideally stopped the search here.But the problem with this
>> is that there is a possibility that sgx is more loaded than sgy; meaning
>> the cpus in sgx are heavily imbalanced;say there are two cpus cpux1 and
>> cpux2 in sgx,where cpux2 is heavily loaded and cpux1 has recently gotten
>> idle and load balancing has not come to its rescue yet.According to the
>> search above, cpux1 is idle,but is *not the right candidate for
>> scheduling forked task,it is the right candidate for relieving the load
>> from cpux2* due to cache locality etc.
> 
> The problem still exists on the current code. It still goes to cpux1.
> and then goes up to sgx to seek idlest group ... idlest cpu, and back to
> cpux1 again. nothing help.
> 
> 

to resolve the problem, I has tried to walk domains from top down. but testing
show aim9/hackbench performance is not good on our SNB EP. and no change on other platforms.
---
@@ -3351,51 +3363,33 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 
 
-       while (sd) {
+       for_each_lower_domain(sd) {
                int load_idx = sd->forkexec_idx;
-               struct sched_group *group;
-               int weight;
-
-               if (!(sd->flags & sd_flag)) {
-                       sd = sd->child;
-                       continue;
-               }
+               int local = 0;
 
                if (sd_flag & SD_BALANCE_WAKE)
                        load_idx = sd->wake_idx;
 
-               group = find_idlest_group(sd, p, cpu, load_idx);
-               if (!group) {
-                       sd = sd->child;
-                       continue;
-               }
-
-               new_cpu = find_idlest_cpu(group, p, cpu);
-               if (new_cpu == -1 || new_cpu == cpu) {
-                       /* Now try balancing at a lower domain level of cpu */
-                       sd = sd->child;
+               group = find_idlest_group(sd, p, cpu, load_idx, &local);
+               if (local)
                        continue;
-               }
+               if (!group)
+                       goto unlock;
 
-               /* Now try balancing at a lower domain level of new_cpu */
-               cpu = new_cpu;
-               weight = sd->span_weight;
-               sd = NULL;
-               for_each_domain(cpu, tmp) {
-                       if (weight <= tmp->span_weight)
-                               break;
-                       if (tmp->flags & sd_flag)
+               /* go down from non-local group */
+               for_each_domain(group_first_cpu(group), tmp)
+                       if (cpumask_equal(sched_domain_span(tmp),
+                                               sched_group_cpus(group))) {
                                sd = tmp;
-               }
-               /* while loop will break here if sd == NULL */
+                               break;
+                       }
        }
+       if (group)
+               new_cpu = find_idlest_cpu(group, p, cpu);
 unlock:
        rcu_read_unlock();



-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
  2013-01-16  6:11         ` Alex Shi
@ 2013-01-16 12:52           ` Namhyung Kim
  0 siblings, 0 replies; 91+ messages in thread
From: Namhyung Kim @ 2013-01-16 12:52 UTC (permalink / raw)
  To: Alex Shi
  Cc: Morten Rasmussen, mingo, peterz, tglx, akpm, arjan, bp, pjt,
	efault, vincent.guittot, gregkh, preeti, linux-kernel

On Wed, 16 Jan 2013 14:11:53 +0800, Alex Shi wrote:
> On 01/14/2013 03:13 PM, Namhyung Kim wrote:
>> On Fri, 11 Jan 2013 11:47:03 +0800, Alex Shi wrote:
>>> Um, change to leader_cpu?
>> 
>> vacantest? ;-)
>
> hard to the ward in google. are you sure it is better than leader_cpu?  :)

Nop.  My English is just bad. :)

But I'm not sure the "leader" is the right one.  Why is it called
"leader" in the first place?

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake
  2013-01-16  6:02         ` Alex Shi
@ 2013-01-16 14:27           ` Morten Rasmussen
  2013-01-17  5:47             ` Namhyung Kim
  0 siblings, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-16 14:27 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Wed, Jan 16, 2013 at 06:02:21AM +0000, Alex Shi wrote:
> On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
> > On Fri, Jan 11, 2013 at 07:08:45AM +0000, Alex Shi wrote:
> >> On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
> >>> On Sat, Jan 05, 2013 at 08:37:45AM +0000, Alex Shi wrote:
> >>>> This patch add power aware scheduling in fork/exec/wake. It try to
> >>>> select cpu from the busiest while still has utilization group. That's
> >>>> will save power for other groups.
> >>>>
> >>>> The trade off is adding a power aware statistics collection in group
> >>>> seeking. But since the collection just happened in power scheduling
> >>>> eligible condition, the worst case of hackbench testing just drops
> >>>> about 2% with powersaving/balance policy. No clear change for
> >>>> performance policy.
> >>>>
> >>>> I had tried to use rq load avg utilisation in this balancing, but since
> >>>> the utilisation need much time to accumulate itself. It's unfit for any
> >>>> burst balancing. So I use nr_running as instant rq utilisation.
> >>>
> >>> So you effective use a mix of nr_running (counting tasks) and PJT's
> >>> tracked load for balancing?
> >>
> >> no, just task number here.
> >>>
> >>> The problem of slow reaction time of the tracked load a cpu/rq is an
> >>> interesting one. Would it be possible to use it if you maintained a
> >>> sched group runnable_load_avg similar to cfs_rq->runnable_load_avg where
> >>> load contribution of a tasks is added when a task is enqueued and
> >>> removed again if it migrates to another cpu?
> >>> This way you would know the new load of the sched group/domain instantly
> >>> when you migrate a task there. It might not be precise as the load
> >>> contribution of the task to some extend depends on the load of the cpu
> >>> where it is running. But it would probably be a fair estimate, which is
> >>> quite likely to be better than just counting tasks (nr_running).
> >>
> >> For power consideration scenario, it ask task number less than Lcpu
> >> number, don't care the load weight, since whatever the load weight, the
> >> task only can burn one LCPU.
> >>
> > 
> > True, but you miss the opportunities for power saving when you have many
> > light tasks (> LCPU). Currently, the sd_utils < threshold check will go
> > for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
> > than the domain weight/capacity irrespective of the actual load caused
> > by those tasks.
> > 
> > If you used tracked task load weight for sd_utils instead you would be
> > able to go for power saving in scenarios with many light tasks as well.
> 
> yes, that's right on power consideration. but for performance consider,
> it's better to spread tasks on different LCPU to save CS cost. And if
> the cpu usage is nearly full, we don't know if some tasks real want more
> cpu time.

If the cpu is nearly full according to its tracked load it should not be
used for packing more tasks. It is the nearly idle scenario that I am
more interested in. If you have lots of task with tracked load <10% then
why not pack them. The performance impact should be minimal.

Furthermore, nr_running is just a snapshot of the current runqueue
status. The combination of runnable and blocked load should give a
better overall view of the cpu loads.

> Even in the power sched policy, we still want to get better performance
> if it's possible. :)

I agree if it comes for free in terms of power. In my opinion it is
acceptable to sacrifice a bit of performance to save power when using a
power sched policy as long as the performance regression can be
justified by the power savings. It will of course depend on the system
and its usage how trade-off power and performance. My point is just that
with multiple sched policies (performance, balance and power as you
propose) it should be acceptable to focus on power for the power policy
and let users that only/mostly care about performance use the balance or
performance policy.

> > 
> >>>> +
> >>>> +		if (sched_policy == SCHED_POLICY_POWERSAVING)
> >>>> +			threshold = sgs.group_weight;
> >>>> +		else
> >>>> +			threshold = sgs.group_capacity;
> >>>
> >>> Is group_capacity larger or smaller than group_weight on your platform?
> >>
> >> Guess most of your confusing come from the capacity != weight here.
> >>
> >> In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
> >> just bigger than a normal cpu power - 1024. but the capacity is still 1,
> >> while the group weight is 2.
> >>
> > 
> > Thanks for clarifying. To the best of my knowledge there are no
> > guidelines for how to specify cpu power so it may be a bit dangerous to
> > assume that capacity < weight when capacity is based on cpu power.
> 
> Sure. I also just got them from code. and don't know other arch how to
> different them.
> but currently, seems this cpu power concept works fine.

Yes, it seems to work fine for your test platform. I just want to
highlight that the assumption you make might not be valid for other
architectures. I know that cpu power is not widely used, but that may
change with the increasing focus on power aware scheduling.

Morten

> > 
> > You could have architectures where the cpu power of each LCPU (HT, core,
> > cpu, whatever LCPU is on the particular platform) is greater than 1024
> > for most LCPUs. In that case, the capacity < weight assumption fails.
> > Also, on non-HT systems it is quite likely that you will have capacity =
> > weight.
> 
> yes.
> > 
> > Morten
> > 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >>
> > 
> 
> 
> -- 
> Thanks Alex
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
  2013-01-16  7:32         ` Alex Shi
@ 2013-01-16 15:08           ` Morten Rasmussen
  2013-01-18 14:06             ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Morten Rasmussen @ 2013-01-16 15:08 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Wed, Jan 16, 2013 at 07:32:49AM +0000, Alex Shi wrote:
> On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
> >>> Why multiply rq->util by nr_running?
> >>> > > 
> >>> > > Let's take an example where rq->util = 50, nr_running = 2, and putil =
> >>> > > 10. In this case the value of putil doesn't really matter as vacancy
> >>> > > would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
> >>> > > However, with rq->util = 50 there should be plenty of spare cpu time to
> >>> > > take another task.
> >> > 
> >> > for this example, the util is not full maybe due to it was just wake up,
> >> > it still is possible like to run full time. So, I try to give it the
> >> > large guess load.
> > I don't see why rq->util should be treated different depending on the
> > number of tasks causing the load. rq->util = 50 means that the cpu is
> > busy about 50% of the time no matter how many tasks contibute to that
> > load.
> > 
> > If nr_running = 1 instead in my example, you would consider the cpu
> > vacant if putil = 6, but if nr_running > 1 you would not. Why should the
> > two scenarios be treated differently?
> > 
> >>> > > 
> >>> > > Also, why multiply putil by 8? rq->util must be very close to 0 for
> >>> > > vacancy to be positive if putil is close to 12 (12.5%).
> >> > 
> >> > just want to pack small util tasks, since packing is possible to hurt
> >> > performance.
> > I agree that packing may affect performance. But why don't you reduce
> > FULL_UTIL instead of multiplying by 8? With current expression you will
> > not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
> > pack a 6% task even if rq->util = 50 and the resulting cpu load is much
> > higher.
> > 
> 
> Yes, the threshold has no strong theory or experiment support. I had
> tried cyclitest which Vicent used, the case's load avg is too small to
> be caught. so just use half of Vicent value as 12.5%. If you has more
> reasonable value, let me know.
> 
> As to nr_running engaged as multiple mode. it's base on 2 reasons.
> 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
> cost full cpu time, it still has 345ms with rq->util < 1.

I agree that load avg may not be accurate, especially for new tasks. But
why use it if you don't trust its value anyway?

The load avg (sum/period) of a new task will reach 100% instantly if the
task is consuming all the cpu time it can get. An old task can reach 50%
within 32ms. So you should fairly quickly be able to see if it is a
light task or not. You may under-estimate its load in the beginning, but
only for a very short time.

> 2, if there are more tasks, like 2 tasks running on one cpu, it's
> possible to has capacity to burn 200% cpu time, while the biggest
> rq->util is still 100%.

If you want to have a better metric for how much cpu time the task on
the runqueue could potentially use, I would suggest using
cfs_rq->runnable_load_avg which is the load_avg_contrib sum of all tasks
on the runqueue. It would give you 200% in your example above.

On the other hand, I think rq->util is fine for this purpose. If
rq->util < 100% you know for sure that cpu is not fully utilized no
matter how many tasks you have on the runqueue. So as long as rq->util
is well below 100% (like < 50%) it should be safe to pack more small
tasks on that cpu even if it has multiple tasks running already.

> 
> Consider to figure out precise utils is complicate and cost much. I do
> this simple calculation. It is not very precise, but it is efficient and
> more bias toward performance.

It is indeed very biased towards performance. I would prefer more focus
on saving power in a power scheduling policy :)

Morten

> 
> -- 
> Thanks Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake
  2013-01-16 14:27           ` Morten Rasmussen
@ 2013-01-17  5:47             ` Namhyung Kim
  2013-01-18 13:41               ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Namhyung Kim @ 2013-01-17  5:47 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Alex Shi, mingo, peterz, tglx, akpm, arjan, bp, pjt, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On Wed, 16 Jan 2013 14:27:30 +0000, Morten Rasmussen wrote:
> On Wed, Jan 16, 2013 at 06:02:21AM +0000, Alex Shi wrote:
>> On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
>> > On Fri, Jan 11, 2013 at 07:08:45AM +0000, Alex Shi wrote:
>> >> On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
>> >> For power consideration scenario, it ask task number less than Lcpu
>> >> number, don't care the load weight, since whatever the load weight, the
>> >> task only can burn one LCPU.
>> >>
>> > 
>> > True, but you miss the opportunities for power saving when you have many
>> > light tasks (> LCPU). Currently, the sd_utils < threshold check will go
>> > for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
>> > than the domain weight/capacity irrespective of the actual load caused
>> > by those tasks.
>> > 
>> > If you used tracked task load weight for sd_utils instead you would be
>> > able to go for power saving in scenarios with many light tasks as well.
>> 
>> yes, that's right on power consideration. but for performance consider,
>> it's better to spread tasks on different LCPU to save CS cost. And if
>> the cpu usage is nearly full, we don't know if some tasks real want more
>> cpu time.
>
> If the cpu is nearly full according to its tracked load it should not be
> used for packing more tasks. It is the nearly idle scenario that I am
> more interested in. If you have lots of task with tracked load <10% then
> why not pack them. The performance impact should be minimal.
>
> Furthermore, nr_running is just a snapshot of the current runqueue
> status. The combination of runnable and blocked load should give a
> better overall view of the cpu loads.

I have a feeling that power aware scheduling policy has to deal only
with the utilization.  Of course it only works under a certain threshold
and if it's exceeded must be changed to other policy which cares the
load weight/average.  Just throwing an idea. :)

>
>> Even in the power sched policy, we still want to get better performance
>> if it's possible. :)
>
> I agree if it comes for free in terms of power. In my opinion it is
> acceptable to sacrifice a bit of performance to save power when using a
> power sched policy as long as the performance regression can be
> justified by the power savings. It will of course depend on the system
> and its usage how trade-off power and performance. My point is just that
> with multiple sched policies (performance, balance and power as you
> propose) it should be acceptable to focus on power for the power policy
> and let users that only/mostly care about performance use the balance or
> performance policy.

Agreed.

>
>> > 
>> >>>> +
>> >>>> +		if (sched_policy == SCHED_POLICY_POWERSAVING)
>> >>>> +			threshold = sgs.group_weight;
>> >>>> +		else
>> >>>> +			threshold = sgs.group_capacity;
>> >>>
>> >>> Is group_capacity larger or smaller than group_weight on your platform?
>> >>
>> >> Guess most of your confusing come from the capacity != weight here.
>> >>
>> >> In most of Intel CPU, a cpu core's power(with 2 HT) is usually 1178, it
>> >> just bigger than a normal cpu power - 1024. but the capacity is still 1,
>> >> while the group weight is 2.
>> >>
>> > 
>> > Thanks for clarifying. To the best of my knowledge there are no
>> > guidelines for how to specify cpu power so it may be a bit dangerous to
>> > assume that capacity < weight when capacity is based on cpu power.
>> 
>> Sure. I also just got them from code. and don't know other arch how to
>> different them.
>> but currently, seems this cpu power concept works fine.
>
> Yes, it seems to work fine for your test platform. I just want to
> highlight that the assumption you make might not be valid for other
> architectures. I know that cpu power is not widely used, but that may
> change with the increasing focus on power aware scheduling.

AFAIK on ARM big.LITTLE, a big cpu will have a cpu power more than
1024.  I'm sure Morten knows way more than me on this. :)

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake
  2013-01-17  5:47             ` Namhyung Kim
@ 2013-01-18 13:41               ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-18 13:41 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Morten Rasmussen, mingo, peterz, tglx, akpm, arjan, bp, pjt,
	efault, vincent.guittot, gregkh, preeti, linux-kernel

On 01/17/2013 01:47 PM, Namhyung Kim wrote:
> On Wed, 16 Jan 2013 14:27:30 +0000, Morten Rasmussen wrote:
>> On Wed, Jan 16, 2013 at 06:02:21AM +0000, Alex Shi wrote:
>>> On 01/15/2013 12:09 AM, Morten Rasmussen wrote:
>>>> On Fri, Jan 11, 2013 at 07:08:45AM +0000, Alex Shi wrote:
>>>>> On 01/10/2013 11:01 PM, Morten Rasmussen wrote:
>>>>> For power consideration scenario, it ask task number less than Lcpu
>>>>> number, don't care the load weight, since whatever the load weight, the
>>>>> task only can burn one LCPU.
>>>>>
>>>>
>>>> True, but you miss the opportunities for power saving when you have many
>>>> light tasks (> LCPU). Currently, the sd_utils < threshold check will go
>>>> for SCHED_POLICY_PERFORMANCE if the number tasks (sd_utils) is greater
>>>> than the domain weight/capacity irrespective of the actual load caused
>>>> by those tasks.
>>>>
>>>> If you used tracked task load weight for sd_utils instead you would be
>>>> able to go for power saving in scenarios with many light tasks as well.
>>>
>>> yes, that's right on power consideration. but for performance consider,
>>> it's better to spread tasks on different LCPU to save CS cost. And if
>>> the cpu usage is nearly full, we don't know if some tasks real want more
>>> cpu time.
>>
>> If the cpu is nearly full according to its tracked load it should not be
>> used for packing more tasks. It is the nearly idle scenario that I am
>> more interested in. If you have lots of task with tracked load <10% then
>> why not pack them. The performance impact should be minimal.

I had tried using runnable utils with many methods, include similar  way
in late regular balance. But like I talked with Mike before, the burst
waking up has no time to accumulate util, so tasks were set to few cpus
and than pulled away in regular balance, that cause many performance
benchmark drop much.

So I'd rather assume very new task will keep busy. If it is not, we
still can pull them in regular balance.
>>
>> Furthermore, nr_running is just a snapshot of the current runqueue
>> status. The combination of runnable and blocked load should give a
>> better overall view of the cpu loads.
> 
> I have a feeling that power aware scheduling policy has to deal only
> with the utilization.  Of course it only works under a certain threshold
> and if it's exceeded must be changed to other policy which cares the
> load weight/average.  Just throwing an idea. :)
> 
>>
>>> Even in the power sched policy, we still want to get better performance
>>> if it's possible. :)
>>
>> I agree if it comes for free in terms of power. In my opinion it is
>> acceptable to sacrifice a bit of performance to save power when using a
>> power sched policy as long as the performance regression can be
>> justified by the power savings. It will of course depend on the system
>> and its usage how trade-off power and performance. My point is just that
>> with multiple sched policies (performance, balance and power as you
>> propose) it should be acceptable to focus on power for the power policy
>> and let users that only/mostly care about performance use the balance or
>> performance policy.
> 
> Agreed.
> 

Firstly I hope the 'balance' policy can be used on widely on sever,
thus, it's better not to hurt performance.

Secondly 'race to idle' is one of the patchset's assumption, if we can
finish the tasks more early. we can save more power.

Last but not least, if the patch is merged, we can do more tunning on
'power' policy. :)
>>>>
>>>> Thanks for clarifying. To the best of my knowledge there are no
>>>> guidelines for how to specify cpu power so it may be a bit dangerous to
>>>> assume that capacity < weight when capacity is based on cpu power.
>>>
>>> Sure. I also just got them from code. and don't know other arch how to
>>> different them.
>>> but currently, seems this cpu power concept works fine.
>>
>> Yes, it seems to work fine for your test platform. I just want to
>> highlight that the assumption you make might not be valid for other
>> architectures. I know that cpu power is not widely used, but that may
>> change with the increasing focus on power aware scheduling.

cpu_power defined and used in general code. I saw arm and powerpc
mentioned them much in self arch code.

Anyway, would you like to share which arch doesn't fit this?
> 
> AFAIK on ARM big.LITTLE, a big cpu will have a cpu power more than
> 1024.  I'm sure Morten knows way more than me on this. :)
> 
> Thanks,
> Namhyung
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing
  2013-01-16 15:08           ` Morten Rasmussen
@ 2013-01-18 14:06             ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-18 14:06 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, linux-kernel

On 01/16/2013 11:08 PM, Morten Rasmussen wrote:
> On Wed, Jan 16, 2013 at 07:32:49AM +0000, Alex Shi wrote:
>> On 01/15/2013 01:00 AM, Morten Rasmussen wrote:
>>>>> Why multiply rq->util by nr_running?
>>>>>>>
>>>>>>> Let's take an example where rq->util = 50, nr_running = 2, and putil =
>>>>>>> 10. In this case the value of putil doesn't really matter as vacancy
>>>>>>> would be negative anyway since FULL_UTIL - rq->util * nr_running is -1.
>>>>>>> However, with rq->util = 50 there should be plenty of spare cpu time to
>>>>>>> take another task.
>>>>>
>>>>> for this example, the util is not full maybe due to it was just wake up,
>>>>> it still is possible like to run full time. So, I try to give it the
>>>>> large guess load.
>>> I don't see why rq->util should be treated different depending on the
>>> number of tasks causing the load. rq->util = 50 means that the cpu is
>>> busy about 50% of the time no matter how many tasks contibute to that
>>> load.
>>>
>>> If nr_running = 1 instead in my example, you would consider the cpu
>>> vacant if putil = 6, but if nr_running > 1 you would not. Why should the
>>> two scenarios be treated differently?
>>>
>>>>>>>
>>>>>>> Also, why multiply putil by 8? rq->util must be very close to 0 for
>>>>>>> vacancy to be positive if putil is close to 12 (12.5%).
>>>>>
>>>>> just want to pack small util tasks, since packing is possible to hurt
>>>>> performance.
>>> I agree that packing may affect performance. But why don't you reduce
>>> FULL_UTIL instead of multiplying by 8? With current expression you will
>>> not pack a 10% task if rq->util = 20 and nr_running = 1, but you would
>>> pack a 6% task even if rq->util = 50 and the resulting cpu load is much
>>> higher.
>>>
>>
>> Yes, the threshold has no strong theory or experiment support. I had
>> tried cyclitest which Vicent used, the case's load avg is too small to
>> be caught. so just use half of Vicent value as 12.5%. If you has more
>> reasonable value, let me know.
>>
>> As to nr_running engaged as multiple mode. it's base on 2 reasons.
>> 1, load avg/util need 345ms to accumulate as 100%. so, if a tasks is
>> cost full cpu time, it still has 345ms with rq->util < 1.
> 
> I agree that load avg may not be accurate, especially for new tasks. But
> why use it if you don't trust its value anyway?
> 
> The load avg (sum/period) of a new task will reach 100% instantly if the
> task is consuming all the cpu time it can get. An old task can reach 50%
> within 32ms. So you should fairly quickly be able to see if it is a
> light task or not. You may under-estimate its load in the beginning, but
> only for a very short time.

this packing is done in wakup, even no 'a very short time' here:)
> 
>> 2, if there are more tasks, like 2 tasks running on one cpu, it's
>> possible to has capacity to burn 200% cpu time, while the biggest
>> rq->util is still 100%.
> 
> If you want to have a better metric for how much cpu time the task on
> the runqueue could potentially use, I would suggest using
> cfs_rq->runnable_load_avg which is the load_avg_contrib sum of all tasks
> on the runqueue. It would give you 200% in your example above.

runnable_load_avg also need much time to accumulate its value, not
better than util.
> 
> On the other hand, I think rq->util is fine for this purpose. If
> rq->util < 100% you know for sure that cpu is not fully utilized no
> matter how many tasks you have on the runqueue. So as long as rq->util
> is well below 100% (like < 50%) it should be safe to pack more small
> tasks on that cpu even if it has multiple tasks running already.
> 
>>
>> Consider to figure out precise utils is complicate and cost much. I do
>> this simple calculation. It is not very precise, but it is efficient and
>> more bias toward performance.
> 
> It is indeed very biased towards performance. I would prefer more focus
> on saving power in a power scheduling policy :)
> 

Agree, and I don't refuse to change the criteria for power. :) but
without reliable benchmarks or data, everything is guess.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-11  6:31         ` Alex Shi
@ 2013-01-21 14:47           ` Alex Shi
  2013-01-22  3:20             ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-21 14:47 UTC (permalink / raw)
  To: Paul Turner, Ingo Molnar, Peter Zijlstra
  Cc: Linus Torvalds, Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Borislav Petkov, namhyung, Mike Galbraith, Vincent Guittot,
	Greg Kroah-Hartman, preeti, Linux Kernel Mailing List

On 01/11/2013 02:31 PM, Alex Shi wrote:
> On 01/07/2013 02:31 AM, Linus Torvalds wrote:
>> On Sat, Jan 5, 2013 at 11:54 PM, Alex Shi <alex.shi@intel.com> wrote:
>>>
>>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
>>> after all tasks ready, aim9 give a signal than all tasks burst waking up
>>> and run until all finished.
>>> Since each of tasks are finished very quickly, a imbalanced empty cpu
>>> may goes to sleep till a regular balancing give it some new tasks. That
>>> causes the performance dropping. cause more idle entering.
>>
>> Sounds like for AIM (and possibly for other really bursty loads), we
>> might want to do some load-balancing at wakeup time by *just* looking
>> at the number of running tasks, rather than at the load average. Hmm?
>>
>> The load average is fundamentally always going to run behind a bit,
>> and while you want to use it for long-term balancing, a short-term you
>> might want to do just a "if we have a huge amount of runnable
>> processes, do a load balancing *now*". Where "huge amount" should
>> probably be relative to the long-term load balancing (ie comparing the
>> number of runnable processes on this CPU right *now* with the load
>> average over the last second or so would show a clear spike, and a
>> reason for quick action).
>>
> 
> Sorry for response late!
> 
> Just written a patch following your suggestion, but no clear improvement for this case.
> I also tried change the burst checking interval, also no clear help.
> 
> If I totally give up runnable load in periodic balancing, the performance can recover 60%
> of lose.
> 
> I will try to optimize wake up balancing in weekend.
> 

(btw, the time for runnable avg to accumulate to 100%, needs 345ms; to
50% needs 32 ms)

I have tried some tuning in both wake up balancing and regular
balancing. Yes, when using instant load weight (without runnable avg
engage), both in waking up, and regular balance, the performance recovered.

But with per_cpu nr_running tracking, it's hard to find a elegant way to
detect the burst whenever in waking up or in regular balance.
In waking up, the whole sd_llc domain cpus are candidates, so just
checking this_cpu is not enough.
In regular balance, this_cpu is the migration destination cpu, checking
if the burst on the cpu is not useful. Instead, we need to check whole
domains' increased task number.

So, guess 2 solutions for this issue.
1, for quick waking up, we need use instant load(same as current
balancing) to do balance; and for regular balance, we can record both
instant load and runnable load data for whole domain, then decide which
one to use according to task number increasing in the domain after
tracking done the whole domain.

2, we can keep current instant load balancing as performance balance
policy, and using runnable load balancing in power friend policy.
Since, none of us find performance benefit with runnable load balancing
on benchmark hackbench/kbuild/aim9/tbench/specjbb etc.
I prefer the 2nd.

What's your opinions of this?

Best regards
Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-21 14:47           ` Alex Shi
@ 2013-01-22  3:20             ` Alex Shi
  2013-01-22  6:55               ` Mike Galbraith
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-22  3:20 UTC (permalink / raw)
  To: Paul Turner, Ingo Molnar, Peter Zijlstra, Linus Torvalds
  Cc: Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Borislav Petkov, namhyung, Mike Galbraith, Vincent Guittot,
	Greg Kroah-Hartman, preeti, Linux Kernel Mailing List

>>>>
>>>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
>>>> after all tasks ready, aim9 give a signal than all tasks burst waking up
>>>> and run until all finished.
>>>> Since each of tasks are finished very quickly, a imbalanced empty cpu
>>>> may goes to sleep till a regular balancing give it some new tasks. That
>>>> causes the performance dropping. cause more idle entering.
>>>
>>> Sounds like for AIM (and possibly for other really bursty loads), we
>>> might want to do some load-balancing at wakeup time by *just* looking
>>> at the number of running tasks, rather than at the load average. Hmm?
>>>
>>> The load average is fundamentally always going to run behind a bit,
>>> and while you want to use it for long-term balancing, a short-term you
>>> might want to do just a "if we have a huge amount of runnable
>>> processes, do a load balancing *now*". Where "huge amount" should
>>> probably be relative to the long-term load balancing (ie comparing the
>>> number of runnable processes on this CPU right *now* with the load
>>> average over the last second or so would show a clear spike, and a
>>> reason for quick action).
>>>
>>
>> Sorry for response late!
>>
>> Just written a patch following your suggestion, but no clear improvement for this case.
>> I also tried change the burst checking interval, also no clear help.
>>
>> If I totally give up runnable load in periodic balancing, the performance can recover 60%
>> of lose.
>>
>> I will try to optimize wake up balancing in weekend.
>>
> 
> (btw, the time for runnable avg to accumulate to 100%, needs 345ms; to
> 50% needs 32 ms)
> 
> I have tried some tuning in both wake up balancing and regular
> balancing. Yes, when using instant load weight (without runnable avg
> engage), both in waking up, and regular balance, the performance recovered.
> 
> But with per_cpu nr_running tracking, it's hard to find a elegant way to
> detect the burst whenever in waking up or in regular balance.
> In waking up, the whole sd_llc domain cpus are candidates, so just
> checking this_cpu is not enough.
> In regular balance, this_cpu is the migration destination cpu, checking
> if the burst on the cpu is not useful. Instead, we need to check whole
> domains' increased task number.
> 
> So, guess 2 solutions for this issue.
> 1, for quick waking up, we need use instant load(same as current
> balancing) to do balance; and for regular balance, we can record both
> instant load and runnable load data for whole domain, then decide which
> one to use according to task number increasing in the domain after
> tracking done the whole domain.
> 
> 2, we can keep current instant load balancing as performance balance
> policy, and using runnable load balancing in power friend policy.
> Since, none of us find performance benefit with runnable load balancing
> on benchmark hackbench/kbuild/aim9/tbench/specjbb etc.
> I prefer the 2nd.

3, On the other hand, Considering the aim9 testing scenario is rare in
real life(prepare thousands tasks and then wake up them at the same
time). And the runnable load avg includes useful running history info.
Only aim9 5~7% performance dropping is not unacceptable.
(kbuild/hackbench/tbench/specjbb have no clear performance change)

So we can let this drop be with a reminder in code. Any comments?


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-22  3:20             ` Alex Shi
@ 2013-01-22  6:55               ` Mike Galbraith
  2013-01-22  7:50                 ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Mike Galbraith @ 2013-01-22  6:55 UTC (permalink / raw)
  To: Alex Shi
  Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Borislav Petkov, namhyung, Vincent Guittot, Greg Kroah-Hartman,
	preeti, Linux Kernel Mailing List

On Tue, 2013-01-22 at 11:20 +0800, Alex Shi wrote: 
> >>>>
> >>>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
> >>>> after all tasks ready, aim9 give a signal than all tasks burst waking up
> >>>> and run until all finished.
> >>>> Since each of tasks are finished very quickly, a imbalanced empty cpu
> >>>> may goes to sleep till a regular balancing give it some new tasks. That
> >>>> causes the performance dropping. cause more idle entering.
> >>>
> >>> Sounds like for AIM (and possibly for other really bursty loads), we
> >>> might want to do some load-balancing at wakeup time by *just* looking
> >>> at the number of running tasks, rather than at the load average. Hmm?
> >>>
> >>> The load average is fundamentally always going to run behind a bit,
> >>> and while you want to use it for long-term balancing, a short-term you
> >>> might want to do just a "if we have a huge amount of runnable
> >>> processes, do a load balancing *now*". Where "huge amount" should
> >>> probably be relative to the long-term load balancing (ie comparing the
> >>> number of runnable processes on this CPU right *now* with the load
> >>> average over the last second or so would show a clear spike, and a
> >>> reason for quick action).
> >>>
> >>
> >> Sorry for response late!
> >>
> >> Just written a patch following your suggestion, but no clear improvement for this case.
> >> I also tried change the burst checking interval, also no clear help.
> >>
> >> If I totally give up runnable load in periodic balancing, the performance can recover 60%
> >> of lose.
> >>
> >> I will try to optimize wake up balancing in weekend.
> >>
> > 
> > (btw, the time for runnable avg to accumulate to 100%, needs 345ms; to
> > 50% needs 32 ms)
> > 
> > I have tried some tuning in both wake up balancing and regular
> > balancing. Yes, when using instant load weight (without runnable avg
> > engage), both in waking up, and regular balance, the performance recovered.
> > 
> > But with per_cpu nr_running tracking, it's hard to find a elegant way to
> > detect the burst whenever in waking up or in regular balance.
> > In waking up, the whole sd_llc domain cpus are candidates, so just
> > checking this_cpu is not enough.
> > In regular balance, this_cpu is the migration destination cpu, checking
> > if the burst on the cpu is not useful. Instead, we need to check whole
> > domains' increased task number.
> > 
> > So, guess 2 solutions for this issue.
> > 1, for quick waking up, we need use instant load(same as current
> > balancing) to do balance; and for regular balance, we can record both
> > instant load and runnable load data for whole domain, then decide which
> > one to use according to task number increasing in the domain after
> > tracking done the whole domain.
> > 
> > 2, we can keep current instant load balancing as performance balance
> > policy, and using runnable load balancing in power friend policy.
> > Since, none of us find performance benefit with runnable load balancing
> > on benchmark hackbench/kbuild/aim9/tbench/specjbb etc.
> > I prefer the 2nd.
> 
> 3, On the other hand, Considering the aim9 testing scenario is rare in
> real life(prepare thousands tasks and then wake up them at the same
> time). And the runnable load avg includes useful running history info.
> Only aim9 5~7% performance dropping is not unacceptable.
> (kbuild/hackbench/tbench/specjbb have no clear performance change)
> 
> So we can let this drop be with a reminder in code. Any comments?

Hm.  Burst of thousands of tasks may be rare and perhaps even silly, but
what about few task bursts?   History is useless for bursts, they live
or die now: modest gaggle of worker threads (NR_CPUS) for say video
encoding job wake in parallel, each is handed a chunk of data to chew up
in parallel.  Double scheduler latency of one worker (stack workers
because individuals don't historically fill a cpu), you double latency
for the entire job every time.

I think 2 is mandatory, keep both, and user picks his poison.

If you want max burst performance, you care about the here and now
reality the burst is waking into.  If you're running a google freight
train farm otoh, you may want some hysteresis so trains don't over-rev
the electric meter on every microscopic spike.  Both policies make
sense, but you can't have both performance profiles with either metric,
so choosing one seems doomed to failure.

Case in point: tick skew.  It was removed because synchronized ticking
saves power.. and then promptly returned under user control because the
power saving gain also inflicted serious latency pain.

-Mike


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-22  6:55               ` Mike Galbraith
@ 2013-01-22  7:50                 ` Alex Shi
  2013-01-22  9:52                   ` Mike Galbraith
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-22  7:50 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Borislav Petkov, namhyung, Vincent Guittot, Greg Kroah-Hartman,
	preeti, Linux Kernel Mailing List

On 01/22/2013 02:55 PM, Mike Galbraith wrote:
> On Tue, 2013-01-22 at 11:20 +0800, Alex Shi wrote: 
>>>>>>
>>>>>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
>>>>>> after all tasks ready, aim9 give a signal than all tasks burst waking up
>>>>>> and run until all finished.
>>>>>> Since each of tasks are finished very quickly, a imbalanced empty cpu
>>>>>> may goes to sleep till a regular balancing give it some new tasks. That
>>>>>> causes the performance dropping. cause more idle entering.
>>>>>
>>>>> Sounds like for AIM (and possibly for other really bursty loads), we
>>>>> might want to do some load-balancing at wakeup time by *just* looking
>>>>> at the number of running tasks, rather than at the load average. Hmm?
>>>>>
>>>>> The load average is fundamentally always going to run behind a bit,
>>>>> and while you want to use it for long-term balancing, a short-term you
>>>>> might want to do just a "if we have a huge amount of runnable
>>>>> processes, do a load balancing *now*". Where "huge amount" should
>>>>> probably be relative to the long-term load balancing (ie comparing the
>>>>> number of runnable processes on this CPU right *now* with the load
>>>>> average over the last second or so would show a clear spike, and a
>>>>> reason for quick action).
>>>>>
>>>>
>>>> Sorry for response late!
>>>>
>>>> Just written a patch following your suggestion, but no clear improvement for this case.
>>>> I also tried change the burst checking interval, also no clear help.
>>>>
>>>> If I totally give up runnable load in periodic balancing, the performance can recover 60%
>>>> of lose.
>>>>
>>>> I will try to optimize wake up balancing in weekend.
>>>>
>>>
>>> (btw, the time for runnable avg to accumulate to 100%, needs 345ms; to
>>> 50% needs 32 ms)
>>>
>>> I have tried some tuning in both wake up balancing and regular
>>> balancing. Yes, when using instant load weight (without runnable avg
>>> engage), both in waking up, and regular balance, the performance recovered.
>>>
>>> But with per_cpu nr_running tracking, it's hard to find a elegant way to
>>> detect the burst whenever in waking up or in regular balance.
>>> In waking up, the whole sd_llc domain cpus are candidates, so just
>>> checking this_cpu is not enough.
>>> In regular balance, this_cpu is the migration destination cpu, checking
>>> if the burst on the cpu is not useful. Instead, we need to check whole
>>> domains' increased task number.
>>>
>>> So, guess 2 solutions for this issue.
>>> 1, for quick waking up, we need use instant load(same as current
>>> balancing) to do balance; and for regular balance, we can record both
>>> instant load and runnable load data for whole domain, then decide which
>>> one to use according to task number increasing in the domain after
>>> tracking done the whole domain.
>>>
>>> 2, we can keep current instant load balancing as performance balance
>>> policy, and using runnable load balancing in power friend policy.
>>> Since, none of us find performance benefit with runnable load balancing
>>> on benchmark hackbench/kbuild/aim9/tbench/specjbb etc.
>>> I prefer the 2nd.
>>
>> 3, On the other hand, Considering the aim9 testing scenario is rare in
>> real life(prepare thousands tasks and then wake up them at the same
>> time). And the runnable load avg includes useful running history info.
>> Only aim9 5~7% performance dropping is not unacceptable.
>> (kbuild/hackbench/tbench/specjbb have no clear performance change)
>>
>> So we can let this drop be with a reminder in code. Any comments?
> 
> Hm.  Burst of thousands of tasks may be rare and perhaps even silly, but
> what about few task bursts?   History is useless for bursts, they live
> or die now: modest gaggle of worker threads (NR_CPUS) for say video
> encoding job wake in parallel, each is handed a chunk of data to chew up
> in parallel.  Double scheduler latency of one worker (stack workers
> because individuals don't historically fill a cpu), you double latency
> for the entire job every time.
> 
> I think 2 is mandatory, keep both, and user picks his poison.
> 
> If you want max burst performance, you care about the here and now
> reality the burst is waking into.  If you're running a google freight
> train farm otoh, you may want some hysteresis so trains don't over-rev
> the electric meter on every microscopic spike.  Both policies make
> sense, but you can't have both performance profiles with either metric,
> so choosing one seems doomed to failure.
> 

Thanks for your suggestions and example, Mike!
I just can't understand the your last words here, Sorry. what the
detailed concern of you on 'both performance profiles with either
metric'? Could you like to give your preferred solutions?

> Case in point: tick skew.  It was removed because synchronized ticking
> saves power.. and then promptly returned under user control because the
> power saving gain also inflicted serious latency pain.
> 
> -Mike
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-22  7:50                 ` Alex Shi
@ 2013-01-22  9:52                   ` Mike Galbraith
  2013-01-23  0:36                     ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Mike Galbraith @ 2013-01-22  9:52 UTC (permalink / raw)
  To: Alex Shi
  Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Borislav Petkov, namhyung, Vincent Guittot, Greg Kroah-Hartman,
	preeti, Linux Kernel Mailing List

On Tue, 2013-01-22 at 15:50 +0800, Alex Shi wrote:

> Thanks for your suggestions and example, Mike!
> I just can't understand the your last words here, Sorry. what the
> detailed concern of you on 'both performance profiles with either
> metric'? Could you like to give your preferred solutions?

Hm.. I'll try rephrasing.  Any power saving gain will of necessity be
paid for in latency currency.  I don't have a solution other than make a
button, let the user decide whether history influences fast path task
placement or not.  Any other decision maker will get it wrong.

-Mike


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-22  9:52                   ` Mike Galbraith
@ 2013-01-23  0:36                     ` Alex Shi
  2013-01-23  1:47                       ` Mike Galbraith
  0 siblings, 1 reply; 91+ messages in thread
From: Alex Shi @ 2013-01-23  0:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Borislav Petkov, namhyung, Vincent Guittot, Greg Kroah-Hartman,
	preeti, Linux Kernel Mailing List

On 01/22/2013 05:52 PM, Mike Galbraith wrote:
> On Tue, 2013-01-22 at 15:50 +0800, Alex Shi wrote:
> 
>> Thanks for your suggestions and example, Mike!
>> I just can't understand the your last words here, Sorry. what the
>> detailed concern of you on 'both performance profiles with either
>> metric'? Could you like to give your preferred solutions?
> 
> Hm.. I'll try rephrasing.  Any power saving gain will of necessity be
> paid for in latency currency.  I don't have a solution other than make a
> button, let the user decide whether history influences fast path task
> placement or not.  Any other decision maker will get it wrong.

Um, if no other objection, I'd like to move the runnable load only used
for power friendly policy -- for this patchset, they are 'powersaving'
and 'balance', Can I?

> 
> -Mike
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-23  0:36                     ` Alex Shi
@ 2013-01-23  1:47                       ` Mike Galbraith
  2013-01-23  2:01                         ` Alex Shi
  0 siblings, 1 reply; 91+ messages in thread
From: Mike Galbraith @ 2013-01-23  1:47 UTC (permalink / raw)
  To: Alex Shi
  Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Borislav Petkov, namhyung, Vincent Guittot, Greg Kroah-Hartman,
	preeti, Linux Kernel Mailing List

On Wed, 2013-01-23 at 08:36 +0800, Alex Shi wrote: 
> On 01/22/2013 05:52 PM, Mike Galbraith wrote:
> > On Tue, 2013-01-22 at 15:50 +0800, Alex Shi wrote:
> > 
> >> Thanks for your suggestions and example, Mike!
> >> I just can't understand the your last words here, Sorry. what the
> >> detailed concern of you on 'both performance profiles with either
> >> metric'? Could you like to give your preferred solutions?
> > 
> > Hm.. I'll try rephrasing.  Any power saving gain will of necessity be
> > paid for in latency currency.  I don't have a solution other than make a
> > button, let the user decide whether history influences fast path task
> > placement or not.  Any other decision maker will get it wrong.
> 
> Um, if no other objection, I'd like to move the runnable load only used
> for power friendly policy -- for this patchset, they are 'powersaving'
> and 'balance', Can I?

Yeah, that should work be fine.

-Mike


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task
  2013-01-23  1:47                       ` Mike Galbraith
@ 2013-01-23  2:01                         ` Alex Shi
  0 siblings, 0 replies; 91+ messages in thread
From: Alex Shi @ 2013-01-23  2:01 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paul Turner, Ingo Molnar, Peter Zijlstra, Linus Torvalds,
	Thomas Gleixner, Andrew Morton, Arjan van de Ven,
	Borislav Petkov, namhyung, Vincent Guittot, Greg Kroah-Hartman,
	preeti, Linux Kernel Mailing List

On 01/23/2013 09:47 AM, Mike Galbraith wrote:
> On Wed, 2013-01-23 at 08:36 +0800, Alex Shi wrote: 
>> On 01/22/2013 05:52 PM, Mike Galbraith wrote:
>>> On Tue, 2013-01-22 at 15:50 +0800, Alex Shi wrote:
>>>
>>>> Thanks for your suggestions and example, Mike!
>>>> I just can't understand the your last words here, Sorry. what the
>>>> detailed concern of you on 'both performance profiles with either
>>>> metric'? Could you like to give your preferred solutions?
>>>
>>> Hm.. I'll try rephrasing.  Any power saving gain will of necessity be
>>> paid for in latency currency.  I don't have a solution other than make a
>>> button, let the user decide whether history influences fast path task
>>> placement or not.  Any other decision maker will get it wrong.
>>
>> Um, if no other objection, I'd like to move the runnable load only used
>> for power friendly policy -- for this patchset, they are 'powersaving'
>> and 'balance', Can I?
> 
> Yeah, that should work be fine.

Thanks for comments! :)
> 
> -Mike
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2013-01-23  2:00 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-05  8:37 [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Alex Shi
2013-01-05  8:37 ` [PATCH v3 01/22] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
2013-01-05  8:37 ` [PATCH v3 02/22] sched: select_task_rq_fair clean up Alex Shi
2013-01-11  4:57   ` Preeti U Murthy
2013-01-05  8:37 ` [PATCH v3 03/22] sched: fix find_idlest_group mess logical Alex Shi
2013-01-11  4:59   ` Preeti U Murthy
2013-01-05  8:37 ` [PATCH v3 04/22] sched: don't need go to smaller sched domain Alex Shi
2013-01-09 17:38   ` Morten Rasmussen
2013-01-10  3:16     ` Mike Galbraith
2013-01-11  5:02   ` Preeti U Murthy
2013-01-05  8:37 ` [PATCH v3 05/22] sched: remove domain iterations in fork/exec/wake Alex Shi
2013-01-09 18:21   ` Morten Rasmussen
2013-01-11  2:46     ` Alex Shi
2013-01-11 10:07       ` Morten Rasmussen
2013-01-11 14:50         ` Alex Shi
2013-01-14  8:55         ` li guang
2013-01-14  9:18           ` Alex Shi
2013-01-11  4:56     ` Preeti U Murthy
2013-01-11  8:01       ` li guang
2013-01-11 14:56         ` Alex Shi
2013-01-14  9:03           ` li guang
2013-01-15  2:34             ` Alex Shi
2013-01-16  1:54               ` li guang
2013-01-11 10:54       ` Morten Rasmussen
2013-01-16  5:43       ` Alex Shi
2013-01-16  7:41         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 06/22] sched: load tracking bug fix Alex Shi
2013-01-05  8:37 ` [PATCH v3 07/22] sched: set initial load avg of new forked task Alex Shi
2013-01-11  5:10   ` Preeti U Murthy
2013-01-11  5:44     ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 08/22] sched: update cpu load after task_tick Alex Shi
2013-01-05  8:37 ` [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
2013-01-05  8:56   ` Alex Shi
2013-01-06  7:54     ` Alex Shi
2013-01-06 18:31       ` Linus Torvalds
2013-01-07  7:00         ` Preeti U Murthy
2013-01-08 14:27         ` Alex Shi
2013-01-11  6:31         ` Alex Shi
2013-01-21 14:47           ` Alex Shi
2013-01-22  3:20             ` Alex Shi
2013-01-22  6:55               ` Mike Galbraith
2013-01-22  7:50                 ` Alex Shi
2013-01-22  9:52                   ` Mike Galbraith
2013-01-23  0:36                     ` Alex Shi
2013-01-23  1:47                       ` Mike Galbraith
2013-01-23  2:01                         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 10/22] sched: consider runnable load average in move_tasks Alex Shi
2013-01-05  8:37 ` [PATCH v3 11/22] sched: consider runnable load average in effective_load Alex Shi
2013-01-10 11:28   ` Morten Rasmussen
2013-01-11  3:26     ` Alex Shi
2013-01-14 12:01       ` Morten Rasmussen
2013-01-16  5:30         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 12/22] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2013-01-05  8:37 ` [PATCH v3 13/22] sched: add sched_policy in kernel Alex Shi
2013-01-05  8:37 ` [PATCH v3 14/22] sched: add sched_policy and it's sysfs interface Alex Shi
2013-01-14  6:53   ` Namhyung Kim
2013-01-14  8:11     ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 15/22] sched: log the cpu utilization at rq Alex Shi
2013-01-10 11:40   ` Morten Rasmussen
2013-01-11  3:30     ` Alex Shi
2013-01-14 13:59       ` Morten Rasmussen
2013-01-16  5:53         ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 16/22] sched: add power aware scheduling in fork/exec/wake Alex Shi
2013-01-10 15:01   ` Morten Rasmussen
2013-01-11  7:08     ` Alex Shi
2013-01-14 16:09       ` Morten Rasmussen
2013-01-16  6:02         ` Alex Shi
2013-01-16 14:27           ` Morten Rasmussen
2013-01-17  5:47             ` Namhyung Kim
2013-01-18 13:41               ` Alex Shi
2013-01-14  7:03   ` Namhyung Kim
2013-01-14  8:30     ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 17/22] sched: packing small tasks in wake/exec balancing Alex Shi
2013-01-10 17:17   ` Morten Rasmussen
2013-01-11  3:47     ` Alex Shi
2013-01-14  7:13       ` Namhyung Kim
2013-01-16  6:11         ` Alex Shi
2013-01-16 12:52           ` Namhyung Kim
2013-01-14 17:00       ` Morten Rasmussen
2013-01-16  7:32         ` Alex Shi
2013-01-16 15:08           ` Morten Rasmussen
2013-01-18 14:06             ` Alex Shi
2013-01-05  8:37 ` [PATCH v3 18/22] sched: add power/performance balance allowed flag Alex Shi
2013-01-05  8:37 ` [PATCH v3 19/22] sched: pull all tasks from source group Alex Shi
2013-01-05  8:37 ` [PATCH v3 20/22] sched: don't care if the local group has capacity Alex Shi
2013-01-05  8:37 ` [PATCH v3 21/22] sched: power aware load balance, Alex Shi
2013-01-05  8:37 ` [PATCH v3 22/22] sched: lazy powersaving balance Alex Shi
2013-01-14  8:39   ` Namhyung Kim
2013-01-14  8:45     ` Alex Shi
2013-01-09 17:16 ` [PATCH V3 0/22] sched: simplified fork, enable load average into LB and power awareness scheduling Morten Rasmussen
2013-01-10  3:49   ` Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).