linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
@ 2013-01-24  3:06 Alex Shi
  2013-01-24  3:06 ` [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
                   ` (20 more replies)
  0 siblings, 21 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

Since the runnable info needs 345ms to accumulate, balancing
doesn't do well for many tasks burst waking. After talking with Mike
Galbraith, we are agree to just use runnable avg in power friendly 
scheduling and keep current instant load in performance scheduling for 
low latency.

So the biggest change in this version is removing runnable load avg in
balance and just using runnable data in power balance.

The patchset bases on Linus' tree, includes 3 parts,
** 1, bug fix and fork/wake balancing clean up. patch 1~5,
----------------------
the first patch remove one domain level. patch 2~5 simplified fork/wake
balancing, it can increase 10+% hackbench performance on our 4 sockets
SNB EP machine.

V3 change:
a, added the first patch to remove one domain level on x86 platform.
b, some small changes according to Namhyung Kim's comments, thanks!

** 2, bug fix of load avg and remove the CONFIG_FAIR_GROUP_SCHED limit
----------------------
patch 6~8, That using runnable avg in load balancing, with
two initial runnable variables fix.

V4 change:
a, remove runnable log avg using in balancing.

V3 change:
a, use rq->cfs.runnable_load_avg as cpu load not
rq->avg.load_avg_contrib, since the latter need much time to accumulate
for new forked task,
b, a build issue fixed with Namhyung Kim's reminder.

** 3, power awareness scheduling, patch 9~18.
----------------------
The subset implement/consummate the rough power aware scheduling
proposal: https://lkml.org/lkml/2012/8/13/139.
It defines 2 new power aware policy 'balance' and 'powersaving' and then
try to spread or pack tasks on each sched groups level according the
different scheduler policy. That can save much power when task number in
system is no more then LCPU number.

As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, pack tasks on less sched_groups will reduce power consumption

The first assumption make performance policy take over scheduling when
system busy.
The second assumption make power aware scheduling try to move
disperse tasks into fewer groups until that groups are full of tasks.

Some power testing data is in the last 2 patches.

V4 change:
a, fix few bugs and clean up code according to Morten Rasmussen, Mike
Galbraith and Namhyung Kim. Thanks!
b, take Morten's suggestion to set different criteria for different
policy in small task packing.
c, shorter latency in power aware scheduling.

V3 change:
a, engaged nr_running in max potential utils consideration in periodic
power balancing.
b, try exec/wake small tasks on running cpu not idle cpu.

V2 change:
a, add lazy power scheduling to deal with kbuild like benchmark.


Thanks Fengguang Wu for the build testing of this patchset!

Any comments are appreciated!

-- Thanks Alex

[patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce
[patch v4 02/18] sched: select_task_rq_fair clean up
[patch v4 03/18] sched: fix find_idlest_group mess logical
[patch v4 04/18] sched: don't need go to smaller sched domain
[patch v4 05/18] sched: quicker balancing on fork/exec/wake
[patch v4 06/18] sched: give initial value for runnable avg of sched
[patch v4 07/18] sched: set initial load avg of new forked task
[patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[patch v4 09/18] sched: add sched_policies in kernel
[patch v4 10/18] sched: add sysfs interface for sched_policy
[patch v4 11/18] sched: log the cpu utilization at rq
[patch v4 12/18] sched: add power aware scheduling in fork/exec/wake
[patch v4 13/18] sched: packing small tasks in wake/exec balancing
[patch v4 14/18] sched: add power/performance balance allowed flag
[patch v4 15/18] sched: pull all tasks from source group
[patch v4 16/18] sched: don't care if the local group has capacity
[patch v4 17/18] sched: power aware load balance,
[patch v4 18/18] sched: lazy power balance

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-02-12 10:11   ` Peter Zijlstra
  2013-01-24  3:06 ` [patch v4 02/18] sched: select_task_rq_fair clean up Alex Shi
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
frist commit b5d978e0c7e79a, and was removed uncarefully when clear up
obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
domain only. It works, but it introduces a extra domain level since this
cause MC/CPU different.

So, recover the the flag in MC domain too to remove a domain level in
x86 platform.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/topology.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..386bcf4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -132,6 +132,7 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
+				| 1*SD_PREFER_SIBLING			\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 02/18] sched: select_task_rq_fair clean up
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
  2013-01-24  3:06 ` [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-02-12 10:14   ` Peter Zijlstra
  2013-01-24  3:06 ` [patch v4 03/18] sched: fix find_idlest_group mess logical Alex Shi
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

It is impossible to miss a task allowed cpu in a eligible group.

And since find_idlest_group only return a different group which
excludes old cpu, it's also impossible to find a new cpu same as old
cpu.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..6d3a95d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3378,11 +3378,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		}
 
 		new_cpu = find_idlest_cpu(group, p, cpu);
-		if (new_cpu == -1 || new_cpu == cpu) {
-			/* Now try balancing at a lower domain level of cpu */
-			sd = sd->child;
-			continue;
-		}
 
 		/* Now try balancing at a lower domain level of new_cpu */
 		cpu = new_cpu;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 03/18] sched: fix find_idlest_group mess logical
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
  2013-01-24  3:06 ` [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
  2013-01-24  3:06 ` [patch v4 02/18] sched: select_task_rq_fair clean up Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-02-12 10:16   ` Peter Zijlstra
  2013-01-24  3:06 ` [patch v4 04/18] sched: don't need go to smaller sched domain Alex Shi
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

There is 4 situations in the function:
1, no task allowed group;
	so min_load = ULONG_MAX, this_load = 0, idlest = NULL
2, only local group task allowed;
	so min_load = ULONG_MAX, this_load assigned, idlest = NULL
3, only non-local task group allowed;
	so min_load assigned, this_load = 0, idlest != NULL
4, local group + another group are task allowed.
	so min_load assigned, this_load assigned, idlest != NULL

Current logical will return NULL in first 3 kinds of scenarios.
And still return NULL, if idlest group is heavier then the
local group in the 4th situation.

Actually, I thought groups in situation 2,3 are also eligible to host
the task. And in 4th situation, agree to bias toward local group.
So, has this patch.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d3a95d..3c7b09a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3181,6 +3181,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int load_idx)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
+	struct sched_group *this_group = NULL;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -3215,14 +3216,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		if (local_group) {
 			this_load = avg_load;
-		} else if (avg_load < min_load) {
+			this_group = group;
+		}
+		if (avg_load < min_load) {
 			min_load = avg_load;
 			idlest = group;
 		}
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
-		return NULL;
+	if (this_group && idlest != this_group)
+		/* Bias toward our group again */
+		if (100*this_load < imbalance*min_load)
+			idlest = this_group;
+
 	return idlest;
 }
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 04/18] sched: don't need go to smaller sched domain
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (2 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 03/18] sched: fix find_idlest_group mess logical Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-01-24  3:06 ` [patch v4 05/18] sched: quicker balancing on fork/exec/wake Alex Shi
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

If parent sched domain has no task allowed cpu. neither in
it's child. So, go out to save useless checking.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3c7b09a..ecfbf8e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3378,10 +3378,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			load_idx = sd->wake_idx;
 
 		group = find_idlest_group(sd, p, cpu, load_idx);
-		if (!group) {
-			sd = sd->child;
-			continue;
-		}
+		if (!group)
+			goto unlock;
 
 		new_cpu = find_idlest_cpu(group, p, cpu);
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 05/18] sched: quicker balancing on fork/exec/wake
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (3 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 04/18] sched: don't need go to smaller sched domain Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-02-12 10:22   ` Peter Zijlstra
  2013-01-24  3:06 ` [patch v4 06/18] sched: give initial value for runnable avg of sched entities Alex Shi
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

Guess the search cpu from bottom to up in domain tree come from
commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
balancing over tasks on all level domains.

This balancing cost too much if there has many domain/groups in a
large system.

If we remove this code, we will get quick fork/exec/wake with a similar
balancing result amony whole system.

This patch increases 10+% performance of hackbench on my 4 sockets
SNB machines and about 3% increasing on 2 sockets servers.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 20 +-------------------
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecfbf8e..895a3f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3364,15 +3364,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		goto unlock;
 	}
 
-	while (sd) {
+	if (sd) {
 		int load_idx = sd->forkexec_idx;
 		struct sched_group *group;
-		int weight;
-
-		if (!(sd->flags & sd_flag)) {
-			sd = sd->child;
-			continue;
-		}
 
 		if (sd_flag & SD_BALANCE_WAKE)
 			load_idx = sd->wake_idx;
@@ -3382,18 +3376,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			goto unlock;
 
 		new_cpu = find_idlest_cpu(group, p, cpu);
-
-		/* Now try balancing at a lower domain level of new_cpu */
-		cpu = new_cpu;
-		weight = sd->span_weight;
-		sd = NULL;
-		for_each_domain(cpu, tmp) {
-			if (weight <= tmp->span_weight)
-				break;
-			if (tmp->flags & sd_flag)
-				sd = tmp;
-		}
-		/* while loop will break here if sd == NULL */
 	}
 unlock:
 	rcu_read_unlock();
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 06/18] sched: give initial value for runnable avg of sched entities.
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (4 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 05/18] sched: quicker balancing on fork/exec/wake Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-02-12 10:23   ` Peter Zijlstra
  2013-01-24  3:06 ` [patch v4 07/18] sched: set initial load avg of new forked task Alex Shi
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
after a new task forked.
Otherwise random values of above variables cause mess when do new task
enqueue:
    enqueue_task_fair
        enqueue_entity
            enqueue_entity_load_avg

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..1743746 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1559,6 +1559,8 @@ static void __sched_fork(struct task_struct *p)
 #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
+	p->se.avg.decay_count = 0;
+	p->se.avg.load_avg_contrib = 0;
 #endif
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 07/18] sched: set initial load avg of new forked task
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (5 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 06/18] sched: give initial value for runnable avg of sched entities Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-02-12 10:26   ` Peter Zijlstra
  2013-01-24  3:06 ` [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

New task has no runnable sum at its first runnable time, so its
runnable load is zero. That makes burst forking balancing just select
few idle cpus to assign tasks if we engage runnable load in balancing.

Set initial load avg of new forked task as its load weight to resolve
this issue.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  2 +-
 kernel/sched/fair.c   | 11 +++++++++--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d211247..f283d3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1069,6 +1069,7 @@ struct sched_domain;
 #else
 #define ENQUEUE_WAKING		0
 #endif
+#define ENQUEUE_NEWTASK		8
 
 #define DEQUEUE_SLEEP		1
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1743746..7292965 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1706,7 +1706,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 
 	rq = __task_rq_lock(p);
-	activate_task(rq, p, 0);
+	activate_task(rq, p, ENQUEUE_NEWTASK);
 	p->on_rq = 1;
 	trace_sched_wakeup_new(p, true);
 	check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 895a3f4..5c545e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1503,8 +1503,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
-						  int wakeup)
+						  int flags)
 {
+	int wakeup = flags & ENQUEUE_WAKEUP;
 	/*
 	 * We track migrations using entity decay_count <= 0, on a wake-up
 	 * migration we use a negative decay count to track the remote decays
@@ -1538,6 +1539,12 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 		update_entity_load_avg(se, 0);
 	}
 
+	/*
+	 * set the initial load avg of new task same as its load
+	 * in order to avoid brust fork make few cpu too heavier
+	 */
+	if (flags & ENQUEUE_NEWTASK)
+		se->avg.load_avg_contrib = se->load.weight;
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
@@ -1701,7 +1708,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
+	enqueue_entity_load_avg(cfs_rq, se, flags);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (6 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 07/18] sched: set initial load avg of new forked task Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-02-12 10:27   ` Peter Zijlstra
  2013-01-24  3:06 ` [patch v4 09/18] sched: add sched_policies in kernel Alex Shi
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |  8 +-------
 kernel/sched/core.c   |  7 +------
 kernel/sched/fair.c   | 13 ++-----------
 kernel/sched/sched.h  |  9 +--------
 4 files changed, 5 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f283d3d..66b05e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1195,13 +1195,7 @@ struct sched_entity {
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq		*my_q;
 #endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
-	/* Per-entity load-tracking */
+#ifdef CONFIG_SMP
 	struct sched_avg	avg;
 #endif
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7292965..0bd9d5f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1551,12 +1551,7 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
 	p->se.avg.decay_count = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c545e4..efeb65c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 /*
  * We choose a half-life close to 1 scheduling period.
  * Note: The tables below are dependent on this value.
@@ -3391,12 +3390,6 @@ unlock:
 }
 
 /*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
@@ -3419,7 +3412,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 		atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
 	}
 }
-#endif
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -6111,9 +6103,8 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	.migrate_task_rq	= migrate_task_rq_fair,
-#endif
+
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..ae3511e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -225,12 +225,6 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
 	 * CFS Load tracking
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -240,8 +234,7 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u32 tg_runnable_contrib;
 	u64 tg_load_contrib;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 09/18] sched: add sched_policies in kernel
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (7 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-02-12 10:36   ` Peter Zijlstra
  2013-01-24  3:06 ` [patch v4 10/18] sched: add sysfs interface for sched_policy selection Alex Shi
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

Current scheduler behavior is just consider the for larger performance
of system. So it try to spread tasks on more cpu sockets and cpu cores

To adding the consideration of power awareness, the patchset adds
2 kinds of scheduler policy: powersaving and balance. They will use
runnable load util in scheduler balancing. The current scheduling is taken
as performance policy.

performance: the current scheduling behaviour, try to spread tasks
                on more CPU sockets or cores. performance oriented.
powersaving: will pack tasks into few sched group until all LCPU in the
                group is full, power oriented.
balance    : will pack tasks into few sched group until group_capacity
                numbers CPU is full, balance between performance and
		powersaving.

The following patches will enable powersaving/balance scheduling in CFS.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c  | 2 ++
 kernel/sched/sched.h | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efeb65c..538f469 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6086,6 +6086,8 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 	return rr_interval;
 }
 
+/* The default scheduler policy is 'performance'. */
+int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
 /*
  * All the scheduling class methods:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae3511e..66b08a1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -8,6 +8,12 @@
 
 extern __read_mostly int scheduler_running;
 
+#define SCHED_POLICY_PERFORMANCE	(0x1)
+#define SCHED_POLICY_POWERSAVING	(0x2)
+#define SCHED_POLICY_BALANCE		(0x4)
+
+extern int __read_mostly sched_policy;
+
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 10/18] sched: add sysfs interface for sched_policy selection
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (8 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 09/18] sched: add sched_policies in kernel Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-01-24  3:06 ` [patch v4 11/18] sched: log the cpu utilization at rq Alex Shi
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

This patch add the power aware scheduler knob into sysfs:

$cat /sys/devices/system/cpu/sched_policy/available_sched_policy
performance powersaving balance
$cat /sys/devices/system/cpu/sched_policy/current_sched_policy
powersaving

This means the using sched policy is 'powersaving'.

User can change the policy by commend 'echo':
 echo performance > /sys/devices/system/cpu/current_sched_policy

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 25 +++++++
 kernel/sched/fair.c                                | 76 ++++++++++++++++++++++
 2 files changed, 101 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 6943133..0ca0727 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -53,6 +53,31 @@ Description:	Dynamic addition and removal of CPU's.  This is not hotplug
 		the system.  Information writtento the file to remove CPU's
 		is architecture specific.
 
+What:		/sys/devices/system/cpu/sched_policy/current_sched_policy
+		/sys/devices/system/cpu/sched_policy/available_sched_policy
+Date:		Oct 2012
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:	CFS scheduler policy showing and setting interface.
+
+		available_sched_policy shows there are 3 kinds of policy now:
+		performance, balance and powersaving.
+		current_sched_policy shows current scheduler policy. User
+		can change the policy by writing it.
+
+		Policy decides the CFS scheduler how to distribute tasks onto
+		different CPU unit.
+
+		performance: try to spread tasks onto more CPU sockets,
+		more CPU cores. performance oriented.
+
+		powersaving: try to pack tasks onto same core or same CPU
+		until every LCPUs are busy in the core or CPU socket.
+		powersaving oriented.
+
+		balance:     try to pack tasks onto same core or same CPU
+		until full powered CPUs are busy.
+		balance between performance and powersaving.
+
 What:		/sys/devices/system/cpu/cpu#/node
 Date:		October 2009
 Contact:	Linux memory management mailing list <linux-mm@kvack.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 538f469..947542f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6088,6 +6088,82 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 
 /* The default scheduler policy is 'performance'. */
 int __read_mostly sched_policy = SCHED_POLICY_PERFORMANCE;
+
+#ifdef CONFIG_SYSFS
+static ssize_t show_available_sched_policy(struct device *dev,
+		struct device_attribute *attr,
+		char *buf)
+{
+	return sprintf(buf, "performance balance powersaving\n");
+}
+
+static ssize_t show_current_sched_policy(struct device *dev,
+		struct device_attribute *attr,
+		char *buf)
+{
+	if (sched_policy == SCHED_POLICY_PERFORMANCE)
+		return sprintf(buf, "performance\n");
+	else if (sched_policy == SCHED_POLICY_POWERSAVING)
+		return sprintf(buf, "powersaving\n");
+	else if (sched_policy == SCHED_POLICY_BALANCE)
+		return sprintf(buf, "balance\n");
+	return 0;
+}
+
+static ssize_t set_sched_policy(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	unsigned int ret = -EINVAL;
+	char    str_policy[16];
+
+	ret = sscanf(buf, "%15s", str_policy);
+	if (ret != 1)
+		return -EINVAL;
+
+	if (!strcmp(str_policy, "performance"))
+		sched_policy = SCHED_POLICY_PERFORMANCE;
+	else if (!strcmp(str_policy, "powersaving"))
+		sched_policy = SCHED_POLICY_POWERSAVING;
+	else if (!strcmp(str_policy, "balance"))
+		sched_policy = SCHED_POLICY_BALANCE;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+/*
+ *  * Sysfs setup bits:
+ *   */
+static DEVICE_ATTR(current_sched_policy, 0644, show_current_sched_policy,
+						set_sched_policy);
+
+static DEVICE_ATTR(available_sched_policy, 0444,
+		show_available_sched_policy, NULL);
+
+static struct attribute *sched_policy_default_attrs[] = {
+	&dev_attr_current_sched_policy.attr,
+	&dev_attr_available_sched_policy.attr,
+	NULL
+};
+static struct attribute_group sched_policy_attr_group = {
+	.attrs = sched_policy_default_attrs,
+	.name = "sched_policy",
+};
+
+int __init create_sysfs_sched_policy_group(struct device *dev)
+{
+	return sysfs_create_group(&dev->kobj, &sched_policy_attr_group);
+}
+
+static int __init sched_policy_sysfs_init(void)
+{
+	return create_sysfs_sched_policy_group(cpu_subsys.dev_root);
+}
+
+core_initcall(sched_policy_sysfs_init);
+#endif /* CONFIG_SYSFS */
+
 /*
  * All the scheduling class methods:
  */
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 11/18] sched: log the cpu utilization at rq
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (9 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 10/18] sched: add sysfs interface for sched_policy selection Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-02-12 10:39   ` Peter Zijlstra
  2013-01-24  3:06 ` [patch v4 12/18] sched: add power aware scheduling in fork/exec/wake Alex Shi
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

The cpu's utilization is to measure how busy is the cpu.
        util = cpu_rq(cpu)->avg.runnable_avg_sum
                / cpu_rq(cpu)->avg.runnable_avg_period;

Since the util is no more than 1, we use its percentage value in later
caculations. And set the the FULL_UTIL as 100%.

In later power aware scheduling, we are sensitive for how busy of the
cpu, not how much weight of its load. As to power consuming, it is more
related with cpu busy time, not the load weight.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/debug.c | 1 +
 kernel/sched/fair.c  | 4 ++++
 kernel/sched/sched.h | 4 ++++
 3 files changed, 9 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2cd3c1b..e4035f7 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -318,6 +318,7 @@ do {									\
 
 	P(ttwu_count);
 	P(ttwu_local);
+	P(util);
 
 #undef P
 #undef P64
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 947542f..20363fb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1495,8 +1495,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
+	u32 period;
 	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+
+	period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
+	rq->util = rq->avg.runnable_avg_sum * 100 / period;
 }
 
 /* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 66b08a1..fa8bdb9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -350,6 +350,9 @@ extern struct root_domain def_root_domain;
 
 #endif /* CONFIG_SMP */
 
+/* the percentage full cpu utilization */
+#define FULL_UTIL	100
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -481,6 +484,7 @@ struct rq {
 #endif
 
 	struct sched_avg avg;
+	unsigned int util;
 };
 
 static inline int cpu_of(struct rq *rq)
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 12/18] sched: add power aware scheduling in fork/exec/wake
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (10 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 11/18] sched: log the cpu utilization at rq Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-01-24  3:06 ` [patch v4 13/18] sched: packing small tasks in wake/exec balancing Alex Shi
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest while still has utilization group. That's
will save power for other groups.

The trade off is adding a power aware statistics collection in group
seeking. But since the collection just happened in power scheduling
eligible condition, the worst case of hackbench testing just drops
about 2% with powersaving/balance policy. No clear change for
performance policy.

I had tried to use rq utilisation in this balancing, but since the
utilisation need much time to accumulate itself(345ms). It's bad for
any burst balancing. So I use instant rq utilisation -- nr_running.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 230 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 179 insertions(+), 51 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 20363fb..7c7d9db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3318,25 +3318,189 @@ done:
 }
 
 /*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ *		during load balancing.
+ */
+struct sd_lb_stats {
+	struct sched_group *busiest; /* Busiest group in this sd */
+	struct sched_group *this;  /* Local group in this sd */
+	unsigned long total_load;  /* Total load of all groups in sd */
+	unsigned long total_pwr;   /*	Total power of all groups in sd */
+	unsigned long avg_load;	   /* Average load across all groups in sd */
+
+	/** Statistics of this group */
+	unsigned long this_load;
+	unsigned long this_load_per_task;
+	unsigned long this_nr_running;
+	unsigned int  this_has_capacity;
+	unsigned int  this_idle_cpus;
+
+	/* Statistics of the busiest group */
+	unsigned int  busiest_idle_cpus;
+	unsigned long max_load;
+	unsigned long busiest_load_per_task;
+	unsigned long busiest_nr_running;
+	unsigned long busiest_group_capacity;
+	unsigned int  busiest_has_capacity;
+	unsigned int  busiest_group_weight;
+
+	int group_imb; /* Is there imbalance in this sd */
+
+	/* Varibles of power awaring scheduling */
+	unsigned int  sd_utils;	/* sum utilizations of this domain */
+	unsigned long sd_capacity;	/* capacity of this domain */
+	struct sched_group *group_leader; /* Group which relieves group_min */
+	unsigned long min_load_per_task; /* load_per_task in group_min */
+	unsigned int  leader_util;	/* sum utilizations of group_leader */
+	unsigned int  min_util;		/* sum utilizations of group_min */
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ */
+struct sg_lb_stats {
+	unsigned long avg_load; /*Avg load across the CPUs of the group */
+	unsigned long group_load; /* Total load over the CPUs of the group */
+	unsigned long sum_nr_running; /* Nr tasks running in the group */
+	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+	unsigned long group_capacity;
+	unsigned long idle_cpus;
+	unsigned long group_weight;
+	int group_imb; /* Is there an imbalance in the group ? */
+	int group_has_capacity; /* Is there extra capacity in the group? */
+	unsigned int group_utils;	/* sum utilizations of group */
+
+	unsigned long sum_shared_running;	/* 0 on non-NUMA */
+};
+
+static inline int
+fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
+
+/*
+ * Try to collect the task running number and capacity of the group.
+ */
+static void get_sg_power_stats(struct sched_group *group,
+	struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(group)) {
+		struct rq *rq = cpu_rq(i);
+
+		sgs->group_utils += rq->nr_running;
+	}
+
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
+						SCHED_POWER_SCALE);
+	if (!sgs->group_capacity)
+		sgs->group_capacity = fix_small_capacity(sd, group);
+	sgs->group_weight = group->group_weight;
+}
+
+/*
+ * Try to collect the task running number and capacity of the doamin.
+ */
+static void get_sd_power_stats(struct sched_domain *sd,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	struct sched_group *group;
+	struct sg_lb_stats sgs;
+	int sd_min_delta = INT_MAX;
+	int cpu = task_cpu(p);
+
+	group = sd->groups;
+	do {
+		long g_delta;
+		unsigned long threshold;
+
+		if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
+			continue;
+
+		memset(&sgs, 0, sizeof(sgs));
+		get_sg_power_stats(group, sd, &sgs);
+
+		if (sched_policy == SCHED_POLICY_POWERSAVING)
+			threshold = sgs.group_weight;
+		else
+			threshold = sgs.group_capacity;
+
+		g_delta = threshold - sgs.group_utils;
+
+		if (g_delta > 0 && g_delta < sd_min_delta) {
+			sd_min_delta = g_delta;
+			sds->group_leader = group;
+		}
+
+		sds->sd_utils += sgs.group_utils;
+		sds->total_pwr += group->sgp->power;
+	} while  (group = group->next, group != sd->groups);
+
+	sds->sd_capacity = DIV_ROUND_CLOSEST(sds->total_pwr,
+						SCHED_POWER_SCALE);
+}
+
+/*
+ * Execute power policy if this domain is not full.
+ */
+static inline int get_sd_sched_policy(struct sched_domain *sd,
+	int cpu, struct task_struct *p, struct sd_lb_stats *sds)
+{
+	unsigned long threshold;
+
+	if (sched_policy == SCHED_POLICY_PERFORMANCE)
+		return SCHED_POLICY_PERFORMANCE;
+
+	memset(sds, 0, sizeof(*sds));
+	get_sd_power_stats(sd, p, sds);
+
+	if (sched_policy == SCHED_POLICY_POWERSAVING)
+		threshold = sd->span_weight;
+	else
+		threshold = sds->sd_capacity;
+
+	/* still can hold one more task in this domain */
+	if (sds->sd_utils < threshold)
+		return sched_policy;
+
+	return SCHED_POLICY_PERFORMANCE;
+}
+
+/*
+ * If power policy is eligible for this domain, and it has task allowed cpu.
+ * we will select CPU from this domain.
+ */
+static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	int policy;
+	int new_cpu = -1;
+
+	policy = get_sd_sched_policy(sd, cpu, p, sds);
+	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
+		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+
+	return new_cpu;
+}
+
+/*
+ * select_task_rq_fair: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
  * SD_BALANCE_EXEC.
  *
- * Balance, ie. select the least loaded group.
- *
  * Returns the target CPU number, or the same CPU if no balancing is needed.
  *
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
-	int sync = wake_flags & WF_SYNC;
+	int sync = flags & WF_SYNC;
+	struct sd_lb_stats sds;
 
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
@@ -3362,11 +3526,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			break;
 		}
 
-		if (tmp->flags & sd_flag)
+		if (tmp->flags & sd_flag) {
 			sd = tmp;
+
+			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+			if (new_cpu != -1)
+				goto unlock;
+		}
 	}
 
 	if (affine_sd) {
+		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+		if (new_cpu != -1)
+			goto unlock;
+
 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 			prev_cpu = cpu;
 
@@ -4167,51 +4340,6 @@ static unsigned long task_h_load(struct task_struct *p)
 #endif
 
 /********** Helpers for find_busiest_group ************************/
-/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * 		during load balancing.
- */
-struct sd_lb_stats {
-	struct sched_group *busiest; /* Busiest group in this sd */
-	struct sched_group *this;  /* Local group in this sd */
-	unsigned long total_load;  /* Total load of all groups in sd */
-	unsigned long total_pwr;   /*	Total power of all groups in sd */
-	unsigned long avg_load;	   /* Average load across all groups in sd */
-
-	/** Statistics of this group */
-	unsigned long this_load;
-	unsigned long this_load_per_task;
-	unsigned long this_nr_running;
-	unsigned long this_has_capacity;
-	unsigned int  this_idle_cpus;
-
-	/* Statistics of the busiest group */
-	unsigned int  busiest_idle_cpus;
-	unsigned long max_load;
-	unsigned long busiest_load_per_task;
-	unsigned long busiest_nr_running;
-	unsigned long busiest_group_capacity;
-	unsigned long busiest_has_capacity;
-	unsigned int  busiest_group_weight;
-
-	int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
- */
-struct sg_lb_stats {
-	unsigned long avg_load; /*Avg load across the CPUs of the group */
-	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long sum_nr_running; /* Nr tasks running in the group */
-	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
-	unsigned long group_capacity;
-	unsigned long idle_cpus;
-	unsigned long group_weight;
-	int group_imb; /* Is there an imbalance in the group ? */
-	int group_has_capacity; /* Is there extra capacity in the group? */
-};
-
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 13/18] sched: packing small tasks in wake/exec balancing
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (11 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 12/18] sched: add power aware scheduling in fork/exec/wake Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-01-24  3:06 ` [patch v4 14/18] sched: add power/performance balance allowed flag Alex Shi
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

If the waked/execed task is idle enough, it will has the chance to be
packed into a cpu which is busy but still has time to care it.

Morten Rasmussen catch a bug and suggest using different criteria for
different policy, thanks!

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 60 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c7d9db..eede065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3466,19 +3466,72 @@ static inline int get_sd_sched_policy(struct sched_domain *sd,
 }
 
 /*
+ * find_leader_cpu - find the busiest but still has enough leisure time cpu
+ * among the cpus in group.
+ */
+static int
+find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu,
+		int policy)
+{
+	/* percentage of the task's util */
+	unsigned putil = p->se.avg.runnable_avg_sum * 100
+				/ (p->se.avg.runnable_avg_period + 1);
+
+	struct rq *rq = cpu_rq(this_cpu);
+	int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
+	int vacancy, min_vacancy = INT_MAX, max_util;
+	int leader_cpu = -1;
+	int i;
+
+	if (policy == SCHED_POLICY_POWERSAVING)
+		max_util = FULL_UTIL;
+	else
+		/* maximum allowable util is 60% */
+		max_util = 60;
+
+	/* bias toward local cpu */
+	if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(p)) &&
+		max_util - (rq->util * nr_running + (putil << 2)) > 0)
+			return this_cpu;
+
+	/* Traverse only the allowed CPUs */
+	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+		if (i == this_cpu)
+			continue;
+
+		rq = cpu_rq(i);
+		nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
+
+		/* only light task allowed, like putil < 25% for powersaving */
+		vacancy = max_util - (rq->util * nr_running + (putil << 2));
+
+		if (vacancy > 0 && vacancy < min_vacancy) {
+			min_vacancy = vacancy;
+			leader_cpu = i;
+		}
+	}
+	return leader_cpu;
+}
+
+/*
  * If power policy is eligible for this domain, and it has task allowed cpu.
  * we will select CPU from this domain.
  */
 static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
-		struct task_struct *p, struct sd_lb_stats *sds)
+		struct task_struct *p, struct sd_lb_stats *sds, int fork)
 {
 	int policy;
 	int new_cpu = -1;
 
 	policy = get_sd_sched_policy(sd, cpu, p, sds);
-	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
-		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
-
+	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
+		if (!fork)
+			new_cpu = find_leader_cpu(sds->group_leader,
+							p, cpu, policy);
+		/* for fork balancing and a little busy task */
+		if (new_cpu == -1)
+			new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+	}
 	return new_cpu;
 }
 
@@ -3529,14 +3582,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
 		if (tmp->flags & sd_flag) {
 			sd = tmp;
 
-			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
+						flags & SD_BALANCE_FORK);
 			if (new_cpu != -1)
 				goto unlock;
 		}
 	}
 
 	if (affine_sd) {
-		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 0);
 		if (new_cpu != -1)
 			goto unlock;
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 14/18] sched: add power/performance balance allowed flag
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (12 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 13/18] sched: packing small tasks in wake/exec balancing Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-01-24  3:06 ` [patch v4 15/18] sched: pull all tasks from source group Alex Shi
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

If a sched domain is idle enough for power balance, power_lb
will be set, perf_lb will be clean. If a sched domain is busy,
their value will be set oppositely.

If the domain is suitable for power balance, but balance should not
be down by this cpu, both of perf_lb and power_lb are cleared to wait a
suitable cpu to do power balance. That mean no any balance, neither
power balance nor performance balance will be done on the balance cpu.

Above logical will be implemented by following patches.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eede065..19624f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4039,6 +4039,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+	int			power_lb;  /* if power balance needed */
+	int			perf_lb;   /* if performance balance needed */
 };
 
 /*
@@ -5180,6 +5182,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.power_lb	= 0,
+		.perf_lb	= 1,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 15/18] sched: pull all tasks from source group
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (13 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 14/18] sched: add power/performance balance allowed flag Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-01-24  3:06 ` [patch v4 16/18] sched: don't care if the local group has capacity Alex Shi
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

In power balance, we hope some sched groups are fully empty to save
CPU power of them. So, we want to move any tasks from them.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19624f4..a1ccb40 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5110,7 +5110,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu power.
 		 */
-		if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+		if (rq->nr_running == 0 ||
+			(!env->power_lb && capacity &&
+				rq->nr_running == 1 && wl > env->imbalance))
 			continue;
 
 		/*
@@ -5214,7 +5216,8 @@ redo:
 
 	ld_moved = 0;
 	lb_iterations = 1;
-	if (busiest->nr_running > 1) {
+	if (busiest->nr_running > 1 ||
+		(busiest->nr_running == 1 && env.power_lb)) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 16/18] sched: don't care if the local group has capacity
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (14 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 15/18] sched: pull all tasks from source group Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-01-24  3:06 ` [patch v4 17/18] sched: power aware load balance, Alex Shi
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

In power aware scheduling, we don't care load weight and
want not to pull tasks just because local group has capacity.
Because the local group maybe no tasks at the time, that is the power
balance hope so.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a1ccb40..94bd40b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4767,8 +4767,12 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
+		 *
+		 * In power aware scheduling, we don't care load weight and
+		 * want not to pull tasks just because local group has capacity.
 		 */
-		if (prefer_sibling && !local_group && sds->this_has_capacity)
+		if (prefer_sibling && !local_group && sds->this_has_capacity
+				&& env->perf_lb)
 			sgs.group_capacity = min(sgs.group_capacity, 1UL);
 
 		if (local_group) {
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 17/18] sched: power aware load balance,
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (15 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 16/18] sched: don't care if the local group has capacity Alex Shi
@ 2013-01-24  3:06 ` Alex Shi
  2013-01-24  3:07 ` [patch v4 18/18] sched: lazy power balance Alex Shi
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:06 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

This patch enabled the power aware consideration in load balance.

As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, pack tasks on less sched_groups will reduce power consumption

The first assumption make performance policy take over scheduling when
system busy.
The second assumption make power aware scheduling try to move
disperse tasks into fewer groups until that groups are full of tasks.

This patch reuse some of Suresh's power saving load balance code.

The enabling logical summary here:
1, Collect power aware scheduler statistics during performance load
balance statistics collection.
2, If the balance cpu is eligible for power load balance, just do it
and forget performance load balance. But if the domain is suitable for
power balance, while the cpu is not appropriate, stop both
power/performance balance, else do performance load balance.

A test can show the effort on different policy:
for ((i = 0; i < I; i++)) ; do while true; do :; done  &   done

On my SNB laptop with 4core* HT: the data is Watts
        powersaving     balance         performance
i = 2   40              54              54
i = 4   57              64*             68
i = 8   68              68              68

Note:
When i = 4 with balance policy, the power may change in 57~68Watt,
since the HT capacity and core capacity are both 1.

on SNB EP machine with 2 sockets * 8 cores * HT:
        powersaving     balance         performance
i = 4   190             201             238
i = 8   205             241             268
i = 16  271             348             376

If system has few continued tasks, use power policy can get
the performance/power gain. Like sysbench fileio randrw test with 16
thread on the SNB EP box,

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 124 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94bd40b..a83ad90 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3350,6 +3350,7 @@ struct sd_lb_stats {
 	unsigned int  sd_utils;	/* sum utilizations of this domain */
 	unsigned long sd_capacity;	/* capacity of this domain */
 	struct sched_group *group_leader; /* Group which relieves group_min */
+	struct sched_group *group_min;	/* Least loaded group in sd */
 	unsigned long min_load_per_task; /* load_per_task in group_min */
 	unsigned int  leader_util;	/* sum utilizations of group_leader */
 	unsigned int  min_util;		/* sum utilizations of group_min */
@@ -4396,6 +4397,106 @@ static unsigned long task_h_load(struct task_struct *p)
 #endif
 
 /********** Helpers for find_busiest_group ************************/
+
+/**
+ * init_sd_lb_power_stats - Initialize power savings statistics for
+ * the given sched_domain, during load balancing.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics for sd.
+ */
+static inline void init_sd_lb_power_stats(struct lb_env *env,
+						struct sd_lb_stats *sds)
+{
+	if (sched_policy == SCHED_POLICY_PERFORMANCE ||
+				env->idle == CPU_NOT_IDLE) {
+		env->power_lb = 0;
+		env->perf_lb = 1;
+		return;
+	}
+	env->perf_lb = 0;
+	env->power_lb = 1;
+	sds->min_util = UINT_MAX;
+	sds->leader_util = 0;
+}
+
+/**
+ * update_sd_lb_power_stats - Update the power saving stats for a
+ * sched_domain while performing load balancing.
+ *
+ * @env: The load balancing environment.
+ * @group: sched_group belonging to the sched_domain under consideration.
+ * @sds: Variable containing the statistics of the sched_domain
+ * @local_group: Does group contain the CPU for which we're performing
+ * load balancing?
+ * @sgs: Variable containing the statistics of the group.
+ */
+static inline void update_sd_lb_power_stats(struct lb_env *env,
+			struct sched_group *group, struct sd_lb_stats *sds,
+			int local_group, struct sg_lb_stats *sgs)
+{
+	unsigned long threshold, threshold_util;
+
+	if (env->perf_lb)
+		return;
+
+	if (sched_policy == SCHED_POLICY_POWERSAVING)
+		threshold = sgs->group_weight;
+	else
+		threshold = sgs->group_capacity;
+	threshold_util = threshold * FULL_UTIL;
+
+	/*
+	 * If the local group is idle or full loaded
+	 * no need to do power savings balance at this domain
+	 */
+	if (local_group && (!sgs->sum_nr_running ||
+		sgs->group_utils + FULL_UTIL > threshold_util))
+		env->power_lb = 0;
+
+	/* Do performance load balance if any group overload */
+	if (sgs->group_utils > threshold_util) {
+		env->perf_lb = 1;
+		env->power_lb = 0;
+	}
+
+	/*
+	 * If a group is idle,
+	 * don't include that group in power savings calculations
+	 */
+	if (!env->power_lb || !sgs->sum_nr_running)
+		return;
+
+	/*
+	 * Calculate the group which has the least non-idle load.
+	 * This is the group from where we need to pick up the load
+	 * for saving power
+	 */
+	if ((sgs->group_utils < sds->min_util) ||
+	    (sgs->group_utils == sds->min_util &&
+	     group_first_cpu(group) > group_first_cpu(sds->group_min))) {
+		sds->group_min = group;
+		sds->min_util = sgs->group_utils;
+		sds->min_load_per_task = sgs->sum_weighted_load /
+						sgs->sum_nr_running;
+	}
+
+	/*
+	 * Calculate the group which is almost near its
+	 * capacity but still has some space to pick up some load
+	 * from other group and save more power
+	 */
+	if (sgs->group_utils + FULL_UTIL > threshold_util)
+		return;
+
+	if (sgs->group_utils > sds->leader_util ||
+	    (sgs->group_utils == sds->leader_util && sds->group_leader &&
+	     group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
+		sds->group_leader = group;
+		sds->leader_util = sgs->group_utils;
+	}
+}
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -4635,6 +4736,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
 		sgs->sum_weighted_load += weighted_cpuload(i);
+
+		/* accumulate the maximum potential util */
+		if (!nr_running)
+			nr_running = 1;
+		sgs->group_utils += rq->util * nr_running;
+
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
 	}
@@ -4743,6 +4850,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
+	init_sd_lb_power_stats(env, sds);
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
@@ -4794,6 +4902,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 			sds->group_imb = sgs.group_imb;
 		}
 
+		update_sd_lb_power_stats(env, sg, sds, local_group, &sgs);
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -5011,6 +5120,19 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 */
 	update_sd_lb_stats(env, balance, &sds);
 
+	if (!env->perf_lb && !env->power_lb)
+		return  NULL;
+
+	if (env->power_lb) {
+		if (sds.this == sds.group_leader &&
+				sds.group_leader != sds.group_min) {
+			env->imbalance = sds.min_load_per_task;
+			return sds.group_min;
+		}
+		env->power_lb = 0;
+		return NULL;
+	}
+
 	/*
 	 * this_cpu is not the appropriate cpu to perform load balancing at
 	 * this level.
@@ -5188,8 +5310,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
-		.power_lb	= 0,
-		.perf_lb	= 1,
+		.power_lb	= 1,
+		.perf_lb	= 0,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -6267,7 +6389,6 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-
 static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
 {
 	struct sched_entity *se = &task->se;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* [patch v4 18/18] sched: lazy power balance
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (16 preceding siblings ...)
  2013-01-24  3:06 ` [patch v4 17/18] sched: power aware load balance, Alex Shi
@ 2013-01-24  3:07 ` Alex Shi
  2013-01-24  9:44 ` [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Borislav Petkov
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-24  3:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel, alex.shi

When active task number in sched domain waves around the power friendly
scheduling creteria, scheduling will thresh between the power friendly
balance and performance balance, bring unnecessary task migration.
The typical benchmark is 'make -j x'.

To remove such issue, introduce a u64 perf_lb_record variable to record
performance load balance history. If there is no performance LB for
continuing 32 times load balancing, or no LB for 8 times max_interval ms,
or only 4 times performance LB in last 64 times load balancing, then we
accept a powersaving LB. Otherwise, give up this power awareness
LB chance.

With this patch, the worst case for power scheduling -- kbuild, gets
similar performance/power value among different policy.

BTW, the lazy balance shows the performance gain when j is up to 32.

On my SNB EP 2 sockets machine with 8 cores * HT: 'make -j x' results:

		powersaving		balance		performance
x = 1    175.603 /417 13          175.220 /416 13        176.073 /407 13
x = 2    192.215 /218 23          194.522 /202 25        217.393 /200 23
x = 4    205.226 /124 39          208.823 /114 42        230.425 /105 41
x = 8    236.369 /71 59           249.005 /65 61         257.661 /62 62
x = 16   283.842 /48 73           307.465 /40 81         309.336 /39 82
x = 32   325.197 /32 96           333.503 /32 93         336.138 /32 92

data explains: 175.603 /417 13
	175.603: avagerage Watts
	417: seconds(compile time)
	13:  scaled performance/power = 1000000 / seconds / watts

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 67 +++++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 55 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 66b05e1..5051990 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -941,6 +941,7 @@ struct sched_domain {
 	unsigned long last_balance;	/* init to jiffies. units in jiffies */
 	unsigned int balance_interval;	/* initialise to 1. units in ms. */
 	unsigned int nr_balance_failed; /* initialise to 0 */
+	u64	perf_lb_record;	/* performance balance record */
 
 	u64 last_update;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a83ad90..262d7ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4497,6 +4497,58 @@ static inline void update_sd_lb_power_stats(struct lb_env *env,
 	}
 }
 
+#define PERF_LB_HH_MASK		0xffffffff00000000ULL
+#define PERF_LB_LH_MASK		0xffffffffULL
+
+/**
+ * need_perf_balance - Check if the performance load balance needed
+ * in the sched_domain.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics of the sched_domain
+ */
+static int need_perf_balance(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	env->sd->perf_lb_record <<= 1;
+
+	if (env->perf_lb) {
+		env->sd->perf_lb_record |= 0x1;
+		return 1;
+	}
+
+	/*
+	 * The situation isn't eligible for performance balance. If this_cpu
+	 * is not eligible or the timing is not suitable for lazy powersaving
+	 * balance, we will stop both powersaving and performance balance.
+	 */
+	if (env->power_lb && sds->this == sds->group_leader
+			&& sds->group_leader != sds->group_min) {
+		int interval;
+
+		/* powersaving balance interval set as 8 * max_interval */
+		interval = msecs_to_jiffies(8 * env->sd->max_interval);
+		if (time_after(jiffies, env->sd->last_balance + interval))
+			env->sd->perf_lb_record = 0;
+
+		/*
+		 * A eligible timing is no performance balance in last 32
+		 * balance and performance balance is no more than 4 times
+		 * in last 64 balance, or no balance in powersaving interval
+		 * time.
+		 */
+		if ((hweight64(env->sd->perf_lb_record & PERF_LB_HH_MASK) <= 4)
+			&& !(env->sd->perf_lb_record & PERF_LB_LH_MASK)) {
+
+			env->imbalance = sds->min_load_per_task;
+			return 0;
+		}
+
+	}
+	env->power_lb = 0;
+	sds->group_min = NULL;
+	return 0;
+}
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -5087,7 +5139,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 }
 
 /******* find_busiest_group() helpers end here *********************/
-
 /**
  * find_busiest_group - Returns the busiest group within the sched_domain
  * if there is an imbalance. If there isn't an imbalance, and
@@ -5120,18 +5171,8 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 */
 	update_sd_lb_stats(env, balance, &sds);
 
-	if (!env->perf_lb && !env->power_lb)
-		return  NULL;
-
-	if (env->power_lb) {
-		if (sds.this == sds.group_leader &&
-				sds.group_leader != sds.group_min) {
-			env->imbalance = sds.min_load_per_task;
-			return sds.group_min;
-		}
-		env->power_lb = 0;
-		return NULL;
-	}
+	if (!need_perf_balance(env, &sds))
+		return sds.group_min;
 
 	/*
 	 * this_cpu is not the appropriate cpu to perform load balancing at
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (17 preceding siblings ...)
  2013-01-24  3:07 ` [patch v4 18/18] sched: lazy power balance Alex Shi
@ 2013-01-24  9:44 ` Borislav Petkov
  2013-01-24 15:07   ` Alex Shi
  2013-01-28  1:28 ` Alex Shi
  2013-02-04  1:35 ` Alex Shi
  20 siblings, 1 reply; 88+ messages in thread
From: Borislav Petkov @ 2013-01-24  9:44 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Thu, Jan 24, 2013 at 11:06:42AM +0800, Alex Shi wrote:
> Since the runnable info needs 345ms to accumulate, balancing
> doesn't do well for many tasks burst waking. After talking with Mike
> Galbraith, we are agree to just use runnable avg in power friendly 
> scheduling and keep current instant load in performance scheduling for 
> low latency.
> 
> So the biggest change in this version is removing runnable load avg in
> balance and just using runnable data in power balance.
> 
> The patchset bases on Linus' tree, includes 3 parts,
> ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
> ----------------------
> the first patch remove one domain level. patch 2~5 simplified fork/wake
> balancing, it can increase 10+% hackbench performance on our 4 sockets
> SNB EP machine.

Ok, I see some benchmarking results here and there in the commit
messages but since this is touching the scheduler, you probably would
need to make sure it doesn't introduce performance regressions vs
mainline with a comprehensive set of benchmarks.

And, AFAICR, mainline does by default the 'performance' scheme by
spreading out tasks to idle cores, so have you tried comparing vanilla
mainline to your patchset in the 'performance' setting so that you can
make sure there are no problems there? And not only hackbench or a
microbenchmark but aim9 (I saw that in a commit message somewhere) and
whatever else multithreaded benchmark you can get your hands on.

Also, you might want to run it on other machines too, not only SNB :-)
And what about ARM, maybe someone there can run your patchset too?

So, it would be cool to see comprehensive results from all those runs
and see what the numbers say.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-24  9:44 ` [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Borislav Petkov
@ 2013-01-24 15:07   ` Alex Shi
  2013-01-27  2:41     ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-24 15:07 UTC (permalink / raw)
  To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/24/2013 05:44 PM, Borislav Petkov wrote:
> On Thu, Jan 24, 2013 at 11:06:42AM +0800, Alex Shi wrote:
>> Since the runnable info needs 345ms to accumulate, balancing
>> doesn't do well for many tasks burst waking. After talking with Mike
>> Galbraith, we are agree to just use runnable avg in power friendly 
>> scheduling and keep current instant load in performance scheduling for 
>> low latency.
>>
>> So the biggest change in this version is removing runnable load avg in
>> balance and just using runnable data in power balance.
>>
>> The patchset bases on Linus' tree, includes 3 parts,
>> ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
>> ----------------------
>> the first patch remove one domain level. patch 2~5 simplified fork/wake
>> balancing, it can increase 10+% hackbench performance on our 4 sockets
>> SNB EP machine.
> 
> Ok, I see some benchmarking results here and there in the commit
> messages but since this is touching the scheduler, you probably would
> need to make sure it doesn't introduce performance regressions vs
> mainline with a comprehensive set of benchmarks.
> 

Thanks a lot for your comments, Borislav! :)

For this patchset, the code will just check current policy, if it is
performance, the code patch will back to original performance code at
once. So there should no performance change on performance policy.

I once tested the balance policy performance with benchmark
kbuild/hackbench/aim9/dbench/tbench on version 2, only hackbench has a
bit drop ~3%. others have no clear change.

> And, AFAICR, mainline does by default the 'performance' scheme by
> spreading out tasks to idle cores, so have you tried comparing vanilla
> mainline to your patchset in the 'performance' setting so that you can
> make sure there are no problems there? And not only hackbench or a
> microbenchmark but aim9 (I saw that in a commit message somewhere) and
> whatever else multithreaded benchmark you can get your hands on.
> 
> Also, you might want to run it on other machines too, not only SNB :-)

Anyway I will redo the performance testing on this version again on all
machine. but doesn't expect something change. :)

> And what about ARM, maybe someone there can run your patchset too?
> 
> So, it would be cool to see comprehensive results from all those runs
> and see what the numbers say.
> 
> Thanks.
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-24 15:07   ` Alex Shi
@ 2013-01-27  2:41     ` Alex Shi
  2013-01-27  4:36       ` Mike Galbraith
  2013-01-27 10:40       ` Borislav Petkov
  0 siblings, 2 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-27  2:41 UTC (permalink / raw)
  To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/24/2013 11:07 PM, Alex Shi wrote:
> On 01/24/2013 05:44 PM, Borislav Petkov wrote:
>> On Thu, Jan 24, 2013 at 11:06:42AM +0800, Alex Shi wrote:
>>> Since the runnable info needs 345ms to accumulate, balancing
>>> doesn't do well for many tasks burst waking. After talking with Mike
>>> Galbraith, we are agree to just use runnable avg in power friendly 
>>> scheduling and keep current instant load in performance scheduling for 
>>> low latency.
>>>
>>> So the biggest change in this version is removing runnable load avg in
>>> balance and just using runnable data in power balance.
>>>
>>> The patchset bases on Linus' tree, includes 3 parts,
>>> ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
>>> ----------------------
>>> the first patch remove one domain level. patch 2~5 simplified fork/wake
>>> balancing, it can increase 10+% hackbench performance on our 4 sockets
>>> SNB EP machine.
>>
>> Ok, I see some benchmarking results here and there in the commit
>> messages but since this is touching the scheduler, you probably would
>> need to make sure it doesn't introduce performance regressions vs
>> mainline with a comprehensive set of benchmarks.
>>
> 
> Thanks a lot for your comments, Borislav! :)
> 
> For this patchset, the code will just check current policy, if it is
> performance, the code patch will back to original performance code at
> once. So there should no performance change on performance policy.
> 
> I once tested the balance policy performance with benchmark
> kbuild/hackbench/aim9/dbench/tbench on version 2, only hackbench has a
> bit drop ~3%. others have no clear change.
> 
>> And, AFAICR, mainline does by default the 'performance' scheme by
>> spreading out tasks to idle cores, so have you tried comparing vanilla
>> mainline to your patchset in the 'performance' setting so that you can
>> make sure there are no problems there? And not only hackbench or a
>> microbenchmark but aim9 (I saw that in a commit message somewhere) and
>> whatever else multithreaded benchmark you can get your hands on.
>>
>> Also, you might want to run it on other machines too, not only SNB :-)
> 
> Anyway I will redo the performance testing on this version again on all
> machine. but doesn't expect something change. :)

Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
performance change found.

I also tested balance policy/powersaving policy with above benchmark,
found, the specjbb2005 drop much 30~50% on both of policy whenever with
openjdk or jrockit. and hackbench drops a lots with powersaving policy
on snb 4 sockets platforms. others has no clear change.

> 
>> And what about ARM, maybe someone there can run your patchset too?
>>
>> So, it would be cool to see comprehensive results from all those runs
>> and see what the numbers say.
>>
>> Thanks.
>>
> 
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-27  2:41     ` Alex Shi
@ 2013-01-27  4:36       ` Mike Galbraith
  2013-01-27 10:35         ` Borislav Petkov
  2013-01-27 10:40       ` Borislav Petkov
  1 sibling, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-27  4:36 UTC (permalink / raw)
  To: Alex Shi
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Sun, 2013-01-27 at 10:41 +0800, Alex Shi wrote: 
> On 01/24/2013 11:07 PM, Alex Shi wrote:
> > On 01/24/2013 05:44 PM, Borislav Petkov wrote:
> >> On Thu, Jan 24, 2013 at 11:06:42AM +0800, Alex Shi wrote:
> >>> Since the runnable info needs 345ms to accumulate, balancing
> >>> doesn't do well for many tasks burst waking. After talking with Mike
> >>> Galbraith, we are agree to just use runnable avg in power friendly 
> >>> scheduling and keep current instant load in performance scheduling for 
> >>> low latency.
> >>>
> >>> So the biggest change in this version is removing runnable load avg in
> >>> balance and just using runnable data in power balance.
> >>>
> >>> The patchset bases on Linus' tree, includes 3 parts,
> >>> ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
> >>> ----------------------
> >>> the first patch remove one domain level. patch 2~5 simplified fork/wake
> >>> balancing, it can increase 10+% hackbench performance on our 4 sockets
> >>> SNB EP machine.
> >>
> >> Ok, I see some benchmarking results here and there in the commit
> >> messages but since this is touching the scheduler, you probably would
> >> need to make sure it doesn't introduce performance regressions vs
> >> mainline with a comprehensive set of benchmarks.
> >>
> > 
> > Thanks a lot for your comments, Borislav! :)
> > 
> > For this patchset, the code will just check current policy, if it is
> > performance, the code patch will back to original performance code at
> > once. So there should no performance change on performance policy.
> > 
> > I once tested the balance policy performance with benchmark
> > kbuild/hackbench/aim9/dbench/tbench on version 2, only hackbench has a
> > bit drop ~3%. others have no clear change.
> > 
> >> And, AFAICR, mainline does by default the 'performance' scheme by
> >> spreading out tasks to idle cores, so have you tried comparing vanilla
> >> mainline to your patchset in the 'performance' setting so that you can
> >> make sure there are no problems there? And not only hackbench or a
> >> microbenchmark but aim9 (I saw that in a commit message somewhere) and
> >> whatever else multithreaded benchmark you can get your hands on.
> >>
> >> Also, you might want to run it on other machines too, not only SNB :-)
> > 
> > Anyway I will redo the performance testing on this version again on all
> > machine. but doesn't expect something change. :)
> 
> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
> performance change found.

With aim7 compute on 4 node 40 core box, I see stable throughput
improvement at tasks = nr_cores and below w. balance and powersaving. 

         3.8.0-performance                                  3.8.0-balance                                      3.8.0-powersaving
Tasks    jobs/min  jti  jobs/min/task      real       cpu   jobs/min  jti  jobs/min/task      real       cpu   jobs/min  jti  jobs/min/task      real       cpu
    1      432.86  100       432.8571     14.00      3.99     433.48  100       433.4764     13.98      3.97     433.17  100       433.1665     13.99      3.98
    1      437.23  100       437.2294     13.86      3.85     436.60  100       436.5994     13.88      3.86     435.66  100       435.6578     13.91      3.90
    1      434.10  100       434.0974     13.96      3.95     436.29  100       436.2851     13.89      3.89     436.29  100       436.2851     13.89      3.87
    5     2400.95   99       480.1902     12.62     12.49    2554.81   98       510.9612     11.86      7.55    2487.68   98       497.5369     12.18      8.22
    5     2341.58   99       468.3153     12.94     13.95    2578.72   99       515.7447     11.75      7.25    2527.11   99       505.4212     11.99      7.90
    5     2350.66   99       470.1319     12.89     13.66    2600.86   99       520.1717     11.65      7.09    2508.28   98       501.6556     12.08      8.24
   10     4291.78   99       429.1785     14.12     40.14    5334.51   99       533.4507     11.36     11.13    5183.92   98       518.3918     11.69     12.15
   10     4334.76   99       433.4764     13.98     38.70    5311.13   99       531.1131     11.41     11.23    5215.15   99       521.5146     11.62     12.53
   10     4273.62   99       427.3625     14.18     40.29    5287.96   99       528.7958     11.46     11.46    5144.31   98       514.4312     11.78     12.32
   20     8487.39   94       424.3697     14.28     63.14   10594.41   99       529.7203     11.44     23.72   10575.92   99       528.7958     11.46     22.08
   20     8387.54   97       419.3772     14.45     77.01   10575.92   98       528.7958     11.46     23.41   10520.83   99       526.0417     11.52     21.88
   20     8713.16   95       435.6578     13.91     55.10   10659.63   99       532.9815     11.37     24.17   10539.13   99       526.9565     11.50     22.13
   40    16786.70   99       419.6676     14.44    170.08   19469.88   98       486.7470     12.45     60.78   19967.05   98       499.1763     12.14     51.40
   40    16728.78   99       418.2195     14.49    172.96   19627.53   98       490.6883     12.35     65.26   20386.88   98       509.6720     11.89     46.91
   40    16763.49   99       419.0871     14.46    171.42   20033.06   98       500.8264     12.10     51.44   20682.59   98       517.0648     11.72     42.45

No deltas after that.  There were also no deltas between patched kernel
using performance policy and virgin source.

-Mike




^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-27  4:36       ` Mike Galbraith
@ 2013-01-27 10:35         ` Borislav Petkov
  2013-01-27 13:25           ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Borislav Petkov @ 2013-01-27 10:35 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
> With aim7 compute on 4 node 40 core box, I see stable throughput
> improvement at tasks = nr_cores and below w. balance and powersaving. 
> 
>          3.8.0-performance                                  3.8.0-balance                                      3.8.0-powersaving
> Tasks    jobs/min  jti  jobs/min/task      real       cpu   jobs/min  jti  jobs/min/task      real       cpu   jobs/min  jti  jobs/min/task      real       cpu
>     1      432.86  100       432.8571     14.00      3.99     433.48  100       433.4764     13.98      3.97     433.17  100       433.1665     13.99      3.98
>     1      437.23  100       437.2294     13.86      3.85     436.60  100       436.5994     13.88      3.86     435.66  100       435.6578     13.91      3.90
>     1      434.10  100       434.0974     13.96      3.95     436.29  100       436.2851     13.89      3.89     436.29  100       436.2851     13.89      3.87
>     5     2400.95   99       480.1902     12.62     12.49    2554.81   98       510.9612     11.86      7.55    2487.68   98       497.5369     12.18      8.22
>     5     2341.58   99       468.3153     12.94     13.95    2578.72   99       515.7447     11.75      7.25    2527.11   99       505.4212     11.99      7.90
>     5     2350.66   99       470.1319     12.89     13.66    2600.86   99       520.1717     11.65      7.09    2508.28   98       501.6556     12.08      8.24
>    10     4291.78   99       429.1785     14.12     40.14    5334.51   99       533.4507     11.36     11.13    5183.92   98       518.3918     11.69     12.15
>    10     4334.76   99       433.4764     13.98     38.70    5311.13   99       531.1131     11.41     11.23    5215.15   99       521.5146     11.62     12.53
>    10     4273.62   99       427.3625     14.18     40.29    5287.96   99       528.7958     11.46     11.46    5144.31   98       514.4312     11.78     12.32
>    20     8487.39   94       424.3697     14.28     63.14   10594.41   99       529.7203     11.44     23.72   10575.92   99       528.7958     11.46     22.08
>    20     8387.54   97       419.3772     14.45     77.01   10575.92   98       528.7958     11.46     23.41   10520.83   99       526.0417     11.52     21.88
>    20     8713.16   95       435.6578     13.91     55.10   10659.63   99       532.9815     11.37     24.17   10539.13   99       526.9565     11.50     22.13
>    40    16786.70   99       419.6676     14.44    170.08   19469.88   98       486.7470     12.45     60.78   19967.05   98       499.1763     12.14     51.40
>    40    16728.78   99       418.2195     14.49    172.96   19627.53   98       490.6883     12.35     65.26   20386.88   98       509.6720     11.89     46.91
>    40    16763.49   99       419.0871     14.46    171.42   20033.06   98       500.8264     12.10     51.44   20682.59   98       517.0648     11.72     42.45

Ok, this is sick. How is balance and powersaving better than perf? Both
have much more jobs per minute than perf; is that because we do pack
much more tasks per cpu with balance and powersaving?

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-27  2:41     ` Alex Shi
  2013-01-27  4:36       ` Mike Galbraith
@ 2013-01-27 10:40       ` Borislav Petkov
  2013-01-27 14:03         ` Alex Shi
  2013-01-28  5:19         ` Alex Shi
  1 sibling, 2 replies; 88+ messages in thread
From: Borislav Petkov @ 2013-01-27 10:40 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
> performance change found.

Ok, good, You could put that in one of the commit messages so that it is
there and people know that this patchset doesn't cause perf regressions
with the bunch of benchmarks.

> I also tested balance policy/powersaving policy with above benchmark,
> found, the specjbb2005 drop much 30~50% on both of policy whenever
> with openjdk or jrockit. and hackbench drops a lots with powersaving
> policy on snb 4 sockets platforms. others has no clear change.

I guess this is expected because there has to be some performance hit
when saving power...

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-27 10:35         ` Borislav Petkov
@ 2013-01-27 13:25           ` Alex Shi
  2013-01-27 15:51             ` Mike Galbraith
  0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-27 13:25 UTC (permalink / raw)
  To: Borislav Petkov, Mike Galbraith, torvalds, mingo, peterz, tglx,
	akpm, arjan, pjt, namhyung, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel

On 01/27/2013 06:35 PM, Borislav Petkov wrote:
> On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
>> With aim7 compute on 4 node 40 core box, I see stable throughput
>> improvement at tasks = nr_cores and below w. balance and powersaving. 
>>
>>          3.8.0-performance                                  3.8.0-balance                                      3.8.0-powersaving
>> Tasks    jobs/min  jti  jobs/min/task      real       cpu   jobs/min  jti  jobs/min/task      real       cpu   jobs/min  jti  jobs/min/task      real       cpu
>>     1      432.86  100       432.8571     14.00      3.99     433.48  100       433.4764     13.98      3.97     433.17  100       433.1665     13.99      3.98
>>     1      437.23  100       437.2294     13.86      3.85     436.60  100       436.5994     13.88      3.86     435.66  100       435.6578     13.91      3.90
>>     1      434.10  100       434.0974     13.96      3.95     436.29  100       436.2851     13.89      3.89     436.29  100       436.2851     13.89      3.87
>>     5     2400.95   99       480.1902     12.62     12.49    2554.81   98       510.9612     11.86      7.55    2487.68   98       497.5369     12.18      8.22
>>     5     2341.58   99       468.3153     12.94     13.95    2578.72   99       515.7447     11.75      7.25    2527.11   99       505.4212     11.99      7.90
>>     5     2350.66   99       470.1319     12.89     13.66    2600.86   99       520.1717     11.65      7.09    2508.28   98       501.6556     12.08      8.24
>>    10     4291.78   99       429.1785     14.12     40.14    5334.51   99       533.4507     11.36     11.13    5183.92   98       518.3918     11.69     12.15
>>    10     4334.76   99       433.4764     13.98     38.70    5311.13   99       531.1131     11.41     11.23    5215.15   99       521.5146     11.62     12.53
>>    10     4273.62   99       427.3625     14.18     40.29    5287.96   99       528.7958     11.46     11.46    5144.31   98       514.4312     11.78     12.32
>>    20     8487.39   94       424.3697     14.28     63.14   10594.41   99       529.7203     11.44     23.72   10575.92   99       528.7958     11.46     22.08
>>    20     8387.54   97       419.3772     14.45     77.01   10575.92   98       528.7958     11.46     23.41   10520.83   99       526.0417     11.52     21.88
>>    20     8713.16   95       435.6578     13.91     55.10   10659.63   99       532.9815     11.37     24.17   10539.13   99       526.9565     11.50     22.13
>>    40    16786.70   99       419.6676     14.44    170.08   19469.88   98       486.7470     12.45     60.78   19967.05   98       499.1763     12.14     51.40
>>    40    16728.78   99       418.2195     14.49    172.96   19627.53   98       490.6883     12.35     65.26   20386.88   98       509.6720     11.89     46.91
>>    40    16763.49   99       419.0871     14.46    171.42   20033.06   98       500.8264     12.10     51.44   20682.59   98       517.0648     11.72     42.45
> 
> Ok, this is sick. How is balance and powersaving better than perf? Both
> have much more jobs per minute than perf; is that because we do pack
> much more tasks per cpu with balance and powersaving?

Maybe it is due to the lazy balancing on balance/powersaving. You can
check the CS times in /proc/pid/status.
> 
> Thanks.
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-27 10:40       ` Borislav Petkov
@ 2013-01-27 14:03         ` Alex Shi
  2013-01-28  5:19         ` Alex Shi
  1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-27 14:03 UTC (permalink / raw)
  To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/27/2013 06:40 PM, Borislav Petkov wrote:
> On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
>> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
>> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
>> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
>> performance change found.
> 
> Ok, good, You could put that in one of the commit messages so that it is
> there and people know that this patchset doesn't cause perf regressions
> with the bunch of benchmarks.

thanks suggestion!

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-27 13:25           ` Alex Shi
@ 2013-01-27 15:51             ` Mike Galbraith
  2013-01-28  5:17               ` Mike Galbraith
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-27 15:51 UTC (permalink / raw)
  To: Alex Shi
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Sun, 2013-01-27 at 21:25 +0800, Alex Shi wrote: 
> On 01/27/2013 06:35 PM, Borislav Petkov wrote:
> > On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
> >> With aim7 compute on 4 node 40 core box, I see stable throughput
> >> improvement at tasks = nr_cores and below w. balance and powersaving. 
... 
> > Ok, this is sick. How is balance and powersaving better than perf? Both
> > have much more jobs per minute than perf; is that because we do pack
> > much more tasks per cpu with balance and powersaving?
> 
> Maybe it is due to the lazy balancing on balance/powersaving. You can
> check the CS times in /proc/pid/status.

Well, it's not wakeup path, limiting entry frequency per waker did zip
squat nada to any policy throughput.

-Mike


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (18 preceding siblings ...)
  2013-01-24  9:44 ` [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Borislav Petkov
@ 2013-01-28  1:28 ` Alex Shi
  2013-02-04  1:35 ` Alex Shi
  20 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-28  1:28 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/24/2013 11:06 AM, Alex Shi wrote:
> Since the runnable info needs 345ms to accumulate, balancing
> doesn't do well for many tasks burst waking. After talking with Mike
> Galbraith, we are agree to just use runnable avg in power friendly 
> scheduling and keep current instant load in performance scheduling for 
> low latency.
> 
> So the biggest change in this version is removing runnable load avg in
> balance and just using runnable data in power balance.
> 
> The patchset bases on Linus' tree, includes 3 parts,

Would you like to give some comments, Ingo? :)

Best regards!



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-27 15:51             ` Mike Galbraith
@ 2013-01-28  5:17               ` Mike Galbraith
  2013-01-28  5:51                 ` Alex Shi
                                   ` (2 more replies)
  0 siblings, 3 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28  5:17 UTC (permalink / raw)
  To: Alex Shi
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Sun, 2013-01-27 at 16:51 +0100, Mike Galbraith wrote: 
> On Sun, 2013-01-27 at 21:25 +0800, Alex Shi wrote: 
> > On 01/27/2013 06:35 PM, Borislav Petkov wrote:
> > > On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
> > >> With aim7 compute on 4 node 40 core box, I see stable throughput
> > >> improvement at tasks = nr_cores and below w. balance and powersaving. 
> ... 
> > > Ok, this is sick. How is balance and powersaving better than perf? Both
> > > have much more jobs per minute than perf; is that because we do pack
> > > much more tasks per cpu with balance and powersaving?
> > 
> > Maybe it is due to the lazy balancing on balance/powersaving. You can
> > check the CS times in /proc/pid/status.
> 
> Well, it's not wakeup path, limiting entry frequency per waker did zip
> squat nada to any policy throughput.

monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
043321  00058616
043313  00058616
043318  00058968
043317  00058968
043316  00059184
043319  00059192
043320  00059048
043314  00059048
043312  00058176
043315  00058184
monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
043337  00053448
043333  00053456
043338  00052992
043331  00053448
043332  00053488
043335  00053496
043334  00053480
043329  00053288
043336  00053464
043330  00053496
monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
043348  00052488
043344  00052488
043349  00052744
043343  00052504
043347  00052504
043352  00052888
043345  00052504
043351  00052496
043346  00052496
043350  00052304
monteverdi:/abuild/mike/:[0]#

Zzzt.  Wish I could turn turbo thingy off.

-Mike


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-27 10:40       ` Borislav Petkov
  2013-01-27 14:03         ` Alex Shi
@ 2013-01-28  5:19         ` Alex Shi
  2013-01-28  6:49           ` Mike Galbraith
  2013-01-29  6:02           ` Alex Shi
  1 sibling, 2 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-28  5:19 UTC (permalink / raw)
  To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/27/2013 06:40 PM, Borislav Petkov wrote:
> On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
>> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
>> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
>> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
>> performance change found.
> 
> Ok, good, You could put that in one of the commit messages so that it is
> there and people know that this patchset doesn't cause perf regressions
> with the bunch of benchmarks.
> 
>> I also tested balance policy/powersaving policy with above benchmark,
>> found, the specjbb2005 drop much 30~50% on both of policy whenever
>> with openjdk or jrockit. and hackbench drops a lots with powersaving
>> policy on snb 4 sockets platforms. others has no clear change.
> 
> I guess this is expected because there has to be some performance hit
> when saving power...
> 

BTW, I had tested the v3 version based on sched numa -- on tip/master.
The specjbb just has about 5~7% dropping on balance/powersaving policy.
The power scheduling done after the numa scheduling logical.




^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  5:17               ` Mike Galbraith
@ 2013-01-28  5:51                 ` Alex Shi
  2013-01-28  6:15                   ` Mike Galbraith
  2013-01-28  9:55                 ` Borislav Petkov
  2013-01-28 15:47                 ` Mike Galbraith
  2 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-28  5:51 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/28/2013 01:17 PM, Mike Galbraith wrote:
> On Sun, 2013-01-27 at 16:51 +0100, Mike Galbraith wrote: 
>> On Sun, 2013-01-27 at 21:25 +0800, Alex Shi wrote: 
>>> On 01/27/2013 06:35 PM, Borislav Petkov wrote:
>>>> On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
>>>>> With aim7 compute on 4 node 40 core box, I see stable throughput
>>>>> improvement at tasks = nr_cores and below w. balance and powersaving. 
>> ... 
>>>> Ok, this is sick. How is balance and powersaving better than perf? Both
>>>> have much more jobs per minute than perf; is that because we do pack
>>>> much more tasks per cpu with balance and powersaving?
>>>
>>> Maybe it is due to the lazy balancing on balance/powersaving. You can
>>> check the CS times in /proc/pid/status.
>>
>> Well, it's not wakeup path, limiting entry frequency per waker did zip
>> squat nada to any policy throughput.
> 
> monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 043321  00058616
> 043313  00058616
> 043318  00058968
> 043317  00058968
> 043316  00059184
> 043319  00059192
> 043320  00059048
> 043314  00059048
> 043312  00058176
> 043315  00058184
> monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 043337  00053448
> 043333  00053456
> 043338  00052992
> 043331  00053448
> 043332  00053488
> 043335  00053496
> 043334  00053480
> 043329  00053288
> 043336  00053464
> 043330  00053496
> monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 043348  00052488
> 043344  00052488
> 043349  00052744
> 043343  00052504
> 043347  00052504
> 043352  00052888
> 043345  00052504
> 043351  00052496
> 043346  00052496
> 043350  00052304
> monteverdi:/abuild/mike/:[0]#

similar with aim7 results. Thanks, Mike!

Wold you like to collect vmstat info in background?
> 
> Zzzt.  Wish I could turn turbo thingy off.

Do you mean the turbo mode of cpu frequency? I remember some of machine
can disable it in BIOS.
> 
> -Mike
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  5:51                 ` Alex Shi
@ 2013-01-28  6:15                   ` Mike Galbraith
  2013-01-28  6:42                     ` Mike Galbraith
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28  6:15 UTC (permalink / raw)
  To: Alex Shi
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, 2013-01-28 at 13:51 +0800, Alex Shi wrote: 
> On 01/28/2013 01:17 PM, Mike Galbraith wrote:
> > On Sun, 2013-01-27 at 16:51 +0100, Mike Galbraith wrote: 
> >> On Sun, 2013-01-27 at 21:25 +0800, Alex Shi wrote: 
> >>> On 01/27/2013 06:35 PM, Borislav Petkov wrote:
> >>>> On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
> >>>>> With aim7 compute on 4 node 40 core box, I see stable throughput
> >>>>> improvement at tasks = nr_cores and below w. balance and powersaving. 
> >> ... 
> >>>> Ok, this is sick. How is balance and powersaving better than perf? Both
> >>>> have much more jobs per minute than perf; is that because we do pack
> >>>> much more tasks per cpu with balance and powersaving?
> >>>
> >>> Maybe it is due to the lazy balancing on balance/powersaving. You can
> >>> check the CS times in /proc/pid/status.
> >>
> >> Well, it's not wakeup path, limiting entry frequency per waker did zip
> >> squat nada to any policy throughput.
> > 
> > monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 043321  00058616
> > 043313  00058616
> > 043318  00058968
> > 043317  00058968
> > 043316  00059184
> > 043319  00059192
> > 043320  00059048
> > 043314  00059048
> > 043312  00058176
> > 043315  00058184
> > monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 043337  00053448
> > 043333  00053456
> > 043338  00052992
> > 043331  00053448
> > 043332  00053488
> > 043335  00053496
> > 043334  00053480
> > 043329  00053288
> > 043336  00053464
> > 043330  00053496
> > monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 043348  00052488
> > 043344  00052488
> > 043349  00052744
> > 043343  00052504
> > 043347  00052504
> > 043352  00052888
> > 043345  00052504
> > 043351  00052496
> > 043346  00052496
> > 043350  00052304
> > monteverdi:/abuild/mike/:[0]#
> 
> similar with aim7 results. Thanks, Mike!
> 
> Wold you like to collect vmstat info in background?
> > 
> > Zzzt.  Wish I could turn turbo thingy off.
> 
> Do you mean the turbo mode of cpu frequency? I remember some of machine
> can disable it in BIOS.

Yeah, I can do that in my local x3550 box.  I can't fiddle with BIOS
settings on the remote NUMA box.

This can't be anything but turbo gizmo mucking up the numbers I think,
not that the numbers are invalid or anything, better numbers are better
numbers no matter where/how they come about ;-)

The massive_intr load is dirt simple sleep/spin with bean counting.  It
sleeps 1ms spins 8ms.  Change that to sleep 8ms, grind away for 1ms...

monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
045150  00006484
045157  00006427
045156  00006401
045152  00006428
045155  00006372
045154  00006370
045158  00006453
045149  00006372
045151  00006371
045153  00006371
monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
045170  00006380
045172  00006374
045169  00006376
045175  00006376
045171  00006334
045176  00006380
045168  00006374
045174  00006334
045177  00006375
045173  00006376
monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
045198  00006408
045191  00006408
045197  00006408
045192  00006411
045194  00006409
045196  00006409
045195  00006336
045189  00006336
045193  00006411
045190  00006410


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  6:15                   ` Mike Galbraith
@ 2013-01-28  6:42                     ` Mike Galbraith
  2013-01-28  7:20                       ` Mike Galbraith
  2013-01-29  1:17                       ` Alex Shi
  0 siblings, 2 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28  6:42 UTC (permalink / raw)
  To: Alex Shi
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, 2013-01-28 at 07:15 +0100, Mike Galbraith wrote: 
> On Mon, 2013-01-28 at 13:51 +0800, Alex Shi wrote: 
> > On 01/28/2013 01:17 PM, Mike Galbraith wrote:
> > > On Sun, 2013-01-27 at 16:51 +0100, Mike Galbraith wrote: 
> > >> On Sun, 2013-01-27 at 21:25 +0800, Alex Shi wrote: 
> > >>> On 01/27/2013 06:35 PM, Borislav Petkov wrote:
> > >>>> On Sun, Jan 27, 2013 at 05:36:25AM +0100, Mike Galbraith wrote:
> > >>>>> With aim7 compute on 4 node 40 core box, I see stable throughput
> > >>>>> improvement at tasks = nr_cores and below w. balance and powersaving. 
> > >> ... 
> > >>>> Ok, this is sick. How is balance and powersaving better than perf? Both
> > >>>> have much more jobs per minute than perf; is that because we do pack
> > >>>> much more tasks per cpu with balance and powersaving?
> > >>>
> > >>> Maybe it is due to the lazy balancing on balance/powersaving. You can
> > >>> check the CS times in /proc/pid/status.
> > >>
> > >> Well, it's not wakeup path, limiting entry frequency per waker did zip
> > >> squat nada to any policy throughput.
> > > 
> > > monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > > 043321  00058616
> > > 043313  00058616
> > > 043318  00058968
> > > 043317  00058968
> > > 043316  00059184
> > > 043319  00059192
> > > 043320  00059048
> > > 043314  00059048
> > > 043312  00058176
> > > 043315  00058184
> > > monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > > 043337  00053448
> > > 043333  00053456
> > > 043338  00052992
> > > 043331  00053448
> > > 043332  00053488
> > > 043335  00053496
> > > 043334  00053480
> > > 043329  00053288
> > > 043336  00053464
> > > 043330  00053496
> > > monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> > > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > > 043348  00052488
> > > 043344  00052488
> > > 043349  00052744
> > > 043343  00052504
> > > 043347  00052504
> > > 043352  00052888
> > > 043345  00052504
> > > 043351  00052496
> > > 043346  00052496
> > > 043350  00052304
> > > monteverdi:/abuild/mike/:[0]#
> > 
> > similar with aim7 results. Thanks, Mike!
> > 
> > Wold you like to collect vmstat info in background?
> > > 
> > > Zzzt.  Wish I could turn turbo thingy off.
> > 
> > Do you mean the turbo mode of cpu frequency? I remember some of machine
> > can disable it in BIOS.
> 
> Yeah, I can do that in my local x3550 box.  I can't fiddle with BIOS
> settings on the remote NUMA box.
> 
> This can't be anything but turbo gizmo mucking up the numbers I think,
> not that the numbers are invalid or anything, better numbers are better
> numbers no matter where/how they come about ;-)
> 
> The massive_intr load is dirt simple sleep/spin with bean counting.  It
> sleeps 1ms spins 8ms.  Change that to sleep 8ms, grind away for 1ms...
> 
> monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
> 045150  00006484
> 045157  00006427
> 045156  00006401
> 045152  00006428
> 045155  00006372
> 045154  00006370
> 045158  00006453
> 045149  00006372
> 045151  00006371
> 045153  00006371
> monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
> 045170  00006380
> 045172  00006374
> 045169  00006376
> 045175  00006376
> 045171  00006334
> 045176  00006380
> 045168  00006374
> 045174  00006334
> 045177  00006375
> 045173  00006376
> monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# ./massive_intr 10 60
> 045198  00006408
> 045191  00006408
> 045197  00006408
> 045192  00006411
> 045194  00006409
> 045196  00006409
> 045195  00006336
> 045189  00006336
> 045193  00006411
> 045190  00006410

Back to original 1ms sleep, 8ms work, turning NUMA box into a single
node 10 core box with numactl.

monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
045286  00043872
045289  00043464
045284  00043488
045287  00043440
045283  00043416
045281  00044456
045285  00043456
045288  00044312
045280  00043048
045282  00043240
monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
045300  00052536
045307  00052472
045304  00052536
045299  00052536
045305  00052520
045306  00052528
045302  00052528
045303  00052528
045308  00052512
045301  00052520
monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
045339  00052600
045340  00052608
045338  00052600
045337  00052608
045343  00052600
045341  00052600
045336  00052608
045335  00052616
045334  00052576
045342  00052600


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  5:19         ` Alex Shi
@ 2013-01-28  6:49           ` Mike Galbraith
  2013-01-28  7:17             ` Alex Shi
  2013-01-29  6:02           ` Alex Shi
  1 sibling, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28  6:49 UTC (permalink / raw)
  To: Alex Shi
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, 2013-01-28 at 13:19 +0800, Alex Shi wrote: 
> On 01/27/2013 06:40 PM, Borislav Petkov wrote:
> > On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
> >> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> >> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> >> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
> >> performance change found.
> > 
> > Ok, good, You could put that in one of the commit messages so that it is
> > there and people know that this patchset doesn't cause perf regressions
> > with the bunch of benchmarks.
> > 
> >> I also tested balance policy/powersaving policy with above benchmark,
> >> found, the specjbb2005 drop much 30~50% on both of policy whenever
> >> with openjdk or jrockit. and hackbench drops a lots with powersaving
> >> policy on snb 4 sockets platforms. others has no clear change.
> > 
> > I guess this is expected because there has to be some performance hit
> > when saving power...
> > 
> 
> BTW, I had tested the v3 version based on sched numa -- on tip/master.
> The specjbb just has about 5~7% dropping on balance/powersaving policy.
> The power scheduling done after the numa scheduling logical.

That makes sense.  How the numa scheduling numbers compare to mainline?
Do you have all three available, mainline, and tip w. w/o powersaving
policy?

-Mike



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  6:49           ` Mike Galbraith
@ 2013-01-28  7:17             ` Alex Shi
  2013-01-28  7:33               ` Mike Galbraith
  0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-28  7:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/28/2013 02:49 PM, Mike Galbraith wrote:
> On Mon, 2013-01-28 at 13:19 +0800, Alex Shi wrote: 
>> On 01/27/2013 06:40 PM, Borislav Petkov wrote:
>>> On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
>>>> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
>>>> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
>>>> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
>>>> performance change found.
>>>
>>> Ok, good, You could put that in one of the commit messages so that it is
>>> there and people know that this patchset doesn't cause perf regressions
>>> with the bunch of benchmarks.
>>>
>>>> I also tested balance policy/powersaving policy with above benchmark,
>>>> found, the specjbb2005 drop much 30~50% on both of policy whenever
>>>> with openjdk or jrockit. and hackbench drops a lots with powersaving
>>>> policy on snb 4 sockets platforms. others has no clear change.
>>>
>>> I guess this is expected because there has to be some performance hit
>>> when saving power...
>>>
>>
>> BTW, I had tested the v3 version based on sched numa -- on tip/master.
>> The specjbb just has about 5~7% dropping on balance/powersaving policy.
>> The power scheduling done after the numa scheduling logical.
> 
> That makes sense.  How the numa scheduling numbers compare to mainline?
> Do you have all three available, mainline, and tip w. w/o powersaving
> policy?
> 

I once caught 20~40% performance increasing on sched numa VS mainline
3.7-rc5. but have no baseline to compare balance/powersaving performance
since lower data are acceptable for balance/powersaving and
tip/master changes too quickly to follow up at that time.
:)

> -Mike
> 
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  6:42                     ` Mike Galbraith
@ 2013-01-28  7:20                       ` Mike Galbraith
  2013-01-29  1:17                       ` Alex Shi
  1 sibling, 0 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28  7:20 UTC (permalink / raw)
  To: Alex Shi
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, 2013-01-28 at 07:42 +0100, Mike Galbraith wrote:

> Back to original 1ms sleep, 8ms work, turning NUMA box into a single
> node 10 core box with numactl.

(aim7 in one 10 core node.. so spread, no delta.)

Benchmark       Version Machine Run Date
AIM Multiuser Benchmark - Suite VII     "1.1"   powersaving     Jan 28 08:04:14 2013

Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
1       441.0           100     13.7    3.7     7.3508
5       2516.6          98      12.0    8.1     8.3887
10      5215.1          98      11.6    11.9    8.6919
20      10475.4         99      11.6    21.7    8.7295
40      20216.8         99      12.0    38.2    8.4237
80      35568.6         99      13.6    71.4    7.4101
160     57102.5         98      17.0    138.2   5.9482
320     82099.9         97      23.6    271.1   4.2760
Benchmark       Version Machine Run Date
AIM Multiuser Benchmark - Suite VII     "1.1"   balance Jan 28 08:06:49 2013

Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
1       439.4           100     13.8    3.8     7.3241
5       2583.1          98      11.7    7.2     8.6104
10      5325.1          99      11.4    11.0    8.8752
20      10687.8         99      11.3    23.6    8.9065
40      20200.0         99      12.0    38.7    8.4167
80      35464.5         98      13.7    71.4    7.3884
160     57203.5         98      16.9    137.9   5.9587
320     82065.2         98      23.6    271.1   4.2742
Benchmark       Version Machine Run Date
AIM Multiuser Benchmark - Suite VII     "1.1"   performance     Jan 28 08:09:20 2013

Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
1       438.8           100     13.8    3.8     7.3135
5       2634.8          99      11.5    7.2     8.7826
10      5396.3          99      11.2    11.4    8.9938
20      10725.7         99      11.3    24.0    8.9381
40      20183.2         99      12.0    38.5    8.4097
80      35620.9         99      13.6    71.4    7.4210
160     57203.5         98      16.9    137.8   5.9587
320     81995.8         98      23.7    271.3   4.2706


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  7:17             ` Alex Shi
@ 2013-01-28  7:33               ` Mike Galbraith
  0 siblings, 0 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28  7:33 UTC (permalink / raw)
  To: Alex Shi
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, 2013-01-28 at 15:17 +0800, Alex Shi wrote: 
> On 01/28/2013 02:49 PM, Mike Galbraith wrote:
> > On Mon, 2013-01-28 at 13:19 +0800, Alex Shi wrote: 
> >> On 01/27/2013 06:40 PM, Borislav Petkov wrote:
> >>> On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
> >>>> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> >>>> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> >>>> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
> >>>> performance change found.
> >>>
> >>> Ok, good, You could put that in one of the commit messages so that it is
> >>> there and people know that this patchset doesn't cause perf regressions
> >>> with the bunch of benchmarks.
> >>>
> >>>> I also tested balance policy/powersaving policy with above benchmark,
> >>>> found, the specjbb2005 drop much 30~50% on both of policy whenever
> >>>> with openjdk or jrockit. and hackbench drops a lots with powersaving
> >>>> policy on snb 4 sockets platforms. others has no clear change.
> >>>
> >>> I guess this is expected because there has to be some performance hit
> >>> when saving power...
> >>>
> >>
> >> BTW, I had tested the v3 version based on sched numa -- on tip/master.
> >> The specjbb just has about 5~7% dropping on balance/powersaving policy.
> >> The power scheduling done after the numa scheduling logical.
> > 
> > That makes sense.  How the numa scheduling numbers compare to mainline?
> > Do you have all three available, mainline, and tip w. w/o powersaving
> > policy?
> > 
> 
> I once caught 20~40% performance increasing on sched numa VS mainline
> 3.7-rc5. but have no baseline to compare balance/powersaving performance
> since lower data are acceptable for balance/powersaving and
> tip/master changes too quickly to follow up at that time.
> :)

(wow.  dram sucks, dram+smp sucks more, dram+smp+numa _sucks rocks_;)


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  5:17               ` Mike Galbraith
  2013-01-28  5:51                 ` Alex Shi
@ 2013-01-28  9:55                 ` Borislav Petkov
  2013-01-28 10:44                   ` Mike Galbraith
  2013-01-28 15:47                 ` Mike Galbraith
  2 siblings, 1 reply; 88+ messages in thread
From: Borislav Petkov @ 2013-01-28  9:55 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, Jan 28, 2013 at 06:17:46AM +0100, Mike Galbraith wrote:
> Zzzt.  Wish I could turn turbo thingy off.

Try setting /sys/devices/system/cpu/cpufreq/boost to 0.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  9:55                 ` Borislav Petkov
@ 2013-01-28 10:44                   ` Mike Galbraith
  2013-01-28 11:29                     ` Borislav Petkov
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 10:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, 2013-01-28 at 10:55 +0100, Borislav Petkov wrote: 
> On Mon, Jan 28, 2013 at 06:17:46AM +0100, Mike Galbraith wrote:
> > Zzzt.  Wish I could turn turbo thingy off.
> 
> Try setting /sys/devices/system/cpu/cpufreq/boost to 0.

How convenient (test) works too.

So much for turbo boost theory.  Nothing changed until I turned load
balancing off at NODE.  High end went to hell (gee), but low end... 
  
Benchmark       Version Machine Run Date
AIM Multiuser Benchmark - Suite VII     "1.1"   performance-no-node-load_balance Jan 28 11:20:12 2013

Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
1       436.3           100     13.9    3.9     7.2714
5       2637.1          99      11.5    7.3     8.7903
10      5415.5          99      11.2    11.3    9.0259
20      10603.7         99      11.4    24.8    8.8364
40      20066.2         99      12.1    40.5    8.3609
80      35079.6         99      13.8    75.5    7.3082
160     55884.7         98      17.3    145.6   5.8213
320     79345.3         98      24.4    287.4   4.1326
640     100294.8        98      38.7    570.9   2.6118  
1280    115998.2        97      66.9    1132.8  1.5104  
2560    125820.0        97      123.3   2256.6  0.8191


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28 10:44                   ` Mike Galbraith
@ 2013-01-28 11:29                     ` Borislav Petkov
  2013-01-28 11:32                       ` Mike Galbraith
  2013-01-29  1:36                       ` Alex Shi
  0 siblings, 2 replies; 88+ messages in thread
From: Borislav Petkov @ 2013-01-28 11:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, Jan 28, 2013 at 11:44:44AM +0100, Mike Galbraith wrote:
> On Mon, 2013-01-28 at 10:55 +0100, Borislav Petkov wrote: 
> > On Mon, Jan 28, 2013 at 06:17:46AM +0100, Mike Galbraith wrote:
> > > Zzzt.  Wish I could turn turbo thingy off.
> > 
> > Try setting /sys/devices/system/cpu/cpufreq/boost to 0.
> 
> How convenient (test) works too.
> 
> So much for turbo boost theory.  Nothing changed until I turned load
> balancing off at NODE.  High end went to hell (gee), but low end... 
>   
> Benchmark       Version Machine Run Date
> AIM Multiuser Benchmark - Suite VII     "1.1"   performance-no-node-load_balance Jan 28 11:20:12 2013
> 
> Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
> 1       436.3           100     13.9    3.9     7.2714
> 5       2637.1          99      11.5    7.3     8.7903
> 10      5415.5          99      11.2    11.3    9.0259
> 20      10603.7         99      11.4    24.8    8.8364
> 40      20066.2         99      12.1    40.5    8.3609
> 80      35079.6         99      13.8    75.5    7.3082
> 160     55884.7         98      17.3    145.6   5.8213
> 320     79345.3         98      24.4    287.4   4.1326

If you're talking about those results from earlier:

Benchmark       Version Machine Run Date
AIM Multiuser Benchmark - Suite VII     "1.1"   performance     Jan 28 08:09:20 2013

Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
1       438.8           100     13.8    3.8     7.3135
5       2634.8          99      11.5    7.2     8.7826
10      5396.3          99      11.2    11.4    8.9938
20      10725.7         99      11.3    24.0    8.9381
40      20183.2         99      12.0    38.5    8.4097
80      35620.9         99      13.6    71.4    7.4210
160     57203.5         98      16.9    137.8   5.9587
320     81995.8         98      23.7    271.3   4.2706

then the above no_node-load_balance thing suffers a small-ish dip at 320
tasks, yeah.

And AFAICR, the effect of disabling boosting will be visible in the
small count tasks cases anyway because if you saturate the cores with
tasks, the boosting algorithms tend to get the box out of boosting for
the simple reason that the power/perf headroom simply disappears due to
the SOC being busy.

> 640     100294.8        98      38.7    570.9   2.6118
> 1280    115998.2        97      66.9    1132.8  1.5104
> 2560    125820.0        97      123.3   2256.6  0.8191

I dunno about those. maybe this is expected with so many tasks or do we
want to optimize that case further?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28 11:29                     ` Borislav Petkov
@ 2013-01-28 11:32                       ` Mike Galbraith
  2013-01-28 11:40                         ` Mike Galbraith
  2013-01-29  1:32                         ` Alex Shi
  2013-01-29  1:36                       ` Alex Shi
  1 sibling, 2 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 11:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, 2013-01-28 at 12:29 +0100, Borislav Petkov wrote: 
> On Mon, Jan 28, 2013 at 11:44:44AM +0100, Mike Galbraith wrote:
> > On Mon, 2013-01-28 at 10:55 +0100, Borislav Petkov wrote: 
> > > On Mon, Jan 28, 2013 at 06:17:46AM +0100, Mike Galbraith wrote:
> > > > Zzzt.  Wish I could turn turbo thingy off.
> > > 
> > > Try setting /sys/devices/system/cpu/cpufreq/boost to 0.
> > 
> > How convenient (test) works too.
> > 
> > So much for turbo boost theory.  Nothing changed until I turned load
> > balancing off at NODE.  High end went to hell (gee), but low end... 
> >   
> > Benchmark       Version Machine Run Date
> > AIM Multiuser Benchmark - Suite VII     "1.1"   performance-no-node-load_balance Jan 28 11:20:12 2013
> > 
> > Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
> > 1       436.3           100     13.9    3.9     7.2714
> > 5       2637.1          99      11.5    7.3     8.7903
> > 10      5415.5          99      11.2    11.3    9.0259
> > 20      10603.7         99      11.4    24.8    8.8364
> > 40      20066.2         99      12.1    40.5    8.3609
> > 80      35079.6         99      13.8    75.5    7.3082
> > 160     55884.7         98      17.3    145.6   5.8213
> > 320     79345.3         98      24.4    287.4   4.1326
> 
> If you're talking about those results from earlier:
> 
> Benchmark       Version Machine Run Date
> AIM Multiuser Benchmark - Suite VII     "1.1"   performance     Jan 28 08:09:20 2013
> 
> Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
> 1       438.8           100     13.8    3.8     7.3135
> 5       2634.8          99      11.5    7.2     8.7826
> 10      5396.3          99      11.2    11.4    8.9938
> 20      10725.7         99      11.3    24.0    8.9381
> 40      20183.2         99      12.0    38.5    8.4097
> 80      35620.9         99      13.6    71.4    7.4210
> 160     57203.5         98      16.9    137.8   5.9587
> 320     81995.8         98      23.7    271.3   4.2706
> 
> then the above no_node-load_balance thing suffers a small-ish dip at 320
> tasks, yeah.

No no, that's not restricted to one node.  It's just overloaded because
I turned balancing off at the NODE domain level.

> And AFAICR, the effect of disabling boosting will be visible in the
> small count tasks cases anyway because if you saturate the cores with
> tasks, the boosting algorithms tend to get the box out of boosting for
> the simple reason that the power/perf headroom simply disappears due to
> the SOC being busy.
> 
> > 640     100294.8        98      38.7    570.9   2.6118
> > 1280    115998.2        97      66.9    1132.8  1.5104
> > 2560    125820.0        97      123.3   2256.6  0.8191
> 
> I dunno about those. maybe this is expected with so many tasks or do we
> want to optimize that case further?

When using all 4 nodes properly, that's still scaling.  Here, I
intentionally screwed up balancing to watch the low end.  High end is
expected wreckage.

-Mike



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28 11:32                       ` Mike Galbraith
@ 2013-01-28 11:40                         ` Mike Galbraith
  2013-01-28 15:22                           ` Borislav Petkov
  2013-01-29  1:32                         ` Alex Shi
  1 sibling, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 11:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, 2013-01-28 at 12:32 +0100, Mike Galbraith wrote: 
> On Mon, 2013-01-28 at 12:29 +0100, Borislav Petkov wrote: 
> > On Mon, Jan 28, 2013 at 11:44:44AM +0100, Mike Galbraith wrote:
> > > On Mon, 2013-01-28 at 10:55 +0100, Borislav Petkov wrote: 
> > > > On Mon, Jan 28, 2013 at 06:17:46AM +0100, Mike Galbraith wrote:
> > > > > Zzzt.  Wish I could turn turbo thingy off.
> > > > 
> > > > Try setting /sys/devices/system/cpu/cpufreq/boost to 0.
> > > 
> > > How convenient (test) works too.
> > > 
> > > So much for turbo boost theory.  Nothing changed until I turned load
> > > balancing off at NODE.  High end went to hell (gee), but low end... 
> > >   
> > > Benchmark       Version Machine Run Date
> > > AIM Multiuser Benchmark - Suite VII     "1.1"   performance-no-node-load_balance Jan 28 11:20:12 2013
> > > 
> > > Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
> > > 1       436.3           100     13.9    3.9     7.2714
> > > 5       2637.1          99      11.5    7.3     8.7903
> > > 10      5415.5          99      11.2    11.3    9.0259
> > > 20      10603.7         99      11.4    24.8    8.8364
> > > 40      20066.2         99      12.1    40.5    8.3609
> > > 80      35079.6         99      13.8    75.5    7.3082
> > > 160     55884.7         98      17.3    145.6   5.8213
> > > 320     79345.3         98      24.4    287.4   4.1326
> > 
> > If you're talking about those results from earlier:
> > 
> > Benchmark       Version Machine Run Date
> > AIM Multiuser Benchmark - Suite VII     "1.1"   performance     Jan 28 08:09:20 2013
> > 
> > Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
> > 1       438.8           100     13.8    3.8     7.3135
> > 5       2634.8          99      11.5    7.2     8.7826
> > 10      5396.3          99      11.2    11.4    8.9938
> > 20      10725.7         99      11.3    24.0    8.9381
> > 40      20183.2         99      12.0    38.5    8.4097
> > 80      35620.9         99      13.6    71.4    7.4210
> > 160     57203.5         98      16.9    137.8   5.9587
> > 320     81995.8         98      23.7    271.3   4.2706
> > 
> > then the above no_node-load_balance thing suffers a small-ish dip at 320
> > tasks, yeah.
> 
> No no, that's not restricted to one node.  It's just overloaded because
> I turned balancing off at the NODE domain level.

Which shows only that I was multitasking, and in a rush.  Boy was that
dumb.  Hohum.

-Mike


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28 11:40                         ` Mike Galbraith
@ 2013-01-28 15:22                           ` Borislav Petkov
  2013-01-28 15:55                             ` Mike Galbraith
  0 siblings, 1 reply; 88+ messages in thread
From: Borislav Petkov @ 2013-01-28 15:22 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, Jan 28, 2013 at 12:40:46PM +0100, Mike Galbraith wrote:
> > No no, that's not restricted to one node.  It's just overloaded because
> > I turned balancing off at the NODE domain level.
> 
> Which shows only that I was multitasking, and in a rush.  Boy was that
> dumb.  Hohum.

Ok, let's take a step back and slow it down a bit so that people like me
can understand it: you want to try it with disabled load balancing on
the node level, AFAICT. But with that many tasks, perf will suck anyway,
no? Unless you want to benchmark the numa-aware aspect and see whether
load balancing on the node level feels differently, perf-wise?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  5:17               ` Mike Galbraith
  2013-01-28  5:51                 ` Alex Shi
  2013-01-28  9:55                 ` Borislav Petkov
@ 2013-01-28 15:47                 ` Mike Galbraith
  2013-01-29  1:45                   ` Alex Shi
  2013-01-29  2:27                   ` Alex Shi
  2 siblings, 2 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 15:47 UTC (permalink / raw)
  To: Alex Shi
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, 2013-01-28 at 06:17 +0100, Mike Galbraith wrote:

Ok damnit.

> monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 043321  00058616
> 043313  00058616
> 043318  00058968
> 043317  00058968
> 043316  00059184
> 043319  00059192
> 043320  00059048
> 043314  00059048
> 043312  00058176
> 043315  00058184

That was boost if you like, and free to roam 4 nodes.

monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
monteverdi:/abuild/mike/:[0]# echo 0 > /sys/devices/system/cpu/cpufreq/boost
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014618  00039616
014623  00039256
014617  00039256
014620  00039304
014621  00039304  (wait a minute, you said..)
014616  00039080
014625  00039064
014622  00039672
014624  00039624
014619  00039672
monteverdi:/abuild/mike/:[0]# echo 1 > /sys/devices/system/cpu/cpufreq/boost
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014635  00058160
014633  00058592
014638  00058592
014636  00058160
014632  00058200
014634  00058704
014639  00058704
014641  00058200
014640  00058560
014637  00058560
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014673  00059504
014676  00059504
014674  00059064
014672  00059064
014675  00058560
014671  00058560
014677  00059248
014668  00058864
014669  00059248
014670  00058864
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014686  00043472
014689  00043472
014685  00043760
014690  00043760
014687  00043528
014688  00043528  (hmm)
014683  00043216
014692  00043208
014684  00043336
014691  00043336
monteverdi:/abuild/mike/:[0]# echo 0 > /sys/devices/system/cpu/cpufreq/boost
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014701  00039344
014707  00039344
014709  00038976
014700  00038976
014708  00039256  (hmm)
014703  00039256
014705  00039400
014704  00039400
014706  00039320
014702  00039320
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014713  00058552
014716  00058664
014719  00058600
014715  00058600
014718  00058520
014722  00058400
014721  00058768
014717  00058768
014714  00058552
014720  00058560
monteverdi:/abuild/mike/:[0]# massive_intr 10 60
014732  00058736
014734  00058760
014729  00040872
014736  00059184
014728  00059184
014727  00058744
014733  00058760
014731  00059320
014730  00059280
014735  00041072
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
014749  00040608
014748  00040616
014745  00039360
014750  00039360
014751  00039416
014747  00039416
014752  00039336
014746  00039336
014744  00039480
014753  00039480
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
014757  00039272
014761  00039272
014765  00039528
014756  00039528
014759  00039352
014760  00039352
014764  00039248
014762  00039248
014758  00039352
014763  00039352
monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
014773  00059680
014769  00059680
014768  00059144
014777  00059144
014775  00059688
014774  00059688
014770  00059264
014771  00059264
014772  00059528
014776  00059528

Ok box, whatever blows your skirt up.  I'm done.

Non
Uniform
Mysterious
Artifacts


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28 15:22                           ` Borislav Petkov
@ 2013-01-28 15:55                             ` Mike Galbraith
  2013-01-29  1:38                               ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Mike Galbraith @ 2013-01-28 15:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alex Shi, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Mon, 2013-01-28 at 16:22 +0100, Borislav Petkov wrote: 
> On Mon, Jan 28, 2013 at 12:40:46PM +0100, Mike Galbraith wrote:
> > > No no, that's not restricted to one node.  It's just overloaded because
> > > I turned balancing off at the NODE domain level.
> > 
> > Which shows only that I was multitasking, and in a rush.  Boy was that
> > dumb.  Hohum.
> 
> Ok, let's take a step back and slow it down a bit so that people like me
> can understand it: you want to try it with disabled load balancing on
> the node level, AFAICT. But with that many tasks, perf will suck anyway,
> no? Unless you want to benchmark the numa-aware aspect and see whether
> load balancing on the node level feels differently, perf-wise?

The broken thought was, since it's not wakeup path, stop node balance..
but killing all of it killed FORK/EXEC balance, oops.

I think I'm done with this thing though.  See mail I just sent.   There
are better things to do than letting box jerk my chain endlessly ;-)

-Mike


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  6:42                     ` Mike Galbraith
  2013-01-28  7:20                       ` Mike Galbraith
@ 2013-01-29  1:17                       ` Alex Shi
  1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29  1:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/28/2013 02:42 PM, Mike Galbraith wrote:
> Back to original 1ms sleep, 8ms work, turning NUMA box into a single
> node 10 core box with numactl.
> 
> monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 045286  00043872
> 045289  00043464
> 045284  00043488
> 045287  00043440
> 045283  00043416
> 045281  00044456
> 045285  00043456
> 045288  00044312
> 045280  00043048
> 045282  00043240

Um, no idea why the powersaving data is so low.
> monteverdi:/abuild/mike/:[0]# echo balance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 045300  00052536
> 045307  00052472
> 045304  00052536
> 045299  00052536
> 045305  00052520
> 045306  00052528
> 045302  00052528
> 045303  00052528
> 045308  00052512
> 045301  00052520
> monteverdi:/abuild/mike/:[0]# echo performance > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 045339  00052600
> 045340  00052608
> 045338  00052600
> 045337  00052608
> 045343  00052600
> 045341  00052600
> 045336  00052608
> 045335  00052616
> 045334  00052576


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28 11:32                       ` Mike Galbraith
  2013-01-28 11:40                         ` Mike Galbraith
@ 2013-01-29  1:32                         ` Alex Shi
  1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29  1:32 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel


>> then the above no_node-load_balance thing suffers a small-ish dip at 320
>> tasks, yeah.
> 
> No no, that's not restricted to one node.  It's just overloaded because
> I turned balancing off at the NODE domain level.
> 
>> And AFAICR, the effect of disabling boosting will be visible in the
>> small count tasks cases anyway because if you saturate the cores with
>> tasks, the boosting algorithms tend to get the box out of boosting for
>> the simple reason that the power/perf headroom simply disappears due to
>> the SOC being busy.
>>
>>> 640     100294.8        98      38.7    570.9   2.6118
>>> 1280    115998.2        97      66.9    1132.8  1.5104
>>> 2560    125820.0        97      123.3   2256.6  0.8191
>>
>> I dunno about those. maybe this is expected with so many tasks or do we
>> want to optimize that case further?
> 
> When using all 4 nodes properly, that's still scaling.  Here, I

Without node regular balancing, only waking balance left in
select_task_rq_fair for aim7 testing, (I just assume you used shared
workfile, most of testing is cpu density and only few exec/fork load).

Since, waking balance just happened in same llc domain. guess that is
the reason for this.

> intentionally screwed up balancing to watch the low end.  High end is
> expected wreckage.



^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28 11:29                     ` Borislav Petkov
  2013-01-28 11:32                       ` Mike Galbraith
@ 2013-01-29  1:36                       ` Alex Shi
  1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29  1:36 UTC (permalink / raw)
  To: Borislav Petkov, Mike Galbraith, torvalds, mingo, peterz, tglx,
	akpm, arjan, pjt, namhyung, vincent.guittot, gregkh, preeti,
	viresh.kumar, linux-kernel


> Benchmark       Version Machine Run Date
> AIM Multiuser Benchmark - Suite VII     "1.1"   performance     Jan 28 08:09:20 2013
> 
> Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
> 1       438.8           100     13.8    3.8     7.3135
> 5       2634.8          99      11.5    7.2     8.7826
> 10      5396.3          99      11.2    11.4    8.9938
> 20      10725.7         99      11.3    24.0    8.9381
> 40      20183.2         99      12.0    38.5    8.4097
> 80      35620.9         99      13.6    71.4    7.4210
> 160     57203.5         98      16.9    137.8   5.9587
> 320     81995.8         98      23.7    271.3   4.2706
> 
> then the above no_node-load_balance thing suffers a small-ish dip at 320
> tasks, yeah.
> 
> And AFAICR, the effect of disabling boosting will be visible in the
> small count tasks cases anyway because if you saturate the cores with
> tasks, the boosting algorithms tend to get the box out of boosting for
> the simple reason that the power/perf headroom simply disappears due to
> the SOC being busy.

Sure. and according to the context of serial email. guess this result
has boosting enabled, right?


> 
>> 640     100294.8        98      38.7    570.9   2.6118
>> 1280    115998.2        97      66.9    1132.8  1.5104
>> 2560    125820.0        97      123.3   2256.6  0.8191
> 
> I dunno about those. maybe this is expected with so many tasks or do we
> want to optimize that case further?
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28 15:55                             ` Mike Galbraith
@ 2013-01-29  1:38                               ` Alex Shi
  0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29  1:38 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/28/2013 11:55 PM, Mike Galbraith wrote:
> On Mon, 2013-01-28 at 16:22 +0100, Borislav Petkov wrote: 
>> On Mon, Jan 28, 2013 at 12:40:46PM +0100, Mike Galbraith wrote:
>>>> No no, that's not restricted to one node.  It's just overloaded because
>>>> I turned balancing off at the NODE domain level.
>>>
>>> Which shows only that I was multitasking, and in a rush.  Boy was that
>>> dumb.  Hohum.
>>
>> Ok, let's take a step back and slow it down a bit so that people like me
>> can understand it: you want to try it with disabled load balancing on
>> the node level, AFAICT. But with that many tasks, perf will suck anyway,
>> no? Unless you want to benchmark the numa-aware aspect and see whether
>> load balancing on the node level feels differently, perf-wise?
> 
> The broken thought was, since it's not wakeup path, stop node balance..
> but killing all of it killed FORK/EXEC balance, oops.

Um. sure. so guess all of tasks just running on one node.
> 
> I think I'm done with this thing though.  See mail I just sent.   There
> are better things to do than letting box jerk my chain endlessly ;-)
> 
> -Mike
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28 15:47                 ` Mike Galbraith
@ 2013-01-29  1:45                   ` Alex Shi
  2013-01-29  4:03                     ` Mike Galbraith
  2013-01-29  2:27                   ` Alex Shi
  1 sibling, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-01-29  1:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/28/2013 11:47 PM, Mike Galbraith wrote:
> On Mon, 2013-01-28 at 06:17 +0100, Mike Galbraith wrote:
> 
> Ok damnit.
> 
>> monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
>> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
>> 043321  00058616
>> 043313  00058616
>> 043318  00058968
>> 043317  00058968
>> 043316  00059184
>> 043319  00059192
>> 043320  00059048
>> 043314  00059048
>> 043312  00058176
>> 043315  00058184
> 
> That was boost if you like, and free to roam 4 nodes.
> 
> monteverdi:/abuild/mike/:[0]# echo powersaving > /sys/devices/system/cpu/sched_policy/current_sched_policy
> monteverdi:/abuild/mike/:[0]# echo 0 > /sys/devices/system/cpu/cpufreq/boost
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014618  00039616
> 014623  00039256
> 014617  00039256
> 014620  00039304
> 014621  00039304  (wait a minute, you said..)
> 014616  00039080
> 014625  00039064
> 014622  00039672
> 014624  00039624
> 014619  00039672
> monteverdi:/abuild/mike/:[0]# echo 1 > /sys/devices/system/cpu/cpufreq/boost
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014635  00058160
> 014633  00058592
> 014638  00058592
> 014636  00058160
> 014632  00058200
> 014634  00058704
> 014639  00058704
> 014641  00058200
> 014640  00058560
> 014637  00058560
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014673  00059504
> 014676  00059504
> 014674  00059064
> 014672  00059064
> 014675  00058560
> 014671  00058560
> 014677  00059248
> 014668  00058864
> 014669  00059248
> 014670  00058864
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014686  00043472
> 014689  00043472
> 014685  00043760
> 014690  00043760
> 014687  00043528
> 014688  00043528  (hmm)
> 014683  00043216
> 014692  00043208
> 014684  00043336
> 014691  00043336

I am sorry Mike. does above 3 times testing has a same sched policy? and
same question for the following testing.

> monteverdi:/abuild/mike/:[0]# echo 0 > /sys/devices/system/cpu/cpufreq/boost
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014701  00039344
> 014707  00039344
> 014709  00038976
> 014700  00038976
> 014708  00039256  (hmm)
> 014703  00039256
> 014705  00039400
> 014704  00039400
> 014706  00039320
> 014702  00039320
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014713  00058552
> 014716  00058664
> 014719  00058600
> 014715  00058600
> 014718  00058520
> 014722  00058400
> 014721  00058768
> 014717  00058768
> 014714  00058552
> 014720  00058560
> monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> 014732  00058736
> 014734  00058760
> 014729  00040872
> 014736  00059184
> 014728  00059184
> 014727  00058744
> 014733  00058760
> 014731  00059320
> 014730  00059280
> 014735  00041072
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 014749  00040608
> 014748  00040616
> 014745  00039360
> 014750  00039360
> 014751  00039416
> 014747  00039416
> 014752  00039336
> 014746  00039336
> 014744  00039480
> 014753  00039480
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 014757  00039272
> 014761  00039272
> 014765  00039528
> 014756  00039528
> 014759  00039352
> 014760  00039352
> 014764  00039248
> 014762  00039248
> 014758  00039352
> 014763  00039352
> monteverdi:/abuild/mike/:[0]# numactl --cpunodebind=0 massive_intr 10 60
> 014773  00059680
> 014769  00059680
> 014768  00059144
> 014777  00059144
> 014775  00059688
> 014774  00059688
> 014770  00059264
> 014771  00059264
> 014772  00059528
> 014776  00059528
> 
> Ok box, whatever blows your skirt up.  I'm done.
> 
> Non
> Uniform
> Mysterious
> Artifacts
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28 15:47                 ` Mike Galbraith
  2013-01-29  1:45                   ` Alex Shi
@ 2013-01-29  2:27                   ` Alex Shi
  1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29  2:27 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/28/2013 11:47 PM, Mike Galbraith wrote:
> 014776  00059528
> 
> Ok box, whatever blows your skirt up.  I'm done.

Many thanks for so much fruitful testing! :D

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-29  1:45                   ` Alex Shi
@ 2013-01-29  4:03                     ` Mike Galbraith
  0 siblings, 0 replies; 88+ messages in thread
From: Mike Galbraith @ 2013-01-29  4:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Tue, 2013-01-29 at 09:45 +0800, Alex Shi wrote: 
> On 01/28/2013 11:47 PM, Mike Galbraith wrote:

> > monteverdi:/abuild/mike/:[0]# echo 1 > /sys/devices/system/cpu/cpufreq/boost
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 014635  00058160
> > 014633  00058592
> > 014638  00058592
> > 014636  00058160
> > 014632  00058200
> > 014634  00058704
> > 014639  00058704
> > 014641  00058200
> > 014640  00058560
> > 014637  00058560
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 014673  00059504
> > 014676  00059504
> > 014674  00059064
> > 014672  00059064
> > 014675  00058560
> > 014671  00058560
> > 014677  00059248
> > 014668  00058864
> > 014669  00059248
> > 014670  00058864
> > monteverdi:/abuild/mike/:[0]# massive_intr 10 60
> > 014686  00043472
> > 014689  00043472
> > 014685  00043760
> > 014690  00043760
> > 014687  00043528
> > 014688  00043528  (hmm)
> > 014683  00043216
> > 014692  00043208
> > 014684  00043336
> > 014691  00043336
> 
> I am sorry Mike. does above 3 times testing has a same sched policy? and
> same question for the following testing.

Yeah, they're back to back repeats.  Using dirt simple massive_intr
didn't help clarify aim7 oddity.

aim7 is fully repeatable, seems to be saying that consolidation of small
independent jobs is a win, that spreading before fully saturated has its
price, just as consolidation of large coordinated burst has its price.

Seems to cut both ways.. but why not, everything else does.

-Mike


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-28  5:19         ` Alex Shi
  2013-01-28  6:49           ` Mike Galbraith
@ 2013-01-29  6:02           ` Alex Shi
  1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-01-29  6:02 UTC (permalink / raw)
  To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/28/2013 01:19 PM, Alex Shi wrote:
> On 01/27/2013 06:40 PM, Borislav Petkov wrote:
>> On Sun, Jan 27, 2013 at 10:41:40AM +0800, Alex Shi wrote:
>>> Just rerun some benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
>>> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
>>> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
>>> performance change found.
>>
>> Ok, good, You could put that in one of the commit messages so that it is
>> there and people know that this patchset doesn't cause perf regressions
>> with the bunch of benchmarks.
>>
>>> I also tested balance policy/powersaving policy with above benchmark,
>>> found, the specjbb2005 drop much 30~50% on both of policy whenever
>>> with openjdk or jrockit. and hackbench drops a lots with powersaving
>>> policy on snb 4 sockets platforms. others has no clear change.

Sorry, the testing configuration is unfair for this specjbb2005 results
here. I set JVM hard pin and use hugepage for peak performance.

When remove the hard pin and no hugepage, the balance/powersaving both
drop about 5% VS performance policy, and performance policy result is
similar with 3.8-rc5.

>>
>> I guess this is expected because there has to be some performance hit
>> when saving power...
>>
> 
> BTW, I had tested the v3 version based on sched numa -- on tip/master.
> The specjbb just has about 5~7% dropping on balance/powersaving policy.
> The power scheduling done after the numa scheduling logical.
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
                   ` (19 preceding siblings ...)
  2013-01-28  1:28 ` Alex Shi
@ 2013-02-04  1:35 ` Alex Shi
  2013-02-04 11:09   ` Ingo Molnar
  20 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-04  1:35 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On 01/24/2013 11:06 AM, Alex Shi wrote:
> Since the runnable info needs 345ms to accumulate, balancing
> doesn't do well for many tasks burst waking. After talking with Mike
> Galbraith, we are agree to just use runnable avg in power friendly 
> scheduling and keep current instant load in performance scheduling for 
> low latency.
> 
> So the biggest change in this version is removing runnable load avg in
> balance and just using runnable data in power balance.
> 
> The patchset bases on Linus' tree, includes 3 parts,
> ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
> ----------------------
> the first patch remove one domain level. patch 2~5 simplified fork/wake
> balancing, it can increase 10+% hackbench performance on our 4 sockets
> SNB EP machine.
> 
> V3 change:
> a, added the first patch to remove one domain level on x86 platform.
> b, some small changes according to Namhyung Kim's comments, thanks!
> 
> ** 2, bug fix of load avg and remove the CONFIG_FAIR_GROUP_SCHED limit
> ----------------------
> patch 6~8, That using runnable avg in load balancing, with
> two initial runnable variables fix.
> 
> V4 change:
> a, remove runnable log avg using in balancing.
> 
> V3 change:
> a, use rq->cfs.runnable_load_avg as cpu load not
> rq->avg.load_avg_contrib, since the latter need much time to accumulate
> for new forked task,
> b, a build issue fixed with Namhyung Kim's reminder.
> 
> ** 3, power awareness scheduling, patch 9~18.
> ----------------------
> The subset implement/consummate the rough power aware scheduling
> proposal: https://lkml.org/lkml/2012/8/13/139.
> It defines 2 new power aware policy 'balance' and 'powersaving' and then
> try to spread or pack tasks on each sched groups level according the
> different scheduler policy. That can save much power when task number in
> system is no more then LCPU number.
> 
> As mentioned in the power aware scheduler proposal, Power aware
> scheduling has 2 assumptions:
> 1, race to idle is helpful for power saving
> 2, pack tasks on less sched_groups will reduce power consumption
> 
> The first assumption make performance policy take over scheduling when
> system busy.
> The second assumption make power aware scheduling try to move
> disperse tasks into fewer groups until that groups are full of tasks.
> 
> Some power testing data is in the last 2 patches.
> 
> V4 change:
> a, fix few bugs and clean up code according to Morten Rasmussen, Mike
> Galbraith and Namhyung Kim. Thanks!
> b, take Morten's suggestion to set different criteria for different
> policy in small task packing.
> c, shorter latency in power aware scheduling.
> 
> V3 change:
> a, engaged nr_running in max potential utils consideration in periodic
> power balancing.
> b, try exec/wake small tasks on running cpu not idle cpu.
> 
> V2 change:
> a, add lazy power scheduling to deal with kbuild like benchmark.
> 
> 
> Thanks Fengguang Wu for the build testing of this patchset!


Add some testing report summary that were posted:
Alex Shi tested the benchmarks: kbuild, specjbb2005, oltp, tbench, aim9, hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
loopback netperf. on core2, nhm, wsm, snb, platforms: 
	a, no clear performance change on performance balance
	b, specjbb2005 drop 5~7% on balance/powersaving policy on SNB/NHM platforms; hackbench drop 30~70% SNB EP4S machine.
	c, no other peformance change on balance/powersaving machine.

test result from Mike Galbraith:
---------
With aim7 compute on 4 node 40 core box, I see stable throughput
improvement at tasks = nr_cores and below w. balance and powersaving. 

         3.8.0-performance                  3.8.0-balance          3.8.0-powersaving
Tasks    jobs/min/task       cpu   jobs/min/task       cpu    jobs/min/task         cpu
    1         432.8571      3.99        433.4764      3.97         433.1665        3.98
    5         480.1902     12.49        510.9612      7.55         497.5369        8.22
   10         429.1785     40.14        533.4507     11.13         518.3918       12.15
   20         424.3697     63.14        529.7203     23.72         528.7958       22.08
   40         419.0871    171.42        500.8264     51.44         517.0648       42.45

No deltas after that.  There were also no deltas between patched kernel
using performance policy and virgin source.
----------

Ingo, I appreciate for any comments from you. :)


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-02-04  1:35 ` Alex Shi
@ 2013-02-04 11:09   ` Ingo Molnar
  2013-02-05  2:26     ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Ingo Molnar @ 2013-02-04 11:09 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel


* Alex Shi <alex.shi@intel.com> wrote:

> On 01/24/2013 11:06 AM, Alex Shi wrote:
> > Since the runnable info needs 345ms to accumulate, balancing
> > doesn't do well for many tasks burst waking. After talking with Mike
> > Galbraith, we are agree to just use runnable avg in power friendly 
> > scheduling and keep current instant load in performance scheduling for 
> > low latency.
> > 
> > So the biggest change in this version is removing runnable load avg in
> > balance and just using runnable data in power balance.
> > 
> > The patchset bases on Linus' tree, includes 3 parts,
> > ** 1, bug fix and fork/wake balancing clean up. patch 1~5,
> > ----------------------
> > the first patch remove one domain level. patch 2~5 simplified fork/wake
> > balancing, it can increase 10+% hackbench performance on our 4 sockets
> > SNB EP machine.
> > 
> > V3 change:
> > a, added the first patch to remove one domain level on x86 platform.
> > b, some small changes according to Namhyung Kim's comments, thanks!
> > 
> > ** 2, bug fix of load avg and remove the CONFIG_FAIR_GROUP_SCHED limit
> > ----------------------
> > patch 6~8, That using runnable avg in load balancing, with
> > two initial runnable variables fix.
> > 
> > V4 change:
> > a, remove runnable log avg using in balancing.
> > 
> > V3 change:
> > a, use rq->cfs.runnable_load_avg as cpu load not
> > rq->avg.load_avg_contrib, since the latter need much time to accumulate
> > for new forked task,
> > b, a build issue fixed with Namhyung Kim's reminder.
> > 
> > ** 3, power awareness scheduling, patch 9~18.
> > ----------------------
> > The subset implement/consummate the rough power aware scheduling
> > proposal: https://lkml.org/lkml/2012/8/13/139.
> > It defines 2 new power aware policy 'balance' and 'powersaving' and then
> > try to spread or pack tasks on each sched groups level according the
> > different scheduler policy. That can save much power when task number in
> > system is no more then LCPU number.
> > 
> > As mentioned in the power aware scheduler proposal, Power aware
> > scheduling has 2 assumptions:
> > 1, race to idle is helpful for power saving
> > 2, pack tasks on less sched_groups will reduce power consumption
> > 
> > The first assumption make performance policy take over scheduling when
> > system busy.
> > The second assumption make power aware scheduling try to move
> > disperse tasks into fewer groups until that groups are full of tasks.
> > 
> > Some power testing data is in the last 2 patches.
> > 
> > V4 change:
> > a, fix few bugs and clean up code according to Morten Rasmussen, Mike
> > Galbraith and Namhyung Kim. Thanks!
> > b, take Morten's suggestion to set different criteria for different
> > policy in small task packing.
> > c, shorter latency in power aware scheduling.
> > 
> > V3 change:
> > a, engaged nr_running in max potential utils consideration in periodic
> > power balancing.
> > b, try exec/wake small tasks on running cpu not idle cpu.
> > 
> > V2 change:
> > a, add lazy power scheduling to deal with kbuild like benchmark.
> > 
> > 
> > Thanks Fengguang Wu for the build testing of this patchset!
> 
> 
> Add some testing report summary that were posted:
> Alex Shi tested the benchmarks: kbuild, specjbb2005, oltp, tbench, aim9, hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> loopback netperf. on core2, nhm, wsm, snb, platforms: 
> 	a, no clear performance change on performance balance
> 	b, specjbb2005 drop 5~7% on balance/powersaving policy on SNB/NHM platforms; hackbench drop 30~70% SNB EP4S machine.
> 	c, no other peformance change on balance/powersaving machine.
> 
> test result from Mike Galbraith:
> ---------
> With aim7 compute on 4 node 40 core box, I see stable throughput
> improvement at tasks = nr_cores and below w. balance and powersaving. 
> 
>          3.8.0-performance                  3.8.0-balance          3.8.0-powersaving
> Tasks    jobs/min/task       cpu   jobs/min/task       cpu    jobs/min/task         cpu
>     1         432.8571      3.99        433.4764      3.97         433.1665        3.98
>     5         480.1902     12.49        510.9612      7.55         497.5369        8.22
>    10         429.1785     40.14        533.4507     11.13         518.3918       12.15
>    20         424.3697     63.14        529.7203     23.72         528.7958       22.08
>    40         419.0871    171.42        500.8264     51.44         517.0648       42.45
> 
> No deltas after that.  There were also no deltas between patched kernel
> using performance policy and virgin source.
> ----------
> 
> Ingo, I appreciate for any comments from you. :)

Have you tried to quantify the actual real or expected power 
savings with the knob enabled?

I'd also love to have an automatic policy here, with a knob that 
has 3 values:

   0: always disabled
   1: automatic
   2: always enabled

here enabled/disabled is your current knob's functionality, and 
those can also be used by user-space policy daemons/handlers.

The interesting thing would be '1' which should be the default: 
on laptops that are on battery it should result in a power 
saving policy, on laptops that are on AC or on battery-less 
systems it should mean 'performance' policy.

It should generally default to 'performance', switching to 
'power saving on' only if there's positive, reliable information 
somewhere in the kernel that we are operating on battery power. 
A callback or two would have to go into the ACPI battery driver 
I suspect.

So I'd like this feature to be a tangible improvement for laptop 
users (as long as the laptop hardware is passing us battery/AC 
events reliably).

Or something like that - with .config switches to influence 
these values as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-02-04 11:09   ` Ingo Molnar
@ 2013-02-05  2:26     ` Alex Shi
  2013-02-06  5:08       ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-05  2:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, Zhang, Rui


>> Ingo, I appreciate for any comments from you. :)
> 
> Have you tried to quantify the actual real or expected power 
> savings with the knob enabled?

Thanks a lot for your comments! :)

Yes, the following power data copied form patch 17th:
---
A test can show the effort on different policy:
for ((i = 0; i < I; i++)) ; do while true; do :; done  &   done

On my SNB laptop with 4core* HT: the data is Watts
        powersaving     balance         performance
i = 2   40              54              54
i = 4   57              64*             68
i = 8   68              68              68

Note:
When i = 4 with balance policy, the power may change in 57~68Watt,
since the HT capacity and core capacity are both 1.

on SNB EP machine with 2 sockets * 8 cores * HT:
        powersaving     balance         performance
i = 4   190             201             238
i = 8   205             241             268
i = 16  271             348             376

If system has few continued tasks, use power policy can get
the performance/power gain. Like sysbench fileio randrw test with 16
thread on the SNB EP box.
=====

and the following from patch 18th
---
On my SNB EP 2 sockets machine with 8 cores * HT: 'make -j x vmlinux'
results:

		powersaving		balance		performance
x = 1    175.603 /417 13          175.220 /416 13        176.073 /407 13
x = 2    192.215 /218 23          194.522 /202 25        217.393 /200 23
x = 4    205.226 /124 39          208.823 /114 42        230.425 /105 41
x = 8    236.369 /71 59           249.005 /65 61         257.661 /62 62
x = 16   283.842 /48 73           307.465 /40 81         309.336 /39 82
x = 32   325.197 /32 96           333.503 /32 93         336.138 /32 92

data explains: 175.603 /417 13
	175.603: average Watts
	417: seconds(compile time)
	13:  scaled performance/power = 1000000 / seconds / watts
=====

some data for parallel compress: https://lkml.org/lkml/2012/12/11/155
---
Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux* &> /dev/null

On a NHM EP box
         powersaving               balance               performance
x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76

On a 2 sockets SNB EP box.
         powersaving               balance               performance
x = 4    190.995 /149 35          200.6 /129 38          208.561 /135 35
x = 8    197.969 /108 46          208.885 /103 46        213.96 /108 43
x = 16   205.163 /76 64           212.144 /91 51         229.287 /97 44

data format is: 166.516 /88 68
        166.516: average Watts
        88: seconds(compress time)
        68:  scaled performance/power = 1000000 / time / power
=====

BTW, bltk-game with openarena dropped 0.3/1.5 Watt on powersaving policy
or 0.2/0.5 Watt on balance policy on my laptop wsm/snb;
> 
> I'd also love to have an automatic policy here, with a knob that 
> has 3 values:
> 
>    0: always disabled
>    1: automatic
>    2: always enabled
> 
> here enabled/disabled is your current knob's functionality, and 
> those can also be used by user-space policy daemons/handlers.

Sure, this patch has a knob for user-space policy selecting,

$cat /sys/devices/system/cpu/sched_policy/available_sched_policy
performance powersaving balance

User can change the policy by commend 'echo':
 echo performance > /sys/devices/system/cpu/current_sched_policy

The 'performance' policy means 'always disabled' power friendly scheduling.

The 'balance/powersaving' is automatic power friendly scheduling, since
system will auto bypass power scheduling when cpus utilisation in a
sched domain is beyond the domain's cpu weight (powersaving) or beyond
the domain's capacity (balance).

There is no always enabled power scheduling, since the patchset bases on
'race to idle'. but it's easy to add this function if needed.

> 
> The interesting thing would be '1' which should be the default: 
> on laptops that are on battery it should result in a power 
> saving policy, on laptops that are on AC or on battery-less 
> systems it should mean 'performance' policy.

yes, with above sysfs interface it is easy to be done. :)
> 
> It should generally default to 'performance', switching to 
> 'power saving on' only if there's positive, reliable information 
> somewhere in the kernel that we are operating on battery power. 
> A callback or two would have to go into the ACPI battery driver 
> I suspect.
> 
> So I'd like this feature to be a tangible improvement for laptop 
> users (as long as the laptop hardware is passing us battery/AC 
> events reliably).

Maybe it is better to let system admin change it from user space? I am
not sure some one like to enable a call back in ACPI battery driver?

CC to Zhang Rui.
> 
> Or something like that - with .config switches to influence 
> these values as well.
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling
  2013-02-05  2:26     ` Alex Shi
@ 2013-02-06  5:08       ` Alex Shi
  0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-06  5:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, Zhang, Rui

BTW,

Since numa balance scheduling is also a kind of cpu locality policy, it
is natural compatible with power aware scheduling.

The v2/v3 of this patch had developed on tip/master, testing show above
2 scheduling policy work together well.

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
  2013-01-24  3:06 ` [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
@ 2013-02-12 10:11   ` Peter Zijlstra
  2013-02-13 13:22     ` Alex Shi
  2013-02-13 14:17     ` Alex Shi
  0 siblings, 2 replies; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:11 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
> frist commit b5d978e0c7e79a, and was removed uncarefully when clear up
> obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
> domain only. It works, but it introduces a extra domain level since this
> cause MC/CPU different.
> 
> So, recover the the flag in MC domain too to remove a domain level in
> x86 platform.

This fails to clearly state why its desirable.. I'm guessing its because
we should use sibling cache domains before sibling threads, right?

A clearly stated reason is always preferable over: it was this way, make
it so again; which leaves us wondering why. 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 02/18] sched: select_task_rq_fair clean up
  2013-01-24  3:06 ` [patch v4 02/18] sched: select_task_rq_fair clean up Alex Shi
@ 2013-02-12 10:14   ` Peter Zijlstra
  2013-02-13 14:44     ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:14 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> It is impossible to miss a task allowed cpu in a eligible group.

I suppose your reasoning goes like: tsk->cpus_allowed is protected by
->pi_lock, we hold this, therefore it cannot change and
find_idlest_group() dtrt?

We can then state that this is due to adding proper serialization to
tsk->cpus_allowed.

> And since find_idlest_group only return a different group which
> excludes old cpu, it's also impossible to find a new cpu same as old
> cpu.

Sounds plausible, but I'm not convinced, do we have hard serialization
against hotplug?




^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 03/18] sched: fix find_idlest_group mess logical
  2013-01-24  3:06 ` [patch v4 03/18] sched: fix find_idlest_group mess logical Alex Shi
@ 2013-02-12 10:16   ` Peter Zijlstra
  2013-02-13 15:07     ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:16 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> There is 4 situations in the function:
> 1, no task allowed group;
> 	so min_load = ULONG_MAX, this_load = 0, idlest = NULL
> 2, only local group task allowed;
> 	so min_load = ULONG_MAX, this_load assigned, idlest = NULL
> 3, only non-local task group allowed;
> 	so min_load assigned, this_load = 0, idlest != NULL
> 4, local group + another group are task allowed.
> 	so min_load assigned, this_load assigned, idlest != NULL
> 
> Current logical will return NULL in first 3 kinds of scenarios.
> And still return NULL, if idlest group is heavier then the
> local group in the 4th situation.
> 
> Actually, I thought groups in situation 2,3 are also eligible to host
> the task. And in 4th situation, agree to bias toward local group.

I'm not convinced this is actually a cleanup.. taken together with patch
4 (which is a direct consequence of this patch afaict) you replace one
conditional with 2.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
  2013-01-24  3:06 ` [patch v4 05/18] sched: quicker balancing on fork/exec/wake Alex Shi
@ 2013-02-12 10:22   ` Peter Zijlstra
  2013-02-14  3:13     ` Alex Shi
  2013-02-14  8:12     ` Preeti U Murthy
  0 siblings, 2 replies; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:22 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> Guess the search cpu from bottom to up in domain tree come from
> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
> balancing over tasks on all level domains.
> 
> This balancing cost too much if there has many domain/groups in a
> large system.
> 
> If we remove this code, we will get quick fork/exec/wake with a
> similar
> balancing result amony whole system.
> 
> This patch increases 10+% performance of hackbench on my 4 sockets
> SNB machines and about 3% increasing on 2 sockets servers.
> 
> 
Numbers be groovy.. still I'd like a little more on the behavioural
change. Expand on what exactly is lost by this change so that if we
later find a regression we have a better idea of what and how.

For instance, note how find_idlest_group() isn't symmetric wrt
local_group. So by not doing the domain iteration we change things.

Now, it might well be that all this is somewhat overkill as it is, but
should we then not replace all of it with a simple min search over all
eligible cpus; that would be a real clean up.
 


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 06/18] sched: give initial value for runnable avg of sched entities.
  2013-01-24  3:06 ` [patch v4 06/18] sched: give initial value for runnable avg of sched entities Alex Shi
@ 2013-02-12 10:23   ` Peter Zijlstra
  0 siblings, 0 replies; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:23 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
> after a new task forked.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
>     enqueue_task_fair
>         enqueue_entity
>             enqueue_entity_load_avg
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/core.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26058d0..1743746 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1559,6 +1559,8 @@ static void __sched_fork(struct task_struct *p)
>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>  	p->se.avg.runnable_avg_period = 0;
>  	p->se.avg.runnable_avg_sum = 0;
> +	p->se.avg.decay_count = 0;
> +	p->se.avg.load_avg_contrib = 0;
>  #endif
>  #ifdef CONFIG_SCHEDSTATS
>  	memset(&p->se.statistics, 0, sizeof(p->se.statistics));

pjt?


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 07/18] sched: set initial load avg of new forked task
  2013-01-24  3:06 ` [patch v4 07/18] sched: set initial load avg of new forked task Alex Shi
@ 2013-02-12 10:26   ` Peter Zijlstra
  2013-02-13 15:14     ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:26 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> +       /*
> +        * set the initial load avg of new task same as its load
> +        * in order to avoid brust fork make few cpu too heavier
> +        */
> +       if (flags & ENQUEUE_NEWTASK)
> +               se->avg.load_avg_contrib = se->load.weight; 

I seem to have vague recollections of a discussion with pjt where we
talk about the initial behaviour of tasks; from this haze I had the
impression that new tasks should behave like full weight..

PJT is something more fundamental screwy?


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
  2013-01-24  3:06 ` [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
@ 2013-02-12 10:27   ` Peter Zijlstra
  2013-02-13 15:23     ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:27 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
> we can use runnable load variables.
> 
It would be nice if we could quantify the performance hit of doing so.
Haven't yet looked at later patches to see if we remove anything to
off-set this.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 09/18] sched: add sched_policies in kernel
  2013-01-24  3:06 ` [patch v4 09/18] sched: add sched_policies in kernel Alex Shi
@ 2013-02-12 10:36   ` Peter Zijlstra
  2013-02-13 15:41     ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:36 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> Current scheduler behavior is just consider the for larger performance
> of system. So it try to spread tasks on more cpu sockets and cpu cores
> 
> To adding the consideration of power awareness, the patchset adds
> 2 kinds of scheduler policy: powersaving and balance. They will use
> runnable load util in scheduler balancing. The current scheduling is taken
> as performance policy.
> 
> performance: the current scheduling behaviour, try to spread tasks
>                 on more CPU sockets or cores. performance oriented.
> powersaving: will pack tasks into few sched group until all LCPU in the
>                 group is full, power oriented.
> balance    : will pack tasks into few sched group until group_capacity
>                 numbers CPU is full, balance between performance and
> 		powersaving.

_WHY_ do you start out with so much choice?

If your power policy is so abysmally poor on performance that you
already know you need a 3rd policy to keep people happy, maybe you're
doing something wrong?

> +#define SCHED_POLICY_PERFORMANCE	(0x1)
> +#define SCHED_POLICY_POWERSAVING	(0x2)
> +#define SCHED_POLICY_BALANCE		(0x4)
> +
> +extern int __read_mostly sched_policy;

I'd much prefer: sched_balance_policy. Scheduler policy is a concept
already well defined by posix and we don't need it to mean two
completely different things.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 11/18] sched: log the cpu utilization at rq
  2013-01-24  3:06 ` [patch v4 11/18] sched: log the cpu utilization at rq Alex Shi
@ 2013-02-12 10:39   ` Peter Zijlstra
  2013-02-14  3:10     ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-12 10:39 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
> 
> The cpu's utilization is to measure how busy is the cpu.
>         util = cpu_rq(cpu)->avg.runnable_avg_sum
>                 / cpu_rq(cpu)->avg.runnable_avg_period;
> 
> Since the util is no more than 1, we use its percentage value in later
> caculations. And set the the FULL_UTIL as 100%.
> 
> In later power aware scheduling, we are sensitive for how busy of the
> cpu, not how much weight of its load. As to power consuming, it is more
> related with cpu busy time, not the load weight.

I think we can make that argument in general; that is irrespective of
the actual policy. We simply never had anything better to go with.

So please clarify why you think this only applies to power aware
scheduling.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
  2013-02-12 10:11   ` Peter Zijlstra
@ 2013-02-13 13:22     ` Alex Shi
  2013-02-15 12:38       ` Peter Zijlstra
  2013-02-13 14:17     ` Alex Shi
  1 sibling, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-13 13:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On 02/12/2013 06:11 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
>> frist commit b5d978e0c7e79a, and was removed uncarefully when clear up
>> obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
>> domain only. It works, but it introduces a extra domain level since this
>> cause MC/CPU different.
>>
>> So, recover the the flag in MC domain too to remove a domain level in
>> x86 platform.

Peter, I am very very happy to see you again! :)
> 
> This fails to clearly state why its desirable.. I'm guessing its because
> we should use sibling cache domains before sibling threads, right?

No, the flags set on MC/CPU domain, but is checked in their parents
balancing, like in NUMA domain.
Without the flag, will cause NUMA domain imbalance. like on my 2 sockets
NHM EP: 3 of 4 tasks were assigned on socket 0(lcpu, 10, 12, 14)

In this case, update_sd_pick_busiest() need a reduced group_capacity to
return true:
	if (sgs->sum_nr_running > sgs->group_capacity)
		return true;
then numa domain balancing get chance to start.

---------
05:00:28 AM  CPU    %usr   %nice      %idle
05:00:29 AM  all   25.00    0.00      74.94
05:00:29 AM    0    0.00    0.00      99.00
05:00:29 AM    1    0.00    0.00     100.00
05:00:29 AM    2    0.00    0.00     100.00
05:00:29 AM    3    0.00    0.00     100.00
05:00:29 AM    4    0.00    0.00     100.00
05:00:29 AM    5    0.00    0.00     100.00
05:00:29 AM    6    0.00    0.00     100.00
05:00:29 AM    7    0.00    0.00     100.00
05:00:29 AM    8    0.00    0.00     100.00
05:00:29 AM    9    0.00    0.00     100.00
05:00:29 AM   10  100.00    0.00       0.00
05:00:29 AM   11    0.00    0.00     100.00
05:00:29 AM   12  100.00    0.00       0.00
05:00:29 AM   13    0.00    0.00     100.00
05:00:29 AM   14  100.00    0.00       0.00
05:00:29 AM   15  100.00    0.00       0.00



-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
  2013-02-12 10:11   ` Peter Zijlstra
  2013-02-13 13:22     ` Alex Shi
@ 2013-02-13 14:17     ` Alex Shi
  1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-13 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On 02/12/2013 06:11 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
>> frist commit b5d978e0c7e79a, and was removed uncarefully when clear up
>> obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
>> domain only. It works, but it introduces a extra domain level since this
>> cause MC/CPU different.
>>
>> So, recover the the flag in MC domain too to remove a domain level in
>> x86 platform.

May I still miss the points of this patch:
Without this patch, the domain levels on my machines are:
SMT, MC, CPU, NUMA

with this patch, the domain levels are:
SMT, MC, NUMA

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 02/18] sched: select_task_rq_fair clean up
  2013-02-12 10:14   ` Peter Zijlstra
@ 2013-02-13 14:44     ` Alex Shi
  0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-13 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On 02/12/2013 06:14 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> It is impossible to miss a task allowed cpu in a eligible group.
> 
> I suppose your reasoning goes like: tsk->cpus_allowed is protected by
> ->pi_lock, we hold this, therefore it cannot change and
> find_idlest_group() dtrt?

yes.
> 
> We can then state that this is due to adding proper serialization to
> tsk->cpus_allowed.
> 
>> And since find_idlest_group only return a different group which
>> excludes old cpu, it's also impossible to find a new cpu same as old
>> cpu.
> 
> Sounds plausible, but I'm not convinced, do we have hard serialization
> against hotplug?

Any caller of select_task_rq will check if returned dst_cpu is still
working. So there is nothing need worry about.
> 
> 
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 03/18] sched: fix find_idlest_group mess logical
  2013-02-12 10:16   ` Peter Zijlstra
@ 2013-02-13 15:07     ` Alex Shi
  0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-13 15:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On 02/12/2013 06:16 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> There is 4 situations in the function:
>> 1, no task allowed group;
>> 	so min_load = ULONG_MAX, this_load = 0, idlest = NULL
>> 2, only local group task allowed;
>> 	so min_load = ULONG_MAX, this_load assigned, idlest = NULL
>> 3, only non-local task group allowed;
>> 	so min_load assigned, this_load = 0, idlest != NULL
>> 4, local group + another group are task allowed.
>> 	so min_load assigned, this_load assigned, idlest != NULL
>>
>> Current logical will return NULL in first 3 kinds of scenarios.
>> And still return NULL, if idlest group is heavier then the
>> local group in the 4th situation.
>>
>> Actually, I thought groups in situation 2,3 are also eligible to host
>> the task. And in 4th situation, agree to bias toward local group.
> 
> I'm not convinced this is actually a cleanup.. taken together with patch
> 4 (which is a direct consequence of this patch afaict) you replace one
> conditional with 2.
> 

The current logical will always miss the eligible CPU in the 3rd
situation.  'sd = sd->child' still skip the non-local group eligible CPU.


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 07/18] sched: set initial load avg of new forked task
  2013-02-12 10:26   ` Peter Zijlstra
@ 2013-02-13 15:14     ` Alex Shi
  2013-02-13 15:41       ` Paul Turner
  0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-13 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On 02/12/2013 06:26 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> +       /*
>> +        * set the initial load avg of new task same as its load
>> +        * in order to avoid brust fork make few cpu too heavier
>> +        */
>> +       if (flags & ENQUEUE_NEWTASK)
>> +               se->avg.load_avg_contrib = se->load.weight; 
> 
> I seem to have vague recollections of a discussion with pjt where we
> talk about the initial behaviour of tasks; from this haze I had the
> impression that new tasks should behave like full weight..
> 

Here just make the new task has full weight..

> PJT is something more fundamental screwy?
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
  2013-02-12 10:27   ` Peter Zijlstra
@ 2013-02-13 15:23     ` Alex Shi
  2013-02-13 15:45       ` Paul Turner
  0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-13 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On 02/12/2013 06:27 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
>> we can use runnable load variables.
>>
> It would be nice if we could quantify the performance hit of doing so.
> Haven't yet looked at later patches to see if we remove anything to
> off-set this.
> 

In our rough testing, no much clear performance changes.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 09/18] sched: add sched_policies in kernel
  2013-02-12 10:36   ` Peter Zijlstra
@ 2013-02-13 15:41     ` Alex Shi
  0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-13 15:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On 02/12/2013 06:36 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Current scheduler behavior is just consider the for larger performance
>> of system. So it try to spread tasks on more cpu sockets and cpu cores
>>
>> To adding the consideration of power awareness, the patchset adds
>> 2 kinds of scheduler policy: powersaving and balance. They will use
>> runnable load util in scheduler balancing. The current scheduling is taken
>> as performance policy.
>>
>> performance: the current scheduling behaviour, try to spread tasks
>>                 on more CPU sockets or cores. performance oriented.
>> powersaving: will pack tasks into few sched group until all LCPU in the
>>                 group is full, power oriented.
>> balance    : will pack tasks into few sched group until group_capacity
>>                 numbers CPU is full, balance between performance and
>> 		powersaving.
> 
> _WHY_ do you start out with so much choice?
> 
> If your power policy is so abysmally poor on performance that you
> already know you need a 3rd policy to keep people happy, maybe you're
> doing something wrong?

Nope, no much performance yield for both of powersaving and balance policy.
Much of testing results in replaying Ingo's email on '0/18' thread --
the cover letter email threads.
https://lkml.org/lkml/2013/2/3/353
https://lkml.org/lkml/2013/2/4/735

I introduce a 'balance' policy just because HT thread LCPU in Intel CPU
is less then 1 usual cpu power. It is used when someone want to save
power but still want tasks have a whole cpu core...
> 
>> +#define SCHED_POLICY_PERFORMANCE	(0x1)
>> +#define SCHED_POLICY_POWERSAVING	(0x2)
>> +#define SCHED_POLICY_BALANCE		(0x4)
>> +
>> +extern int __read_mostly sched_policy;
> 
> I'd much prefer: sched_balance_policy. Scheduler policy is a concept
> already well defined by posix and we don't need it to mean two
> completely different things.
> 

Got it.
-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 07/18] sched: set initial load avg of new forked task
  2013-02-13 15:14     ` Alex Shi
@ 2013-02-13 15:41       ` Paul Turner
  2013-02-14 13:07         ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Paul Turner @ 2013-02-13 15:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Wed, Feb 13, 2013 at 7:14 AM, Alex Shi <alex.shi@intel.com> wrote:
> On 02/12/2013 06:26 PM, Peter Zijlstra wrote:
>> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>>> +       /*
>>> +        * set the initial load avg of new task same as its load
>>> +        * in order to avoid brust fork make few cpu too heavier
>>> +        */
>>> +       if (flags & ENQUEUE_NEWTASK)
>>> +               se->avg.load_avg_contrib = se->load.weight;
>>
>> I seem to have vague recollections of a discussion with pjt where we
>> talk about the initial behaviour of tasks; from this haze I had the
>> impression that new tasks should behave like full weight..
>>
>
> Here just make the new task has full weight..
>
>> PJT is something more fundamental screwy?
>>

So tasks get the quotient of their runnability over the period.  Given
the period initially is equivalent to runnability it's definitely the
*intent* to start at full-weight and ramp-down.

Thinking on it, perhaps this is running a-foul of amortization -- in
that we only recompute this quotient on each 1024ns boundary; perhaps
in the fork-bomb case we're too slow to accumulate these.

Alex, does something like the following help?  This would force an
initial __update_entity_load_avg_contrib() update the first time we
see the task.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1dff78a..9d1c193 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p)
  * load-balance).
  */
 #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
-       p->se.avg.runnable_avg_period = 0;
-       p->se.avg.runnable_avg_sum = 0;
+       p->se.avg.runnable_avg_period = 1024;
+       p->se.avg.runnable_avg_sum = 1024;
 #endif
 #ifdef CONFIG_SCHEDSTATS
        memset(&p->se.statistics, 0, sizeof(p->se.statistics));



>
>
> --
> Thanks
>     Alex

^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
  2013-02-13 15:23     ` Alex Shi
@ 2013-02-13 15:45       ` Paul Turner
  2013-02-14  3:07         ` Preeti U Murthy
  0 siblings, 1 reply; 88+ messages in thread
From: Paul Turner @ 2013-02-13 15:45 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Wed, Feb 13, 2013 at 7:23 AM, Alex Shi <alex.shi@intel.com> wrote:
> On 02/12/2013 06:27 PM, Peter Zijlstra wrote:
>> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>>> Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
>>> we can use runnable load variables.
>>>
>> It would be nice if we could quantify the performance hit of doing so.
>> Haven't yet looked at later patches to see if we remove anything to
>> off-set this.
>>
>
> In our rough testing, no much clear performance changes.
>

I'd personally like this to go with a series that actually does
something with it.

There's been a few proposals floating around on _how_ to do this; but
the challenge is in getting it stable enough that all of the wake-up
balancing does not totally perforate your stability gains into the
noise.  select_idle_sibling really is your nemesis here.

It's a small enough patch that it can go at the head of any such
series (and indeed; it was originally structured to make such a patch
rather explicit.)

> --
> Thanks
>     Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
  2013-02-13 15:45       ` Paul Turner
@ 2013-02-14  3:07         ` Preeti U Murthy
  0 siblings, 0 replies; 88+ messages in thread
From: Preeti U Murthy @ 2013-02-14  3:07 UTC (permalink / raw)
  To: Paul Turner
  Cc: Alex Shi, Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp,
	namhyung, efault, vincent.guittot, gregkh, viresh.kumar,
	linux-kernel

Hi everyone,

On 02/13/2013 09:15 PM, Paul Turner wrote:
> On Wed, Feb 13, 2013 at 7:23 AM, Alex Shi <alex.shi@intel.com> wrote:
>> On 02/12/2013 06:27 PM, Peter Zijlstra wrote:
>>> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>>>> Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
>>>> we can use runnable load variables.
>>>>
>>> It would be nice if we could quantify the performance hit of doing so.
>>> Haven't yet looked at later patches to see if we remove anything to
>>> off-set this.
>>>
>>
>> In our rough testing, no much clear performance changes.
>>
> 
> I'd personally like this to go with a series that actually does
> something with it.
> 
> There's been a few proposals floating around on _how_ to do this; but
> the challenge is in getting it stable enough that all of the wake-up
> balancing does not totally perforate your stability gains into the
> noise.  select_idle_sibling really is your nemesis here.
> 
> It's a small enough patch that it can go at the head of any such
> series (and indeed; it was originally structured to make such a patch
> rather explicit.)
> 
>> --
>> Thanks
>>     Alex
> 

Paul,what exactly do you mean by select_idle_sibling() is our nemesis
here? What we observed through our experiments was that:
1.With the per entity load tracking(runnable_load_avg) in load
balancing,the load is distributed appropriately across the cpus.
2.However when a task sleeps and wakes up,select_idle_sibling() searches
for the idlest group top to bottom.If a suitable candidate is not
found,it wakes up the task on the prev_cpu/waker_cpu.This would increase
the runqueue size and load of prev_cpu/waker_cpu respectively.
3.The load balancer would then come to the rescue and redistribute the load.

As a consequence,

*The primary observation was that there is no performance degradation
with the integration of per entity load tracking into the load balancer
but there was a good increase in the number of migrations*. This  as I
see it, is due to the point2 and point3 above.Is this what you call as
the nemesis? OR

select_idle_sibling() does a top to bottom search of the chosen domain
for an idlest group and is very likely to spread the waking task to a
far off group,in case of underutilized systems.This would prove costly
for the software buddies in finding each other due to the time taken for
the search and the possible spreading of the software buddy tasks.Is
this what you call nemesis?

Another approach to remove the above two nemesis,if they are so,would be
to use blocked_load+runnable_load for balancing.But when waking up a
task,use select_idle_sibling() only to search the L2 cache domains for
an idlest group.If unsuccessful,return the prev_cpu which has already
accounted for the task in the blocked_load,hence this move would not
increase its load.Would you recommend going in this direction?

Thank you

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 11/18] sched: log the cpu utilization at rq
  2013-02-12 10:39   ` Peter Zijlstra
@ 2013-02-14  3:10     ` Alex Shi
  0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-14  3:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On 02/12/2013 06:39 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>>
>> The cpu's utilization is to measure how busy is the cpu.
>>         util = cpu_rq(cpu)->avg.runnable_avg_sum
>>                 / cpu_rq(cpu)->avg.runnable_avg_period;
>>
>> Since the util is no more than 1, we use its percentage value in later
>> caculations. And set the the FULL_UTIL as 100%.
>>
>> In later power aware scheduling, we are sensitive for how busy of the
>> cpu, not how much weight of its load. As to power consuming, it is more
>> related with cpu busy time, not the load weight.
> 
> I think we can make that argument in general; that is irrespective of
> the actual policy. We simply never had anything better to go with.
> 
> So please clarify why you think this only applies to power aware
> scheduling.

Um, the rq->util is a general argument. It can be used on any other
places if needed, not power aware scheduling specific.
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
  2013-02-12 10:22   ` Peter Zijlstra
@ 2013-02-14  3:13     ` Alex Shi
  2013-02-14  8:12     ` Preeti U Murthy
  1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-14  3:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On 02/12/2013 06:22 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Guess the search cpu from bottom to up in domain tree come from
>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>> balancing over tasks on all level domains.
>>
>> This balancing cost too much if there has many domain/groups in a
>> large system.
>>
>> If we remove this code, we will get quick fork/exec/wake with a
>> similar
>> balancing result amony whole system.
>>
>> This patch increases 10+% performance of hackbench on my 4 sockets
>> SNB machines and about 3% increasing on 2 sockets servers.
>>
>>
> Numbers be groovy.. still I'd like a little more on the behavioural
> change. Expand on what exactly is lost by this change so that if we
> later find a regression we have a better idea of what and how.
> 
> For instance, note how find_idlest_group() isn't symmetric wrt
> local_group. So by not doing the domain iteration we change things.
> 
> Now, it might well be that all this is somewhat overkill as it is, but
> should we then not replace all of it with a simple min search over all
> eligible cpus; that would be a real clean up.
>  

Um, will think this again..
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
  2013-02-12 10:22   ` Peter Zijlstra
  2013-02-14  3:13     ` Alex Shi
@ 2013-02-14  8:12     ` Preeti U Murthy
  2013-02-14 14:08       ` Alex Shi
  2013-02-15 13:00       ` Peter Zijlstra
  1 sibling, 2 replies; 88+ messages in thread
From: Preeti U Murthy @ 2013-02-14  8:12 UTC (permalink / raw)
  To: Peter Zijlstra, Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, viresh.kumar, linux-kernel

Hi everyone,

On 02/12/2013 03:52 PM, Peter Zijlstra wrote:
> On Thu, 2013-01-24 at 11:06 +0800, Alex Shi wrote:
>> Guess the search cpu from bottom to up in domain tree come from
>> commit 3dbd5342074a1e sched: multilevel sbe sbf, the purpose is
>> balancing over tasks on all level domains.
>>
>> This balancing cost too much if there has many domain/groups in a
>> large system.
>>
>> If we remove this code, we will get quick fork/exec/wake with a
>> similar
>> balancing result amony whole system.
>>
>> This patch increases 10+% performance of hackbench on my 4 sockets
>> SNB machines and about 3% increasing on 2 sockets servers.
>>
>>
> Numbers be groovy.. still I'd like a little more on the behavioural
> change. Expand on what exactly is lost by this change so that if we
> later find a regression we have a better idea of what and how.
> 
> For instance, note how find_idlest_group() isn't symmetric wrt
> local_group. So by not doing the domain iteration we change things.
> 
> Now, it might well be that all this is somewhat overkill as it is, but
> should we then not replace all of it with a simple min search over all
> eligible cpus; that would be a real clean up.

Hi Peter,Alex,
If the eligible cpus happen to be all the cpus,then iterating over all the 
cpus for idlest would be much worse than iterating over sched domains right?
I am also wondering how important it is to bias the balancing of forked/woken up
task onto an idlest cpu at every iteration.

If biasing towards the idlest_cpu at every iteration is not really the criteria,
then we could cut down on the iterations in fork/exec/wake balancing.
Then the problem boils down to,is the option between biasing our search towards
the idlest_cpu or the idlest_group.If we are not really concerned about balancing
load across  groups,but ensuring we find the idlest cpu to run the task on,then
Alex's patch seems to have covered the criteria.

However if the concern is to distribute the load uniformly across groups,then
I have the following patch which might reduce the overhead of the search of an
eligible cpu for a forked/exec/woken up task.

Alex,if your patch does not show favourable behavioural changes,you could try
the below and check the same.

**************************START PATCH*************************************

sched:Improve balancing for fork/exec/wake tasks

As I see it,currently,we first find the idlest group,then the idlest cpu
within it.However the current code does not seem to get convinced that the
selected cpu in the current iteration,is in the correct child group and does
an iteration over all the groups yet again,
*taking the selected cpu as reference to point to the next level sched
domain.*

Why then find the idlest cpu at every iteration,if the concern is primarily
around the idlest group at each iteration? Why not find the idlest group in
every iteration and the idlest cpu in the final iteration? As a result:

1.We save time spent in going over all the cpus in a sched group in
find_idlest_cpu() at every iteration.
2.Functionality remains the same.We find the idlest group at every level,and
consider a cpu of the idlest group as a reference to get the next lower level
sched domain so as to find the idlest group there.
However instead of taking the idlest cpu as the reference,take
the first cpu as the reference.The resulting next level sched domain remains
the same.

*However this completely removes the bias towards the idlest cpu at every level.*

This patchset therefore tries to bias its find towards the right group to put
the task on,and not the idlest cpu.
---
 kernel/sched/fair.c |   53 ++++++++++++++-------------------------------------
 1 file changed, 15 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8691b0d..90855dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3190,16 +3190,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
  */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int load_idx)
+		  int this_cpu)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
-	struct sched_group *this_group = NULL;
-	u64 min_load = ~0ULL, this_load = 0;
-	int imbalance = 100 + (sd->imbalance_pct-100)/2;
+	struct sched_group;
+	u64 min_load = ~0ULL;
 
 	do {
 		u64 load, avg_load;
-		int local_group;
 		int i;
 
 		/* Skip over this group if it has no CPUs allowed */
@@ -3207,18 +3205,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 					tsk_cpus_allowed(p)))
 			continue;
 
-		local_group = cpumask_test_cpu(this_cpu,
-					       sched_group_cpus(group));
-
 		/* Tally up the load of all CPUs in the group */
 		avg_load = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
-			/* Bias balancing toward cpus of our domain */
-			if (local_group)
-				load = source_load(i, load_idx);
-			else
-				load = target_load(i, load_idx);
+			load = weighted_cpuload(i);
 
 			avg_load += load;
 		}
@@ -3227,20 +3218,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		avg_load = (avg_load * SCHED_POWER_SCALE);
 		do_div(avg_load, group->sgp->power);
 
-		if (local_group) {
-			this_load = avg_load;
-			this_group = group;
-		} else if (avg_load < min_load) {
+		if (avg_load < min_load) {
 			min_load = avg_load;
 			idlest = group;
 		}
 	} while (group = group->next, group != sd->groups);
 
-	if (this_group && idlest!= this_group) {
-		/* Bias towards our group again */
-		if (!idlest || 100*this_load < imbalance*min_load)
-			return this_group;
-	}
 	return idlest;
 }
 
@@ -3248,7 +3231,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
  * find_idlest_cpu - find the idlest cpu among the cpus in group.
  */
 static int
-find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_idlest_cpu(struct sched_group *group, struct task_struct *p)
 {
 	u64 load, min_load = ~0ULL;
 	int idlest = -1;
@@ -3258,7 +3241,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
 		load = weighted_cpuload(i);
 
-		if (load < min_load || (load == min_load && i == this_cpu)) {
+		if (load < min_load) {
 			min_load = load;
 			idlest = i;
 		}
@@ -3325,6 +3308,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
+	struct sched_group *group;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
@@ -3367,8 +3351,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	}
 
 	while (sd) {
-		int load_idx = sd->forkexec_idx;
-		struct sched_group *group;
 		int weight;
 
 		if (!(sd->flags & sd_flag)) {
@@ -3376,15 +3358,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			continue;
 		}
 
-		if (sd_flag & SD_BALANCE_WAKE)
-			load_idx = sd->wake_idx;
-
-		group = find_idlest_group(sd, p, cpu, load_idx);
-		if (!group) {
-			goto unlock;
-		}
+		group = find_idlest_group(sd, p, cpu);
 
-		new_cpu = find_idlest_cpu(group, p, cpu);
+		new_cpu = group_first_cpu(group);
 
 		/* Now try balancing at a lower domain level of new_cpu */
 		cpu = new_cpu;
@@ -3398,6 +3374,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		}
 		/* while loop will break here if sd == NULL */
 	}
+	new_cpu = find_idlest_cpu(group, p);
 unlock:
 	rcu_read_unlock();
 
@@ -4280,15 +4257,15 @@ static inline void update_sd_lb_power_stats(struct lb_env *env,
 	if (sgs->sum_nr_running + 1 >  sgs->group_capacity)
 		return;
 	if (sgs->group_util > sds->leader_util ||
-		sgs->group_util == sds->leader_util &&
-		group_first_cpu(group) < group_first_cpu(sds->group_leader)) {
+		(sgs->group_util == sds->leader_util &&
+		group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
 		sds->group_leader = group;
 		sds->leader_util = sgs->group_util;
 	}
 	/* Calculate the group which is almost idle */
 	if (sgs->group_util < sds->min_util ||
-		sgs->group_util == sds->min_util &&
-		group_first_cpu(group) > group_first_cpu(sds->group_leader)) {
+		(sgs->group_util == sds->min_util &&
+		group_first_cpu(group) > group_first_cpu(sds->group_leader))) {
 		sds->group_min = group;
 		sds->min_util = sgs->group_util;
 		sds->min_load_per_task = sgs->sum_weighted_load;

> 
> 
Regards
Preeti U Murthy


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [patch v4 07/18] sched: set initial load avg of new forked task
  2013-02-13 15:41       ` Paul Turner
@ 2013-02-14 13:07         ` Alex Shi
  2013-02-19 11:34           ` Paul Turner
  0 siblings, 1 reply; 88+ messages in thread
From: Alex Shi @ 2013-02-14 13:07 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel


> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1dff78a..9d1c193 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p)
>   * load-balance).
>   */
>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
> -       p->se.avg.runnable_avg_period = 0;
> -       p->se.avg.runnable_avg_sum = 0;
> +       p->se.avg.runnable_avg_period = 1024;
> +       p->se.avg.runnable_avg_sum = 1024;

It can't work.
avg.decay_count needs to be set to 0 before enqueue_entity_load_avg(), then
update_entity_load_avg() can't be called, so, runnable_avg_period/sum
are unusable.

Even we has chance to call __update_entity_runnable_avg(),
avg.last_runnable_update needs be set before that, usually, it needs to
be set as 'now', that cause __update_entity_runnable_avg() function
return 0, then update_entity_load_avg() still can not reach to
__update_entity_load_avg_contrib().

If we embed a simple new task load initialization to many functions,
that is too hard for future reader.

>  #endif
>  #ifdef CONFIG_SCHEDSTATS
>         memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> 
> 
> 
>>
>>
>> --
>> Thanks
>>     Alex


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
  2013-02-14  8:12     ` Preeti U Murthy
@ 2013-02-14 14:08       ` Alex Shi
  2013-02-15 13:00       ` Peter Zijlstra
  1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-14 14:08 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, pjt,
	namhyung, efault, vincent.guittot, gregkh, viresh.kumar,
	linux-kernel

On 02/14/2013 04:12 PM, Preeti U Murthy wrote:
> **************************START PATCH*************************************
> 
> sched:Improve balancing for fork/exec/wake tasks

Can you test the patch with hackbench and aim7 with 2000 threads? plus
the patch in the tip:core/locking tree:

  3a15e0e0cdda rwsem: Implement writer lock-stealing for better scalability

If it works you may find some improvement.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
  2013-02-13 13:22     ` Alex Shi
@ 2013-02-15 12:38       ` Peter Zijlstra
  2013-02-16  5:16         ` Alex Shi
  0 siblings, 1 reply; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-15 12:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On Wed, 2013-02-13 at 21:22 +0800, Alex Shi wrote:
> No, the flags set on MC/CPU domain, but is checked in their parents
> balancing, like in NUMA domain.
> Without the flag, will cause NUMA domain imbalance. like on my 2
> sockets
> NHM EP: 3 of 4 tasks were assigned on socket 0(lcpu, 10, 12, 14)
> 
> In this case, update_sd_pick_busiest() need a reduced group_capacity
> to
> return true:
>         if (sgs->sum_nr_running > sgs->group_capacity)
>                 return true;
> then numa domain balancing get chance to start.

Ah, indeed. Its always better to include such 'obvious' problems in the
changelog :-)


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 05/18] sched: quicker balancing on fork/exec/wake
  2013-02-14  8:12     ` Preeti U Murthy
  2013-02-14 14:08       ` Alex Shi
@ 2013-02-15 13:00       ` Peter Zijlstra
  1 sibling, 0 replies; 88+ messages in thread
From: Peter Zijlstra @ 2013-02-15 13:00 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Alex Shi, torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, viresh.kumar, linux-kernel

On Thu, 2013-02-14 at 13:42 +0530, Preeti U Murthy wrote:
> Hi Peter,Alex,
> If the eligible cpus happen to be all the cpus,then iterating over all
> the 
> cpus for idlest would be much worse than iterating over sched domains
> right?

Depends, doing a domain walk generally gets you 2n cpus visited --
geometric series and such. A simple scan of the top-most domain mask
that's eligible will limit that to n. 

> I am also wondering how important it is to bias the balancing of
> forked/woken up
> task onto an idlest cpu at every iteration.

Yeah, I don't know, it seems overkill to me, that code is from before my
time, so far it has survived.

> If biasing towards the idlest_cpu at every iteration is not really the
> criteria,
> then we could cut down on the iterations in fork/exec/wake balancing.
> Then the problem boils down to,is the option between biasing our
> search towards
> the idlest_cpu or the idlest_group.If we are not really concerned
> about balancing
> load across  groups,but ensuring we find the idlest cpu to run the
> task on,then
> Alex's patch seems to have covered the criteria.
> 
> However if the concern is to distribute the load uniformly across
> groups,then
> I have the following patch which might reduce the overhead of the
> search of an
> eligible cpu for a forked/exec/woken up task.

Nah, so I think the whole bias thing was mostly done to avoid
over-balancing and possibly to compensate for some approximations on the
whole weight/load measurement stuff.


^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level
  2013-02-15 12:38       ` Peter Zijlstra
@ 2013-02-16  5:16         ` Alex Shi
  0 siblings, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-16  5:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel

On 02/15/2013 08:38 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-13 at 21:22 +0800, Alex Shi wrote:
>> No, the flags set on MC/CPU domain, but is checked in their parents
>> balancing, like in NUMA domain.
>> Without the flag, will cause NUMA domain imbalance. like on my 2
>> sockets
>> NHM EP: 3 of 4 tasks were assigned on socket 0(lcpu, 10, 12, 14)
>>
>> In this case, update_sd_pick_busiest() need a reduced group_capacity
>> to
>> return true:
>>         if (sgs->sum_nr_running > sgs->group_capacity)
>>                 return true;
>> then numa domain balancing get chance to start.
> 
> Ah, indeed. Its always better to include such 'obvious' problems in the
> changelog :-)
> 

got it. :)
how about the following commit log and patch:

---

>From c97fceceaf9d68e73eaf015d5915474a9a94a2d1 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Fri, 28 Dec 2012 13:53:00 +0800
Subject: [PATCH] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain
 level

The domain flag SD_PREFER_SIBLING was set both on MC and CPU domain at
frist commit b5d978e0c7e79a, and was removed in-carefully when clear up
obsolete power scheduler. Then commit 6956dc568 recover the flag on CPU
domain only. It works, but it introduces a extra domain level since this
cause MC/CPU different.

So, recover the the flag in MC domain too to remove a domain level in
x86 platform.

This flag can not be removed since it is used to keep parent domain
balancing, like in NUMA domain, update_sd_pick_busiest() need a reduced
group_capacity to return 'true' then re-balance tasks from groups.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/topology.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index d3cf0d6..386bcf4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -132,6 +132,7 @@ int arch_update_cpu_topology(void);
 				| 0*SD_SHARE_CPUPOWER			\
 				| 1*SD_SHARE_PKG_RESOURCES		\
 				| 0*SD_SERIALIZE			\
+				| 1*SD_PREFER_SIBLING			\
 				,					\
 	.last_balance		= jiffies,				\
 	.balance_interval	= 1,					\
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [patch v4 07/18] sched: set initial load avg of new forked task
  2013-02-14 13:07         ` Alex Shi
@ 2013-02-19 11:34           ` Paul Turner
  2013-02-20  4:18             ` Preeti U Murthy
  2013-02-20  5:13             ` Alex Shi
  0 siblings, 2 replies; 88+ messages in thread
From: Paul Turner @ 2013-02-19 11:34 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

On Fri, Feb 15, 2013 at 2:07 AM, Alex Shi <alex.shi@intel.com> wrote:
>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 1dff78a..9d1c193 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p)
>>   * load-balance).
>>   */
>>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>> -       p->se.avg.runnable_avg_period = 0;
>> -       p->se.avg.runnable_avg_sum = 0;
>> +       p->se.avg.runnable_avg_period = 1024;
>> +       p->se.avg.runnable_avg_sum = 1024;
>
> It can't work.
> avg.decay_count needs to be set to 0 before enqueue_entity_load_avg(), then
> update_entity_load_avg() can't be called, so, runnable_avg_period/sum
> are unusable.

Well we _could_ also use a negative decay_count here and treat it like
a migration; but the larger problem is the visibility of p->on_rq;
which is gates whether we account the time as runnable and occurs
after activate_task() so that's out.

>
> Even we has chance to call __update_entity_runnable_avg(),
> avg.last_runnable_update needs be set before that, usually, it needs to
> be set as 'now', that cause __update_entity_runnable_avg() function
> return 0, then update_entity_load_avg() still can not reach to
> __update_entity_load_avg_contrib().
>
> If we embed a simple new task load initialization to many functions,
> that is too hard for future reader.

This is my concern about making this a special case with the
introduction ENQUEUE_NEWTASK flag; enqueue jumps through enough hoops
as it is.

I still don't see why we can't resolve this at init time in
__sched_fork(); your patch above just moves an explicit initialization
of load_avg_contrib into the enqueue path.  Adding a call to
__update_task_entity_contrib() to the previous alternate suggestion
would similarly seem to resolve this?

>
>>  #endif
>>  #ifdef CONFIG_SCHEDSTATS
>>         memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>>
>>
>>
>>>
>>>
>>> --
>>> Thanks
>>>     Alex
>
>
> --
> Thanks
>     Alex

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [patch v4 07/18] sched: set initial load avg of new forked task
  2013-02-19 11:34           ` Paul Turner
@ 2013-02-20  4:18             ` Preeti U Murthy
  2013-02-20  5:13             ` Alex Shi
  1 sibling, 0 replies; 88+ messages in thread
From: Preeti U Murthy @ 2013-02-20  4:18 UTC (permalink / raw)
  To: Paul Turner
  Cc: Alex Shi, Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp,
	namhyung, efault, vincent.guittot, gregkh, viresh.kumar,
	linux-kernel

Hi everyone,

On 02/19/2013 05:04 PM, Paul Turner wrote:
> On Fri, Feb 15, 2013 at 2:07 AM, Alex Shi <alex.shi@intel.com> wrote:
>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 1dff78a..9d1c193 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1557,8 +1557,8 @@ static void __sched_fork(struct task_struct *p)
>>>   * load-balance).
>>>   */
>>>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>>> -       p->se.avg.runnable_avg_period = 0;
>>> -       p->se.avg.runnable_avg_sum = 0;
>>> +       p->se.avg.runnable_avg_period = 1024;
>>> +       p->se.avg.runnable_avg_sum = 1024;
>>
>> It can't work.
>> avg.decay_count needs to be set to 0 before enqueue_entity_load_avg(), then
>> update_entity_load_avg() can't be called, so, runnable_avg_period/sum
>> are unusable.
> 
> Well we _could_ also use a negative decay_count here and treat it like
> a migration; but the larger problem is the visibility of p->on_rq;
> which is gates whether we account the time as runnable and occurs
> after activate_task() so that's out.
> 
>>
>> Even we has chance to call __update_entity_runnable_avg(),
>> avg.last_runnable_update needs be set before that, usually, it needs to
>> be set as 'now', that cause __update_entity_runnable_avg() function
>> return 0, then update_entity_load_avg() still can not reach to
>> __update_entity_load_avg_contrib().
>>
>> If we embed a simple new task load initialization to many functions,
>> that is too hard for future reader.
> 
> This is my concern about making this a special case with the
> introduction ENQUEUE_NEWTASK flag; enqueue jumps through enough hoops
> as it is.
> 
> I still don't see why we can't resolve this at init time in
> __sched_fork(); your patch above just moves an explicit initialization
> of load_avg_contrib into the enqueue path.  Adding a call to
> __update_task_entity_contrib() to the previous alternate suggestion
> would similarly seem to resolve this?

We could do this(Adding a call to __update_task_entity_contrib()),but the
cfs_rq->runnable_load_avg gets updated only if the task is on the runqueue.
But in the forked task's case the on_rq flag is not yet set.Something like
the below:

---
 kernel/sched/fair.c |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8691b0d..841e156 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1451,14 +1451,20 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
-		return;
-
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq)) {
+		if (!(flags & ENQUEUE_NEWTASK))
+			return;
+	}
 	contrib_delta = __update_entity_load_avg_contrib(se);
 
 	if (!update_cfs_rq)
 		return;
 
+	/* But the cfs_rq->runnable_load_avg does not get updated in case of
+	 * a forked task,because the se->on_rq = 0,although we update the
+	 * task's load_avg_contrib above in
+	 * __update_entity_laod_avg_contrib().
+	 */
 	if (se->on_rq)
 		cfs_rq->runnable_load_avg += contrib_delta;
 	else
@@ -1538,12 +1544,6 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
 		update_entity_load_avg(se, 0);
 	}
-	/*
-	 * set the initial load avg of new task same as its load
-	 * in order to avoid brust fork make few cpu too heavier
-	 */
-	if (flags & ENQUEUE_NEWTASK)
-		se->avg.load_avg_contrib = se->load.weight;
 
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
 	/* we force update consideration on load-balancer moves */

Thanks

Regards
Preeti U Murthy


^ permalink raw reply related	[flat|nested] 88+ messages in thread

* Re: [patch v4 07/18] sched: set initial load avg of new forked task
  2013-02-19 11:34           ` Paul Turner
  2013-02-20  4:18             ` Preeti U Murthy
@ 2013-02-20  5:13             ` Alex Shi
  1 sibling, 0 replies; 88+ messages in thread
From: Alex Shi @ 2013-02-20  5:13 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel

> This is my concern about making this a special case with the
> introduction ENQUEUE_NEWTASK flag; enqueue jumps through enough hoops
> as it is.
> 
> I still don't see why we can't resolve this at init time in
> __sched_fork(); your patch above just moves an explicit initialization
> of load_avg_contrib into the enqueue path.  Adding a call to
> __update_task_entity_contrib() to the previous alternate suggestion
> would similarly seem to resolve this?
> 

Without ENQUEUE_NEWTASK flag, we can use the following patch. That embeds
 the new fork with a implicate way.

but since the newtask flag just follows existing enqueue path, it also
looks natural and is a explicit way.

I am ok for alternate of solutions.

=======
>From 0f5dd6babe899e27cfb78ea49d337e4f0918591b Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Wed, 20 Feb 2013 12:51:28 +0800
Subject: [PATCH 02/15] sched: set initial load avg of new forked task

New task has no runnable sum at its first runnable time, so its
runnable load is zero. That makes burst forking balancing just select
few idle cpus to assign tasks if we engage runnable load in balancing.

Set initial load avg of new forked task as its load weight to resolve
this issue.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c | 2 ++
 kernel/sched/fair.c | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1743746..93a7590 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1648,6 +1648,8 @@ void sched_fork(struct task_struct *p)
 		p->sched_reset_on_fork = 0;
 	}
 
+	p->se.avg.load_avg_contrib = p->se.load.weight;
+
 	if (!rt_prio(p->prio))
 		p->sched_class = &fair_sched_class;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81fa536..cae5134 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	 * We track migrations using entity decay_count <= 0, on a wake-up
 	 * migration we use a negative decay count to track the remote decays
 	 * accumulated while sleeping.
+	 *
+	 * When enqueue a new forked task, the se->avg.decay_count == 0, so
+	 * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
+	 * value: se->load.weight.
 	 */
 	if (unlikely(se->avg.decay_count <= 0)) {
 		se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2013-02-20  5:13 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-24  3:06 [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Alex Shi
2013-01-24  3:06 ` [patch v4 01/18] sched: set SD_PREFER_SIBLING on MC domain to reduce a domain level Alex Shi
2013-02-12 10:11   ` Peter Zijlstra
2013-02-13 13:22     ` Alex Shi
2013-02-15 12:38       ` Peter Zijlstra
2013-02-16  5:16         ` Alex Shi
2013-02-13 14:17     ` Alex Shi
2013-01-24  3:06 ` [patch v4 02/18] sched: select_task_rq_fair clean up Alex Shi
2013-02-12 10:14   ` Peter Zijlstra
2013-02-13 14:44     ` Alex Shi
2013-01-24  3:06 ` [patch v4 03/18] sched: fix find_idlest_group mess logical Alex Shi
2013-02-12 10:16   ` Peter Zijlstra
2013-02-13 15:07     ` Alex Shi
2013-01-24  3:06 ` [patch v4 04/18] sched: don't need go to smaller sched domain Alex Shi
2013-01-24  3:06 ` [patch v4 05/18] sched: quicker balancing on fork/exec/wake Alex Shi
2013-02-12 10:22   ` Peter Zijlstra
2013-02-14  3:13     ` Alex Shi
2013-02-14  8:12     ` Preeti U Murthy
2013-02-14 14:08       ` Alex Shi
2013-02-15 13:00       ` Peter Zijlstra
2013-01-24  3:06 ` [patch v4 06/18] sched: give initial value for runnable avg of sched entities Alex Shi
2013-02-12 10:23   ` Peter Zijlstra
2013-01-24  3:06 ` [patch v4 07/18] sched: set initial load avg of new forked task Alex Shi
2013-02-12 10:26   ` Peter Zijlstra
2013-02-13 15:14     ` Alex Shi
2013-02-13 15:41       ` Paul Turner
2013-02-14 13:07         ` Alex Shi
2013-02-19 11:34           ` Paul Turner
2013-02-20  4:18             ` Preeti U Murthy
2013-02-20  5:13             ` Alex Shi
2013-01-24  3:06 ` [patch v4 08/18] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2013-02-12 10:27   ` Peter Zijlstra
2013-02-13 15:23     ` Alex Shi
2013-02-13 15:45       ` Paul Turner
2013-02-14  3:07         ` Preeti U Murthy
2013-01-24  3:06 ` [patch v4 09/18] sched: add sched_policies in kernel Alex Shi
2013-02-12 10:36   ` Peter Zijlstra
2013-02-13 15:41     ` Alex Shi
2013-01-24  3:06 ` [patch v4 10/18] sched: add sysfs interface for sched_policy selection Alex Shi
2013-01-24  3:06 ` [patch v4 11/18] sched: log the cpu utilization at rq Alex Shi
2013-02-12 10:39   ` Peter Zijlstra
2013-02-14  3:10     ` Alex Shi
2013-01-24  3:06 ` [patch v4 12/18] sched: add power aware scheduling in fork/exec/wake Alex Shi
2013-01-24  3:06 ` [patch v4 13/18] sched: packing small tasks in wake/exec balancing Alex Shi
2013-01-24  3:06 ` [patch v4 14/18] sched: add power/performance balance allowed flag Alex Shi
2013-01-24  3:06 ` [patch v4 15/18] sched: pull all tasks from source group Alex Shi
2013-01-24  3:06 ` [patch v4 16/18] sched: don't care if the local group has capacity Alex Shi
2013-01-24  3:06 ` [patch v4 17/18] sched: power aware load balance, Alex Shi
2013-01-24  3:07 ` [patch v4 18/18] sched: lazy power balance Alex Shi
2013-01-24  9:44 ` [patch v4 0/18] sched: simplified fork, release load avg and power awareness scheduling Borislav Petkov
2013-01-24 15:07   ` Alex Shi
2013-01-27  2:41     ` Alex Shi
2013-01-27  4:36       ` Mike Galbraith
2013-01-27 10:35         ` Borislav Petkov
2013-01-27 13:25           ` Alex Shi
2013-01-27 15:51             ` Mike Galbraith
2013-01-28  5:17               ` Mike Galbraith
2013-01-28  5:51                 ` Alex Shi
2013-01-28  6:15                   ` Mike Galbraith
2013-01-28  6:42                     ` Mike Galbraith
2013-01-28  7:20                       ` Mike Galbraith
2013-01-29  1:17                       ` Alex Shi
2013-01-28  9:55                 ` Borislav Petkov
2013-01-28 10:44                   ` Mike Galbraith
2013-01-28 11:29                     ` Borislav Petkov
2013-01-28 11:32                       ` Mike Galbraith
2013-01-28 11:40                         ` Mike Galbraith
2013-01-28 15:22                           ` Borislav Petkov
2013-01-28 15:55                             ` Mike Galbraith
2013-01-29  1:38                               ` Alex Shi
2013-01-29  1:32                         ` Alex Shi
2013-01-29  1:36                       ` Alex Shi
2013-01-28 15:47                 ` Mike Galbraith
2013-01-29  1:45                   ` Alex Shi
2013-01-29  4:03                     ` Mike Galbraith
2013-01-29  2:27                   ` Alex Shi
2013-01-27 10:40       ` Borislav Petkov
2013-01-27 14:03         ` Alex Shi
2013-01-28  5:19         ` Alex Shi
2013-01-28  6:49           ` Mike Galbraith
2013-01-28  7:17             ` Alex Shi
2013-01-28  7:33               ` Mike Galbraith
2013-01-29  6:02           ` Alex Shi
2013-01-28  1:28 ` Alex Shi
2013-02-04  1:35 ` Alex Shi
2013-02-04 11:09   ` Ingo Molnar
2013-02-05  2:26     ` Alex Shi
2013-02-06  5:08       ` Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).