linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch v5 0/15] power aware scheduling
@ 2013-02-18  5:07 Alex Shi
  2013-02-18  5:07 ` [patch v5 01/15] sched: set initial value for runnable avg of sched entities Alex Shi
                   ` (16 more replies)
  0 siblings, 17 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

Since the simplification of fork/exec/wake balancing has much arguments,
I removed that part in the patch set.

This patch set implement/consummate the rough power aware scheduling
proposal: https://lkml.org/lkml/2012/8/13/139.
It defines 2 new power aware policy 'balance' and 'powersaving', then
try to pack tasks on each sched groups level according the different 
scheduler policy. That can save much power when task number in system 
is no more than LCPU number.

As mentioned in the power aware scheduling proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, less active sched groups will reduce cpu power consumption

The first assumption make performance policy take over scheduling when
any group is busy.
The second assumption make power aware scheduling try to pack disperse
tasks into fewer groups.

Like sched numa, power aware scheduling is also a kind of cpu locality
oriented scheduling, so it is natural compatible with sched numa.

Since the patch can perfect pack tasks into fewer groups, I just show
some performance/power testing data here:
=========================================
$for ((i = 0; i < I; i++)) ; do while true; do :; done  &   done

On my SNB laptop with 4core* HT: the data is avg Watts
        powersaving     balance         performance
i = 2   40              54              54
i = 4   57              64*             68
i = 8   68              68              68

Note:
When i = 4 with balance policy, the power may change in 57~68Watt,
since the HT capacity and core capacity are both 1.

on SNB EP machine with 2 sockets * 8 cores * HT:
        powersaving     balance         performance
i = 4   190             201             238
i = 8   205             241             268
i = 16  271             348             376

bltk-game with openarena, the data is avg Watts
	    powersaving     balance         performance
wsm laptop  22.9             23.8           24.4
snb laptop  20.2	     20.5	    20.7

tasks number keep waving benchmark, 'make -j x vmlinux'
on my SNB EP 2 sockets machine with 8 cores * HT:

         powersaving	          balance	         performance
x = 1    175.603 /417 13          175.220 /416 13        176.073 /407 13
x = 2    192.215 /218 23          194.522 /202 25        217.393 /200 23
x = 4    205.226 /124 39          208.823 /114 42        230.425 /105 41
x = 8    236.369 /71 59           249.005 /65 61         257.661 /62 62
x = 16   283.842 /48 73           307.465 /40 81         309.336 /39 82
x = 32   325.197 /32 96           333.503 /32 93         336.138 /32 92

data explains: 175.603 /417 13
	175.603: average Watts
	417: seconds(compile time)
	13:  scaled performance/power = 1000000 / seconds / watts

Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux* &> /dev/null

On a NHM EP box
         powersaving               balance               performance
x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76

On a 2 sockets SNB EP box.
         powersaving               balance               performance
x = 4    190.995 /149 35          200.6 /129 38          208.561 /135 35
x = 8    197.969 /108 46          208.885 /103 46        213.96 /108 43
x = 16   205.163 /76 64           212.144 /91 51         229.287 /97 44

data format is: 166.516 /88 68
        166.516: average Watts
        88: seconds(compress time)
        68:  scaled performance/power = 1000000 / time / power

Some performance testing results:
---------------------------------

Tested benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
performance change found on 'performance' policy.

Tested balance/powersaving policy with above benchmarks,
a, specjbb2005 drop 5~7% on both of policy whenever with openjdk or jrockit.
b, hackbench drops 30+% with powersaving policy on snb 4 sockets platforms.
Others has no clear change.

test result from Mike Galbraith:
--------------------------------
With aim7 compute on 4 node 40 core box, I see stable throughput
improvement at tasks = nr_cores and below w. balance and powersaving.

         3.8.0-performance   3.8.0-balance      3.8.0-powersaving
Tasks    jobs/min/task       jobs/min/task      jobs/min/task
    1         432.8571       433.4764      	433.1665
    5         480.1902       510.9612      	497.5369
   10         429.1785       533.4507      	518.3918
   20         424.3697       529.7203      	528.7958
   40         419.0871       500.8264      	517.0648

No deltas after that.  There were also no deltas between patched kernel
using performance policy and virgin source.


Changelog:
V5 change:
a, change sched_policy to sched_balance_policy
b, split fork/exec/wake power balancing into 3 patches and refresh
commit logs
c, others minors clean up

V4 change:
a, fix few bugs and clean up code according to Morten Rasmussen, Mike
Galbraith and Namhyung Kim. Thanks!
b, take Morten Rasmussen's suggestion to use different criteria for
different policy in transitory task packing.
c, shorter latency in power aware scheduling.

V3 change:
a, engaged nr_running and utils in periodic power balancing.
b, try packing small exec/wake tasks on running cpu not idle cpu.

V2 change:
a, add lazy power scheduling to deal with kbuild like benchmark.


Thanks comments/suggestions from PeterZ, Linus Torvalds, Andrew Morton,
Ingo, Arjan van de Ven, Borislav Petkov, PJT, Namhyung Kim, Mike
Galbraith, Greg, Preeti, Morten Rasmussen etc.

Thanks fengguang's 0-day kbuild system for testing this patchset.

Any more comments are appreciated!

-- Thanks Alex


[patch v5 01/15] sched: set initial value for runnable avg of sched
[patch v5 02/15] sched: set initial load avg of new forked task
[patch v5 03/15] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[patch v5 04/15] sched: add sched balance policies in kernel
[patch v5 05/15] sched: add sysfs interface for sched_balance_policy
[patch v5 06/15] sched: log the cpu utilization at rq
[patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming
[patch v5 08/15] sched: move sg/sd_lb_stats struct ahead
[patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
[patch v5 10/15] sched: packing transitory tasks in wake/exec power
[patch v5 11/15] sched: add power/performance balance allow flag
[patch v5 12/15] sched: pull all tasks from source group
[patch v5 13/15] sched: no balance for prefer_sibling in power
[patch v5 14/15] sched: power aware load balance
[patch v5 15/15] sched: lazy power balance

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [patch v5 01/15] sched: set initial value for runnable avg of sched entities.
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-18  8:28   ` Joonsoo Kim
  2013-02-18  5:07 ` [patch v5 02/15] sched: set initial load avg of new forked task Alex Shi
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
after a new task forked.
Otherwise random values of above variables cause mess when do new task
enqueue:
    enqueue_task_fair
        enqueue_entity
            enqueue_entity_load_avg

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..1743746 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1559,6 +1559,8 @@ static void __sched_fork(struct task_struct *p)
 #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
+	p->se.avg.decay_count = 0;
+	p->se.avg.load_avg_contrib = 0;
 #endif
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 02/15] sched: set initial load avg of new forked task
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
  2013-02-18  5:07 ` [patch v5 01/15] sched: set initial value for runnable avg of sched entities Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-20  6:20   ` Alex Shi
  2013-02-18  5:07 ` [patch v5 03/15] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

New task has no runnable sum at its first runnable time, so its
runnable load is zero. That makes burst forking balancing just select
few idle cpus to assign tasks if we engage runnable load in balancing.

Set initial load avg of new forked task as its load weight to resolve
this issue.

Signed-off-by: Alex Shi <alex.shi@intel.com>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  2 +-
 kernel/sched/fair.c   | 11 +++++++++--
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d211247..f283d3d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1069,6 +1069,7 @@ struct sched_domain;
 #else
 #define ENQUEUE_WAKING		0
 #endif
+#define ENQUEUE_NEWTASK		8
 
 #define DEQUEUE_SLEEP		1
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1743746..7292965 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1706,7 +1706,7 @@ void wake_up_new_task(struct task_struct *p)
 #endif
 
 	rq = __task_rq_lock(p);
-	activate_task(rq, p, 0);
+	activate_task(rq, p, ENQUEUE_NEWTASK);
 	p->on_rq = 1;
 	trace_sched_wakeup_new(p, true);
 	check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81fa536..171790c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1503,8 +1503,9 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
-						  int wakeup)
+						  int flags)
 {
+	int wakeup = flags & ENQUEUE_WAKEUP;
 	/*
 	 * We track migrations using entity decay_count <= 0, on a wake-up
 	 * migration we use a negative decay count to track the remote decays
@@ -1538,6 +1539,12 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 		update_entity_load_avg(se, 0);
 	}
 
+	/*
+	 * set the initial load avg of new task same as its load
+	 * in order to avoid brust fork make few cpu too heavier
+	 */
+	if (flags & ENQUEUE_NEWTASK)
+		se->avg.load_avg_contrib = se->load.weight;
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
@@ -1701,7 +1708,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	enqueue_entity_load_avg(cfs_rq, se, flags & ENQUEUE_WAKEUP);
+	enqueue_entity_load_avg(cfs_rq, se, flags);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 03/15] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking"
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
  2013-02-18  5:07 ` [patch v5 01/15] sched: set initial value for runnable avg of sched entities Alex Shi
  2013-02-18  5:07 ` [patch v5 02/15] sched: set initial load avg of new forked task Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-18  5:07 ` [patch v5 04/15] sched: add sched balance policies in kernel Alex Shi
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |  8 +-------
 kernel/sched/core.c   |  7 +------
 kernel/sched/fair.c   | 13 ++-----------
 kernel/sched/sched.h  |  9 +--------
 4 files changed, 5 insertions(+), 32 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f283d3d..66b05e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1195,13 +1195,7 @@ struct sched_entity {
 	/* rq "owned" by this entity/group: */
 	struct cfs_rq		*my_q;
 #endif
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
-	/* Per-entity load-tracking */
+#ifdef CONFIG_SMP
 	struct sched_avg	avg;
 #endif
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7292965..0bd9d5f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1551,12 +1551,7 @@ static void __sched_fork(struct task_struct *p)
 	p->se.vruntime			= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
 	p->se.avg.decay_count = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 171790c..350eb8d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1109,8 +1109,7 @@ static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-/* Only depends on SMP, FAIR_GROUP_SCHED may be removed when useful in lb */
-#if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
+#ifdef CONFIG_SMP
 /*
  * We choose a half-life close to 1 scheduling period.
  * Note: The tables below are dependent on this value.
@@ -3410,12 +3409,6 @@ unlock:
 }
 
 /*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous cpu.  However, the caller only guarantees p->pi_lock is held; no
@@ -3438,7 +3431,6 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 		atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
 	}
 }
-#endif
 #endif /* CONFIG_SMP */
 
 static unsigned long
@@ -6130,9 +6122,8 @@ const struct sched_class fair_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	.migrate_task_rq	= migrate_task_rq_fair,
-#endif
+
 	.rq_online		= rq_online_fair,
 	.rq_offline		= rq_offline_fair,
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..ae3511e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -225,12 +225,6 @@ struct cfs_rq {
 #endif
 
 #ifdef CONFIG_SMP
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
-#ifdef CONFIG_FAIR_GROUP_SCHED
 	/*
 	 * CFS Load tracking
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
@@ -240,8 +234,7 @@ struct cfs_rq {
 	u64 runnable_load_avg, blocked_load_avg;
 	atomic64_t decay_counter, removed_load;
 	u64 last_decay;
-#endif /* CONFIG_FAIR_GROUP_SCHED */
-/* These always depend on CONFIG_FAIR_GROUP_SCHED */
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	u32 tg_runnable_contrib;
 	u64 tg_load_contrib;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 04/15] sched: add sched balance policies in kernel
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (2 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 03/15] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-20  9:37   ` Ingo Molnar
  2013-02-18  5:07 ` [patch v5 05/15] sched: add sysfs interface for sched_balance_policy selection Alex Shi
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

Current scheduler behavior is just consider for larger performance
of system. So it try to spread tasks on more cpu sockets and cpu cores

To adding the consideration of power awareness, the patchset adds
2 kinds of scheduler policy: powersaving and balance. They will use
runnable load util in scheduler balancing. The current scheduling is taken
as performance policy.

performance: the current scheduling behaviour, try to spread tasks
                on more CPU sockets or cores. performance oriented.
powersaving: will pack tasks into few sched group until all LCPU in the
                group is full, power oriented.
balance    : will pack tasks into few sched group until group_capacity
                numbers CPU is full, balance between performance and
		powersaving.

The incoming patches will enable powersaving/balance scheduling in CFS.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c  | 3 +++
 kernel/sched/sched.h | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350eb8d..2f98ffb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6105,6 +6105,9 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 	return rr_interval;
 }
 
+/* The default scheduler policy is 'performance'. */
+int __read_mostly sched_balance_policy = SCHED_POLICY_PERFORMANCE;
+
 /*
  * All the scheduling class methods:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae3511e..7a19792 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -8,6 +8,12 @@
 
 extern __read_mostly int scheduler_running;
 
+#define SCHED_POLICY_PERFORMANCE	(0x1)
+#define SCHED_POLICY_POWERSAVING	(0x2)
+#define SCHED_POLICY_BALANCE		(0x4)
+
+extern int __read_mostly sched_balance_policy;
+
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
  * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 05/15] sched: add sysfs interface for sched_balance_policy selection
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (3 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 04/15] sched: add sched balance policies in kernel Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-18  5:07 ` [patch v5 06/15] sched: log the cpu utilization at rq Alex Shi
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

This patch add the power aware scheduler knob into sysfs:

$cat /sys/devices/system/cpu/sched_balance_policy/available_sched_balance_policy
performance powersaving balance
$cat /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
powersaving

This means the using sched balance policy is 'powersaving'.

User can change the policy by commend 'echo':
 echo performance > /sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 26 ++++++++
 kernel/sched/fair.c                                | 73 ++++++++++++++++++++++
 2 files changed, 99 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 6943133..3283a86 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -53,6 +53,32 @@ Description:	Dynamic addition and removal of CPU's.  This is not hotplug
 		the system.  Information writtento the file to remove CPU's
 		is architecture specific.
 
+What:		/sys/devices/system/cpu/sched_balance_policy/current_sched_balance_policy
+		/sys/devices/system/cpu/sched_balance_policy/available_sched_balance_policy
+Date:		Oct 2012
+Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:	CFS scheduler policy showing and setting interface.
+
+		available_sched_balance_policy shows there are 3 kinds of
+		policies:
+			performance, balance and powersaving.
+		current_sched_balance_policy shows current scheduler policy.
+		User can change the policy by writing it.
+
+		Policy decides the CFS scheduler how to distribute tasks onto
+		different CPU unit.
+
+		performance: try to spread tasks onto more CPU sockets,
+		more CPU cores. performance oriented.
+
+		powersaving: try to pack tasks onto same core or same CPU
+		until every LCPUs are busy in the core or CPU socket.
+		powersaving oriented.
+
+		balance:     try to pack tasks onto same core or same CPU
+		until full powered CPUs are busy.
+		balance between performance and powersaving.
+
 What:		/sys/devices/system/cpu/cpu#/node
 Date:		October 2009
 Contact:	Linux memory management mailing list <linux-mm@kvack.org>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f98ffb..fcdb21f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6108,6 +6108,79 @@ static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task
 /* The default scheduler policy is 'performance'. */
 int __read_mostly sched_balance_policy = SCHED_POLICY_PERFORMANCE;
 
+#ifdef CONFIG_SYSFS
+static ssize_t show_available_sched_balance_policy(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "performance balance powersaving\n");
+}
+
+static ssize_t show_current_sched_balance_policy(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	if (sched_balance_policy == SCHED_POLICY_PERFORMANCE)
+		return sprintf(buf, "performance\n");
+	else if (sched_balance_policy == SCHED_POLICY_POWERSAVING)
+		return sprintf(buf, "powersaving\n");
+	else if (sched_balance_policy == SCHED_POLICY_BALANCE)
+		return sprintf(buf, "balance\n");
+	return 0;
+}
+
+static ssize_t set_sched_balance_policy(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	unsigned int ret = -EINVAL;
+	char    str_policy[16];
+
+	ret = sscanf(buf, "%15s", str_policy);
+	if (ret != 1)
+		return -EINVAL;
+
+	if (!strcmp(str_policy, "performance"))
+		sched_balance_policy = SCHED_POLICY_PERFORMANCE;
+	else if (!strcmp(str_policy, "powersaving"))
+		sched_balance_policy = SCHED_POLICY_POWERSAVING;
+	else if (!strcmp(str_policy, "balance"))
+		sched_balance_policy = SCHED_POLICY_BALANCE;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+/*
+ *  * Sysfs setup bits:
+ *   */
+static DEVICE_ATTR(current_sched_balance_policy, 0644,
+		show_current_sched_balance_policy, set_sched_balance_policy);
+
+static DEVICE_ATTR(available_sched_balance_policy, 0444,
+		show_available_sched_balance_policy, NULL);
+
+static struct attribute *sched_balance_policy_default_attrs[] = {
+	&dev_attr_current_sched_balance_policy.attr,
+	&dev_attr_available_sched_balance_policy.attr,
+	NULL
+};
+static struct attribute_group sched_balance_policy_attr_group = {
+	.attrs = sched_balance_policy_default_attrs,
+	.name = "sched_balance_policy",
+};
+
+int __init create_sysfs_sched_balance_policy_group(struct device *dev)
+{
+	return sysfs_create_group(&dev->kobj, &sched_balance_policy_attr_group);
+}
+
+static int __init sched_balance_policy_sysfs_init(void)
+{
+	return create_sysfs_sched_balance_policy_group(cpu_subsys.dev_root);
+}
+
+core_initcall(sched_balance_policy_sysfs_init);
+#endif /* CONFIG_SYSFS */
+
 /*
  * All the scheduling class methods:
  */
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (4 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 05/15] sched: add sysfs interface for sched_balance_policy selection Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-20  9:30   ` Peter Zijlstra
  2013-02-20 12:19   ` Preeti U Murthy
  2013-02-18  5:07 ` [patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing Alex Shi
                   ` (10 subsequent siblings)
  16 siblings, 2 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

The cpu's utilization is to measure how busy is the cpu.
        util = cpu_rq(cpu)->avg.runnable_avg_sum
                / cpu_rq(cpu)->avg.runnable_avg_period;

Since the util is no more than 1, we use its percentage value in later
caculations. And set the the FULL_UTIL as 100%.

In later power aware scheduling, we are sensitive for how busy of the
cpu, not how much weight of its load. As to power consuming, it is more
related with cpu busy time, not load weight.

BTW, rq->util can be used for any purposes if needed, not only power
scheduling.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/debug.c | 1 +
 kernel/sched/fair.c  | 4 ++++
 kernel/sched/sched.h | 4 ++++
 3 files changed, 9 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 7ae4c4c..d220354 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -318,6 +318,7 @@ do {									\
 
 	P(ttwu_count);
 	P(ttwu_local);
+	P(util);
 
 #undef P
 #undef P64
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fcdb21f..b9a34ab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1495,8 +1495,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
+	u32 period;
 	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+
+	period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
+	rq->util = rq->avg.runnable_avg_sum * 100 / period;
 }
 
 /* Add the load generated by se into cfs_rq's child load-average */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7a19792..ac1e107 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -350,6 +350,9 @@ extern struct root_domain def_root_domain;
 
 #endif /* CONFIG_SMP */
 
+/* the percentage full cpu utilization */
+#define FULL_UTIL	100
+
 /*
  * This is the main, per-CPU runqueue data structure.
  *
@@ -481,6 +484,7 @@ struct rq {
 #endif
 
 	struct sched_avg avg;
+	unsigned int util;
 };
 
 static inline int cpu_of(struct rq *rq)
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (5 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 06/15] sched: log the cpu utilization at rq Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-20  9:38   ` Peter Zijlstra
  2013-02-18  5:07 ` [patch v5 08/15] sched: move sg/sd_lb_stats struct ahead Alex Shi
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

For power aware balancing, we care the sched domain/group's
utilizations more than their load weight. So add:
sd_lb_stats.sd_utils and sg_lb_stats.group_utils.

We want to know the sd capacity, so add:
sd_lb_stats.sd_capacity.

And want to know which group is busiest but still has free time to
handle more tasks, so add:
sd_lb_stats.group_leader.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9a34ab..32b8a7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4214,6 +4214,11 @@ struct sd_lb_stats {
 	unsigned int  busiest_group_weight;
 
 	int group_imb; /* Is there imbalance in this sd */
+
+	/* Varibles of power awaring scheduling */
+	unsigned int  sd_utils;	/* sum utilizations of this domain */
+	unsigned long sd_capacity;	/* capacity of this domain */
+	struct sched_group *group_leader; /* Group which relieves group_min */
 };
 
 /*
@@ -4229,6 +4234,7 @@ struct sg_lb_stats {
 	unsigned long group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
+	unsigned int group_utils;	/* sum utilizations of group */
 };
 
 /**
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 08/15] sched: move sg/sd_lb_stats struct ahead
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (6 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-18  5:07 ` [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake Alex Shi
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

Power aware fork/exec/wake balancing needs both of structs in incoming
patches. So move ahead before it.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 101 ++++++++++++++++++++++++++--------------------------
 1 file changed, 51 insertions(+), 50 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 32b8a7b..287582b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3312,6 +3312,57 @@ done:
 }
 
 /*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ *		during load balancing.
+ */
+struct sd_lb_stats {
+	struct sched_group *busiest; /* Busiest group in this sd */
+	struct sched_group *this;  /* Local group in this sd */
+	unsigned long total_load;  /* Total load of all groups in sd */
+	unsigned long total_pwr;   /*	Total power of all groups in sd */
+	unsigned long avg_load;	   /* Average load across all groups in sd */
+
+	/** Statistics of this group */
+	unsigned long this_load;
+	unsigned long this_load_per_task;
+	unsigned long this_nr_running;
+	unsigned int  this_has_capacity;
+	unsigned int  this_idle_cpus;
+
+	/* Statistics of the busiest group */
+	unsigned int  busiest_idle_cpus;
+	unsigned long max_load;
+	unsigned long busiest_load_per_task;
+	unsigned long busiest_nr_running;
+	unsigned long busiest_group_capacity;
+	unsigned int  busiest_has_capacity;
+	unsigned int  busiest_group_weight;
+
+	int group_imb; /* Is there imbalance in this sd */
+
+	/* Varibles of power awaring scheduling */
+	unsigned int  sd_utils;	/* sum utilizations of this domain */
+	unsigned long sd_capacity;	/* capacity of this domain */
+	struct sched_group *group_leader; /* Group which relieves group_min */
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load_balancing
+ */
+struct sg_lb_stats {
+	unsigned long avg_load; /*Avg load across the CPUs of the group */
+	unsigned long group_load; /* Total load over the CPUs of the group */
+	unsigned long sum_nr_running; /* Nr tasks running in the group */
+	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
+	unsigned long group_capacity;
+	unsigned long idle_cpus;
+	unsigned long group_weight;
+	int group_imb; /* Is there an imbalance in the group ? */
+	int group_has_capacity; /* Is there extra capacity in the group? */
+	unsigned int group_utils;	/* sum utilizations of group */
+};
+
+/*
  * sched_balance_self: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
  * SD_BALANCE_EXEC.
@@ -4186,56 +4237,6 @@ static unsigned long task_h_load(struct task_struct *p)
 #endif
 
 /********** Helpers for find_busiest_group ************************/
-/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * 		during load balancing.
- */
-struct sd_lb_stats {
-	struct sched_group *busiest; /* Busiest group in this sd */
-	struct sched_group *this;  /* Local group in this sd */
-	unsigned long total_load;  /* Total load of all groups in sd */
-	unsigned long total_pwr;   /*	Total power of all groups in sd */
-	unsigned long avg_load;	   /* Average load across all groups in sd */
-
-	/** Statistics of this group */
-	unsigned long this_load;
-	unsigned long this_load_per_task;
-	unsigned long this_nr_running;
-	unsigned long this_has_capacity;
-	unsigned int  this_idle_cpus;
-
-	/* Statistics of the busiest group */
-	unsigned int  busiest_idle_cpus;
-	unsigned long max_load;
-	unsigned long busiest_load_per_task;
-	unsigned long busiest_nr_running;
-	unsigned long busiest_group_capacity;
-	unsigned long busiest_has_capacity;
-	unsigned int  busiest_group_weight;
-
-	int group_imb; /* Is there imbalance in this sd */
-
-	/* Varibles of power awaring scheduling */
-	unsigned int  sd_utils;	/* sum utilizations of this domain */
-	unsigned long sd_capacity;	/* capacity of this domain */
-	struct sched_group *group_leader; /* Group which relieves group_min */
-};
-
-/*
- * sg_lb_stats - stats of a sched_group required for load_balancing
- */
-struct sg_lb_stats {
-	unsigned long avg_load; /*Avg load across the CPUs of the group */
-	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long sum_nr_running; /* Nr tasks running in the group */
-	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
-	unsigned long group_capacity;
-	unsigned long idle_cpus;
-	unsigned long group_weight;
-	int group_imb; /* Is there an imbalance in the group ? */
-	int group_has_capacity; /* Is there extra capacity in the group? */
-	unsigned int group_utils;	/* sum utilizations of group */
-};
 
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (7 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 08/15] sched: move sg/sd_lb_stats struct ahead Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-20  9:42   ` Peter Zijlstra
  2013-02-18  5:07 ` [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing Alex Shi
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

This patch add power aware scheduling in fork/exec/wake. It try to
select cpu from the busiest while still has utilization group. That's
will save power since it leaves more groups idle in system.

The trade off is adding a power aware statistics collection in group
seeking. But since the collection just happened in power scheduling
eligible condition, the worst case of hackbench testing just drops
about 2% with powersaving/balance policy. No clear change for
performance policy.

The main function in this patch is get_cpu_for_power_policy(), that
will try to get a idlest cpu from the busiest while still has
utilization group, if the system is using power aware policy and
has such group.

I had tried to use rq utilisation in this balancing, but since the
utilisation need much time to accumulate itself(345ms). It's bad for
any burst balancing. So I use instant rq utilisation -- nr_running.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 123 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 287582b..b172678 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3362,26 +3362,134 @@ struct sg_lb_stats {
 	unsigned int group_utils;	/* sum utilizations of group */
 };
 
+static inline int
+fix_small_capacity(struct sched_domain *sd, struct sched_group *group);
+
+/*
+ * Try to collect the task running number and capacity of the group.
+ */
+static void get_sg_power_stats(struct sched_group *group,
+	struct sched_domain *sd, struct sg_lb_stats *sgs)
+{
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(group)) {
+		struct rq *rq = cpu_rq(i);
+
+		sgs->group_utils += rq->nr_running;
+	}
+
+	sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
+						SCHED_POWER_SCALE);
+	if (!sgs->group_capacity)
+		sgs->group_capacity = fix_small_capacity(sd, group);
+	sgs->group_weight = group->group_weight;
+}
+
+/*
+ * Try to collect the task running number and capacity of the domain.
+ */
+static void get_sd_power_stats(struct sched_domain *sd,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	struct sched_group *group;
+	struct sg_lb_stats sgs;
+	int sd_min_delta = INT_MAX;
+	int cpu = task_cpu(p);
+
+	group = sd->groups;
+	do {
+		long g_delta;
+		unsigned long threshold;
+
+		if (!cpumask_test_cpu(cpu, sched_group_mask(group)))
+			continue;
+
+		memset(&sgs, 0, sizeof(sgs));
+		get_sg_power_stats(group, sd, &sgs);
+
+		if (sched_balance_policy == SCHED_POLICY_POWERSAVING)
+			threshold = sgs.group_weight;
+		else
+			threshold = sgs.group_capacity;
+
+		g_delta = threshold - sgs.group_utils;
+
+		if (g_delta > 0 && g_delta < sd_min_delta) {
+			sd_min_delta = g_delta;
+			sds->group_leader = group;
+		}
+
+		sds->sd_utils += sgs.group_utils;
+		sds->total_pwr += group->sgp->power;
+	} while  (group = group->next, group != sd->groups);
+
+	sds->sd_capacity = DIV_ROUND_CLOSEST(sds->total_pwr,
+						SCHED_POWER_SCALE);
+}
+
+/*
+ * Execute power policy if this domain is not full.
+ */
+static inline int get_sd_sched_balance_policy(struct sched_domain *sd,
+	int cpu, struct task_struct *p, struct sd_lb_stats *sds)
+{
+	unsigned long threshold;
+
+	if (sched_balance_policy == SCHED_POLICY_PERFORMANCE)
+		return SCHED_POLICY_PERFORMANCE;
+
+	memset(sds, 0, sizeof(*sds));
+	get_sd_power_stats(sd, p, sds);
+
+	if (sched_balance_policy == SCHED_POLICY_POWERSAVING)
+		threshold = sd->span_weight;
+	else
+		threshold = sds->sd_capacity;
+
+	/* still can hold one more task in this domain */
+	if (sds->sd_utils < threshold)
+		return sched_balance_policy;
+
+	return SCHED_POLICY_PERFORMANCE;
+}
+
+/*
+ * If power policy is eligible for this domain, and it has task allowed cpu.
+ * we will select CPU from this domain.
+ */
+static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
+		struct task_struct *p, struct sd_lb_stats *sds)
+{
+	int policy;
+	int new_cpu = -1;
+
+	policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
+	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
+		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+
+	return new_cpu;
+}
+
 /*
- * sched_balance_self: balance the current task (running on cpu) in domains
+ * select_task_rq_fair: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
  * SD_BALANCE_EXEC.
  *
- * Balance, ie. select the least loaded group.
- *
  * Returns the target CPU number, or the same CPU if no balancing is needed.
  *
  * preempt must be disabled.
  */
 static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
 	int prev_cpu = task_cpu(p);
 	int new_cpu = cpu;
 	int want_affine = 0;
-	int sync = wake_flags & WF_SYNC;
+	int sync = flags & WF_SYNC;
+	struct sd_lb_stats sds;
 
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
@@ -3407,11 +3515,20 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 			break;
 		}
 
-		if (tmp->flags & sd_flag)
+		if (tmp->flags & sd_flag) {
 			sd = tmp;
+
+			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+			if (new_cpu != -1)
+				goto unlock;
+		}
 	}
 
 	if (affine_sd) {
+		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+		if (new_cpu != -1)
+			goto unlock;
+
 		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
 			prev_cpu = cpu;
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (8 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-18  8:44   ` Joonsoo Kim
  2013-02-18  5:07 ` [patch v5 11/15] sched: add power/performance balance allow flag Alex Shi
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

If the waked/execed task is transitory enough, it will has a chance to be
packed into a cpu which is busy but still has time to care it.

For powersaving policy, only the history util < 25% task has chance to
be packed, and for balance policy, only histroy util < 12.5% has chance.
If there is no cpu eligible to handle it, will use a idlest cpu in
leader group.

Morten Rasmussen catch a type bug and suggest using different criteria
for different policy, thanks!

Inspired-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 60 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b172678..2e8131d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3455,19 +3455,72 @@ static inline int get_sd_sched_balance_policy(struct sched_domain *sd,
 }
 
 /*
+ * find_leader_cpu - find the busiest but still has enough leisure time cpu
+ * among the cpus in group.
+ */
+static int
+find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu,
+		int policy)
+{
+	/* percentage of the task's util */
+	unsigned putil = p->se.avg.runnable_avg_sum * 100
+				/ (p->se.avg.runnable_avg_period + 1);
+
+	struct rq *rq = cpu_rq(this_cpu);
+	int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
+	int vacancy, min_vacancy = INT_MAX, max_util;
+	int leader_cpu = -1;
+	int i;
+
+	if (policy == SCHED_POLICY_POWERSAVING)
+		max_util = FULL_UTIL;
+	else
+		/* maximum allowable util is 60% */
+		max_util = 60;
+
+	/* bias toward local cpu */
+	if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(p)) &&
+		max_util - (rq->util * nr_running + (putil << 2)) > 0)
+			return this_cpu;
+
+	/* Traverse only the allowed CPUs */
+	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+		if (i == this_cpu)
+			continue;
+
+		rq = cpu_rq(i);
+		nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
+
+		/* only light task allowed, like putil < 25% for powersaving */
+		vacancy = max_util - (rq->util * nr_running + (putil << 2));
+
+		if (vacancy > 0 && vacancy < min_vacancy) {
+			min_vacancy = vacancy;
+			leader_cpu = i;
+		}
+	}
+	return leader_cpu;
+}
+
+/*
  * If power policy is eligible for this domain, and it has task allowed cpu.
  * we will select CPU from this domain.
  */
 static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
-		struct task_struct *p, struct sd_lb_stats *sds)
+		struct task_struct *p, struct sd_lb_stats *sds, int fork)
 {
 	int policy;
 	int new_cpu = -1;
 
 	policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
-	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
-		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
-
+	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
+		if (!fork)
+			new_cpu = find_leader_cpu(sds->group_leader,
+							p, cpu, policy);
+		/* for fork balancing and a little busy task */
+		if (new_cpu == -1)
+			new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+	}
 	return new_cpu;
 }
 
@@ -3518,14 +3571,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
 		if (tmp->flags & sd_flag) {
 			sd = tmp;
 
-			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
+						flags & SD_BALANCE_FORK);
 			if (new_cpu != -1)
 				goto unlock;
 		}
 	}
 
 	if (affine_sd) {
-		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 0);
 		if (new_cpu != -1)
 			goto unlock;
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (9 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-20  9:48   ` Peter Zijlstra
  2013-02-20 12:12   ` Borislav Petkov
  2013-02-18  5:07 ` [patch v5 12/15] sched: pull all tasks from source group Alex Shi
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

If a sched domain is idle enough for regular power balance, power_lb
will be set, perf_lb will be clean. If a sched domain is busy,
their value will be set oppositely.

If the domain is suitable for power balance, but balance should not
be down by this cpu(this cpu is already idle or full), both of perf_lb
 and power_lb are cleared to wait a suitable cpu to do power balance.
That mean no any balance, neither power balance nor performance balance
will be done on this cpu.

Above logical will be implemented by incoming patches.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e8131d..0047856 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4053,6 +4053,8 @@ struct lb_env {
 	unsigned int		loop;
 	unsigned int		loop_break;
 	unsigned int		loop_max;
+	int			power_lb;  /* if power balance needed */
+	int			perf_lb;   /* if performance balance needed */
 };
 
 /*
@@ -5195,6 +5197,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
+		.power_lb	= 0,
+		.perf_lb	= 1,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 12/15] sched: pull all tasks from source group
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (10 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 11/15] sched: add power/performance balance allow flag Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-18  5:07 ` [patch v5 13/15] sched: no balance for prefer_sibling in power scheduling Alex Shi
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

In power balance, we hope some sched groups are fully empty to save
CPU power of them. So, we want to move any tasks from them.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0047856..f3abb83 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5125,7 +5125,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu power.
 		 */
-		if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+		if (rq->nr_running == 0 ||
+			(!env->power_lb && capacity &&
+				rq->nr_running == 1 && wl > env->imbalance))
 			continue;
 
 		/*
@@ -5229,7 +5231,8 @@ redo:
 
 	ld_moved = 0;
 	lb_iterations = 1;
-	if (busiest->nr_running > 1) {
+	if (busiest->nr_running > 1 ||
+		(busiest->nr_running == 1 && env.power_lb)) {
 		/*
 		 * Attempt to move tasks. If find_busiest_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 13/15] sched: no balance for prefer_sibling in power scheduling
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (11 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 12/15] sched: pull all tasks from source group Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-18  5:07 ` [patch v5 14/15] sched: power aware load balance Alex Shi
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

In power aware scheduling, we don't want to balance 'prefer_sibling'
groups just because local group has capacity.
If the local group has no tasks at the time, that is the power
balance hope so.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f3abb83..ffdf35d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4782,8 +4782,12 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
+		 *
+		 * In power aware scheduling, we don't care load weight and
+		 * want not to pull tasks just because local group has capacity.
 		 */
-		if (prefer_sibling && !local_group && sds->this_has_capacity)
+		if (prefer_sibling && !local_group && sds->this_has_capacity
+				&& env->perf_lb)
 			sgs.group_capacity = min(sgs.group_capacity, 1UL);
 
 		if (local_group) {
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 14/15] sched: power aware load balance
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (12 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 13/15] sched: no balance for prefer_sibling in power scheduling Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-03-20  4:57   ` Preeti U Murthy
  2013-02-18  5:07 ` [patch v5 15/15] sched: lazy power balance Alex Shi
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

This patch enabled the power aware consideration in load balance.

As mentioned in the power aware scheduler proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, less active sched_groups will reduce power consumption

The first assumption make performance policy take over scheduling when
any scheduler group is busy.
The second assumption make power aware scheduling try to pack disperse
tasks into fewer groups.

The enabling logical summary here:
1, Collect power aware scheduler statistics during performance load
balance statistics collection.
2, If the balance cpu is eligible for power load balance, just do it
and forget performance load balance. If the domain is suitable for
power balance, but the cpu is inappropriate(idle or full), stop both
power/performance balance in this domain. If using performance balance
or any group is busy, do performance balance.

Above logical is mainly implemented in update_sd_lb_power_stats(). It
decides if a domain is suitable for power aware scheduling. If so,
it will fill the dst group and source group accordingly.

This patch reuse some of Suresh's power saving load balance code.

A test can show the effort on different policy:
for ((i = 0; i < I; i++)) ; do while true; do :; done  &   done

On my SNB laptop with 4core* HT: the data is Watts
        powersaving     balance         performance
i = 2   40              54              54
i = 4   57              64*             68
i = 8   68              68              68

Note:
When i = 4 with balance policy, the power may change in 57~68Watt,
since the HT capacity and core capacity are both 1.

on SNB EP machine with 2 sockets * 8 cores * HT:
        powersaving     balance         performance
i = 4   190             201             238
i = 8   205             241             268
i = 16  271             348             376

If system has few continued tasks, use power policy can get
the performance/power gain. Like sysbench fileio randrw test with 16
thread on the SNB EP box,

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 126 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ffdf35d..3b1e9a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3344,6 +3344,10 @@ struct sd_lb_stats {
 	unsigned int  sd_utils;	/* sum utilizations of this domain */
 	unsigned long sd_capacity;	/* capacity of this domain */
 	struct sched_group *group_leader; /* Group which relieves group_min */
+	struct sched_group *group_min;	/* Least loaded group in sd */
+	unsigned long min_load_per_task; /* load_per_task in group_min */
+	unsigned int  leader_util;	/* sum utilizations of group_leader */
+	unsigned int  min_util;		/* sum utilizations of group_min */
 };
 
 /*
@@ -4412,6 +4416,105 @@ static unsigned long task_h_load(struct task_struct *p)
 /********** Helpers for find_busiest_group ************************/
 
 /**
+ * init_sd_lb_power_stats - Initialize power savings statistics for
+ * the given sched_domain, during load balancing.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics for sd.
+ */
+static inline void init_sd_lb_power_stats(struct lb_env *env,
+						struct sd_lb_stats *sds)
+{
+	if (sched_balance_policy == SCHED_POLICY_PERFORMANCE ||
+				env->idle == CPU_NOT_IDLE) {
+		env->power_lb = 0;
+		env->perf_lb = 1;
+		return;
+	}
+	env->perf_lb = 0;
+	env->power_lb = 1;
+	sds->min_util = UINT_MAX;
+	sds->leader_util = 0;
+}
+
+/**
+ * update_sd_lb_power_stats - Update the power saving stats for a
+ * sched_domain while performing load balancing.
+ *
+ * @env: The load balancing environment.
+ * @group: sched_group belonging to the sched_domain under consideration.
+ * @sds: Variable containing the statistics of the sched_domain
+ * @local_group: Does group contain the CPU for which we're performing
+ * load balancing?
+ * @sgs: Variable containing the statistics of the group.
+ */
+static inline void update_sd_lb_power_stats(struct lb_env *env,
+			struct sched_group *group, struct sd_lb_stats *sds,
+			int local_group, struct sg_lb_stats *sgs)
+{
+	unsigned long threshold, threshold_util;
+
+	if (env->perf_lb)
+		return;
+
+	if (sched_balance_policy == SCHED_POLICY_POWERSAVING)
+		threshold = sgs->group_weight;
+	else
+		threshold = sgs->group_capacity;
+	threshold_util = threshold * FULL_UTIL;
+
+	/*
+	 * If the local group is idle or full loaded
+	 * no need to do power savings balance at this domain
+	 */
+	if (local_group && (!sgs->sum_nr_running ||
+		sgs->group_utils + FULL_UTIL > threshold_util))
+		env->power_lb = 0;
+
+	/* Do performance load balance if any group overload */
+	if (sgs->group_utils > threshold_util) {
+		env->perf_lb = 1;
+		env->power_lb = 0;
+	}
+
+	/*
+	 * If a group is idle,
+	 * don't include that group in power savings calculations
+	 */
+	if (!env->power_lb || !sgs->sum_nr_running)
+		return;
+
+	/*
+	 * Calculate the group which has the least non-idle load.
+	 * This is the group from where we need to pick up the load
+	 * for saving power
+	 */
+	if ((sgs->group_utils < sds->min_util) ||
+	    (sgs->group_utils == sds->min_util &&
+	     group_first_cpu(group) > group_first_cpu(sds->group_min))) {
+		sds->group_min = group;
+		sds->min_util = sgs->group_utils;
+		sds->min_load_per_task = sgs->sum_weighted_load /
+						sgs->sum_nr_running;
+	}
+
+	/*
+	 * Calculate the group which is almost near its
+	 * capacity but still has some space to pick up some load
+	 * from other group and save more power
+	 */
+	if (sgs->group_utils + FULL_UTIL > threshold_util)
+		return;
+
+	if (sgs->group_utils > sds->leader_util ||
+	    (sgs->group_utils == sds->leader_util && sds->group_leader &&
+	     group_first_cpu(group) < group_first_cpu(sds->group_leader))) {
+		sds->group_leader = group;
+		sds->leader_util = sgs->group_utils;
+	}
+}
+
+/**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
  * @idle: The Idle status of the CPU for whose sd load_icx is obtained.
@@ -4650,6 +4753,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->group_load += load;
 		sgs->sum_nr_running += nr_running;
 		sgs->sum_weighted_load += weighted_cpuload(i);
+
+		/* accumulate the maximum potential util */
+		if (!nr_running)
+			nr_running = 1;
+		sgs->group_utils += rq->util * nr_running;
+
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
 	}
@@ -4758,6 +4867,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
 
+	init_sd_lb_power_stats(env, sds);
 	load_idx = get_sd_load_idx(env->sd, env->idle);
 
 	do {
@@ -4809,6 +4919,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 			sds->group_imb = sgs.group_imb;
 		}
 
+		update_sd_lb_power_stats(env, sg, sds, local_group, &sgs);
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -5026,6 +5137,19 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 */
 	update_sd_lb_stats(env, balance, &sds);
 
+	if (!env->perf_lb && !env->power_lb)
+		return  NULL;
+
+	if (env->power_lb) {
+		if (sds.this == sds.group_leader &&
+				sds.group_leader != sds.group_min) {
+			env->imbalance = sds.min_load_per_task;
+			return sds.group_min;
+		}
+		env->power_lb = 0;
+		return NULL;
+	}
+
 	/*
 	 * this_cpu is not the appropriate cpu to perform load balancing at
 	 * this level.
@@ -5203,8 +5327,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.idle		= idle,
 		.loop_break	= sched_nr_migrate_break,
 		.cpus		= cpus,
-		.power_lb	= 0,
-		.perf_lb	= 1,
+		.power_lb	= 1,
+		.perf_lb	= 0,
 	};
 
 	cpumask_copy(cpus, cpu_active_mask);
@@ -6282,7 +6406,6 @@ void unregister_fair_sched_group(struct task_group *tg, int cpu) { }
 
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
-
 static unsigned int get_rr_interval_fair(struct rq *rq, struct task_struct *task)
 {
 	struct sched_entity *se = &task->se;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [patch v5 15/15] sched: lazy power balance
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (13 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 14/15] sched: power aware load balance Alex Shi
@ 2013-02-18  5:07 ` Alex Shi
  2013-02-18  7:44 ` [patch v5 0/15] power aware scheduling Alex Shi
  2013-02-19 12:08 ` Paul Turner
  16 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

When active task number in sched domain waves around the power friendly
scheduling creteria, scheduling will thresh between the power friendly
balance and performance balance, bring unnecessary task migration.
The typical benchmark is 'make -j x'.

To remove such issue, introduce a u64 perf_lb_record variable to record
performance load balance history. If there is no performance LB for
continuing 32 times load balancing, or no LB for 8 times max_interval ms,
or only 4 times performance LB in last 64 times load balancing, then we
accept a power friendly LB. Otherwise, give up this time power friendly
LB chance, do nothing.

With this patch, the worst case for power scheduling -- kbuild, gets
similar performance/power value among different policy.

BTW, the lazy balance shows the performance gain when j is up to 32.

On my SNB EP 2 sockets machine with 8 cores * HT: 'make -j x' results:

		powersaving		balance		performance
x = 1    175.603 /417 13          175.220 /416 13        176.073 /407 13
x = 2    192.215 /218 23          194.522 /202 25        217.393 /200 23
x = 4    205.226 /124 39          208.823 /114 42        230.425 /105 41
x = 8    236.369 /71 59           249.005 /65 61         257.661 /62 62
x = 16   283.842 /48 73           307.465 /40 81         309.336 /39 82
x = 32   325.197 /32 96           333.503 /32 93         336.138 /32 92

data explains: 175.603 /417 13
	175.603: avagerage Watts
	417: seconds(compile time)
	13:  scaled performance/power = 1000000 / seconds / watts

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 68 ++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 66b05e1..5051990 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -941,6 +941,7 @@ struct sched_domain {
 	unsigned long last_balance;	/* init to jiffies. units in jiffies */
 	unsigned int balance_interval;	/* initialise to 1. units in ms. */
 	unsigned int nr_balance_failed; /* initialise to 0 */
+	u64	perf_lb_record;	/* performance balance record */
 
 	u64 last_update;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b1e9a6..f6ae655 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4514,6 +4514,60 @@ static inline void update_sd_lb_power_stats(struct lb_env *env,
 	}
 }
 
+#define PERF_LB_HH_MASK		0xffffffff00000000ULL
+#define PERF_LB_LH_MASK		0xffffffffULL
+
+/**
+ * need_perf_balance - Check if the performance load balance needed
+ * in the sched_domain.
+ *
+ * @env: The load balancing environment.
+ * @sds: Variable containing the statistics of the sched_domain
+ */
+static int need_perf_balance(struct lb_env *env, struct sd_lb_stats *sds)
+{
+	env->sd->perf_lb_record <<= 1;
+
+	if (env->perf_lb) {
+		env->sd->perf_lb_record |= 0x1;
+		return 1;
+	}
+
+	/*
+	 * The situation isn't eligible for performance balance. If this_cpu
+	 * is not eligible or the timing is not suitable for lazy powersaving
+	 * balance, we will stop both powersaving and performance balance.
+	 */
+	if (env->power_lb && sds->this == sds->group_leader
+			&& sds->group_leader != sds->group_min) {
+		int interval;
+
+		/* powersaving balance interval set as 8 * max_interval */
+		interval = msecs_to_jiffies(8 * env->sd->max_interval);
+		if (time_after(jiffies, env->sd->last_balance + interval))
+			env->sd->perf_lb_record = 0;
+
+		/*
+		 * A eligible timing is no performance balance in last 32
+		 * balance and performance balance is no more than 4 times
+		 * in last 64 balance, or no balance in powersaving interval
+		 * time.
+		 */
+		if ((hweight64(env->sd->perf_lb_record & PERF_LB_HH_MASK) <= 4)
+			&& !(env->sd->perf_lb_record & PERF_LB_LH_MASK)) {
+
+			env->imbalance = sds->min_load_per_task;
+			return 0;
+		}
+
+	}
+
+	/* give up this time power balancing, do nothing */
+	env->power_lb = 0;
+	sds->group_min = NULL;
+	return 0;
+}
+
 /**
  * get_sd_load_idx - Obtain the load index for a given sched domain.
  * @sd: The sched_domain whose load_idx is to be obtained.
@@ -5137,18 +5191,8 @@ find_busiest_group(struct lb_env *env, int *balance)
 	 */
 	update_sd_lb_stats(env, balance, &sds);
 
-	if (!env->perf_lb && !env->power_lb)
-		return  NULL;
-
-	if (env->power_lb) {
-		if (sds.this == sds.group_leader &&
-				sds.group_leader != sds.group_min) {
-			env->imbalance = sds.min_load_per_task;
-			return sds.group_min;
-		}
-		env->power_lb = 0;
-		return NULL;
-	}
+	if (!need_perf_balance(env, &sds))
+		return sds.group_min;
 
 	/*
 	 * this_cpu is not the appropriate cpu to perform load balancing at
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [patch v5 0/15] power aware scheduling
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (14 preceding siblings ...)
  2013-02-18  5:07 ` [patch v5 15/15] sched: lazy power balance Alex Shi
@ 2013-02-18  7:44 ` Alex Shi
  2013-02-19 12:08 ` Paul Turner
  16 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  7:44 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/18/2013 01:07 PM, Alex Shi wrote:
> Since the simplification of fork/exec/wake balancing has much arguments,
> I removed that part in the patch set.
> 
> This patch set implement/consummate the rough power aware scheduling
> proposal: https://lkml.org/lkml/2012/8/13/139.

Just review the great summary of the conversation on this proposal.
http://lwn.net/Articles/512487/

One of advantage of this patch is that it bases on the general sched
domain/groups architecture, so, if other hardware platforms, like ARM,
has different hardware setting, it can be represented as specific domain
flags. and then be treated specific in scheduling. The scheduling is
natural and easy to expend to that.

And guess the big.LITTLE arch also can be represented by cpu power, if
so, the 'balance' policy just fit them since it judges if the sched
domain/group is full by domain/group's capacity.

Also adding a answer for policy automatic change feature:
---
The 'balance/powersaving' is automatic power friendly scheduling, since
system will auto bypass power scheduling when cpus utilisation in a
sched domain is beyond the domain's cpu weight (powersaving) or beyond
the domain's capacity (balance).

There is no always enabled power scheduling, since the patchset bases on
'race to idle'.


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 01/15] sched: set initial value for runnable avg of sched entities.
  2013-02-18  5:07 ` [patch v5 01/15] sched: set initial value for runnable avg of sched entities Alex Shi
@ 2013-02-18  8:28   ` Joonsoo Kim
  2013-02-18  9:16     ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Joonsoo Kim @ 2013-02-18  8:28 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

Hello, Alex.

On Mon, Feb 18, 2013 at 01:07:28PM +0800, Alex Shi wrote:
> We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
> after a new task forked.
> Otherwise random values of above variables cause mess when do new task

I think that these are not random values. In arch_dup_task_struct(),
we do '*dst = *src', so, these values come from parent process. If we use
these value appropriately, we can anticipate child process' load easily.
So to initialize the load_avg_contrib to zero is not good idea for me.

Thanks.

> enqueue:
>     enqueue_task_fair
>         enqueue_entity
>             enqueue_entity_load_avg
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/core.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26058d0..1743746 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1559,6 +1559,8 @@ static void __sched_fork(struct task_struct *p)
>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>  	p->se.avg.runnable_avg_period = 0;
>  	p->se.avg.runnable_avg_sum = 0;
> +	p->se.avg.decay_count = 0;
> +	p->se.avg.load_avg_contrib = 0;
>  #endif
>  #ifdef CONFIG_SCHEDSTATS
>  	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> -- 
> 1.7.12
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing
  2013-02-18  5:07 ` [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing Alex Shi
@ 2013-02-18  8:44   ` Joonsoo Kim
  2013-02-18  8:56     ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Joonsoo Kim @ 2013-02-18  8:44 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

Hello, Alex.

On Mon, Feb 18, 2013 at 01:07:37PM +0800, Alex Shi wrote:
> If the waked/execed task is transitory enough, it will has a chance to be
> packed into a cpu which is busy but still has time to care it.
> For powersaving policy, only the history util < 25% task has chance to
> be packed, and for balance policy, only histroy util < 12.5% has chance.
> If there is no cpu eligible to handle it, will use a idlest cpu in
> leader group.

After exec(), task's behavior may be changed, and history util may be
changed, too. So, IMHO, exec balancing by history util is not good idea.
How do you think about it?

Thanks.

> Morten Rasmussen catch a type bug and suggest using different criteria
> for different policy, thanks!
> 
> Inspired-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 60 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b172678..2e8131d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3455,19 +3455,72 @@ static inline int get_sd_sched_balance_policy(struct sched_domain *sd,
>  }
>  
>  /*
> + * find_leader_cpu - find the busiest but still has enough leisure time cpu
> + * among the cpus in group.
> + */
> +static int
> +find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu,
> +		int policy)
> +{
> +	/* percentage of the task's util */
> +	unsigned putil = p->se.avg.runnable_avg_sum * 100
> +				/ (p->se.avg.runnable_avg_period + 1);
> +
> +	struct rq *rq = cpu_rq(this_cpu);
> +	int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
> +	int vacancy, min_vacancy = INT_MAX, max_util;
> +	int leader_cpu = -1;
> +	int i;
> +
> +	if (policy == SCHED_POLICY_POWERSAVING)
> +		max_util = FULL_UTIL;
> +	else
> +		/* maximum allowable util is 60% */
> +		max_util = 60;
> +
> +	/* bias toward local cpu */
> +	if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(p)) &&
> +		max_util - (rq->util * nr_running + (putil << 2)) > 0)
> +			return this_cpu;
> +
> +	/* Traverse only the allowed CPUs */
> +	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
> +		if (i == this_cpu)
> +			continue;
> +
> +		rq = cpu_rq(i);
> +		nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
> +
> +		/* only light task allowed, like putil < 25% for powersaving */
> +		vacancy = max_util - (rq->util * nr_running + (putil << 2));
> +
> +		if (vacancy > 0 && vacancy < min_vacancy) {
> +			min_vacancy = vacancy;
> +			leader_cpu = i;
> +		}
> +	}
> +	return leader_cpu;
> +}
> +
> +/*
>   * If power policy is eligible for this domain, and it has task allowed cpu.
>   * we will select CPU from this domain.
>   */
>  static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
> -		struct task_struct *p, struct sd_lb_stats *sds)
> +		struct task_struct *p, struct sd_lb_stats *sds, int fork)
>  {
>  	int policy;
>  	int new_cpu = -1;
>  
>  	policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
> -	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
> -		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
> -
> +	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
> +		if (!fork)
> +			new_cpu = find_leader_cpu(sds->group_leader,
> +							p, cpu, policy);
> +		/* for fork balancing and a little busy task */
> +		if (new_cpu == -1)
> +			new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
> +	}
>  	return new_cpu;
>  }
>  
> @@ -3518,14 +3571,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
>  		if (tmp->flags & sd_flag) {
>  			sd = tmp;
>  
> -			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
> +			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
> +						flags & SD_BALANCE_FORK);
>  			if (new_cpu != -1)
>  				goto unlock;
>  		}
>  	}
>  
>  	if (affine_sd) {
> -		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
> +		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 0);
>  		if (new_cpu != -1)
>  			goto unlock;
>  
> -- 
> 1.7.12
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing
  2013-02-18  8:44   ` Joonsoo Kim
@ 2013-02-18  8:56     ` Alex Shi
  2013-02-20  5:55       ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-18  8:56 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/18/2013 04:44 PM, Joonsoo Kim wrote:
> Hello, Alex.
> 
> On Mon, Feb 18, 2013 at 01:07:37PM +0800, Alex Shi wrote:
>> If the waked/execed task is transitory enough, it will has a chance to be
>> packed into a cpu which is busy but still has time to care it.
>> For powersaving policy, only the history util < 25% task has chance to
>> be packed, and for balance policy, only histroy util < 12.5% has chance.
>> If there is no cpu eligible to handle it, will use a idlest cpu in
>> leader group.
> 
> After exec(), task's behavior may be changed, and history util may be
> changed, too. So, IMHO, exec balancing by history util is not good idea.
> How do you think about it?
> 

sounds make sense. are there any objections?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 01/15] sched: set initial value for runnable avg of sched entities.
  2013-02-18  8:28   ` Joonsoo Kim
@ 2013-02-18  9:16     ` Alex Shi
  0 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  9:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/18/2013 04:28 PM, Joonsoo Kim wrote:
> On Mon, Feb 18, 2013 at 01:07:28PM +0800, Alex Shi wrote:
>> > We need initialize the se.avg.{decay_count, load_avg_contrib} to zero
>> > after a new task forked.
>> > Otherwise random values of above variables cause mess when do new task
> I think that these are not random values. In arch_dup_task_struct(),
> we do '*dst = *src', so, these values come from parent process. If we use
> these value appropriately, we can anticipate child process' load easily.
> So to initialize the load_avg_contrib to zero is not good idea for me.

Um, for a new forked task, calling them random value is appropriate,
since uncertain value of decay_count make random behaviour in
enqueue_entity_load_avg().
And many comments said new forked task need not follow parent's
utilisations, like what's you claimed for exec.

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 0/15] power aware scheduling
  2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
                   ` (15 preceding siblings ...)
  2013-02-18  7:44 ` [patch v5 0/15] power aware scheduling Alex Shi
@ 2013-02-19 12:08 ` Paul Turner
  16 siblings, 0 replies; 90+ messages in thread
From: Paul Turner @ 2013-02-19 12:08 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

FYI I'm currently out of the country in New Zealand and won't be able
to take a proper look at this until the beginning of March.

On Mon, Feb 18, 2013 at 6:07 PM, Alex Shi <alex.shi@intel.com> wrote:
> Since the simplification of fork/exec/wake balancing has much arguments,
> I removed that part in the patch set.
>
> This patch set implement/consummate the rough power aware scheduling
> proposal: https://lkml.org/lkml/2012/8/13/139.
> It defines 2 new power aware policy 'balance' and 'powersaving', then
> try to pack tasks on each sched groups level according the different
> scheduler policy. That can save much power when task number in system
> is no more than LCPU number.
>
> As mentioned in the power aware scheduling proposal, Power aware
> scheduling has 2 assumptions:
> 1, race to idle is helpful for power saving
> 2, less active sched groups will reduce cpu power consumption
>
> The first assumption make performance policy take over scheduling when
> any group is busy.
> The second assumption make power aware scheduling try to pack disperse
> tasks into fewer groups.
>
> Like sched numa, power aware scheduling is also a kind of cpu locality
> oriented scheduling, so it is natural compatible with sched numa.
>
> Since the patch can perfect pack tasks into fewer groups, I just show
> some performance/power testing data here:
> =========================================
> $for ((i = 0; i < I; i++)) ; do while true; do :; done  &   done
>
> On my SNB laptop with 4core* HT: the data is avg Watts
>         powersaving     balance         performance
> i = 2   40              54              54
> i = 4   57              64*             68
> i = 8   68              68              68
>
> Note:
> When i = 4 with balance policy, the power may change in 57~68Watt,
> since the HT capacity and core capacity are both 1.
>
> on SNB EP machine with 2 sockets * 8 cores * HT:
>         powersaving     balance         performance
> i = 4   190             201             238
> i = 8   205             241             268
> i = 16  271             348             376
>
> bltk-game with openarena, the data is avg Watts
>             powersaving     balance         performance
> wsm laptop  22.9             23.8           24.4
> snb laptop  20.2             20.5           20.7
>
> tasks number keep waving benchmark, 'make -j x vmlinux'
> on my SNB EP 2 sockets machine with 8 cores * HT:
>
>          powersaving              balance                performance
> x = 1    175.603 /417 13          175.220 /416 13        176.073 /407 13
> x = 2    192.215 /218 23          194.522 /202 25        217.393 /200 23
> x = 4    205.226 /124 39          208.823 /114 42        230.425 /105 41
> x = 8    236.369 /71 59           249.005 /65 61         257.661 /62 62
> x = 16   283.842 /48 73           307.465 /40 81         309.336 /39 82
> x = 32   325.197 /32 96           333.503 /32 93         336.138 /32 92
>
> data explains: 175.603 /417 13
>         175.603: average Watts
>         417: seconds(compile time)
>         13:  scaled performance/power = 1000000 / seconds / watts
>
> Another testing of parallel compress with pigz on Linus' git tree.
> results show we get much better performance/power with powersaving and
> balance policy:
>
> testing command:
> #pigz -k -c  -p$x -r linux* &> /dev/null
>
> On a NHM EP box
>          powersaving               balance               performance
> x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
> x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76
>
> On a 2 sockets SNB EP box.
>          powersaving               balance               performance
> x = 4    190.995 /149 35          200.6 /129 38          208.561 /135 35
> x = 8    197.969 /108 46          208.885 /103 46        213.96 /108 43
> x = 16   205.163 /76 64           212.144 /91 51         229.287 /97 44
>
> data format is: 166.516 /88 68
>         166.516: average Watts
>         88: seconds(compress time)
>         68:  scaled performance/power = 1000000 / time / power
>
> Some performance testing results:
> ---------------------------------
>
> Tested benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
> hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
> loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
> performance change found on 'performance' policy.
>
> Tested balance/powersaving policy with above benchmarks,
> a, specjbb2005 drop 5~7% on both of policy whenever with openjdk or jrockit.
> b, hackbench drops 30+% with powersaving policy on snb 4 sockets platforms.
> Others has no clear change.
>
> test result from Mike Galbraith:
> --------------------------------
> With aim7 compute on 4 node 40 core box, I see stable throughput
> improvement at tasks = nr_cores and below w. balance and powersaving.
>
>          3.8.0-performance   3.8.0-balance      3.8.0-powersaving
> Tasks    jobs/min/task       jobs/min/task      jobs/min/task
>     1         432.8571       433.4764           433.1665
>     5         480.1902       510.9612           497.5369
>    10         429.1785       533.4507           518.3918
>    20         424.3697       529.7203           528.7958
>    40         419.0871       500.8264           517.0648
>
> No deltas after that.  There were also no deltas between patched kernel
> using performance policy and virgin source.
>
>
> Changelog:
> V5 change:
> a, change sched_policy to sched_balance_policy
> b, split fork/exec/wake power balancing into 3 patches and refresh
> commit logs
> c, others minors clean up
>
> V4 change:
> a, fix few bugs and clean up code according to Morten Rasmussen, Mike
> Galbraith and Namhyung Kim. Thanks!
> b, take Morten Rasmussen's suggestion to use different criteria for
> different policy in transitory task packing.
> c, shorter latency in power aware scheduling.
>
> V3 change:
> a, engaged nr_running and utils in periodic power balancing.
> b, try packing small exec/wake tasks on running cpu not idle cpu.
>
> V2 change:
> a, add lazy power scheduling to deal with kbuild like benchmark.
>
>
> Thanks comments/suggestions from PeterZ, Linus Torvalds, Andrew Morton,
> Ingo, Arjan van de Ven, Borislav Petkov, PJT, Namhyung Kim, Mike
> Galbraith, Greg, Preeti, Morten Rasmussen etc.
>
> Thanks fengguang's 0-day kbuild system for testing this patchset.
>
> Any more comments are appreciated!
>
> -- Thanks Alex
>
>
> [patch v5 01/15] sched: set initial value for runnable avg of sched
> [patch v5 02/15] sched: set initial load avg of new forked task
> [patch v5 03/15] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
> [patch v5 04/15] sched: add sched balance policies in kernel
> [patch v5 05/15] sched: add sysfs interface for sched_balance_policy
> [patch v5 06/15] sched: log the cpu utilization at rq
> [patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming
> [patch v5 08/15] sched: move sg/sd_lb_stats struct ahead
> [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
> [patch v5 10/15] sched: packing transitory tasks in wake/exec power
> [patch v5 11/15] sched: add power/performance balance allow flag
> [patch v5 12/15] sched: pull all tasks from source group
> [patch v5 13/15] sched: no balance for prefer_sibling in power
> [patch v5 14/15] sched: power aware load balance
> [patch v5 15/15] sched: lazy power balance

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing
  2013-02-18  8:56     ` Alex Shi
@ 2013-02-20  5:55       ` Alex Shi
  2013-02-20  7:40         ` Mike Galbraith
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-20  5:55 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/18/2013 04:56 PM, Alex Shi wrote:
> On 02/18/2013 04:44 PM, Joonsoo Kim wrote:
>> Hello, Alex.
>>
>> On Mon, Feb 18, 2013 at 01:07:37PM +0800, Alex Shi wrote:
>>> If the waked/execed task is transitory enough, it will has a chance to be
>>> packed into a cpu which is busy but still has time to care it.
>>> For powersaving policy, only the history util < 25% task has chance to
>>> be packed, and for balance policy, only histroy util < 12.5% has chance.
>>> If there is no cpu eligible to handle it, will use a idlest cpu in
>>> leader group.
>>
>> After exec(), task's behavior may be changed, and history util may be
>> changed, too. So, IMHO, exec balancing by history util is not good idea.
>> How do you think about it?
>>
> 
> sounds make sense. are there any objections?
> 

New patch without exec balance packing:

==============
>From 7ed6c68dbd34e40b70c1b4f2563a5af172e289c3 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Thu, 14 Feb 2013 22:46:02 +0800
Subject: [PATCH 09/14] sched: packing transitory tasks in wakeup power
 balancing

If the waked task is transitory enough, it will has a chance to be
packed into a cpu which is busy but still has time to care it.

For powersaving policy, only the history util < 25% task has chance to
be packed, and for balance policy, only histroy util < 12.5% has chance.
If there is no cpu eligible to handle it, will use a idlest cpu in
leader group.

Morten Rasmussen catch a type bug and suggest using different criteria
for different policy, thanks!

Joonsoo Kim suggests not packing exec task, since the old task utils is
possibly unuseable.

Inspired-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 60 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2a65f9..24a68af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3452,19 +3452,72 @@ static inline int get_sd_sched_balance_policy(struct sched_domain *sd,
 }
 
 /*
+ * find_leader_cpu - find the busiest but still has enough leisure time cpu
+ * among the cpus in group.
+ */
+static int
+find_leader_cpu(struct sched_group *group, struct task_struct *p, int this_cpu,
+		int policy)
+{
+	/* percentage of the task's util */
+	unsigned putil = p->se.avg.runnable_avg_sum * 100
+				/ (p->se.avg.runnable_avg_period + 1);
+
+	struct rq *rq = cpu_rq(this_cpu);
+	int nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
+	int vacancy, min_vacancy = INT_MAX, max_util;
+	int leader_cpu = -1;
+	int i;
+
+	if (policy == SCHED_POLICY_POWERSAVING)
+		max_util = FULL_UTIL;
+	else
+		/* maximum allowable util is 60% */
+		max_util = 60;
+
+	/* bias toward local cpu */
+	if (cpumask_test_cpu(this_cpu, tsk_cpus_allowed(p)) &&
+		max_util - (rq->util * nr_running + (putil << 2)) > 0)
+			return this_cpu;
+
+	/* Traverse only the allowed CPUs */
+	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
+		if (i == this_cpu)
+			continue;
+
+		rq = cpu_rq(i);
+		nr_running = rq->nr_running > 0 ? rq->nr_running : 1;
+
+		/* only light task allowed, like putil < 25% for powersaving */
+		vacancy = max_util - (rq->util * nr_running + (putil << 2));
+
+		if (vacancy > 0 && vacancy < min_vacancy) {
+			min_vacancy = vacancy;
+			leader_cpu = i;
+		}
+	}
+	return leader_cpu;
+}
+
+/*
  * If power policy is eligible for this domain, and it has task allowed cpu.
  * we will select CPU from this domain.
  */
 static int get_cpu_for_power_policy(struct sched_domain *sd, int cpu,
-		struct task_struct *p, struct sd_lb_stats *sds)
+		struct task_struct *p, struct sd_lb_stats *sds, int wakeup)
 {
 	int policy;
 	int new_cpu = -1;
 
 	policy = get_sd_sched_balance_policy(sd, cpu, p, sds);
-	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader)
-		new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
-
+	if (policy != SCHED_POLICY_PERFORMANCE && sds->group_leader) {
+		if (wakeup)
+			new_cpu = find_leader_cpu(sds->group_leader,
+							p, cpu, policy);
+		/* for fork balancing and a little busy task */
+		if (new_cpu == -1)
+			new_cpu = find_idlest_cpu(sds->group_leader, p, cpu);
+	}
 	return new_cpu;
 }
 
@@ -3515,14 +3568,15 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int flags)
 		if (tmp->flags & sd_flag) {
 			sd = tmp;
 
-			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds);
+			new_cpu = get_cpu_for_power_policy(sd, cpu, p, &sds,
+						sd_flag & SD_BALANCE_WAKE);
 			if (new_cpu != -1)
 				goto unlock;
 		}
 	}
 
 	if (affine_sd) {
-		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds);
+		new_cpu = get_cpu_for_power_policy(affine_sd, cpu, p, &sds, 1);
 		if (new_cpu != -1)
 			goto unlock;
 
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [patch v5 02/15] sched: set initial load avg of new forked task
  2013-02-18  5:07 ` [patch v5 02/15] sched: set initial load avg of new forked task Alex Shi
@ 2013-02-20  6:20   ` Alex Shi
  2013-02-24 10:57     ` Preeti U Murthy
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-20  6:20 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/18/2013 01:07 PM, Alex Shi wrote:
> New task has no runnable sum at its first runnable time, so its
> runnable load is zero. That makes burst forking balancing just select
> few idle cpus to assign tasks if we engage runnable load in balancing.
> 
> Set initial load avg of new forked task as its load weight to resolve
> this issue.
> 

patch answering PJT's update here. that merged the 1st and 2nd patches 
into one. other patches in serial don't need to change.

=========
>From 89b56f2e5a323a0cb91c98be15c94d34e8904098 Mon Sep 17 00:00:00 2001
From: Alex Shi <alex.shi@intel.com>
Date: Mon, 3 Dec 2012 17:30:39 +0800
Subject: [PATCH 01/14] sched: set initial value of runnable avg for new
 forked task

We need initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task.
Otherwise random values of above variables cause mess when do new task
enqueue:
    enqueue_task_fair
        enqueue_entity
            enqueue_entity_load_avg

and make forking balancing imbalance since incorrect load_avg_contrib.

set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
resolve such issues.

Signed-off-by: Alex Shi <alex.shi@intel.com>
---
 kernel/sched/core.c | 3 +++
 kernel/sched/fair.c | 4 ++++
 2 files changed, 7 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26058d0..1452e14 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1559,6 +1559,7 @@ static void __sched_fork(struct task_struct *p)
 #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
 	p->se.avg.runnable_avg_period = 0;
 	p->se.avg.runnable_avg_sum = 0;
+	p->se.avg.decay_count = 0;
 #endif
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
@@ -1646,6 +1647,8 @@ void sched_fork(struct task_struct *p)
 		p->sched_reset_on_fork = 0;
 	}
 
+	p->se.avg.load_avg_contrib = p->se.load.weight;
+
 	if (!rt_prio(p->prio))
 		p->sched_class = &fair_sched_class;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81fa536..cae5134 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	 * We track migrations using entity decay_count <= 0, on a wake-up
 	 * migration we use a negative decay count to track the remote decays
 	 * accumulated while sleeping.
+	 *
+	 * When enqueue a new forked task, the se->avg.decay_count == 0, so
+	 * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
+	 * value: se->load.weight.
 	 */
 	if (unlikely(se->avg.decay_count <= 0)) {
 		se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing
  2013-02-20  5:55       ` Alex Shi
@ 2013-02-20  7:40         ` Mike Galbraith
  2013-02-20  8:11           ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Mike Galbraith @ 2013-02-20  7:40 UTC (permalink / raw)
  To: Alex Shi
  Cc: Joonsoo Kim, torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On Wed, 2013-02-20 at 13:55 +0800, Alex Shi wrote:

> Joonsoo Kim suggests not packing exec task, since the old task utils is
> possibly unuseable.

(I'm stumbling around in rtmutex PI land, all dazed and confused, so
forgive me if my peripheral following of this thread is off target;)

Hm, possibly.  Future behavior is always undefined, trying to predict
always a gamble... so it looks to me like not packing on exec places a
bet against the user, who chose to wager that powersaving will happen
and it won't cost him too much, if you don't always try to pack despite
any risks.  The user placed a bet on powersaving, not burst performance.

Same for the fork, if you spread to accommodate a potential burst, you
bin the power wager, so maybe it's not in his best interest.. fork/exec
is common, if it's happening frequently, you'll bin the potential power
win frequently, reducing effectiveness, and silently trading power for
performance when the user asked to trade performance for a lower
electric bill.

Dunno, just a thought, but I'd say for powersaving policy, you have to
go just for broke and hope it works out.  You can't know it won't, but
you'll toss potential winnings every time you don't roll the dice.

-Mike


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing
  2013-02-20  7:40         ` Mike Galbraith
@ 2013-02-20  8:11           ` Alex Shi
  2013-02-20  8:43             ` Mike Galbraith
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-20  8:11 UTC (permalink / raw)
  To: Mike Galbraith, peterz
  Cc: Joonsoo Kim, torvalds, mingo, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/20/2013 03:40 PM, Mike Galbraith wrote:
> On Wed, 2013-02-20 at 13:55 +0800, Alex Shi wrote:
> 
>> Joonsoo Kim suggests not packing exec task, since the old task utils is
>> possibly unuseable.
> 
> (I'm stumbling around in rtmutex PI land, all dazed and confused, so
> forgive me if my peripheral following of this thread is off target;)
> 
> Hm, possibly.  Future behavior is always undefined, trying to predict
> always a gamble... so it looks to me like not packing on exec places a
> bet against the user, who chose to wager that powersaving will happen
> and it won't cost him too much, if you don't always try to pack despite
> any risks.  The user placed a bet on powersaving, not burst performance.
> 
> Same for the fork, if you spread to accommodate a potential burst, you
> bin the power wager, so maybe it's not in his best interest.. fork/exec
> is common, if it's happening frequently, you'll bin the potential power
> win frequently, reducing effectiveness, and silently trading power for
> performance when the user asked to trade performance for a lower
> electric bill.
> 
> Dunno, just a thought, but I'd say for powersaving policy, you have to
> go just for broke and hope it works out.  You can't know it won't, but
> you'll toss potential winnings every time you don't roll the dice.


Sounds reasonable too.

I have no idea of the of the decision now.
And guess many guys dislike to use a knob to let user do the choice.

What's your opinions, Peter?
> 
> -Mike
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing
  2013-02-20  8:11           ` Alex Shi
@ 2013-02-20  8:43             ` Mike Galbraith
  2013-02-20  8:54               ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Mike Galbraith @ 2013-02-20  8:43 UTC (permalink / raw)
  To: Alex Shi
  Cc: peterz, Joonsoo Kim, torvalds, mingo, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On Wed, 2013-02-20 at 16:11 +0800, Alex Shi wrote: 
> On 02/20/2013 03:40 PM, Mike Galbraith wrote:
> > On Wed, 2013-02-20 at 13:55 +0800, Alex Shi wrote:
> > 
> >> Joonsoo Kim suggests not packing exec task, since the old task utils is
> >> possibly unuseable.
> > 
> > (I'm stumbling around in rtmutex PI land, all dazed and confused, so
> > forgive me if my peripheral following of this thread is off target;)
> > 
> > Hm, possibly.  Future behavior is always undefined, trying to predict
> > always a gamble... so it looks to me like not packing on exec places a
> > bet against the user, who chose to wager that powersaving will happen
> > and it won't cost him too much, if you don't always try to pack despite
> > any risks.  The user placed a bet on powersaving, not burst performance.
> > 
> > Same for the fork, if you spread to accommodate a potential burst, you
> > bin the power wager, so maybe it's not in his best interest.. fork/exec
> > is common, if it's happening frequently, you'll bin the potential power
> > win frequently, reducing effectiveness, and silently trading power for
> > performance when the user asked to trade performance for a lower
> > electric bill.
> > 
> > Dunno, just a thought, but I'd say for powersaving policy, you have to
> > go just for broke and hope it works out.  You can't know it won't, but
> > you'll toss potential winnings every time you don't roll the dice.
> 
> 
> Sounds reasonable too.
> 
> I have no idea of the of the decision now.
> And guess many guys dislike to use a knob to let user do the choice.

Nobody likes seeing yet more knobs much, automagical is preferred.
Trouble with automagical heuristics usage is that any heuristic will
inevitably get it wrong sometimes, so giving the user control over usage
is IMHO a good thing.. and once we give the user the choice, we must
honor it, else what was the point?

Anyway, fwiw, I liked what I saw test driving the patch set..

> What's your opinions, Peter?

..but maintainer opinions carry more weight than mine, even to me ;-)

-Mike


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing
  2013-02-20  8:43             ` Mike Galbraith
@ 2013-02-20  8:54               ` Alex Shi
  0 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-20  8:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: peterz, Joonsoo Kim, torvalds, mingo, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/20/2013 04:43 PM, Mike Galbraith wrote:
>> > 
>> > Sounds reasonable too.
>> > 
>> > I have no idea of the of the decision now.
>> > And guess many guys dislike to use a knob to let user do the choice.
> Nobody likes seeing yet more knobs much, automagical is preferred.
> Trouble with automagical heuristics usage is that any heuristic will
> inevitably get it wrong sometimes, so giving the user control over usage
> is IMHO a good thing.. and once we give the user the choice, we must
> honor it, else what was the point?
> 
> Anyway, fwiw, I liked what I saw test driving the patch set..
> 
>> > What's your opinions, Peter?
> ..but maintainer opinions carry more weight than mine, even to me ;-)

:) Yes, maintainers usually heard enough arguments and can balance them...


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-18  5:07 ` [patch v5 06/15] sched: log the cpu utilization at rq Alex Shi
@ 2013-02-20  9:30   ` Peter Zijlstra
  2013-02-20 12:09     ` Preeti U Murthy
  2013-02-20 14:33     ` Alex Shi
  2013-02-20 12:19   ` Preeti U Murthy
  1 sibling, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-20  9:30 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Mon, 2013-02-18 at 13:07 +0800, Alex Shi wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fcdb21f..b9a34ab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1495,8 +1495,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>  
>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>  {
> +	u32 period;
>  	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
>  	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
> +
> +	period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
> +	rq->util = rq->avg.runnable_avg_sum * 100 / period;
>  }
>  
>  /* Add the load generated by se into cfs_rq's child load-average */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7a19792..ac1e107 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -350,6 +350,9 @@ extern struct root_domain def_root_domain;
>  
>  #endif /* CONFIG_SMP */
>  
> +/* the percentage full cpu utilization */
> +#define FULL_UTIL	100

There's generally a better value than 100 when using computers.. seeing
how 100 is 64+32+4.

> +
>  /*
>   * This is the main, per-CPU runqueue data structure.
>   *
> @@ -481,6 +484,7 @@ struct rq {
>  #endif
>  
>  	struct sched_avg avg;
> +	unsigned int util;
>  };
>  
>  static inline int cpu_of(struct rq *rq)

You don't actually compute the rq utilization, you only compute the
utilization as per the fair class, so if there's significant RT activity
it'll think the cpu is under-utilized, whihc I think will result in the
wrong thing.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 04/15] sched: add sched balance policies in kernel
  2013-02-18  5:07 ` [patch v5 04/15] sched: add sched balance policies in kernel Alex Shi
@ 2013-02-20  9:37   ` Ingo Molnar
  2013-02-20 13:40     ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Ingo Molnar @ 2013-02-20  9:37 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen


* Alex Shi <alex.shi@intel.com> wrote:

> Current scheduler behavior is just consider for larger 
> performance of system. So it try to spread tasks on more cpu 
> sockets and cpu cores
> 
> To adding the consideration of power awareness, the patchset 
> adds 2 kinds of scheduler policy: powersaving and balance. 
> They will use runnable load util in scheduler balancing. The 
> current scheduling is taken as performance policy.
> 
> performance: the current scheduling behaviour, try to spread tasks
>                 on more CPU sockets or cores. performance oriented.
> powersaving: will pack tasks into few sched group until all LCPU in the
>                 group is full, power oriented.
> balance    : will pack tasks into few sched group until group_capacity
>                 numbers CPU is full, balance between performance and
> 		powersaving.

Hm, so in a previous review I suggested keeping two main 
policies: power-saving and performance, plus a third, default 
policy, which automatically switches between these two if/when 
the kernel has information about whether a system is on battery 
or on AC - and picking 'performance' when it has no information.

Such an automatic policy would obviously be useful to users - 
and that is what makes such a feature really interesting and a 
step forward.

I think Peter expressed similar views: we don't want many knobs 
and states, we want two major goals plus an (optional but 
default enabled) automatism.

Is your 'balance' policy implementing that suggestion?
If not, why not?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing
  2013-02-18  5:07 ` [patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing Alex Shi
@ 2013-02-20  9:38   ` Peter Zijlstra
  2013-02-20 12:27     ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-20  9:38 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Mon, 2013-02-18 at 13:07 +0800, Alex Shi wrote:
> @@ -4214,6 +4214,11 @@ struct sd_lb_stats {
>         unsigned int  busiest_group_weight;
>  
>         int group_imb; /* Is there imbalance in this sd */
> +
> +       /* Varibles of power awaring scheduling */
> +       unsigned int  sd_utils; /* sum utilizations of this domain */
> +       unsigned long sd_capacity;      /* capacity of this domain */
> +       struct sched_group *group_leader; /* Group which relieves
> group_min */
>  };
>  
>  /*
> @@ -4229,6 +4234,7 @@ struct sg_lb_stats {
>         unsigned long group_weight;
>         int group_imb; /* Is there an imbalance in the group ? */
>         int group_has_capacity; /* Is there extra capacity in the
> group? */
> +       unsigned int group_utils;       /* sum utilizations of group
> */
>  };

So I have two problems with the _utils name, firstly its a single value
and therefore we shouldn't give it a name in plural, secondly, whenever
I read utils I read utilities, not utilizations.

As a non native speaker I'm not entirely sure, but utilizations sounds
iffy to me, is there even a plural of utilization?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-18  5:07 ` [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake Alex Shi
@ 2013-02-20  9:42   ` Peter Zijlstra
  2013-02-20 12:09     ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-20  9:42 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Mon, 2013-02-18 at 13:07 +0800, Alex Shi wrote:
> +/*
> + * Try to collect the task running number and capacity of the group.
> + */
> +static void get_sg_power_stats(struct sched_group *group,
> +       struct sched_domain *sd, struct sg_lb_stats *sgs)
> +{
> +       int i;
> +
> +       for_each_cpu(i, sched_group_cpus(group)) {
> +               struct rq *rq = cpu_rq(i);
> +
> +               sgs->group_utils += rq->nr_running;
> +       }
> +
> +       sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
> +                                               SCHED_POWER_SCALE);
> +       if (!sgs->group_capacity)
> +               sgs->group_capacity = fix_small_capacity(sd, group);
> +       sgs->group_weight = group->group_weight;
> +}

So you're trying to compute the group utilization, but what does that
have to do with nr_running? In an earlier patch you introduced the
per-cpu utilization, so why not avg that to compute the group
utilization?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-18  5:07 ` [patch v5 11/15] sched: add power/performance balance allow flag Alex Shi
@ 2013-02-20  9:48   ` Peter Zijlstra
  2013-02-20 12:04     ` Alex Shi
  2013-02-20 12:12   ` Borislav Petkov
  1 sibling, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-20  9:48 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Mon, 2013-02-18 at 13:07 +0800, Alex Shi wrote:
> @@ -4053,6 +4053,8 @@ struct lb_env {
>         unsigned int            loop;
>         unsigned int            loop_break;
>         unsigned int            loop_max;
> +       int                     power_lb;  /* if power balance needed
> */
> +       int                     perf_lb;   /* if performance balance
> needed */
>  };
>  
>  /*
> @@ -5195,6 +5197,8 @@ static int load_balance(int this_cpu, struct rq
> *this_rq,
>                 .idle           = idle,
>                 .loop_break     = sched_nr_migrate_break,
>                 .cpus           = cpus,
> +               .power_lb       = 0,
> +               .perf_lb        = 1,
>         };
>  
>         cpumask_copy(cpus, cpu_active_mask);

This construct allows for the possibility of power_lb=1,perf_lb=1, does
that make sense? Why not have a single balance_policy enumeration?


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-20  9:48   ` Peter Zijlstra
@ 2013-02-20 12:04     ` Alex Shi
  2013-02-20 13:37       ` Peter Zijlstra
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/20/2013 05:48 PM, Peter Zijlstra wrote:
> On Mon, 2013-02-18 at 13:07 +0800, Alex Shi wrote:
>> @@ -4053,6 +4053,8 @@ struct lb_env {
>>         unsigned int            loop;
>>         unsigned int            loop_break;
>>         unsigned int            loop_max;
>> +       int                     power_lb;  /* if power balance needed
>> */
>> +       int                     perf_lb;   /* if performance balance
>> needed */
>>  };
>>  
>>  /*
>> @@ -5195,6 +5197,8 @@ static int load_balance(int this_cpu, struct rq
>> *this_rq,
>>                 .idle           = idle,
>>                 .loop_break     = sched_nr_migrate_break,
>>                 .cpus           = cpus,
>> +               .power_lb       = 0,
>> +               .perf_lb        = 1,
>>         };
>>  
>>         cpumask_copy(cpus, cpu_active_mask);
> 
> This construct allows for the possibility of power_lb=1,perf_lb=1, does
> that make sense? Why not have a single balance_policy enumeration?

(power_lb == 1 && perf_lb == 1) is incorrect and impossible to have.

(power_lb == 0 && perf_lb == 0) is possible and it means there is no any
balance on this cpu.

So, enumeration is not enough.
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-20  9:42   ` Peter Zijlstra
@ 2013-02-20 12:09     ` Alex Shi
  2013-02-20 13:36       ` Peter Zijlstra
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-20 12:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/20/2013 05:42 PM, Peter Zijlstra wrote:
> On Mon, 2013-02-18 at 13:07 +0800, Alex Shi wrote:
>> +/*
>> + * Try to collect the task running number and capacity of the group.
>> + */
>> +static void get_sg_power_stats(struct sched_group *group,
>> +       struct sched_domain *sd, struct sg_lb_stats *sgs)
>> +{
>> +       int i;
>> +
>> +       for_each_cpu(i, sched_group_cpus(group)) {
>> +               struct rq *rq = cpu_rq(i);
>> +
>> +               sgs->group_utils += rq->nr_running;
>> +       }
>> +
>> +       sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
>> +                                               SCHED_POWER_SCALE);
>> +       if (!sgs->group_capacity)
>> +               sgs->group_capacity = fix_small_capacity(sd, group);
>> +       sgs->group_weight = group->group_weight;
>> +}
> 
> So you're trying to compute the group utilization, but what does that
> have to do with nr_running? In an earlier patch you introduced the
> per-cpu utilization, so why not avg that to compute the group
> utilization?

I had tried to use rq utilisation in this balancing, but since the
utilisation need much time to accumulate itself(345ms). It's bad for
any burst balancing. So I use instant utilisation -- nr_running.

> 



-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-20  9:30   ` Peter Zijlstra
@ 2013-02-20 12:09     ` Preeti U Murthy
  2013-02-20 13:34       ` Peter Zijlstra
  2013-02-20 14:33     ` Alex Shi
  1 sibling, 1 reply; 90+ messages in thread
From: Preeti U Murthy @ 2013-02-20 12:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Shi, torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

Hi,

>>  /*
>>   * This is the main, per-CPU runqueue data structure.
>>   *
>> @@ -481,6 +484,7 @@ struct rq {
>>  #endif
>>  
>>  	struct sched_avg avg;
>> +	unsigned int util;
>>  };
>>  
>>  static inline int cpu_of(struct rq *rq)
> 
> You don't actually compute the rq utilization, you only compute the
> utilization as per the fair class, so if there's significant RT activity
> it'll think the cpu is under-utilized, whihc I think will result in the
> wrong thing.

Correct me if I am wrong,but isn't the current load balancer also
disregarding the real time tasks to calculate the domain/group/cpu level
load too?

What I mean is,if the answer to the above question is yes,then can we
safely assume that the furthur optimizations to the load balancer like
the power aware scheduler and the usage of per entity load tracking can
be done without considering the real time tasks?

Regards
Preeti U Murthy
> 


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-18  5:07 ` [patch v5 11/15] sched: add power/performance balance allow flag Alex Shi
  2013-02-20  9:48   ` Peter Zijlstra
@ 2013-02-20 12:12   ` Borislav Petkov
  2013-02-20 14:20     ` Alex Shi
  1 sibling, 1 reply; 90+ messages in thread
From: Borislav Petkov @ 2013-02-20 12:12 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On Mon, Feb 18, 2013 at 01:07:38PM +0800, Alex Shi wrote:
> If a sched domain is idle enough for regular power balance, power_lb
> will be set, perf_lb will be clean. If a sched domain is busy,
> their value will be set oppositely.
> 
> If the domain is suitable for power balance, but balance should not
> be down by this cpu(this cpu is already idle or full), both of perf_lb
>  and power_lb are cleared to wait a suitable cpu to do power balance.
> That mean no any balance, neither power balance nor performance balance
> will be done on this cpu.
> 
> Above logical will be implemented by incoming patches.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2e8131d..0047856 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4053,6 +4053,8 @@ struct lb_env {
>  	unsigned int		loop;
>  	unsigned int		loop_break;
>  	unsigned int		loop_max;
> +	int			power_lb;  /* if power balance needed */
> +	int			perf_lb;   /* if performance balance needed */

Those look like they're used like simple boolean flags. Why not make
them such, i.e. bitfields? See struct perf_event_attr for an example.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-18  5:07 ` [patch v5 06/15] sched: log the cpu utilization at rq Alex Shi
  2013-02-20  9:30   ` Peter Zijlstra
@ 2013-02-20 12:19   ` Preeti U Murthy
  2013-02-20 12:39     ` Alex Shi
  1 sibling, 1 reply; 90+ messages in thread
From: Preeti U Murthy @ 2013-02-20 12:19 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

Hi everyone,

On 02/18/2013 10:37 AM, Alex Shi wrote:
> The cpu's utilization is to measure how busy is the cpu.
>         util = cpu_rq(cpu)->avg.runnable_avg_sum
>                 / cpu_rq(cpu)->avg.runnable_avg_period;

Why not cfs_rq->runnable_load_avg? I am concerned with what is the right
metric to use here.
Refer to this discussion:https://lkml.org/lkml/2012/10/29/448

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing
  2013-02-20  9:38   ` Peter Zijlstra
@ 2013-02-20 12:27     ` Alex Shi
  0 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-20 12:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/20/2013 05:38 PM, Peter Zijlstra wrote:
> On Mon, 2013-02-18 at 13:07 +0800, Alex Shi wrote:
>> @@ -4214,6 +4214,11 @@ struct sd_lb_stats {
>>         unsigned int  busiest_group_weight;
>>  
>>         int group_imb; /* Is there imbalance in this sd */
>> +
>> +       /* Varibles of power awaring scheduling */
>> +       unsigned int  sd_utils; /* sum utilizations of this domain */
>> +       unsigned long sd_capacity;      /* capacity of this domain */
>> +       struct sched_group *group_leader; /* Group which relieves
>> group_min */
>>  };
>>  
>>  /*
>> @@ -4229,6 +4234,7 @@ struct sg_lb_stats {
>>         unsigned long group_weight;
>>         int group_imb; /* Is there an imbalance in the group ? */
>>         int group_has_capacity; /* Is there extra capacity in the
>> group? */
>> +       unsigned int group_utils;       /* sum utilizations of group
>> */
>>  };
> 
> So I have two problems with the _utils name, firstly its a single value
> and therefore we shouldn't give it a name in plural, secondly, whenever
> I read utils I read utilities, not utilizations.

I think you are right. At least don't need a plural utilization.
Sorry for my bad English. how about group_util here?
> 
> As a non native speaker I'm not entirely sure, but utilizations sounds
> iffy to me, is there even a plural of utilization?
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-20 12:19   ` Preeti U Murthy
@ 2013-02-20 12:39     ` Alex Shi
  0 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-20 12:39 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/20/2013 08:19 PM, Preeti U Murthy wrote:
> Hi everyone,
> 
> On 02/18/2013 10:37 AM, Alex Shi wrote:
>> The cpu's utilization is to measure how busy is the cpu.
>>         util = cpu_rq(cpu)->avg.runnable_avg_sum
>>                 / cpu_rq(cpu)->avg.runnable_avg_period;
> 
> Why not cfs_rq->runnable_load_avg? I am concerned with what is the right
> metric to use here.

Here we care the utilization of the cpu not the load avg. load avg maybe
quit bigger on a heavy task(with a big load weight), but maybe it just
used 20% of cpu time, while a light task with 100% cpu usage maybe still
has smaller load avg.

For power consideration, above light task with 100% usage need to take a
cpu, while another heavy task can packing into one cpu with other tasks.


> Refer to this discussion:https://lkml.org/lkml/2012/10/29/448
> 
> Regards
> Preeti U Murthy
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-20 12:09     ` Preeti U Murthy
@ 2013-02-20 13:34       ` Peter Zijlstra
  2013-02-20 14:36         ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-20 13:34 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Alex Shi, torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On Wed, 2013-02-20 at 17:39 +0530, Preeti U Murthy wrote:
> Hi,
> 
> >>  /*
> >>   * This is the main, per-CPU runqueue data structure.
> >>   *
> >> @@ -481,6 +484,7 @@ struct rq {
> >>  #endif
> >>  
> >>  	struct sched_avg avg;
> >> +	unsigned int util;
> >>  };
> >>  
> >>  static inline int cpu_of(struct rq *rq)
> > 
> > You don't actually compute the rq utilization, you only compute the
> > utilization as per the fair class, so if there's significant RT activity
> > it'll think the cpu is under-utilized, whihc I think will result in the
> > wrong thing.
> 
> Correct me if I am wrong,but isn't the current load balancer also
> disregarding the real time tasks to calculate the domain/group/cpu level
> load too?

Nope, the rt utilization affects the cpu_power, thereby correcting the
weight stuff.



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-20 12:09     ` Alex Shi
@ 2013-02-20 13:36       ` Peter Zijlstra
  2013-02-20 14:23         ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-20 13:36 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Wed, 2013-02-20 at 20:09 +0800, Alex Shi wrote:
> On 02/20/2013 05:42 PM, Peter Zijlstra wrote:
> > On Mon, 2013-02-18 at 13:07 +0800, Alex Shi wrote:
> >> +/*
> >> + * Try to collect the task running number and capacity of the group.
> >> + */
> >> +static void get_sg_power_stats(struct sched_group *group,
> >> +       struct sched_domain *sd, struct sg_lb_stats *sgs)
> >> +{
> >> +       int i;
> >> +
> >> +       for_each_cpu(i, sched_group_cpus(group)) {
> >> +               struct rq *rq = cpu_rq(i);
> >> +
> >> +               sgs->group_utils += rq->nr_running;
> >> +       }
> >> +
> >> +       sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
> >> +                                               SCHED_POWER_SCALE);
> >> +       if (!sgs->group_capacity)
> >> +               sgs->group_capacity = fix_small_capacity(sd, group);
> >> +       sgs->group_weight = group->group_weight;
> >> +}
> > 
> > So you're trying to compute the group utilization, but what does that
> > have to do with nr_running? In an earlier patch you introduced the
> > per-cpu utilization, so why not avg that to compute the group
> > utilization?
> 
> I had tried to use rq utilisation in this balancing, but since the
> utilisation need much time to accumulate itself(345ms). It's bad for
> any burst balancing. So I use instant utilisation -- nr_running.

But but but,... nr_running is completely unrelated to utilization.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-20 12:04     ` Alex Shi
@ 2013-02-20 13:37       ` Peter Zijlstra
  2013-02-20 13:48         ` Peter Zijlstra
  2013-02-20 13:52         ` Alex Shi
  0 siblings, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-20 13:37 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Wed, 2013-02-20 at 20:04 +0800, Alex Shi wrote:

> >> @@ -5195,6 +5197,8 @@ static int load_balance(int this_cpu, struct rq
> >> *this_rq,
> >>                 .idle           = idle,
> >>                 .loop_break     = sched_nr_migrate_break,
> >>                 .cpus           = cpus,
> >> +               .power_lb       = 0,
> >> +               .perf_lb        = 1,
> >>         };
> >>  
> >>         cpumask_copy(cpus, cpu_active_mask);
> > 
> > This construct allows for the possibility of power_lb=1,perf_lb=1, does
> > that make sense? Why not have a single balance_policy enumeration?
> 
> (power_lb == 1 && perf_lb == 1) is incorrect and impossible to have.
> 
> (power_lb == 0 && perf_lb == 0) is possible and it means there is no any
> balance on this cpu.
> 
> So, enumeration is not enough.

Huh.. both 0 doesn't make any sense either. If there's no balancing, we
shouldn't be here to begin with.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 04/15] sched: add sched balance policies in kernel
  2013-02-20  9:37   ` Ingo Molnar
@ 2013-02-20 13:40     ` Alex Shi
  2013-02-20 15:41       ` Ingo Molnar
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-20 13:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/20/2013 05:37 PM, Ingo Molnar wrote:
> 
> * Alex Shi <alex.shi@intel.com> wrote:
> 
>> Current scheduler behavior is just consider for larger 
>> performance of system. So it try to spread tasks on more cpu 
>> sockets and cpu cores
>>
>> To adding the consideration of power awareness, the patchset 
>> adds 2 kinds of scheduler policy: powersaving and balance. 
>> They will use runnable load util in scheduler balancing. The 
>> current scheduling is taken as performance policy.
>>
>> performance: the current scheduling behaviour, try to spread tasks
>>                 on more CPU sockets or cores. performance oriented.
>> powersaving: will pack tasks into few sched group until all LCPU in the
>>                 group is full, power oriented.
>> balance    : will pack tasks into few sched group until group_capacity
>>                 numbers CPU is full, balance between performance and
>> 		powersaving.
> 
> Hm, so in a previous review I suggested keeping two main 
> policies: power-saving and performance, plus a third, default 
> policy, which automatically switches between these two if/when 
> the kernel has information about whether a system is on battery 
> or on AC - and picking 'performance' when it has no information.

I will try to add a default policy according to your suggestion.
> 
> Such an automatic policy would obviously be useful to users - 
> and that is what makes such a feature really interesting and a 
> step forward.
> 
> I think Peter expressed similar views: we don't want many knobs 
> and states, we want two major goals plus an (optional but 
> default enabled) automatism.

I got the message. thanks for reclaim again.

Now there is just 2 types policy: performance and powersaving(with 2
degrees, powersaving and balance).

powersaving policy will try to assign one task to each LCPU, whichever
the LCPU is SMT thread or a core.
The balance policy is also a kind of powersaving policy, just a bit less
aggressive. It will try to assign tasks according group capacity, one
task to one capacity.
It was introduced just because SMT LCPU in intel arch. SMT thread is a
independent LCPU in software, but its cpu power(smt_gain 1178 / 2 = 589)
is smaller than a normal CPU(1024). So, the group capacity is just 1 for
a 2 SMT thread core. So, on policy, just one task assign to one core
normally.


> 
> Is your 'balance' policy implementing that suggestion?
> If not, why not?
> 
> Thanks,
> 
> 	Ingo
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-20 13:37       ` Peter Zijlstra
@ 2013-02-20 13:48         ` Peter Zijlstra
  2013-02-20 14:08           ` Alex Shi
  2013-02-20 13:52         ` Alex Shi
  1 sibling, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-20 13:48 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Wed, 2013-02-20 at 14:37 +0100, Peter Zijlstra wrote:
> On Wed, 2013-02-20 at 20:04 +0800, Alex Shi wrote:
> 
> > >> @@ -5195,6 +5197,8 @@ static int load_balance(int this_cpu, struct rq
> > >> *this_rq,
> > >>                 .idle           = idle,
> > >>                 .loop_break     = sched_nr_migrate_break,
> > >>                 .cpus           = cpus,
> > >> +               .power_lb       = 0,
> > >> +               .perf_lb        = 1,
> > >>         };
> > >>  
> > >>         cpumask_copy(cpus, cpu_active_mask);
> > > 
> > > This construct allows for the possibility of power_lb=1,perf_lb=1, does
> > > that make sense? Why not have a single balance_policy enumeration?
> > 
> > (power_lb == 1 && perf_lb == 1) is incorrect and impossible to have.
> > 
> > (power_lb == 0 && perf_lb == 0) is possible and it means there is no any
> > balance on this cpu.
> > 
> > So, enumeration is not enough.
> 
> Huh.. both 0 doesn't make any sense either. If there's no balancing, we
> shouldn't be here to begin with.

Also, why is this in the lb_env at all, shouldn't we simply use the
global sched_balance_policy all over the place? Its not like we want to
change power/perf on a finer granularity.



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-20 13:37       ` Peter Zijlstra
  2013-02-20 13:48         ` Peter Zijlstra
@ 2013-02-20 13:52         ` Alex Shi
  1 sibling, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-20 13:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/20/2013 09:37 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-20 at 20:04 +0800, Alex Shi wrote:
> 
>>>> @@ -5195,6 +5197,8 @@ static int load_balance(int this_cpu, struct rq
>>>> *this_rq,
>>>>                 .idle           = idle,
>>>>                 .loop_break     = sched_nr_migrate_break,
>>>>                 .cpus           = cpus,
>>>> +               .power_lb       = 0,
>>>> +               .perf_lb        = 1,
>>>>         };
>>>>  
>>>>         cpumask_copy(cpus, cpu_active_mask);
>>>
>>> This construct allows for the possibility of power_lb=1,perf_lb=1, does
>>> that make sense? Why not have a single balance_policy enumeration?
>>
>> (power_lb == 1 && perf_lb == 1) is incorrect and impossible to have.
>>
>> (power_lb == 0 && perf_lb == 0) is possible and it means there is no any
>> balance on this cpu.
>>
>> So, enumeration is not enough.
> 
> Huh.. both 0 doesn't make any sense either. If there's no balancing, we
> shouldn't be here to begin with.
> 

Um, both 0 means, there is a balance happen, and we think a power
balance is appropriate for this domain, but maybe this group is already
empty, so the cpu is inappropriate to pull a task, than we exit this
time balancing, to wait another cpu from another appropriate group do
balance and pull a task.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-20 13:48         ` Peter Zijlstra
@ 2013-02-20 14:08           ` Alex Shi
  0 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-20 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen


>>> (power_lb == 1 && perf_lb == 1) is incorrect and impossible to have.
>>>
>>> (power_lb == 0 && perf_lb == 0) is possible and it means there is no any
>>> balance on this cpu.
>>>
>>> So, enumeration is not enough.
>>
>> Huh.. both 0 doesn't make any sense either. If there's no balancing, we
>> shouldn't be here to begin with.
> 
> Also, why is this in the lb_env at all, shouldn't we simply use the
> global sched_balance_policy all over the place? Its not like we want to
> change power/perf on a finer granularity.

they are in lb_env, since we need to set them according to each group
status, mostly in update_sd_lb_power_stats().

Even the sched_balance_policy is powersaving, the domain may also need
performance balance since there are maybe too much tasks or much
imbalance in domain.

when we find the domain is not suitable for power balance, we will set
lb_perf = 1, then we don't need go through other groups for power info
collection.
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-20 12:12   ` Borislav Petkov
@ 2013-02-20 14:20     ` Alex Shi
  2013-02-20 15:22       ` Borislav Petkov
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-20 14:20 UTC (permalink / raw)
  To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen


>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2e8131d..0047856 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4053,6 +4053,8 @@ struct lb_env {
>>  	unsigned int		loop;
>>  	unsigned int		loop_break;
>>  	unsigned int		loop_max;
>> +	int			power_lb;  /* if power balance needed */
>> +	int			perf_lb;   /* if performance balance needed */
> 
> Those look like they're used like simple boolean flags. Why not make
> them such, i.e. bitfields? See struct perf_event_attr for an example.

there are 11 long words in struct lb_env now. use boolean or bitfields
can't save much space. and not use conveniently.
> 
> Thanks.
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-20 13:36       ` Peter Zijlstra
@ 2013-02-20 14:23         ` Alex Shi
  2013-02-21 13:33           ` Peter Zijlstra
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-20 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/20/2013 09:36 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-20 at 20:09 +0800, Alex Shi wrote:
>> On 02/20/2013 05:42 PM, Peter Zijlstra wrote:
>>> On Mon, 2013-02-18 at 13:07 +0800, Alex Shi wrote:
>>>> +/*
>>>> + * Try to collect the task running number and capacity of the group.
>>>> + */
>>>> +static void get_sg_power_stats(struct sched_group *group,
>>>> +       struct sched_domain *sd, struct sg_lb_stats *sgs)
>>>> +{
>>>> +       int i;
>>>> +
>>>> +       for_each_cpu(i, sched_group_cpus(group)) {
>>>> +               struct rq *rq = cpu_rq(i);
>>>> +
>>>> +               sgs->group_utils += rq->nr_running;
>>>> +       }
>>>> +
>>>> +       sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
>>>> +                                               SCHED_POWER_SCALE);
>>>> +       if (!sgs->group_capacity)
>>>> +               sgs->group_capacity = fix_small_capacity(sd, group);
>>>> +       sgs->group_weight = group->group_weight;
>>>> +}
>>>
>>> So you're trying to compute the group utilization, but what does that
>>> have to do with nr_running? In an earlier patch you introduced the
>>> per-cpu utilization, so why not avg that to compute the group
>>> utilization?
>>
>> I had tried to use rq utilisation in this balancing, but since the
>> utilisation need much time to accumulate itself(345ms). It's bad for
>> any burst balancing. So I use instant utilisation -- nr_running.
> 
> But but but,... nr_running is completely unrelated to utilization.
> 

Actually, I also hesitated on the name, how about using nr_running to
replace group_util directly?



-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-20  9:30   ` Peter Zijlstra
  2013-02-20 12:09     ` Preeti U Murthy
@ 2013-02-20 14:33     ` Alex Shi
  2013-02-20 15:20       ` Peter Zijlstra
  2013-02-20 15:22       ` Peter Zijlstra
  1 sibling, 2 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-20 14:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/20/2013 05:30 PM, Peter Zijlstra wrote:
> On Mon, 2013-02-18 at 13:07 +0800, Alex Shi wrote:
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index fcdb21f..b9a34ab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1495,8 +1495,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
>>  
>>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>>  {
>> +	u32 period;
>>  	__update_entity_runnable_avg(rq->clock_task, &rq->avg, runnable);
>>  	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
>> +
>> +	period = rq->avg.runnable_avg_period ? rq->avg.runnable_avg_period : 1;
>> +	rq->util = rq->avg.runnable_avg_sum * 100 / period;
>>  }
>>  
>>  /* Add the load generated by se into cfs_rq's child load-average */
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 7a19792..ac1e107 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -350,6 +350,9 @@ extern struct root_domain def_root_domain;
>>  
>>  #endif /* CONFIG_SMP */
>>  
>> +/* the percentage full cpu utilization */
>> +#define FULL_UTIL	100
> 
> There's generally a better value than 100 when using computers.. seeing
> how 100 is 64+32+4.

I didn't find a good example for this. and no idea of your suggestion,
would you like to explain a bit more?

> 
>> +
>>  /*
>>   * This is the main, per-CPU runqueue data structure.
>>   *
>> @@ -481,6 +484,7 @@ struct rq {
>>  #endif
>>  
>>  	struct sched_avg avg;
>> +	unsigned int util;
>>  };
>>  
>>  static inline int cpu_of(struct rq *rq)
> 
> You don't actually compute the rq utilization, you only compute the
> utilization as per the fair class, so if there's significant RT activity
> it'll think the cpu is under-utilized, whihc I think will result in the
> wrong thing.

yes. A bit complicit to resolve this. Any suggestions on this, guys?
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-20 13:34       ` Peter Zijlstra
@ 2013-02-20 14:36         ` Alex Shi
  0 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-20 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Preeti U Murthy, torvalds, mingo, tglx, akpm, arjan, bp, pjt,
	namhyung, efault, vincent.guittot, gregkh, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/20/2013 09:34 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-20 at 17:39 +0530, Preeti U Murthy wrote:
>> Hi,
>>
>>>>  /*
>>>>   * This is the main, per-CPU runqueue data structure.
>>>>   *
>>>> @@ -481,6 +484,7 @@ struct rq {
>>>>  #endif
>>>>  
>>>>  	struct sched_avg avg;
>>>> +	unsigned int util;
>>>>  };
>>>>  
>>>>  static inline int cpu_of(struct rq *rq)
>>>
>>> You don't actually compute the rq utilization, you only compute the
>>> utilization as per the fair class, so if there's significant RT activity
>>> it'll think the cpu is under-utilized, whihc I think will result in the
>>> wrong thing.
>>
>> Correct me if I am wrong,but isn't the current load balancer also
>> disregarding the real time tasks to calculate the domain/group/cpu level
>> load too?
> 
> Nope, the rt utilization affects the cpu_power, thereby correcting the
> weight stuff.

The balance policy use group capacity, that implicated using cpu power,
but seems capacity is a very rough data.
> 
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-20 14:33     ` Alex Shi
@ 2013-02-20 15:20       ` Peter Zijlstra
  2013-02-21  1:35         ` Alex Shi
  2013-02-20 15:22       ` Peter Zijlstra
  1 sibling, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-20 15:20 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Wed, 2013-02-20 at 22:33 +0800, Alex Shi wrote:
> > There's generally a better value than 100 when using computers..
> seeing
> > how 100 is 64+32+4.
> 
> I didn't find a good example for this. and no idea of your suggestion,
> would you like to explain a bit more?

Basically what you're doing ends up being fixed point math, using 100 as
unit is inefficient, pick a power-of-2 and everything reduces to
bit-shifts.

http://en.wikipedia.org/wiki/Fixed-point_arithmetic

So use 128 or 1024 or whatever and you don't need mult and div
instructions to represent [0,1]


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-20 14:33     ` Alex Shi
  2013-02-20 15:20       ` Peter Zijlstra
@ 2013-02-20 15:22       ` Peter Zijlstra
  2013-02-25  2:26         ` Alex Shi
  2013-03-22  8:49         ` Alex Shi
  1 sibling, 2 replies; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-20 15:22 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Wed, 2013-02-20 at 22:33 +0800, Alex Shi wrote:
> > You don't actually compute the rq utilization, you only compute the
> > utilization as per the fair class, so if there's significant RT
> activity
> > it'll think the cpu is under-utilized, whihc I think will result in
> the
> > wrong thing.
> 
> yes. A bit complicit to resolve this. Any suggestions on this, guys?

Shouldn't be too hard seeing as we already track cpu utilization for !
fair usage; see rq::rt_avg and scale_rt_power.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-20 14:20     ` Alex Shi
@ 2013-02-20 15:22       ` Borislav Petkov
  2013-02-21  1:32         ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Borislav Petkov @ 2013-02-20 15:22 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On Wed, Feb 20, 2013 at 10:20:19PM +0800, Alex Shi wrote:
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 2e8131d..0047856 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4053,6 +4053,8 @@ struct lb_env {
> >>  	unsigned int		loop;
> >>  	unsigned int		loop_break;
> >>  	unsigned int		loop_max;
> >> +	int			power_lb;  /* if power balance needed */
> >> +	int			perf_lb;   /* if performance balance needed */
> > 
> > Those look like they're used like simple boolean flags. Why not make
> > them such, i.e. bitfields? See struct perf_event_attr for an example.
> 
> there are 11 long words in struct lb_env now. use boolean or bitfields
> can't save much space.

Now now maybe.

Btw, there's a ->flags variable there which simply cries to get another
LBF_* flag or two. This way you don't add any new members at all and
don't enlarge the struct.

> and not use conveniently.

Make yourself accessor functions or whatever.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 04/15] sched: add sched balance policies in kernel
  2013-02-20 13:40     ` Alex Shi
@ 2013-02-20 15:41       ` Ingo Molnar
  2013-02-21  1:43         ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Ingo Molnar @ 2013-02-20 15:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen


* Alex Shi <alex.shi@intel.com> wrote:

> Now there is just 2 types policy: performance and 
> powersaving(with 2 degrees, powersaving and balance).

I don't think we really want to have 'degrees' to the policies 
at this point - we want each policy to be extremely good at what 
it aims to do:

 - 'performance' should finish jobs in in the least amount of 
    time possible. No ifs and whens.

 - 'power saving' should finish jobs with the least amount of 
    watts consumed. No ifs and whens.

> powersaving policy will try to assign one task to each LCPU, 
> whichever the LCPU is SMT thread or a core. The balance policy 
> is also a kind of powersaving policy, just a bit less 
> aggressive. It will try to assign tasks according group 
> capacity, one task to one capacity.

The thing is, 'a bit less aggressive' is an awfully vague 
concept to maintain on a long term basis - while the two 
definitions above are reasonably deterministic which can be 
measured and improved upon.

Those two policies and definitions are also much easier to 
communicate to user-space and to users - it's much easier to 
explain what each policy is supposed to do.

I'd be totally glad if we got so far that those two policies 
work really well. Any further nuance visible at the ABI level is 
I think many years down the road - if at all. Simple things 
first - those are complex enough already.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-20 15:22       ` Borislav Petkov
@ 2013-02-21  1:32         ` Alex Shi
  2013-02-21  9:42           ` Borislav Petkov
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-21  1:32 UTC (permalink / raw)
  To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/20/2013 11:22 PM, Borislav Petkov wrote:
> On Wed, Feb 20, 2013 at 10:20:19PM +0800, Alex Shi wrote:
>>>> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> > >> index 2e8131d..0047856 100644
>>>> > >> --- a/kernel/sched/fair.c
>>>> > >> +++ b/kernel/sched/fair.c
>>>> > >> @@ -4053,6 +4053,8 @@ struct lb_env {
>>>> > >>  	unsigned int		loop;
>>>> > >>  	unsigned int		loop_break;
>>>> > >>  	unsigned int		loop_max;
>>>> > >> +	int			power_lb;  /* if power balance needed */
>>>> > >> +	int			perf_lb;   /* if performance balance needed */
>>> > > 
>>> > > Those look like they're used like simple boolean flags. Why not make
>>> > > them such, i.e. bitfields? See struct perf_event_attr for an example.
>> > 
>> > there are 11 long words in struct lb_env now. use boolean or bitfields
>> > can't save much space.
> Now now maybe.
> 
> Btw, there's a ->flags variable there which simply cries to get another
> LBF_* flag or two. This way you don't add any new members at all and
> don't enlarge the struct.
> 

Yes, use flags can save 2 int variable, I will change that.

Just curious, consider the lb_env size and just used in stack, plus the
big cacheline size of modern cpu, and the alignment of gcc flag on
kernel, seems no arch needs more cache lines. Are there any platforms
performance is impacted by this 2 int variables?

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-20 15:20       ` Peter Zijlstra
@ 2013-02-21  1:35         ` Alex Shi
  0 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-21  1:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/20/2013 11:20 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-20 at 22:33 +0800, Alex Shi wrote:
>>> There's generally a better value than 100 when using computers..
>> seeing
>>> how 100 is 64+32+4.
>>
>> I didn't find a good example for this. and no idea of your suggestion,
>> would you like to explain a bit more?
> 
> Basically what you're doing ends up being fixed point math, using 100 as
> unit is inefficient, pick a power-of-2 and everything reduces to
> bit-shifts.
> 
> http://en.wikipedia.org/wiki/Fixed-point_arithmetic
> 
> So use 128 or 1024 or whatever and you don't need mult and div
> instructions to represent [0,1]
> 

got it. will reconsider this.

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 04/15] sched: add sched balance policies in kernel
  2013-02-20 15:41       ` Ingo Molnar
@ 2013-02-21  1:43         ` Alex Shi
  0 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-21  1:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/20/2013 11:41 PM, Ingo Molnar wrote:
> 
> * Alex Shi <alex.shi@intel.com> wrote:
> 
>> Now there is just 2 types policy: performance and 
>> powersaving(with 2 degrees, powersaving and balance).
> 
> I don't think we really want to have 'degrees' to the policies 
> at this point - we want each policy to be extremely good at what 
> it aims to do:
> 
>  - 'performance' should finish jobs in in the least amount of 
>     time possible. No ifs and whens.
> 
>  - 'power saving' should finish jobs with the least amount of 
>     watts consumed. No ifs and whens.
> 
>> powersaving policy will try to assign one task to each LCPU, 
>> whichever the LCPU is SMT thread or a core. The balance policy 
>> is also a kind of powersaving policy, just a bit less 
>> aggressive. It will try to assign tasks according group 
>> capacity, one task to one capacity.
> 
> The thing is, 'a bit less aggressive' is an awfully vague 
> concept to maintain on a long term basis - while the two 
> definitions above are reasonably deterministic which can be 
> measured and improved upon.
> 
> Those two policies and definitions are also much easier to 
> communicate to user-space and to users - it's much easier to 
> explain what each policy is supposed to do.
> 
> I'd be totally glad if we got so far that those two policies 
> work really well. Any further nuance visible at the ABI level is 
> I think many years down the road - if at all. Simple things 
> first - those are complex enough already.


Thanks for comments!
I will remove the 'balance' policy.

> 
> Thanks,
> 
> 	Ingo
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-21  1:32         ` Alex Shi
@ 2013-02-21  9:42           ` Borislav Petkov
  2013-02-21 14:52             ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Borislav Petkov @ 2013-02-21  9:42 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, pjt, namhyung,
	efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On Thu, Feb 21, 2013 at 09:32:54AM +0800, Alex Shi wrote:
> Yes, use flags can save 2 int variable, I will change that.
>
> Just curious, consider the lb_env size and just used in stack, plus
> the big cacheline size of modern cpu, and the alignment of gcc flag on
> kernel, seems no arch needs more cache lines. Are there any platforms
> performance is impacted by this 2 int variables?

Not that I know of. But that's not the point: if we don't pay attention
and are not as economical as possible in the kernel, and especially in
heavily walked code as the scheduler, we'll become fat and bloated (if
we're not halfway there already, that is).

It might not impact processor bandwidth now because internal paths are
obviously adequate but you're not the only one adding features. What
happens if the next guy comes and adds another two integers just because
it is convenient in the code?

Btw, sizeof(lb_env) is currently something around 80 bytes AFAICT. Now
that doesn't fit in one cacheline anyway. So if you add your two ints,
they'll be trailing in the second cacheline which needs to go up to L1.

Now flags will still be at the beginning of the second cacheline but
it is still better to add two new bits there because this is exactly
what this variable is for.

And, just for the fun of it, if you push the flags variable higher in
the struct itself, it will land in the first cacheline and there's your
design with *absolutely* no overhead in that respect. I betcha if you
do this, you won't see any overhead in L1 utilization even with perf
counters because you get it practically for free.

:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-20 14:23         ` Alex Shi
@ 2013-02-21 13:33           ` Peter Zijlstra
  2013-02-21 14:40             ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-21 13:33 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Wed, 2013-02-20 at 22:23 +0800, Alex Shi wrote:
> > But but but,... nr_running is completely unrelated to utilization.
> > 
> 
> Actually, I also hesitated on the name, how about using nr_running to
> replace group_util directly?


The name is a secondary issue, first you need to explain why you think
nr_running is a useful metric at all.

You can have a high nr_running and a low utilization (a burst of
wakeups, each waking a process that'll instantly go to sleep again), or
low nr_running and high utilization (a single process cpu bound
process).

There is absolutely no relation between utilization and nr_running,
building something on that assumption is just wrong and broken.




^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-21 13:33           ` Peter Zijlstra
@ 2013-02-21 14:40             ` Alex Shi
  2013-02-22  8:54               ` Peter Zijlstra
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-21 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/21/2013 09:33 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-20 at 22:23 +0800, Alex Shi wrote:
>>> But but but,... nr_running is completely unrelated to utilization.
>>>
>>
>> Actually, I also hesitated on the name, how about using nr_running to
>> replace group_util directly?
> 
> 
> The name is a secondary issue, first you need to explain why you think
> nr_running is a useful metric at all.
> 
> You can have a high nr_running and a low utilization (a burst of
> wakeups, each waking a process that'll instantly go to sleep again), or
> low nr_running and high utilization (a single process cpu bound
> process).

It is true in periodic balance. But in fork/exec/waking timing, the
incoming processes usually need to do something before sleep again.

I use nr_running to measure how the group busy, due to 3 reasons:
1, the current performance policy doesn't use utilization too.
2, the power policy don't care load weight.
3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some
benchmark results looks clear bad when use utilization. if my memory
right, the hackbench/aim7 both looks bad. I had tried many ways to
engage utilization into this balance, like use utilization only, or use
utilization * nr_running etc. but still can not find a way to recover
the lose. But with nr_running, the performance seems doesn't lose much
with power policy.

> 
> There is absolutely no relation between utilization and nr_running,
> building something on that assumption is just wrong and broken.

I just had tried all my benchmarks dbench/loop netperf/specjbb/sysbench
etc seems the performance/power testing result are all acceptable.

> 
> 
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 11/15] sched: add power/performance balance allow flag
  2013-02-21  9:42           ` Borislav Petkov
@ 2013-02-21 14:52             ` Alex Shi
  0 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-21 14:52 UTC (permalink / raw)
  To: Borislav Petkov, torvalds, mingo, peterz, tglx, akpm, arjan, pjt,
	namhyung, efault, vincent.guittot, gregkh, preeti, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/21/2013 05:42 PM, Borislav Petkov wrote:
> On Thu, Feb 21, 2013 at 09:32:54AM +0800, Alex Shi wrote:
>> Yes, use flags can save 2 int variable, I will change that.
>>
>> Just curious, consider the lb_env size and just used in stack, plus
>> the big cacheline size of modern cpu, and the alignment of gcc flag on
>> kernel, seems no arch needs more cache lines. Are there any platforms
>> performance is impacted by this 2 int variables?
> 
> Not that I know of. But that's not the point: if we don't pay attention
> and are not as economical as possible in the kernel, and especially in
> heavily walked code as the scheduler, we'll become fat and bloated (if
> we're not halfway there already, that is).
> 
> It might not impact processor bandwidth now because internal paths are
> obviously adequate but you're not the only one adding features. What
> happens if the next guy comes and adds another two integers just because
> it is convenient in the code?

Thanks for the detailed nice explanation!

I know the point, as a performance sensitive guy, just curious which
platform maybe impacted. :)
> 
> Btw, sizeof(lb_env) is currently something around 80 bytes AFAICT. Now
> that doesn't fit in one cacheline anyway. So if you add your two ints,
> they'll be trailing in the second cacheline which needs to go up to L1.
> 
> Now flags will still be at the beginning of the second cacheline but
> it is still better to add two new bits there because this is exactly
> what this variable is for.
> 
> And, just for the fun of it, if you push the flags variable higher in
> the struct itself, it will land in the first cacheline and there's your
> design with *absolutely* no overhead in that respect. I betcha if you
> do this, you won't see any overhead in L1 utilization even with perf
> counters because you get it practically for free.

thanks suggestion.
looks the member's sequence was considered in lb_env. The 'flags' looks
less important and used frequent than the fields before it. :)
> 
> :-)
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-21 14:40             ` Alex Shi
@ 2013-02-22  8:54               ` Peter Zijlstra
  2013-02-24  9:27                 ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Peter Zijlstra @ 2013-02-22  8:54 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote:
> > The name is a secondary issue, first you need to explain why you
> think
> > nr_running is a useful metric at all.
> > 
> > You can have a high nr_running and a low utilization (a burst of
> > wakeups, each waking a process that'll instantly go to sleep again),
> or
> > low nr_running and high utilization (a single process cpu bound
> > process).
> 
> It is true in periodic balance. But in fork/exec/waking timing, the
> incoming processes usually need to do something before sleep again.

You'd be surprised, there's a fair number of workloads that have
negligible runtime on wakeup.

> I use nr_running to measure how the group busy, due to 3 reasons:
> 1, the current performance policy doesn't use utilization too.

We were planning to fix that now that its available.

> 2, the power policy don't care load weight.

Then its broken, it should very much still care about weight. 

> 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some
> benchmark results looks clear bad when use utilization. if my memory
> right, the hackbench/aim7 both looks bad. I had tried many ways to
> engage utilization into this balance, like use utilization only, or
> use
> utilization * nr_running etc. but still can not find a way to recover
> the lose. But with nr_running, the performance seems doesn't lose much
> with power policy.

You're failing to explain why utilization performs bad and you don't
explain why nr_running is better. That things work simply isn't good
enough, you have to have at least a general idea (but much preferable a
very good idea) _why_ things work.



^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-22  8:54               ` Peter Zijlstra
@ 2013-02-24  9:27                 ` Alex Shi
  2013-02-24  9:49                   ` Preeti U Murthy
  2013-02-24 17:51                   ` Preeti U Murthy
  0 siblings, 2 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-24  9:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/22/2013 04:54 PM, Peter Zijlstra wrote:
> On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote:
>>> The name is a secondary issue, first you need to explain why you
>> think
>>> nr_running is a useful metric at all.
>>>
>>> You can have a high nr_running and a low utilization (a burst of
>>> wakeups, each waking a process that'll instantly go to sleep again),
>> or
>>> low nr_running and high utilization (a single process cpu bound
>>> process).
>>
>> It is true in periodic balance. But in fork/exec/waking timing, the
>> incoming processes usually need to do something before sleep again.
> 
> You'd be surprised, there's a fair number of workloads that have
> negligible runtime on wakeup.

will appreciate if you like introduce some workload. :)
BTW, do you has some idea to handle them?
Actually, if tasks is just like transitory, it is also hard to catch
them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat
just can catch 1 or 2 tasks very second.
> 
>> I use nr_running to measure how the group busy, due to 3 reasons:
>> 1, the current performance policy doesn't use utilization too.
> 
> We were planning to fix that now that its available.

I had tried, but failed on aim9 benchmark. As a result I give up to use
utilization in performance balance.
Some trying and talking in the thread.
https://lkml.org/lkml/2013/1/6/96
https://lkml.org/lkml/2013/1/22/662
> 
>> 2, the power policy don't care load weight.
> 
> Then its broken, it should very much still care about weight.

Here power policy just use nr_running as the criteria to check if it's
eligible for power aware balance. when do balancing the load weight is
still the key judgment.

> 
>> 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some
>> benchmark results looks clear bad when use utilization. if my memory
>> right, the hackbench/aim7 both looks bad. I had tried many ways to
>> engage utilization into this balance, like use utilization only, or
>> use
>> utilization * nr_running etc. but still can not find a way to recover
>> the lose. But with nr_running, the performance seems doesn't lose much
>> with power policy.
> 
> You're failing to explain why utilization performs bad and you don't
> explain why nr_running is better. That things work simply isn't good

Um, let me try to explain again, The utilisation need much time to
accumulate itself(345ms). Whenever with or without load weight, many
bursting tasks just give a minimum weight to the carrier CPU at the
first few ms. So, it is too easy to do a incorrect distribution here and
need migration on later periodic balancing.

> enough, you have to have at least a general idea (but much preferable a
> very good idea) _why_ things work.

Here nr_running is just a criteria for if a suitable power policy
checking, in later task distribution judgment, load weight and util
still used. like in next patch: sched: packing transitory tasks in
wake/exec power balancing

I will reconsider the criteria, but also appreciate for any idea input.
> 
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-24  9:27                 ` Alex Shi
@ 2013-02-24  9:49                   ` Preeti U Murthy
  2013-02-24 11:55                     ` Alex Shi
  2013-02-24 17:51                   ` Preeti U Murthy
  1 sibling, 1 reply; 90+ messages in thread
From: Preeti U Murthy @ 2013-02-24  9:49 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, pjt,
	namhyung, efault, vincent.guittot, gregkh, viresh.kumar,
	linux-kernel, morten.rasmussen

Hi Alex,

On 02/24/2013 02:57 PM, Alex Shi wrote:
> On 02/22/2013 04:54 PM, Peter Zijlstra wrote:
>> On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote:
>>>> The name is a secondary issue, first you need to explain why you
>>> think
>>>> nr_running is a useful metric at all.
>>>>
>>>> You can have a high nr_running and a low utilization (a burst of
>>>> wakeups, each waking a process that'll instantly go to sleep again),
>>> or
>>>> low nr_running and high utilization (a single process cpu bound
>>>> process).
>>>
>>> It is true in periodic balance. But in fork/exec/waking timing, the
>>> incoming processes usually need to do something before sleep again.
>>
>> You'd be surprised, there's a fair number of workloads that have
>> negligible runtime on wakeup.
> 
> will appreciate if you like introduce some workload. :)
> BTW, do you has some idea to handle them?
> Actually, if tasks is just like transitory, it is also hard to catch
> them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat
> just can catch 1 or 2 tasks very second.
>>
>>> I use nr_running to measure how the group busy, due to 3 reasons:
>>> 1, the current performance policy doesn't use utilization too.
>>
>> We were planning to fix that now that its available.
> 
> I had tried, but failed on aim9 benchmark. As a result I give up to use
> utilization in performance balance.
> Some trying and talking in the thread.
> https://lkml.org/lkml/2013/1/6/96
> https://lkml.org/lkml/2013/1/22/662
>>
>>> 2, the power policy don't care load weight.
>>
>> Then its broken, it should very much still care about weight.
> 
> Here power policy just use nr_running as the criteria to check if it's
> eligible for power aware balance. when do balancing the load weight is
> still the key judgment.
> 
>>
>>> 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some
>>> benchmark results looks clear bad when use utilization. if my memory
>>> right, the hackbench/aim7 both looks bad. I had tried many ways to
>>> engage utilization into this balance, like use utilization only, or
>>> use
>>> utilization * nr_running etc. but still can not find a way to recover
>>> the lose. But with nr_running, the performance seems doesn't lose much
>>> with power policy.
>>
>> You're failing to explain why utilization performs bad and you don't
>> explain why nr_running is better. That things work simply isn't good
> 
> Um, let me try to explain again, The utilisation need much time to
> accumulate itself(345ms). Whenever with or without load weight, many
> bursting tasks just give a minimum weight to the carrier CPU at the
> first few ms. So, it is too easy to do a incorrect distribution here and
> need migration on later periodic balancing.

I dont understand why forked tasks are taking time to accumulate the
load.I understand this if it were to be a woken up task.The first time
the forked task gets a chance to update the load itself,it needs to
reflect full utilization.In __update_entity_runnable_avg both
runnable_avg_period and runnable_avg_sum get equally incremented for a
forked task since it is runnable.Hence where is the chance for the load
to get incremented in steps?

In sleeping tasks since runnable_avg_sum progresses much slower than
runnable_avg_period,these tasks take much time to accumulate the load
when they wake up.This makes sense of course.But how does this happen
for forked tasks?

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 02/15] sched: set initial load avg of new forked task
  2013-02-20  6:20   ` Alex Shi
@ 2013-02-24 10:57     ` Preeti U Murthy
  2013-02-25  6:00       ` Alex Shi
  2013-02-25  7:12       ` Alex Shi
  0 siblings, 2 replies; 90+ messages in thread
From: Preeti U Murthy @ 2013-02-24 10:57 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

Hi Alex,

On 02/20/2013 11:50 AM, Alex Shi wrote:
> On 02/18/2013 01:07 PM, Alex Shi wrote:
>> New task has no runnable sum at its first runnable time, so its
>> runnable load is zero. That makes burst forking balancing just select
>> few idle cpus to assign tasks if we engage runnable load in balancing.
>>
>> Set initial load avg of new forked task as its load weight to resolve
>> this issue.
>>
> 
> patch answering PJT's update here. that merged the 1st and 2nd patches 
> into one. other patches in serial don't need to change.
> 
> =========
> From 89b56f2e5a323a0cb91c98be15c94d34e8904098 Mon Sep 17 00:00:00 2001
> From: Alex Shi <alex.shi@intel.com>
> Date: Mon, 3 Dec 2012 17:30:39 +0800
> Subject: [PATCH 01/14] sched: set initial value of runnable avg for new
>  forked task
> 
> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
> new forked task.
> Otherwise random values of above variables cause mess when do new task
> enqueue:
>     enqueue_task_fair
>         enqueue_entity
>             enqueue_entity_load_avg
> 
> and make forking balancing imbalance since incorrect load_avg_contrib.
> 
> set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
> resolve such issues.
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/core.c | 3 +++
>  kernel/sched/fair.c | 4 ++++
>  2 files changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 26058d0..1452e14 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1559,6 +1559,7 @@ static void __sched_fork(struct task_struct *p)
>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>  	p->se.avg.runnable_avg_period = 0;
>  	p->se.avg.runnable_avg_sum = 0;
> +	p->se.avg.decay_count = 0;
>  #endif
>  #ifdef CONFIG_SCHEDSTATS
>  	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
> @@ -1646,6 +1647,8 @@ void sched_fork(struct task_struct *p)
>  		p->sched_reset_on_fork = 0;
>  	}
> 
I think the following comment will help here.
/* All forked tasks are assumed to have full utilization to begin with */
> +	p->se.avg.load_avg_contrib = p->se.load.weight;
> +
>  	if (!rt_prio(p->prio))
>  		p->sched_class = &fair_sched_class;
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 81fa536..cae5134 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>  	 * We track migrations using entity decay_count <= 0, on a wake-up
>  	 * migration we use a negative decay count to track the remote decays
>  	 * accumulated while sleeping.
> +	 *
> +	 * When enqueue a new forked task, the se->avg.decay_count == 0, so
> +	 * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
> +	 * value: se->load.weight.

I disagree with the comment.update_entity_load_avg() gets called for all
forked tasks.
enqueue_task_fair->update_entity_load_avg() during the second
iteration.But __update_entity_load_avg() in update_entity_load_avg()
,where the actual load update happens does not get called.This is
because as below,the last_update of the forked task is nearly equal to
the clock task of the runqueue.Hence probably 1ms has not passed by for
the load to get updated.Which is why the load of the task nor the load
of the runqueue gets updated when the task forks.

Also note that the reason we bypass update_entity_load_avg() below is
not because our decay_count=0.Its because the forked tasks have nothing
to update.Only woken up tasks and migrated wake ups have load updates to
do.Forked tasks just got created,they have no load to "update" but only
to "create". This I feel is rightly done in sched_fork by this patch.

So ideally I dont think we should have any comment here.It does not
sound relevant.

>  	 */
>  	if (unlikely(se->avg.decay_count <= 0)) {
>  		se->avg.last_runnable_update = rq_of(cfs_rq)->clock_task;
> 


Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-24  9:49                   ` Preeti U Murthy
@ 2013-02-24 11:55                     ` Alex Shi
  0 siblings, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-24 11:55 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, pjt,
	namhyung, efault, vincent.guittot, gregkh, viresh.kumar,
	linux-kernel, morten.rasmussen


>> Um, let me try to explain again, The utilisation need much time to
>> accumulate itself(345ms). Whenever with or without load weight, many
>> bursting tasks just give a minimum weight to the carrier CPU at the
>> first few ms. So, it is too easy to do a incorrect distribution here and
>> need migration on later periodic balancing.
> 
> I dont understand why forked tasks are taking time to accumulate the
> load.I understand this if it were to be a woken up task.The first time

new forked task will get its load at once.
but the CPU utilization still need time to accumulate, these are
different concept. The cpu utilization means in a past period, this cpu
runs some ms...



-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-24  9:27                 ` Alex Shi
  2013-02-24  9:49                   ` Preeti U Murthy
@ 2013-02-24 17:51                   ` Preeti U Murthy
  2013-02-25  2:23                     ` Alex Shi
  1 sibling, 1 reply; 90+ messages in thread
From: Preeti U Murthy @ 2013-02-24 17:51 UTC (permalink / raw)
  To: Alex Shi
  Cc: Peter Zijlstra, torvalds, mingo, tglx, akpm, arjan, bp, pjt,
	namhyung, efault, vincent.guittot, gregkh, viresh.kumar,
	linux-kernel, morten.rasmussen

Hi,

On 02/24/2013 02:57 PM, Alex Shi wrote:
> On 02/22/2013 04:54 PM, Peter Zijlstra wrote:
>> On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote:
>>>> The name is a secondary issue, first you need to explain why you
>>> think
>>>> nr_running is a useful metric at all.
>>>>
>>>> You can have a high nr_running and a low utilization (a burst of
>>>> wakeups, each waking a process that'll instantly go to sleep again),
>>> or
>>>> low nr_running and high utilization (a single process cpu bound
>>>> process).
>>>
>>> It is true in periodic balance. But in fork/exec/waking timing, the
>>> incoming processes usually need to do something before sleep again.
>>
>> You'd be surprised, there's a fair number of workloads that have
>> negligible runtime on wakeup.
> 
> will appreciate if you like introduce some workload. :)
> BTW, do you has some idea to handle them?
> Actually, if tasks is just like transitory, it is also hard to catch
> them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat
> just can catch 1 or 2 tasks very second.
>>
>>> I use nr_running to measure how the group busy, due to 3 reasons:
>>> 1, the current performance policy doesn't use utilization too.
>>
>> We were planning to fix that now that its available.
> 
> I had tried, but failed on aim9 benchmark. As a result I give up to use
> utilization in performance balance.
> Some trying and talking in the thread.
> https://lkml.org/lkml/2013/1/6/96
> https://lkml.org/lkml/2013/1/22/662
>>
>>> 2, the power policy don't care load weight.
>>
>> Then its broken, it should very much still care about weight.
> 
> Here power policy just use nr_running as the criteria to check if it's
> eligible for power aware balance. when do balancing the load weight is
> still the key judgment.
> 
>>
>>> 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some
>>> benchmark results looks clear bad when use utilization. if my memory
>>> right, the hackbench/aim7 both looks bad. I had tried many ways to
>>> engage utilization into this balance, like use utilization only, or
>>> use
>>> utilization * nr_running etc. but still can not find a way to recover
>>> the lose. But with nr_running, the performance seems doesn't lose much
>>> with power policy.
>>
>> You're failing to explain why utilization performs bad and you don't
>> explain why nr_running is better. That things work simply isn't good
> 
> Um, let me try to explain again, The utilisation need much time to
> accumulate itself(345ms). Whenever with or without load weight, many
> bursting tasks just give a minimum weight to the carrier CPU at the
> first few ms. So, it is too easy to do a incorrect distribution here and
> need migration on later periodic balancing.

Why can't this be attacked in *either* of the following ways:

1.Attack this problem at the source, by ensuring that the utilisation is
accumulated faster by making the update window smaller.

2.Balance on nr->running only if you detect burst wakeups.
Alex, you had released a patch earlier which could detect this right?
Instead of balancing on nr_running all the time, why not balance on it
only if burst wakeups are detected. By doing so you ensure that
nr_running as a metric for load balancing is used when it is right to do
so and the reason to use it also gets well documented.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-24 17:51                   ` Preeti U Murthy
@ 2013-02-25  2:23                     ` Alex Shi
  2013-02-25  3:23                       ` Mike Galbraith
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-25  2:23 UTC (permalink / raw)
  To: Preeti U Murthy, Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/25/2013 01:51 AM, Preeti U Murthy wrote:
> Hi,
> 
> On 02/24/2013 02:57 PM, Alex Shi wrote:
>> On 02/22/2013 04:54 PM, Peter Zijlstra wrote:
>>> On Thu, 2013-02-21 at 22:40 +0800, Alex Shi wrote:
>>>>> The name is a secondary issue, first you need to explain why you
>>>> think
>>>>> nr_running is a useful metric at all.
>>>>>
>>>>> You can have a high nr_running and a low utilization (a burst of
>>>>> wakeups, each waking a process that'll instantly go to sleep again),
>>>> or
>>>>> low nr_running and high utilization (a single process cpu bound
>>>>> process).
>>>>
>>>> It is true in periodic balance. But in fork/exec/waking timing, the
>>>> incoming processes usually need to do something before sleep again.
>>>
>>> You'd be surprised, there's a fair number of workloads that have
>>> negligible runtime on wakeup.
>>
>> will appreciate if you like introduce some workload. :)
>> BTW, do you has some idea to handle them?
>> Actually, if tasks is just like transitory, it is also hard to catch
>> them in balance, like 'cyclitest -t 100' on my 4 LCPU laptop, vmstat
>> just can catch 1 or 2 tasks very second.
>>>
>>>> I use nr_running to measure how the group busy, due to 3 reasons:
>>>> 1, the current performance policy doesn't use utilization too.
>>>
>>> We were planning to fix that now that its available.
>>
>> I had tried, but failed on aim9 benchmark. As a result I give up to use
>> utilization in performance balance.
>> Some trying and talking in the thread.
>> https://lkml.org/lkml/2013/1/6/96
>> https://lkml.org/lkml/2013/1/22/662
>>>
>>>> 2, the power policy don't care load weight.
>>>
>>> Then its broken, it should very much still care about weight.
>>
>> Here power policy just use nr_running as the criteria to check if it's
>> eligible for power aware balance. when do balancing the load weight is
>> still the key judgment.
>>
>>>
>>>> 3, I tested some benchmarks, kbuild/tbench/hackbench/aim7 etc, some
>>>> benchmark results looks clear bad when use utilization. if my memory
>>>> right, the hackbench/aim7 both looks bad. I had tried many ways to
>>>> engage utilization into this balance, like use utilization only, or
>>>> use
>>>> utilization * nr_running etc. but still can not find a way to recover
>>>> the lose. But with nr_running, the performance seems doesn't lose much
>>>> with power policy.
>>>
>>> You're failing to explain why utilization performs bad and you don't
>>> explain why nr_running is better. That things work simply isn't good
>>
>> Um, let me try to explain again, The utilisation need much time to
>> accumulate itself(345ms). Whenever with or without load weight, many
>> bursting tasks just give a minimum weight to the carrier CPU at the
>> first few ms. So, it is too easy to do a incorrect distribution here and
>> need migration on later periodic balancing.
> 
> Why can't this be attacked in *either* of the following ways:
> 
> 1.Attack this problem at the source, by ensuring that the utilisation is
> accumulated faster by making the update window smaller.

It is a double blade sword. Small period will response quickly, but
loses lots of history record. A extreme short period is just same as
current instant utilization.
> 
> 2.Balance on nr->running only if you detect burst wakeups.
> Alex, you had released a patch earlier which could detect this right?

Yes, the patch is here:
https://lkml.org/lkml/2013/1/11/45

One of problem is the how to decide the criteria of the burst? If we set
5 waking up/ms is burst, we will lose 4 waking up/ms.
another problem is the burst detection cost, we need tracking a period
history info of the waking up, better on whole group. but that give the
extra cost in burst.

solution candidates:
https://lkml.org/lkml/2013/1/21/316
After talk with MikeG, I remove the runnable load avg in performance
load balance.

Using nr_running as instant utilization may narrow the power policy
suitable situation. -- consider for power consumption, a light but cpu
intensive task will cost much more power than a heavy load but run
occasionally task. And it fit all my benchmarks
aim7/hackbench/kbuild/cyclitest/netperf etc.

> Instead of balancing on nr_running all the time, why not balance on it
> only if burst wakeups are detected. By doing so you ensure that
> nr_running as a metric for load balancing is used when it is right to do
> so and the reason to use it also gets well documented.
> 
> Regards
> Preeti U Murthy
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-20 15:22       ` Peter Zijlstra
@ 2013-02-25  2:26         ` Alex Shi
  2013-03-22  8:49         ` Alex Shi
  1 sibling, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-25  2:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/20/2013 11:22 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-20 at 22:33 +0800, Alex Shi wrote:
>>> You don't actually compute the rq utilization, you only compute the
>>> utilization as per the fair class, so if there's significant RT
>> activity
>>> it'll think the cpu is under-utilized, whihc I think will result in
>> the
>>> wrong thing.
>>
>> yes. A bit complicit to resolve this. Any suggestions on this, guys?
> 
> Shouldn't be too hard seeing as we already track cpu utilization for !
> fair usage; see rq::rt_avg and scale_rt_power.
> 

added them in periodic balancing, thanks!

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-25  2:23                     ` Alex Shi
@ 2013-02-25  3:23                       ` Mike Galbraith
  2013-02-25  9:53                         ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Mike Galbraith @ 2013-02-25  3:23 UTC (permalink / raw)
  To: Alex Shi
  Cc: Preeti U Murthy, Peter Zijlstra, torvalds, mingo, tglx, akpm,
	arjan, bp, pjt, namhyung, vincent.guittot, gregkh, viresh.kumar,
	linux-kernel, morten.rasmussen

On Mon, 2013-02-25 at 10:23 +0800, Alex Shi wrote:

> One of problem is the how to decide the criteria of the burst? If we set
> 5 waking up/ms is burst, we will lose 4 waking up/ms.
> another problem is the burst detection cost, we need tracking a period
> history info of the waking up, better on whole group. but that give the
> extra cost in burst.
> 
> solution candidates:
> https://lkml.org/lkml/2013/1/21/316
> After talk with MikeG, I remove the runnable load avg in performance
> load balance.

One thing you could try is to make criteria depend on avg_idle.  It will
slam to 2*migration_cost when a wakeup arrives after an ~extended idle.
You could perhaps extend it to cover new task wakeup as well, and use
that transition to invalidate history, switch to instantaneous until
fresh history can accumulate.

-Mike


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 02/15] sched: set initial load avg of new forked task
  2013-02-24 10:57     ` Preeti U Murthy
@ 2013-02-25  6:00       ` Alex Shi
  2013-02-28  7:03         ` Preeti U Murthy
  2013-02-25  7:12       ` Alex Shi
  1 sibling, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-25  6:00 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/24/2013 06:57 PM, Preeti U Murthy wrote:
> Hi Alex,
> 
> On 02/20/2013 11:50 AM, Alex Shi wrote:
>> On 02/18/2013 01:07 PM, Alex Shi wrote:
>>> New task has no runnable sum at its first runnable time, so its
>>> runnable load is zero. That makes burst forking balancing just select
>>> few idle cpus to assign tasks if we engage runnable load in balancing.
>>>
>>> Set initial load avg of new forked task as its load weight to resolve
>>> this issue.
>>>
>>
>> patch answering PJT's update here. that merged the 1st and 2nd patches 
>> into one. other patches in serial don't need to change.
>>
>> =========
>> From 89b56f2e5a323a0cb91c98be15c94d34e8904098 Mon Sep 17 00:00:00 2001
>> From: Alex Shi <alex.shi@intel.com>
>> Date: Mon, 3 Dec 2012 17:30:39 +0800
>> Subject: [PATCH 01/14] sched: set initial value of runnable avg for new
>>  forked task
>>
>> We need initialize the se.avg.{decay_count, load_avg_contrib} for a
>> new forked task.
>> Otherwise random values of above variables cause mess when do new task
>> enqueue:
>>     enqueue_task_fair
>>         enqueue_entity
>>             enqueue_entity_load_avg
>>
>> and make forking balancing imbalance since incorrect load_avg_contrib.
>>
>> set avg.decay_count = 0, and avg.load_avg_contrib = se->load.weight to
>> resolve such issues.
>>
>> Signed-off-by: Alex Shi <alex.shi@intel.com>
>> ---
>>  kernel/sched/core.c | 3 +++
>>  kernel/sched/fair.c | 4 ++++
>>  2 files changed, 7 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 26058d0..1452e14 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1559,6 +1559,7 @@ static void __sched_fork(struct task_struct *p)
>>  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>>  	p->se.avg.runnable_avg_period = 0;
>>  	p->se.avg.runnable_avg_sum = 0;
>> +	p->se.avg.decay_count = 0;
>>  #endif
>>  #ifdef CONFIG_SCHEDSTATS
>>  	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>> @@ -1646,6 +1647,8 @@ void sched_fork(struct task_struct *p)
>>  		p->sched_reset_on_fork = 0;
>>  	}
>>
> I think the following comment will help here.
> /* All forked tasks are assumed to have full utilization to begin with */
>> +	p->se.avg.load_avg_contrib = p->se.load.weight;
>> +
>>  	if (!rt_prio(p->prio))
>>  		p->sched_class = &fair_sched_class;
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 81fa536..cae5134 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>>  	 * We track migrations using entity decay_count <= 0, on a wake-up
>>  	 * migration we use a negative decay count to track the remote decays
>>  	 * accumulated while sleeping.
>> +	 *
>> +	 * When enqueue a new forked task, the se->avg.decay_count == 0, so
>> +	 * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
>> +	 * value: se->load.weight.
> 
> I disagree with the comment.update_entity_load_avg() gets called for all
> forked tasks.
> enqueue_task_fair->update_entity_load_avg() during the second
> iteration.But __update_entity_load_avg() in update_entity_load_avg()
> 

When goes 'enqueue_task_fair->update_entity_load_avg()' during the
second iteration. the se is changed.
That is different se.


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 02/15] sched: set initial load avg of new forked task
  2013-02-24 10:57     ` Preeti U Murthy
  2013-02-25  6:00       ` Alex Shi
@ 2013-02-25  7:12       ` Alex Shi
  1 sibling, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-25  7:12 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/24/2013 06:57 PM, Preeti U Murthy wrote:
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 26058d0..1452e14 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -1559,6 +1559,7 @@ static void __sched_fork(struct task_struct *p)
>> >  #if defined(CONFIG_SMP) && defined(CONFIG_FAIR_GROUP_SCHED)
>> >  	p->se.avg.runnable_avg_period = 0;
>> >  	p->se.avg.runnable_avg_sum = 0;
>> > +	p->se.avg.decay_count = 0;
>> >  #endif
>> >  #ifdef CONFIG_SCHEDSTATS
>> >  	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
>> > @@ -1646,6 +1647,8 @@ void sched_fork(struct task_struct *p)
>> >  		p->sched_reset_on_fork = 0;
>> >  	}
>> > 
> I think the following comment will help here.
> /* All forked tasks are assumed to have full utilization to begin with */


looks fine.

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-25  3:23                       ` Mike Galbraith
@ 2013-02-25  9:53                         ` Alex Shi
  2013-02-25 10:30                           ` Mike Galbraith
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-02-25  9:53 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Preeti U Murthy, Peter Zijlstra, torvalds, mingo, tglx, akpm,
	arjan, bp, pjt, namhyung, vincent.guittot, gregkh, viresh.kumar,
	linux-kernel, morten.rasmussen

On 02/25/2013 11:23 AM, Mike Galbraith wrote:
> On Mon, 2013-02-25 at 10:23 +0800, Alex Shi wrote:
> 
>> One of problem is the how to decide the criteria of the burst? If we set
>> 5 waking up/ms is burst, we will lose 4 waking up/ms.
>> another problem is the burst detection cost, we need tracking a period
>> history info of the waking up, better on whole group. but that give the
>> extra cost in burst.
>>
>> solution candidates:
>> https://lkml.org/lkml/2013/1/21/316
>> After talk with MikeG, I remove the runnable load avg in performance
>> load balance.
> 
> One thing you could try is to make criteria depend on avg_idle.  It will
> slam to 2*migration_cost when a wakeup arrives after an ~extended idle.
> You could perhaps extend it to cover new task wakeup as well, and use
> that transition to invalidate history, switch to instantaneous until
> fresh history can accumulate.

Sorry for can not get your points, would you like to goes to details?

And still don't understand of the idle_stamp setting, idle_stamp was set
in idle_balance(), if idle_balance doesn't pulled task, idle_stamp value
kept. then even the cpu get tasks from another balancing, like periodic
balance, fork/exec/wake balancing, the idle_stamp is still kept.

So, when the cpu goes to next idle_balance(), it's highly possible to
meet the avg_idle > migration_cost condition, and start try to pull
tasks, nearly unconditionally.

Does the idle_balance was designed to this? or we still should reset
idle_stamp whichever we pulled a task?

> 
> -Mike
> 


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
  2013-02-25  9:53                         ` Alex Shi
@ 2013-02-25 10:30                           ` Mike Galbraith
  0 siblings, 0 replies; 90+ messages in thread
From: Mike Galbraith @ 2013-02-25 10:30 UTC (permalink / raw)
  To: Alex Shi
  Cc: Preeti U Murthy, Peter Zijlstra, torvalds, mingo, tglx, akpm,
	arjan, bp, pjt, namhyung, vincent.guittot, gregkh, viresh.kumar,
	linux-kernel, morten.rasmussen

On Mon, 2013-02-25 at 17:53 +0800, Alex Shi wrote: 
> On 02/25/2013 11:23 AM, Mike Galbraith wrote:
> > On Mon, 2013-02-25 at 10:23 +0800, Alex Shi wrote:
> > 
> >> One of problem is the how to decide the criteria of the burst? If we set
> >> 5 waking up/ms is burst, we will lose 4 waking up/ms.
> >> another problem is the burst detection cost, we need tracking a period
> >> history info of the waking up, better on whole group. but that give the
> >> extra cost in burst.
> >>
> >> solution candidates:
> >> https://lkml.org/lkml/2013/1/21/316
> >> After talk with MikeG, I remove the runnable load avg in performance
> >> load balance.
> > 
> > One thing you could try is to make criteria depend on avg_idle.  It will
> > slam to 2*migration_cost when a wakeup arrives after an ~extended idle.
> > You could perhaps extend it to cover new task wakeup as well, and use
> > that transition to invalidate history, switch to instantaneous until
> > fresh history can accumulate.
> 
> Sorry for can not get your points, would you like to goes to details?

If you've been idle for a bit, your history is stale.
> And still don't understand of the idle_stamp setting, idle_stamp was set
> in idle_balance(), if idle_balance doesn't pulled task, idle_stamp value
> kept. then even the cpu get tasks from another balancing, like periodic
> balance, fork/exec/wake balancing, the idle_stamp is still kept.

The thought is that you only care about the somewhat longish idles that
bursty loads exhibits.  The time when ttwu() detects that it should
trash idle history to kick idle_balance() back into action seems likely
to me to be the right time to trash load history, to accommodate bursty
loads in a dirt simple dirt cheap manner.  The idle balance throttle
methodology may not be perfect, but it works pretty well, and is dirt
cheap.

> So, when the cpu goes to next idle_balance(), it's highly possible to
> meet the avg_idle > migration_cost condition, and start try to pull
> tasks, nearly unconditionally.
> 
> Does the idle_balance was designed to this? or we still should reset
> idle_stamp whichever we pulled a task?

                if (pulled_task) {
                        this_rq->idle_stamp = 0;
                        break;
                }

-Mike


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 02/15] sched: set initial load avg of new forked task
  2013-02-25  6:00       ` Alex Shi
@ 2013-02-28  7:03         ` Preeti U Murthy
  0 siblings, 0 replies; 90+ messages in thread
From: Preeti U Murthy @ 2013-02-28  7:03 UTC (permalink / raw)
  To: Alex Shi
  Cc: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung,
	efault, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

Hi Alex,

>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 81fa536..cae5134 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1509,6 +1509,10 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
>>>  	 * We track migrations using entity decay_count <= 0, on a wake-up
>>>  	 * migration we use a negative decay count to track the remote decays
>>>  	 * accumulated while sleeping.
>>> +	 *
>>> +	 * When enqueue a new forked task, the se->avg.decay_count == 0, so
>>> +	 * we bypass update_entity_load_avg(), use avg.load_avg_contrib initial
>>> +	 * value: se->load.weight.
>>
>> I disagree with the comment.update_entity_load_avg() gets called for all
>> forked tasks.
>> enqueue_task_fair->update_entity_load_avg() during the second
>> iteration.But __update_entity_load_avg() in update_entity_load_avg()
>>
> 
> When goes 'enqueue_task_fair->update_entity_load_avg()' during the
> second iteration. the se is changed.
> That is different se.
> 
> 
Correct Alex,sorry I overlooked this.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-02-18  5:07 ` [patch v5 14/15] sched: power aware load balance Alex Shi
@ 2013-03-20  4:57   ` Preeti U Murthy
  2013-03-21  7:43     ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Preeti U Murthy @ 2013-03-20  4:57 UTC (permalink / raw)
  To: Alex Shi, mingo, peterz, efault
  Cc: torvalds, tglx, akpm, arjan, bp, pjt, namhyung, vincent.guittot,
	gregkh, viresh.kumar, linux-kernel, morten.rasmussen

Hi Alex,

Please note one point below.

On 02/18/2013 10:37 AM, Alex Shi wrote:
> This patch enabled the power aware consideration in load balance.
> 
> As mentioned in the power aware scheduler proposal, Power aware
> scheduling has 2 assumptions:
> 1, race to idle is helpful for power saving
> 2, less active sched_groups will reduce power consumption
> 
> The first assumption make performance policy take over scheduling when
> any scheduler group is busy.
> The second assumption make power aware scheduling try to pack disperse
> tasks into fewer groups.
> 
> The enabling logical summary here:
> 1, Collect power aware scheduler statistics during performance load
> balance statistics collection.
> 2, If the balance cpu is eligible for power load balance, just do it
> and forget performance load balance. If the domain is suitable for
> power balance, but the cpu is inappropriate(idle or full), stop both
> power/performance balance in this domain. If using performance balance
> or any group is busy, do performance balance.
> 
> Above logical is mainly implemented in update_sd_lb_power_stats(). It
> decides if a domain is suitable for power aware scheduling. If so,
> it will fill the dst group and source group accordingly.
> 
> This patch reuse some of Suresh's power saving load balance code.
> 
> A test can show the effort on different policy:
> for ((i = 0; i < I; i++)) ; do while true; do :; done  &   done
> 
> On my SNB laptop with 4core* HT: the data is Watts
>         powersaving     balance         performance
> i = 2   40              54              54
> i = 4   57              64*             68
> i = 8   68              68              68
> 
> Note:
> When i = 4 with balance policy, the power may change in 57~68Watt,
> since the HT capacity and core capacity are both 1.
> 
> on SNB EP machine with 2 sockets * 8 cores * HT:
>         powersaving     balance         performance
> i = 4   190             201             238
> i = 8   205             241             268
> i = 16  271             348             376
> 
> If system has few continued tasks, use power policy can get
> the performance/power gain. Like sysbench fileio randrw test with 16
> thread on the SNB EP box,
> 
> Signed-off-by: Alex Shi <alex.shi@intel.com>
> ---
>  kernel/sched/fair.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 126 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ffdf35d..3b1e9a6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4650,6 +4753,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  		sgs->group_load += load;
>  		sgs->sum_nr_running += nr_running;
>  		sgs->sum_weighted_load += weighted_cpuload(i);
> +
> +		/* accumulate the maximum potential util */
> +		if (!nr_running)
> +			nr_running = 1;
> +		sgs->group_utils += rq->util * nr_running;

You may have observed the following,but I thought it would be best to
bring this to notice.

The above will lead to situations where sched groups never fill to their
full capacity. This is explained with an example below:

Say the topology is two cores with hyper threading of two logical
threads each. If we choose powersaving policy,and run two workloads at
full utilization, the load will get distributed one on each core, they
will not get consolidated on a single core.
The reason being,the condition:
" if (sgs->group_utils + FULL_UTIL > threshold_util) " in
update_sd_lb_power_stats will fail.

The situation goes thus:


w1                 w2
t1  t2             t3  t4
-------            -------
core1               core2

Above t->thread(logical cpu)
      w->workload


Neither core will be able to pull the task from the other to consolidate
the load because the rq->util of t2 and t4, on which no process is
running, continue to show some number even though they degrade with time
and sgs->utils accounts for them. Therefore,
for core1 and core2, the sgs->utils will be slightly above 100 and the
above condition will fail, thus failing them as candidates for
group_leader,since threshold_util will be 200.

This phenomenon is seen for balance policy and wider topology as well.
I think we would be better off without accounting the rq->utils of the
cpus which do not have any processes running on them for sgs->utils.
What do you think?


Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-20  4:57   ` Preeti U Murthy
@ 2013-03-21  7:43     ` Alex Shi
  2013-03-21  8:41       ` Preeti U Murthy
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-03-21  7:43 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 03/20/2013 12:57 PM, Preeti U Murthy wrote:
> Neither core will be able to pull the task from the other to consolidate
> the load because the rq->util of t2 and t4, on which no process is
> running, continue to show some number even though they degrade with time
> and sgs->utils accounts for them. Therefore,
> for core1 and core2, the sgs->utils will be slightly above 100 and the
> above condition will fail, thus failing them as candidates for
> group_leader,since threshold_util will be 200.

Thanks for note, Preeti!

Did you find some real issue in some platform?
In theory, a totally idle cpu has a zero rq->util at least after 3xxms,
and in fact, I find the code works fine on my machines.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-21  7:43     ` Alex Shi
@ 2013-03-21  8:41       ` Preeti U Murthy
  2013-03-21  9:27         ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Preeti U Murthy @ 2013-03-21  8:41 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

Hi Alex,

On 03/21/2013 01:13 PM, Alex Shi wrote:
> On 03/20/2013 12:57 PM, Preeti U Murthy wrote:
>> Neither core will be able to pull the task from the other to consolidate
>> the load because the rq->util of t2 and t4, on which no process is
>> running, continue to show some number even though they degrade with time
>> and sgs->utils accounts for them. Therefore,
>> for core1 and core2, the sgs->utils will be slightly above 100 and the
>> above condition will fail, thus failing them as candidates for
>> group_leader,since threshold_util will be 200.
> 
> Thanks for note, Preeti!
> 
> Did you find some real issue in some platform?
> In theory, a totally idle cpu has a zero rq->util at least after 3xxms,
> and in fact, I fi
nd the code works fine on my machines.
> 

Yes, I did find this behaviour on a 2 socket, 8 core machine very
consistently.

rq->util cannot go to 0, after it has begun accumulating load right?

Say a load was running on a runqueue which had its rq->util to be at
100%. After the load finishes, the runqueue goes idle. For every
scheduler tick, its utilisation decays. But can never become 0.

rq->util = rq->avg.runnable_avg_sum/rq->avg.runnable_avg_period

This ratio will come close to 0, but will never become 0 once it has
picked up a value.So if a sched_group consists of two run queues,one
having utilisation 100, running 1 load and the other having utilisation
.001,but running no load,then in update_sd_lb_power_stats(), the below
condition

"sgs->group_utils + FULL_UTIL > threshold_util " will turn out to be

(100.001 + 100 > 200) and hence the group will fail to act as the group
leader,to take on more tasks onto itself.


Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-21  8:41       ` Preeti U Murthy
@ 2013-03-21  9:27         ` Alex Shi
  2013-03-21 10:27           ` Preeti U Murthy
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-03-21  9:27 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 03/21/2013 04:41 PM, Preeti U Murthy wrote:
>> > 
> Yes, I did find this behaviour on a 2 socket, 8 core machine very
> consistently.
> 
> rq->util cannot go to 0, after it has begun accumulating load right?
> 
> Say a load was running on a runqueue which had its rq->util to be at
> 100%. After the load finishes, the runqueue goes idle. For every
> scheduler tick, its utilisation decays. But can never become 0.
> 
> rq->util = rq->avg.runnable_avg_sum/rq->avg.runnable_avg_period


did you close all of background system services?
In theory the rq->avg.runnable_avg_sum should be zero if there is no
task a bit long, otherwise there are some bugs in kernel. Could you
check the value under /proc/sched_debug?


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-21  9:27         ` Alex Shi
@ 2013-03-21 10:27           ` Preeti U Murthy
  2013-03-22  1:30             ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Preeti U Murthy @ 2013-03-21 10:27 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 03/21/2013 02:57 PM, Alex Shi wrote:
> On 03/21/2013 04:41 PM, Preeti U Murthy wrote:
>>>>
>> Yes, I did find this behaviour on a 2 socket, 8 core machine very
>> consistently.
>>
>> rq->util cannot go to 0, after it has begun accumulating load right?
>>
>> Say a load was running on a runqueue which had its rq->util to be at
>> 100%. After the load finishes, the runqueue goes idle. For every
>> scheduler tick, its utilisation decays. But can never become 0.
>>
>> rq->util = rq->avg.runnable_avg_sum/rq->avg.runnable_avg_period
> 
> 
> did you close all of background system services?
> In theory the rq->avg.runnable_avg_sum should be zero if there is no
> task a bit long, otherwise there are some bugs in kernel.

Could you explain why rq->avg.runnable_avg_sum should be zero? What if
some kernel thread ran on this run queue and is now finished? Its
utilisation would be say x.How would that ever drop to 0,even if nothing
ran on it later?

Regards
Preeti U Murthy

 Could you
> check the value under /proc/sched_debug?
> 
> 


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-21 10:27           ` Preeti U Murthy
@ 2013-03-22  1:30             ` Alex Shi
  2013-03-22  5:14               ` Preeti U Murthy
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-03-22  1:30 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 03/21/2013 06:27 PM, Preeti U Murthy wrote:
>> > did you close all of background system services?
>> > In theory the rq->avg.runnable_avg_sum should be zero if there is no
>> > task a bit long, otherwise there are some bugs in kernel.
> Could you explain why rq->avg.runnable_avg_sum should be zero? What if
> some kernel thread ran on this run queue and is now finished? Its
> utilisation would be say x.How would that ever drop to 0,even if nothing
> ran on it later?

the value get from decay_load():
 sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
in decay_load it is possible to be set zero.

and /proc/sched_debug also approve this:

  .tg_runnable_contrib           : 0
  .tg->runnable_avg              : 50
  .avg->runnable_avg_sum         : 0
  .avg->runnable_avg_period      : 47507


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-22  1:30             ` Alex Shi
@ 2013-03-22  5:14               ` Preeti U Murthy
  2013-03-25  4:52                 ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Preeti U Murthy @ 2013-03-22  5:14 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

Hi,

On 03/22/2013 07:00 AM, Alex Shi wrote:
> On 03/21/2013 06:27 PM, Preeti U Murthy wrote:
>>>> did you close all of background system services?
>>>> In theory the rq->avg.runnable_avg_sum should be zero if there is no
>>>> task a bit long, otherwise there are some bugs in kernel.
>> Could you explain why rq->avg.runnable_avg_sum should be zero? What if
>> some kernel thread ran on this run queue and is now finished? Its
>> utilisation would be say x.How would that ever drop to 0,even if nothing
>> ran on it later?
> 
> the value get from decay_load():
>  sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
> in decay_load it is possible to be set zero.

Yes you are right, it is possible to be set to 0, but after a very long
time, to be more precise, nearly 2 seconds. If you look at decay_load(),
if the period between last update and now has crossed (32*63),only then
will the runnable_avg_sum become 0, else it will simply decay.

This means that for nearly 2seconds,consolidation of loads may not be
possible even after the runqueues have finished executing tasks running
on them.

The exact experiment that I performed was running ebizzy, with just two
threads. My setup was 2 socket,2 cores each,4 threads each core. So a 16
logical cpu machine.When I begin running ebizzy with balance policy, the
2 threads of ebizzy are found one on each socket, while I would expect
them to be on the same socket. All other cpus, except the ones running
ebizzy threads are idle and not running anything on either socket.
I am not running any other processes.

You could run a similar experiment and let me know if you see otherwise.
I am at a loss to understand why else would such a spreading of load
occur, if not for the rq->util not becoming 0 quickly,when it is not
running anything. I have used trace_printks to keep track of runqueue
util of those runqueues not running anything after maybe some initial
load and it does not become 0 till the end of the run.

Regards
Preeti U Murthy


> 
> and /proc/sched_debug also approve this:
> 
>   .tg_runnable_contrib           : 0
>   .tg->runnable_avg              : 50
>   .avg->runnable_avg_sum         : 0
>   .avg->runnable_avg_period      : 47507
> 
> 


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 06/15] sched: log the cpu utilization at rq
  2013-02-20 15:22       ` Peter Zijlstra
  2013-02-25  2:26         ` Alex Shi
@ 2013-03-22  8:49         ` Alex Shi
  1 sibling, 0 replies; 90+ messages in thread
From: Alex Shi @ 2013-03-22  8:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: torvalds, mingo, tglx, akpm, arjan, bp, pjt, namhyung, efault,
	vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	morten.rasmussen

On 02/20/2013 11:22 PM, Peter Zijlstra wrote:
> On Wed, 2013-02-20 at 22:33 +0800, Alex Shi wrote:
>>> You don't actually compute the rq utilization, you only compute the
>>> utilization as per the fair class, so if there's significant RT
>> activity
>>> it'll think the cpu is under-utilized, whihc I think will result in
>> the
>>> wrong thing.
>>
>> yes. A bit complicit to resolve this. Any suggestions on this, guys?
> 
> Shouldn't be too hard seeing as we already track cpu utilization for !
> fair usage; see rq::rt_avg and scale_rt_power.
> 

Hi Peter,

rt_avg will be accumulated the irq time and steal time in
update_rq_clock_task(), if CONFIG_IRQ_TIME_ACCOUNTING or
CONFIG_IRQ_TIME_ACCOUNTING defined. That cause irq/steal time was double
added into rq utilisation, since normal rq->util already include the irq
time. So we do wrongly judgement to think it is a overload cpu. but it
is not.

To resolve this issue, if is it possible to introduce another member in
rq to describe rt_avg non irq/steal beside the rt_avg? If so, what the
name do you like to use?

-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-22  5:14               ` Preeti U Murthy
@ 2013-03-25  4:52                 ` Alex Shi
  2013-03-29 12:42                   ` Preeti U Murthy
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-03-25  4:52 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 03/22/2013 01:14 PM, Preeti U Murthy wrote:
>> > 
>> > the value get from decay_load():
>> >  sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
>> > in decay_load it is possible to be set zero.
> Yes you are right, it is possible to be set to 0, but after a very long
> time, to be more precise, nearly 2 seconds. If you look at decay_load(),
> if the period between last update and now has crossed (32*63),only then
> will the runnable_avg_sum become 0, else it will simply decay.
> 
> This means that for nearly 2seconds,consolidation of loads may not be
> possible even after the runqueues have finished executing tasks running
> on them.

Look into the decay_load(), since the LOAD_AVG_MAX is about 47742, so
after 16 * 32ms, the maximum avg sum will be decay to zero. 2^16 = 65536

Yes, compare to accumulate time 345ms, the decay is not symmetry, and
not precise, seems it has space to tune well. But it is acceptable now.
> 
> The exact experiment that I performed was running ebizzy, with just two
> threads. My setup was 2 socket,2 cores each,4 threads each core. So a 16
> logical cpu machine.When I begin running ebizzy with balance policy, the
> 2 threads of ebizzy are found one on each socket, while I would expect
> them to be on the same socket. All other cpus, except the ones running
> ebizzy threads are idle and not running anything on either socket.
> I am not running any other processes.

did you try the simplest benchmark: while true; do :; done
I am writing the v6 version which include rt_util etc. you may test on
it after I send out. :)
> 
> You could run a similar experiment and let me know if you see otherwise.
> I am at a loss to understand why else would such a spreading of load
> occur, if not for the rq->util not becoming 0 quickly,when it is not
> running anything. I have used trace_printks to keep track of runqueue
> util of those runqueues not running anything after maybe some initial
> load and it does not become 0 till the end of the run.


-- 
Thanks Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-25  4:52                 ` Alex Shi
@ 2013-03-29 12:42                   ` Preeti U Murthy
  2013-03-29 13:39                     ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Preeti U Murthy @ 2013-03-29 12:42 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

Hi Alex,

On 03/25/2013 10:22 AM, Alex Shi wrote:
> On 03/22/2013 01:14 PM, Preeti U Murthy wrote:
>>>>
>>>> the value get from decay_load():
>>>>  sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
>>>> in decay_load it is possible to be set zero.
>> Yes you are right, it is possible to be set to 0, but after a very long
>> time, to be more precise, nearly 2 seconds. If you look at decay_load(),
>> if the period between last update and now has crossed (32*63),only then
>> will the runnable_avg_sum become 0, else it will simply decay.
>>
>> This means that for nearly 2seconds,consolidation of loads may not be
>> possible even after the runqueues have finished executing tasks running
>> on them.
> 
> Look into the decay_load(), since the LOAD_AVG_MAX is about 47742, so
> after 16 * 32ms, the maximum avg sum will be decay to zero. 2^16 = 65536
> 
> Yes, compare to accumulate time 345ms, the decay is not symmetry, and
> not precise, seems it has space to tune well. But it is acceptable now.
>>
>> The exact experiment that I performed was running ebizzy, with just two
>> threads. My setup was 2 socket,2 cores each,4 threads each core. So a 16
>> logical cpu machine.When I begin running ebizzy with balance policy, the
>> 2 threads of ebizzy are found one on each socket, while I would expect
>> them to be on the same socket. All other cpus, except the ones running
>> ebizzy threads are idle and not running anything on either socket.
>> I am not running any other processes.
> 
> did you try the simplest benchmark: while true; do :; done

Yeah I tried out this while true; do :; done benchmark on a vm which ran
on 2 socket, 2 cores each socket and 2 threads each core emulation.
I ran two instances of this loop with balance policy on, and it was
found that there was one instance running on each socket, rather than
both instances getting consolidated on one socket.

But when I apply the change where we do not consider rq->util if it has
no nr_running on the rq,the two instances of the above benchmark get
consolidated onto one socket.


> I am writing the v6 version which include rt_util etc. you may test on
> it after I send out. :)

Sure will do so.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-29 12:42                   ` Preeti U Murthy
@ 2013-03-29 13:39                     ` Alex Shi
  2013-03-30 11:25                       ` Preeti U Murthy
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-03-29 13:39 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 03/29/2013 08:42 PM, Preeti U Murthy wrote:
>> > did you try the simplest benchmark: while true; do :; done
> Yeah I tried out this while true; do :; done benchmark on a vm which ran

Thanks a lot for trying!

What's do you mean 'vm'? Virtual machine?

> on 2 socket, 2 cores each socket and 2 threads each core emulation.
> I ran two instances of this loop with balance policy on, and it was
> found that there was one instance running on each socket, rather than
> both instances getting consolidated on one socket.
> 
> But when I apply the change where we do not consider rq->util if it has
> no nr_running on the rq,the two instances of the above benchmark get
> consolidated onto one socket.
> 
> 

I don't know much of virtual machine, guess the unstable VCPU to CPU pin
cause rq->util keep large? Did you try to pin VCPU to physical CPU?

I still give the rq->util weight even the nr_running is 0, because some
transitory tasks may actived on the cpu, but just missed on balancing point.

I just wondering that forgetting rq->util when nr_running = 0 is the
real root cause if your finding is just on VM and without fixed VCPU to
CPU pin.


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-29 13:39                     ` Alex Shi
@ 2013-03-30 11:25                       ` Preeti U Murthy
  2013-03-30 14:04                         ` Alex Shi
  0 siblings, 1 reply; 90+ messages in thread
From: Preeti U Murthy @ 2013-03-30 11:25 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 03/29/2013 07:09 PM, Alex Shi wrote:
> On 03/29/2013 08:42 PM, Preeti U Murthy wrote:
>>>> did you try the simplest benchmark: while true; do :; done
>> Yeah I tried out this while true; do :; done benchmark on a vm which ran
> 
> Thanks a lot for trying!
> 
> What's do you mean 'vm'? Virtual machine?

Yes.

> 
>> on 2 socket, 2 cores each socket and 2 threads each core emulation.
>> I ran two instances of this loop with balance policy on, and it was
>> found that there was one instance running on each socket, rather than
>> both instances getting consolidated on one socket.
>>
>> But when I apply the change where we do not consider rq->util if it has
>> no nr_running on the rq,the two instances of the above benchmark get
>> consolidated onto one socket.
>>
>>
> 
> I don't know much of virtual machine, guess the unstable VCPU to CPU pin
> cause rq->util keep large? Did you try to pin VCPU to physical CPU?

No I hadn't done any vcpu to cpu pinning but why did the situation
drastically alter to consolidate the load when the rq->util for the
runqueues with 0 tasks on them was not considered as part of sgs->utils?

> 
> I still give the rq->util weight even the nr_running is 0, because some
> transitory tasks may actived on the cpu, but just missed on balancing point.
> 
> I just wondering that forgetting rq->util when nr_running = 0 is the
> real root cause if your finding is just on VM and without fixed VCPU to
> CPU pin.

I find the same situation on a physical machine too. On a 2 socket, 4
core machine as well. In fact, using trace_printks in the load balancing
part, I could find that the reason that the load was not getting
consolidated onto a socket was because the rq->util of a run-queue with
no processes on it, had not decayed to 0, which is why it would consider
the socket as overloaded and would  rule out power aware balancing.All
this was on a physical machine.

Regards
Preeti U Murthy


> 
> 


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-30 11:25                       ` Preeti U Murthy
@ 2013-03-30 14:04                         ` Alex Shi
  2013-03-30 15:31                           ` Preeti U Murthy
  0 siblings, 1 reply; 90+ messages in thread
From: Alex Shi @ 2013-03-30 14:04 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

On 03/30/2013 07:25 PM, Preeti U Murthy wrote:
>> > I still give the rq->util weight even the nr_running is 0, because some
>> > transitory tasks may actived on the cpu, but just missed on balancing point.
>> > 
>> > I just wondering that forgetting rq->util when nr_running = 0 is the
>> > real root cause if your finding is just on VM and without fixed VCPU to
>> > CPU pin.
> I find the same situation on a physical machine too. On a 2 socket, 4
> core machine as well. In fact, using trace_printks in the load balancing
> part, I could find that the reason that the load was not getting
> consolidated onto a socket was because the rq->util of a run-queue with
> no processes on it, had not decayed to 0, which is why it would consider
> the socket as overloaded and would  rule out power aware balancing.All
> this was on a physical machine.

Consider of this situation, we may stop account the rq->util when
nr_running is zero. Tasks will be a bit more compact. but anyway, that's
powersaving policy.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [patch v5 14/15] sched: power aware load balance
  2013-03-30 14:04                         ` Alex Shi
@ 2013-03-30 15:31                           ` Preeti U Murthy
  0 siblings, 0 replies; 90+ messages in thread
From: Preeti U Murthy @ 2013-03-30 15:31 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, peterz, efault, torvalds, tglx, akpm, arjan, bp, pjt,
	namhyung, vincent.guittot, gregkh, viresh.kumar, linux-kernel,
	morten.rasmussen

Hi,

On 03/30/2013 07:34 PM, Alex Shi wrote:
> On 03/30/2013 07:25 PM, Preeti U Murthy wrote:
>>>> I still give the rq->util weight even the nr_running is 0, because some
>>>> transitory tasks may actived on the cpu, but just missed on balancing point.
>>>>
>>>> I just wondering that forgetting rq->util when nr_running = 0 is the
>>>> real root cause if your finding is just on VM and without fixed VCPU to
>>>> CPU pin.
>> I find the same situation on a physical machine too. On a 2 socket, 4
>> core machine as well. In fact, using trace_printks in the load balancing
>> part, I could find that the reason that the load was not getting
>> consolidated onto a socket was because the rq->util of a run-queue with
>> no processes on it, had not decayed to 0, which is why it would consider
>> the socket as overloaded and would  rule out power aware balancing.All
>> this was on a physical machine.
> 
> Consider of this situation, we may stop account the rq->util when
> nr_running is zero. Tasks will be a bit more compact. but anyway, that's
> powersaving policy.
> 
True, the tasks will be packed a bit more compactly, but we can expect
the behaviour of your patchset *defaulting to performance policy when
overloaded*, to come to the rescue of such a situation.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2013-03-30 15:33 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-18  5:07 [patch v5 0/15] power aware scheduling Alex Shi
2013-02-18  5:07 ` [patch v5 01/15] sched: set initial value for runnable avg of sched entities Alex Shi
2013-02-18  8:28   ` Joonsoo Kim
2013-02-18  9:16     ` Alex Shi
2013-02-18  5:07 ` [patch v5 02/15] sched: set initial load avg of new forked task Alex Shi
2013-02-20  6:20   ` Alex Shi
2013-02-24 10:57     ` Preeti U Murthy
2013-02-25  6:00       ` Alex Shi
2013-02-28  7:03         ` Preeti U Murthy
2013-02-25  7:12       ` Alex Shi
2013-02-18  5:07 ` [patch v5 03/15] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2013-02-18  5:07 ` [patch v5 04/15] sched: add sched balance policies in kernel Alex Shi
2013-02-20  9:37   ` Ingo Molnar
2013-02-20 13:40     ` Alex Shi
2013-02-20 15:41       ` Ingo Molnar
2013-02-21  1:43         ` Alex Shi
2013-02-18  5:07 ` [patch v5 05/15] sched: add sysfs interface for sched_balance_policy selection Alex Shi
2013-02-18  5:07 ` [patch v5 06/15] sched: log the cpu utilization at rq Alex Shi
2013-02-20  9:30   ` Peter Zijlstra
2013-02-20 12:09     ` Preeti U Murthy
2013-02-20 13:34       ` Peter Zijlstra
2013-02-20 14:36         ` Alex Shi
2013-02-20 14:33     ` Alex Shi
2013-02-20 15:20       ` Peter Zijlstra
2013-02-21  1:35         ` Alex Shi
2013-02-20 15:22       ` Peter Zijlstra
2013-02-25  2:26         ` Alex Shi
2013-03-22  8:49         ` Alex Shi
2013-02-20 12:19   ` Preeti U Murthy
2013-02-20 12:39     ` Alex Shi
2013-02-18  5:07 ` [patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming fork/exec/wake balancing Alex Shi
2013-02-20  9:38   ` Peter Zijlstra
2013-02-20 12:27     ` Alex Shi
2013-02-18  5:07 ` [patch v5 08/15] sched: move sg/sd_lb_stats struct ahead Alex Shi
2013-02-18  5:07 ` [patch v5 09/15] sched: add power aware scheduling in fork/exec/wake Alex Shi
2013-02-20  9:42   ` Peter Zijlstra
2013-02-20 12:09     ` Alex Shi
2013-02-20 13:36       ` Peter Zijlstra
2013-02-20 14:23         ` Alex Shi
2013-02-21 13:33           ` Peter Zijlstra
2013-02-21 14:40             ` Alex Shi
2013-02-22  8:54               ` Peter Zijlstra
2013-02-24  9:27                 ` Alex Shi
2013-02-24  9:49                   ` Preeti U Murthy
2013-02-24 11:55                     ` Alex Shi
2013-02-24 17:51                   ` Preeti U Murthy
2013-02-25  2:23                     ` Alex Shi
2013-02-25  3:23                       ` Mike Galbraith
2013-02-25  9:53                         ` Alex Shi
2013-02-25 10:30                           ` Mike Galbraith
2013-02-18  5:07 ` [patch v5 10/15] sched: packing transitory tasks in wake/exec power balancing Alex Shi
2013-02-18  8:44   ` Joonsoo Kim
2013-02-18  8:56     ` Alex Shi
2013-02-20  5:55       ` Alex Shi
2013-02-20  7:40         ` Mike Galbraith
2013-02-20  8:11           ` Alex Shi
2013-02-20  8:43             ` Mike Galbraith
2013-02-20  8:54               ` Alex Shi
2013-02-18  5:07 ` [patch v5 11/15] sched: add power/performance balance allow flag Alex Shi
2013-02-20  9:48   ` Peter Zijlstra
2013-02-20 12:04     ` Alex Shi
2013-02-20 13:37       ` Peter Zijlstra
2013-02-20 13:48         ` Peter Zijlstra
2013-02-20 14:08           ` Alex Shi
2013-02-20 13:52         ` Alex Shi
2013-02-20 12:12   ` Borislav Petkov
2013-02-20 14:20     ` Alex Shi
2013-02-20 15:22       ` Borislav Petkov
2013-02-21  1:32         ` Alex Shi
2013-02-21  9:42           ` Borislav Petkov
2013-02-21 14:52             ` Alex Shi
2013-02-18  5:07 ` [patch v5 12/15] sched: pull all tasks from source group Alex Shi
2013-02-18  5:07 ` [patch v5 13/15] sched: no balance for prefer_sibling in power scheduling Alex Shi
2013-02-18  5:07 ` [patch v5 14/15] sched: power aware load balance Alex Shi
2013-03-20  4:57   ` Preeti U Murthy
2013-03-21  7:43     ` Alex Shi
2013-03-21  8:41       ` Preeti U Murthy
2013-03-21  9:27         ` Alex Shi
2013-03-21 10:27           ` Preeti U Murthy
2013-03-22  1:30             ` Alex Shi
2013-03-22  5:14               ` Preeti U Murthy
2013-03-25  4:52                 ` Alex Shi
2013-03-29 12:42                   ` Preeti U Murthy
2013-03-29 13:39                     ` Alex Shi
2013-03-30 11:25                       ` Preeti U Murthy
2013-03-30 14:04                         ` Alex Shi
2013-03-30 15:31                           ` Preeti U Murthy
2013-02-18  5:07 ` [patch v5 15/15] sched: lazy power balance Alex Shi
2013-02-18  7:44 ` [patch v5 0/15] power aware scheduling Alex Shi
2013-02-19 12:08 ` Paul Turner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).