[patch v5 0/15] power aware scheduling

* [patch v5 0/15] power aware scheduling
@ 2013-02-18  5:07 Alex Shi
  2013-02-18  5:07 ` [patch v5 01/15] sched: set initial value for runnable avg of sched entities Alex Shi
                   ` (16 more replies)
  0 siblings, 17 replies; 90+ messages in thread
From: Alex Shi @ 2013-02-18  5:07 UTC (permalink / raw)
  To: torvalds, mingo, peterz, tglx, akpm, arjan, bp, pjt, namhyung, efault
  Cc: vincent.guittot, gregkh, preeti, viresh.kumar, linux-kernel,
	alex.shi, morten.rasmussen

Since the simplification of fork/exec/wake balancing has much arguments,
I removed that part in the patch set.

This patch set implement/consummate the rough power aware scheduling
proposal: https://lkml.org/lkml/2012/8/13/139.
It defines 2 new power aware policy 'balance' and 'powersaving', then
try to pack tasks on each sched groups level according the different 
scheduler policy. That can save much power when task number in system 
is no more than LCPU number.

As mentioned in the power aware scheduling proposal, Power aware
scheduling has 2 assumptions:
1, race to idle is helpful for power saving
2, less active sched groups will reduce cpu power consumption

The first assumption make performance policy take over scheduling when
any group is busy.
The second assumption make power aware scheduling try to pack disperse
tasks into fewer groups.

Like sched numa, power aware scheduling is also a kind of cpu locality
oriented scheduling, so it is natural compatible with sched numa.

Since the patch can perfect pack tasks into fewer groups, I just show
some performance/power testing data here:
=========================================
$for ((i = 0; i < I; i++)) ; do while true; do :; done  &   done

On my SNB laptop with 4core* HT: the data is avg Watts
        powersaving     balance         performance
i = 2   40              54              54
i = 4   57              64*             68
i = 8   68              68              68

Note:
When i = 4 with balance policy, the power may change in 57~68Watt,
since the HT capacity and core capacity are both 1.

on SNB EP machine with 2 sockets * 8 cores * HT:
        powersaving     balance         performance
i = 4   190             201             238
i = 8   205             241             268
i = 16  271             348             376

bltk-game with openarena, the data is avg Watts
	    powersaving     balance         performance
wsm laptop  22.9             23.8           24.4
snb laptop  20.2	     20.5	    20.7

tasks number keep waving benchmark, 'make -j x vmlinux'
on my SNB EP 2 sockets machine with 8 cores * HT:

         powersaving	          balance	         performance
x = 1    175.603 /417 13          175.220 /416 13        176.073 /407 13
x = 2    192.215 /218 23          194.522 /202 25        217.393 /200 23
x = 4    205.226 /124 39          208.823 /114 42        230.425 /105 41
x = 8    236.369 /71 59           249.005 /65 61         257.661 /62 62
x = 16   283.842 /48 73           307.465 /40 81         309.336 /39 82
x = 32   325.197 /32 96           333.503 /32 93         336.138 /32 92

data explains: 175.603 /417 13
	175.603: average Watts
	417: seconds(compile time)
	13:  scaled performance/power = 1000000 / seconds / watts

Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux* &> /dev/null

On a NHM EP box
         powersaving               balance               performance
x = 4    166.516 /88 68           170.515 /82 71         165.283 /103 58
x = 8    173.654 /61 94           177.693 /60 93         172.31 /76 76

On a 2 sockets SNB EP box.
         powersaving               balance               performance
x = 4    190.995 /149 35          200.6 /129 38          208.561 /135 35
x = 8    197.969 /108 46          208.885 /103 46        213.96 /108 43
x = 16   205.163 /76 64           212.144 /91 51         229.287 /97 44

data format is: 166.516 /88 68
        166.516: average Watts
        88: seconds(compress time)
        68:  scaled performance/power = 1000000 / time / power

Some performance testing results:
---------------------------------

Tested benchmarks: kbuild, specjbb2005, oltp, tbench, aim9,
hackbench, fileio-cfq of sysbench, dbench, aiostress, multhreads
loopback netperf. on my core2, nhm, wsm, snb, platforms. no clear
performance change found on 'performance' policy.

Tested balance/powersaving policy with above benchmarks,
a, specjbb2005 drop 5~7% on both of policy whenever with openjdk or jrockit.
b, hackbench drops 30+% with powersaving policy on snb 4 sockets platforms.
Others has no clear change.

test result from Mike Galbraith:
--------------------------------
With aim7 compute on 4 node 40 core box, I see stable throughput
improvement at tasks = nr_cores and below w. balance and powersaving.

         3.8.0-performance   3.8.0-balance      3.8.0-powersaving
Tasks    jobs/min/task       jobs/min/task      jobs/min/task
    1         432.8571       433.4764      	433.1665
    5         480.1902       510.9612      	497.5369
   10         429.1785       533.4507      	518.3918
   20         424.3697       529.7203      	528.7958
   40         419.0871       500.8264      	517.0648

No deltas after that.  There were also no deltas between patched kernel
using performance policy and virgin source.

Changelog:
V5 change:
a, change sched_policy to sched_balance_policy
b, split fork/exec/wake power balancing into 3 patches and refresh
commit logs
c, others minors clean up

V4 change:
a, fix few bugs and clean up code according to Morten Rasmussen, Mike
Galbraith and Namhyung Kim. Thanks!
b, take Morten Rasmussen's suggestion to use different criteria for
different policy in transitory task packing.
c, shorter latency in power aware scheduling.

V3 change:
a, engaged nr_running and utils in periodic power balancing.
b, try packing small exec/wake tasks on running cpu not idle cpu.

V2 change:
a, add lazy power scheduling to deal with kbuild like benchmark.

Thanks comments/suggestions from PeterZ, Linus Torvalds, Andrew Morton,
Ingo, Arjan van de Ven, Borislav Petkov, PJT, Namhyung Kim, Mike
Galbraith, Greg, Preeti, Morten Rasmussen etc.

Thanks fengguang's 0-day kbuild system for testing this patchset.

Any more comments are appreciated!

-- Thanks Alex

[patch v5 01/15] sched: set initial value for runnable avg of sched
[patch v5 02/15] sched: set initial load avg of new forked task
[patch v5 03/15] Revert "sched: Introduce temporary FAIR_GROUP_SCHED
[patch v5 04/15] sched: add sched balance policies in kernel
[patch v5 05/15] sched: add sysfs interface for sched_balance_policy
[patch v5 06/15] sched: log the cpu utilization at rq
[patch v5 07/15] sched: add new sg/sd_lb_stats fields for incoming
[patch v5 08/15] sched: move sg/sd_lb_stats struct ahead
[patch v5 09/15] sched: add power aware scheduling in fork/exec/wake
[patch v5 10/15] sched: packing transitory tasks in wake/exec power
[patch v5 11/15] sched: add power/performance balance allow flag
[patch v5 12/15] sched: pull all tasks from source group
[patch v5 13/15] sched: no balance for prefer_sibling in power
[patch v5 14/15] sched: power aware load balance
[patch v5 15/15] sched: lazy power balance

^ permalink raw reply	[flat|nested] 90+ messages in thread