From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752838Ab3EaBSC (ORCPT <rfc822;w@1wt.eu>);
	Thu, 30 May 2013 21:18:02 -0400
Received: from mga02.intel.com ([134.134.136.20]:29778 "EHLO mga02.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752502Ab3EaBRm (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 30 May 2013 21:17:42 -0400
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="4.87,774,1363158000"; 
   d="scan'208";a="322230562"
Message-ID: <51A7FA14.70902@intel.com>
Date: Fri, 31 May 2013 09:17:08 +0800
From: Alex Shi <alex.shi@intel.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130329 Thunderbird/17.0.5
MIME-Version: 1.0
To: Morten Rasmussen <morten.rasmussen@arm.com>
CC: peterz@infradead.org, mingo@kernel.org, preeti@linux.vnet.ibm.com,
        vincent.guittot@linaro.org, efault@gmx.de, pjt@google.com,
        linux-kernel@vger.kernel.org, linaro-kernel@lists.linaro.org,
        arjan@linux.intel.com, len.brown@intel.com, corbet@lwn.net,
        tglx@linutronix.de
Subject: Re: [RFC] Comparison of power-efficient scheduling patch sets
References: <20130530134718.GB32728@e103034-lin>
In-Reply-To: <20130530134718.GB32728@e103034-lin>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 05/30/2013 09:47 PM, Morten Rasmussen wrote:
> Hi,
> 
> A number of patch sets related to power-efficient scheduling have been
> posted over the last couple of months. Most of them do not have much
> data to back them up, so I decided to do some testing.
> 
> Common for all of the patch sets that I have tested, except one, is that
> they attempt to pack tasks on as few cpus as possible to allow the
> remaining cpus to enter deeper sleep states - a strategy that should
> make sense on most platforms that support per-cpu power gating and
> multi-socket machines.
> 
> Kernel: 3.9
> 
> Patch sets:
> rlb-v4: sched: use runnable load based balance (Alex Shi)
>         <https://lkml.org/lkml/2013/4/27/13>

Thanks for the valuable comparison!

The runnable load balance target is performance. It is still try to
disperse tasks to as much as possible CPUs. :)
The latest v7 version remove the 6th patch(wake_affine change) in v4.
and plus fix a slept time double counting issue, and remove
blocked_load_avg in tg load.
http://comments.gmane.org/gmane.linux.kernel/1498988
Enjoy!
> pas-v7: sched: power aware scheduling (Alex Shi)
>         <https://lkml.org/lkml/2013/4/3/732>

We still have some internal discussion on this patch set before update
it. Sorry for response late on this patchset!

> pst-v3: sched: packing small tasks (Vincent Guittot)
>         <https://lkml.org/lkml/2013/3/22/183>
> pst-v4: sched: packing small tasks (Vincent Guittot)
>         <https://lkml.org/lkml/2013/4/25/396>
> 
> Configuration:
> pas-v7: Set to "powersaving" mode.
> pst-v4: Set to "Full" packing mode.
> 
> Platform:
> ARM TC2 (test-chip), 2xCortex-A15 + 3xCortex-A7. Cortex-A15s disabled.
> 
> Measurement technique:
> Time spent non-idle (not in idle state) for each cpu based on cpuidle
> ftrace events. TC2 does not have per-core power-gating, so packing
> inside the A7 cluster does not lead to any significant power savings.
> Note that any product grade hardware (TC2 is a test-chip) will very
> likely have per-core power-gating, so in those cases packing will have
> an appreciable effect on power savings.
> Measuring non-idle time rather than power should give a more clear idea
> about the effect of the patch sets given that the idle back-end is
> highly implementation specific.
> 
> Benchmarks:
> audio playback (Android): 30s mp3 file playback on Android.
> bbench+audio (Android): Web page rendering while doing mp3 playback.
> andebench_native (Android): Android benchmark running in native mode.
> cyclictest: Short periodic tasks.
> 
> Results:
> Two runs for each patch set.
> 
> audio playback (Android) SMP
> non-idle %  cpu 0  cpu 1  cpu 2
> 3.9_1       11.96   2.86   2.48
> 3.9_2       12.64   2.81   1.88
> rlb-v4_1    12.61   2.44   1.90
> rlb-v4_2    12.45   2.44   1.90
> pas-v7_1    16.17   0.03   0.24
> pas-v7_2    16.08   0.28   0.07
> pst-v3_1    15.18   2.76   1.70
> pst-v3_2    15.13   0.80   0.38
> pst-v4_1    16.14   0.05   0.00
> pst-v4_2    16.34   0.06   0.00
> 
> bbench+audio (Android) SMP
> non-idle %  cpu 0  cpu 1  cpu 2  render time
> 3.9_1       25.00  20.73  21.22   812
> 3.9_2       24.29  19.78  22.34   795
> rlb-v4_1    23.84  19.36  22.74   782
> rlb-v4_2    24.07  19.36  22.74   797
> pas-v7_1    28.29  17.86  16.01   869
> pas-v7_2    28.62  18.54  15.05   908
> pst-v3_1    29.14  20.59  21.72   830
> pst-v3_2    27.69  18.81  20.06   830
> pst-v4_1    42.20  13.63   2.29   880
> pst-v4_2    41.56  14.40   2.17   935
> 
> andebench_native (8 threads) (Android) SMP
> non-idle %  cpu 0  cpu 1  cpu 2  Score
> 3.9_1       99.22  98.88  99.61   4139
> 3.9_2       99.56  99.31  99.46   4148
> rlb-v4_1    99.49  99.61  99.53   4153
> rlb-v4_2    99.56  99.61  99.53   4149
> pas-v7_1    99.53  99.59  99.29   4149
> pas-v7_2    99.42  99.63  99.48   4150
> pst-v3_1    97.89  99.33  99.42   4097
> pst-v3_2    99.16  99.62  99.42   4097
> pst-v4_1    99.34  99.01  99.59   4146
> pst-v4_2    99.49  99.52  99.20   4146
> 
> cyclictest SMP
> non-idle %  cpu 0  cpu 1  cpu 2
> 3.9_1        9.13   8.88   8.41
> 3.9_2       10.27   8.02   6.30
> rlb-v4_1     8.88   8.09   8.11
> rlb-v4_2     8.49   8.09   8.11
> pas-v7_1    10.20   0.02  11.50
> pas-v7_2     7.86  14.31   0.02
> pst-v3_1    20.44   8.68   7.97
> pst-v3_2    20.41   0.78   1.00
> pst-v4_1    21.32   0.21   0.05
> pst-v4_2    21.56   0.21   0.04
> 
> Overall, pas-v7 seems to do a fairly good job at packing. The idle time
> distribution seems to be somewhere between pst-v3 and the more
> aggressive pst-v4 for all the benchmarks. pst-v4 manages to keep two
> cpus nearly idle (<0.25% non-idle) for both cyclictest and audio, which
> is better than both pst-v3 and pas-v7. pas-v7 fails to pack cyclictest.
> Packing does come at at cost which can be seen for bbench+audio, where
> pst-v3 and rlb-v4 get better render times than pas-v7 and pst-v4 which
> do more aggressive packing. rlb-v4 does not pack, it is only included
> for reference.
> 
> From a packing perspective pst-v4 seems to do the best job for the
> workloads that I have tested on ARM TC2. The less aggressive packing in
> pst-v3 may be a better choice for in terms of performance.
> 
> I'm well aware that these tests are heavily focused on mobile workloads.
> I would therefore encourage people to share your test results for your
> workloads on your platforms to complete the picture. Comments are also
> welcome.
> 
> Thanks,
> Morten
> 
> 


-- 
Thanks
    Alex