linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Paul Turner <pjt@google.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alex Shi <alex.shi@intel.com>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	Arjan van de Ven <arjan@linux.intel.com>,
	vincent.guittot@linaro.org, svaidy@linux.vnet.ibm.com,
	Ingo Molnar <mingo@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [discussion]sched: a rough proposal to enable power saving in scheduler
Date: Fri, 17 Aug 2012 01:43:25 -0700	[thread overview]
Message-ID: <CAPM31RKZOk92NS5jrbQXiY7hZO5LRdfPWKt9+pSOS3OvkSrRng@mail.gmail.com> (raw)
In-Reply-To: <1345028738.31459.82.camel@twins>

On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>> Since there is no power saving consideration in scheduler CFS, I has a
>> very rough idea for enabling a new power saving schema in CFS.
>
> Adding Thomas, he always delights poking holes in power schemes.
>
>> It bases on the following assumption:
>> 1, If there are many task crowd in system, just let few domain cpus
>> running and let other cpus idle can not save power. Let all cpu take the
>> load, finish tasks early, and then get into idle. will save more power
>> and have better user experience.
>
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing. It has to
> do with the cost of switching power states and the duration of execution
> and other such things.
>
>> 2, schedule domain, schedule group perfect match the hardware, and
>> the power consumption unit. So, pull tasks out of a domain means
>> potentially this power consumption unit idle.
>
> I'm not sure I understand what you're saying, sorry.
>
>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>> power aware scheduling), this proposal will adopt the
>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>
> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> between the two based on AC/BAT, UPS status and simple things like that.
> But this seems like a later concern, you have to have something to pick
> between before you can pick :-)
>
>> And in scheduling, 2 place will care the policy, load_balance() and in
>> task fork/exec: select_task_rq_fair().
>
> ack
>
>> Here is some pseudo code try to explain the proposal behaviour in
>> load_balance() and select_task_rq_fair();
>
> Oh man.. A few words outlining the general idea would've been nice.
>
>> load_balance() {
>>       update_sd_lb_stats(); //get busiest group, idlest group data.
>>
>>       if (sd->nr_running > sd's capacity) {
>>               //power saving policy is not suitable for
>>               //this scenario, it runs like performance policy
>>               mv tasks from busiest cpu in busiest group to
>>               idlest  cpu in idlest group;
>
> Once upon a time we talked about adding a factor to the capacity for
> this. So say you'd allow 2*capacity before overflowing and waking
> another power group.
>
> But I think we should not go on nr_running here, PJTs per-entity load
> tracking stuff gives us much better measures -- also, repost that series
> already Paul! :-)

Yes -- I just got back from Africa this week.  It's updated for almost
all the previous comments but I ran out of time before I left to
re-post.  I'm just about caught up enough that I should be able to get
this done over the upcoming weekend.  Monday at the latest.

>
> Also, I'm not sure this is entirely correct, the thing you want to do
> for power aware stuff is to minimize the number of active power domains,
> this means you don't want idlest, you want least busy non-idle.
>
>>       } else {// the sd has enough capacity to hold all tasks.
>>               if (sg->nr_running > sg's capacity) {
>>                       //imbalanced between groups
>>                       if (schedule policy == performance) {
>>                               //when 2 busiest group at same busy
>>                               //degree, need to prefer the one has
>>                               // softest group??
>>                               move tasks from busiest group to
>>                                       idletest group;
>
> So I'd leave the currently implemented scheme as performance, and I
> don't think the above describes the current state.
>
>>                       } else if (schedule policy == power)
>>                               move tasks from busiest group to
>>                               idlest group until busiest is just full
>>                               of capacity.
>>                               //the busiest group can balance
>>                               //internally after next time LB,
>
> There's another thing we need to do, and that is collect tasks in a
> minimal amount of power domains. The old code (that got deleted) did
> something like that, you can revive some of the that code if needed -- I
> just killed everything to be able to start with a clean slate.
>
>
>>               } else {
>>                       //all groups has enough capacity for its tasks.
>>                       if (schedule policy == performance)
>>                               //all tasks may has enough cpu
>>                               //resources to run,
>>                               //mv tasks from busiest to idlest group?
>>                               //no, at this time, it's better to keep
>>                               //the task on current cpu.
>>                               //so, it is maybe better to do balance
>>                               //in each of groups
>>                               for_each_imbalance_groups()
>>                                       move tasks from busiest cpu to
>>                                       idlest cpu in each of groups;
>>                       else if (schedule policy == power) {
>>                               if (no hard pin in idlest group)
>>                                       mv tasks from idlest group to
>>                                       busiest until busiest full.
>>                               else
>>                                       mv unpin tasks to the biggest
>>                                       hard pin group.
>>                       }
>>               }
>>       }
>> }
>
> OK, so you only start to group later.. I think we can do better than
> that.
>
>>
>> sub proposal:
>> 1, If it's possible to balance task on idlest cpu not appointed 'balance
>> cpu'. If so, it may can reduce one more time balancing.
>> The idlest cpu can prefer the new idle cpu;  and is the least load cpu;
>> 2, se or task load is good for running time setting.
>> but it should the second basis in load balancing. The first basis of LB
>> is running tasks' number in group/cpu. Since whatever of the weight of
>> groups is, if the tasks number is less than cpu number, the group is
>> still has capacity to take more tasks. (will consider the SMT cpu power
>> or other big/little cpu capacity on ARM.)
>
> Ah, no we shouldn't balance on nr_running, but on the amount of time
> consumed. Imagine two tasks being woken at the same time, both tasks
> will only run a fraction of the available time, you don't want this to
> exceed your capacity because ran back to back the one cpu will still be
> mostly idle.
>
> What you want it to keep track of a per-cpu utilization level (inverse
> of idle-time) and using PJTs per-task runnable avg see if placing the
> new task on will exceed the utilization limit.

Observations of the runnable average also have the nice property that
it quickly converges to 100% when over-scheduled.

Since we also have the usage average for a single task the ratio of
used avg:runnable avg is likely a useful pointwise estimate.

>
> I think some of the Linaro people actually played around with this,
> Vincent?
>
>> unsolved issues:
>> 1, like current scheduler, it didn't handled cpu affinity well in
>> load_balance.
>
> cpu affinity is always 'fun'.. while there's still a few fun sites in
> the current load-balancer we do better than we did a while ago.
>
>> 2, task group that isn't consider well in this rough proposal.
>
> You mean the cgroup mess?
>
>> It isn't consider well and may has mistaken . So just share my ideas and
>> hope it become better and workable in your comments and discussion.
>
> Very simplistically the current scheme is a 'spread' the load scheme
> (SD_PREFER_SIBLING if you will). We spread load to maximize per-task
> cache and cpu power.
>
> The power scheme should be a 'pack' scheme, where we minimize the active
> power domains.
>
> One way to implement this is to keep track of an active and
> under-utilized power domain (the target) and fail the regular (pull)
> load-balance for all cpus not in that domain. For the cpu that are in
> that domain we'll have find_busiest select from all other under-utilized
> domains pulling tasks to fill our target, once full, we pick a new
> target, goto 1.
>
>

  parent reply	other threads:[~2012-08-17  8:49 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-13 12:21 [discussion]sched: a rough proposal to enable power saving in scheduler Alex Shi
2012-08-14  7:35 ` Alex Shi
2012-08-15  8:23   ` Peter Zijlstra
2012-08-15 11:05 ` Peter Zijlstra
2012-08-15 13:15   ` Borislav Petkov
2012-08-15 14:43     ` Peter Zijlstra
2012-08-16  3:22       ` Alex Shi
2012-08-16  3:09     ` Alex Shi
2012-08-15 13:45   ` Arjan van de Ven
2012-08-15 14:39     ` Peter Zijlstra
2012-08-15 14:43       ` Arjan van de Ven
2012-08-15 15:04         ` Peter Zijlstra
2012-08-15 17:59           ` Arjan van de Ven
2012-08-20  8:06             ` Ingo Molnar
2012-08-20  8:26               ` Peter Zijlstra
2012-08-20 13:26               ` Arjan van de Ven
2012-08-20 18:16               ` Matthew Garrett
2012-08-21  9:42                 ` Ingo Molnar
2012-08-21 11:39                   ` Matthew Garrett
2012-08-21 15:19                     ` Ingo Molnar
2012-08-21 15:21                       ` Arjan van de Ven
2012-08-21 15:28                       ` Matthew Garrett
2012-08-21 15:59                         ` Ingo Molnar
2012-08-21 16:13                           ` Matthew Garrett
2012-08-21 18:23                             ` Ingo Molnar
2012-08-21 18:34                               ` Matthew Garrett
2012-08-22  9:10                                 ` Ingo Molnar
2012-08-22 11:35                                   ` Matthew Garrett
2012-08-23  8:19                                   ` Alex Shi
2012-08-21 18:52                               ` Alan Cox
2012-08-22  9:03                                 ` Ingo Molnar
2012-08-22 11:00                                   ` Alan Cox
2012-08-22 11:33                                     ` Ingo Molnar
2012-08-22 12:58                                       ` Alan Cox
2012-08-21 16:02                       ` Alan Cox
2012-08-22  5:41                         ` Mike Galbraith
2012-08-22 13:02                           ` Arjan van de Ven
2012-08-22 13:09                             ` Mike Galbraith
2012-08-22 13:21                             ` Matthew Garrett
2012-08-22 13:23                               ` Arjan van de Ven
2012-08-16  1:14         ` Rik van Riel
2012-08-16  1:17           ` Arjan van de Ven
2012-08-16  1:21           ` Arjan van de Ven
2012-08-15 14:19   ` Arjan van de Ven
2012-08-15 14:44     ` Peter Zijlstra
2012-08-15 14:47       ` Thomas Gleixner
2012-08-15 16:23   ` Matthew Garrett
2012-08-15 16:34   ` Matthew Garrett
2012-08-15 18:02     ` Arjan van de Ven
2012-08-17  8:59       ` Paul Turner
2012-08-16  3:07   ` Alex Shi
2012-08-16  6:53   ` preeti
2012-08-16  9:58     ` Alex Shi
2012-08-16 12:45     ` Shilimkar, Santosh
2012-08-16 14:01     ` Arjan van de Ven
2012-08-16 18:45       ` Rik van Riel
2012-08-16 19:20         ` Arjan van de Ven
2012-08-17  1:29       ` Alex Shi
2012-08-17 18:41       ` Matthew Garrett
2012-08-17 18:44         ` Arjan van de Ven
2012-08-17 18:47           ` Matthew Garrett
2012-08-17 19:45             ` Chris Friesen
2012-08-17 19:50               ` Matthew Garrett
2012-08-17 20:16                 ` Chris Friesen
2012-08-18 14:33                   ` Luming Yu
2012-08-18 14:52                     ` Arjan van de Ven
2012-08-16 14:31   ` Morten Rasmussen
2012-08-19 10:12     ` Juri Lelli
2012-08-17  8:43   ` Paul Turner [this message]
2012-08-20 15:55     ` Vincent Guittot
2012-08-20 15:36   ` Vincent Guittot
2012-08-21  0:58     ` Alex Shi
2012-08-21 11:05       ` Vincent Guittot
2012-08-15 14:24 ` Rakib Mullick
2012-08-15 14:55   ` Peter Zijlstra
2012-08-15 22:58     ` Rakib Mullick
2012-08-16  5:26     ` Alex Shi
2012-08-16  4:57   ` Alex Shi
2012-08-16  8:05     ` Rakib Mullick
2012-08-15 16:19 ` Matthew Garrett
2012-08-16  5:03   ` Alex Shi
2012-08-16  5:31     ` Matthew Garrett
2012-08-16  5:39       ` Alex Shi
2012-08-16  5:45         ` Matthew Garrett
2012-08-16 13:57     ` Arjan van de Ven
2012-08-20 15:47       ` Christoph Lameter
2012-08-20 15:52         ` Matthew Garrett
2012-08-20 19:22           ` Christoph Lameter
2012-08-20 15:47     ` Vincent Guittot
2012-08-21  1:05       ` Alex Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPM31RKZOk92NS5jrbQXiY7hZO5LRdfPWKt9+pSOS3OvkSrRng@mail.gmail.com \
    --to=pjt@google.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=arjan@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=suresh.b.siddha@intel.com \
    --cc=svaidy@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).