linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: power-efficient scheduling design
       [not found]   ` <20130608112801.GA8120@MacBook-Pro.local>
@ 2013-06-08 14:02     ` Rafael J. Wysocki
  2013-06-09  3:42       ` Preeti U Murthy
  0 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2013-06-08 14:02 UTC (permalink / raw)
  To: Catalin Marinas, Preeti U Murthy
  Cc: Ingo Molnar, Morten Rasmussen, alex.shi, Peter Zijlstra,
	Vincent Guittot, Mike Galbraith, pjt, Linux Kernel Mailing List,
	linaro-kernel, arjan, len.brown, corbet, Andrew Morton,
	Linus Torvalds, Thomas Gleixner, Linux PM list

On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
> > On 06/07/2013 08:21 PM, Catalin Marinas wrote:
> > > I think you are missing Ingo's point. It's not about the scheduler
> > > complying with decisions made by various governors in the kernel
> > > (which may or may not have enough information) but rather the
> > > scheduler being in a better position for making such decisions.
> > 
> > My mail pointed out that I disagree with this design ("the scheduler
> > being in a better position for making such decisions").
> > I think it should be a 2 way co-operation. I have elaborated below.

I agree with that.

> > > Take the cpuidle example, it uses the load average of the CPUs,
> > > however this load average is currently controlled by the scheduler
> > > (load balance). Rather than using a load average that degrades over
> > > time and gradually putting the CPU into deeper sleep states, the
> > > scheduler could predict more accurately that a run-queue won't have
> > > any work over the next x ms and ask for a deeper sleep state from the
> > > beginning.
> > 
> > How will the scheduler know that there will not be work in the near
> > future? How will the scheduler ask for a deeper sleep state?
> > 
> > My answer to the above two questions are, the scheduler cannot know how
> > much work will come up. All it knows is the current load of the
> > runqueues and the nature of the task (thanks to the PJT's metric). It
> > can then match the task load to the cpu capacity and schedule the tasks
> > on the appropriate cpus.
> 
> The scheduler can decide to load a single CPU or cluster and let the
> others idle. If the total CPU load can fit into a smaller number of CPUs
> it could as well tell cpuidle to go into deeper state from the
> beginning as it moved all the tasks elsewhere.

So why can't it do that today?  What's the problem?

> Regarding future work, neither cpuidle nor the scheduler know this but
> the scheduler would make a better prediction, for example by tracking
> task periodicity.

Well, basically, two pieces of information are needed to make target idle
state selections: (1) when the CPU (core or package) is going to be used
next time and (2) how much latency for going back to the non-idle state
can be tolerated.  While the scheduler knows (1) to some extent (arguably,
it generally cannot predict when hardware interrupts are going to occur),
I'm not really sure about (2).

> > As a consequence, it leaves certain cpus idle. The load of these cpus
> > degrade. It is via this load that the scheduler asks for a deeper sleep
> > state. Right here we have scheduler talking to the cpuidle governor.
> 
> So we agree that the scheduler _tells_ the cpuidle governor when to go
> idle (but not how deep).

It does indicate to cpuidle how deep it can go, however, by providing it with
the information about when the CPU is going to be used next time (from the
scheduler's perspective).

> IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the
> cpuidle does not get enough information from the scheduler (arguably this
> could be fixed)

OK, so what information is missing in your opinion?

> and (2) the scheduler does not have any information about the idle states
> (power gating etc.) to make any informed decision on which/when CPUs should
> go idle.

That's correct, which is a drawback.  However, on some systems it may never
have that information (because hardware coordinates idle states in a way that
is opaque to the OS - e.g. by autopromoting deeper states when idle for
sufficiently long time) and on some systems that information may change over
time (i.e. the availablility of specific idle states may depend on factors
that aren't constant).

If you attempted to take all of the possible complications related to hardware
designs in that area in the scheduler, you'd end up with completely
unmaintainable piece of code.

> As you said, it is a non-optimal one-way communication but the solution
> is not feedback loop from cpuidle into scheduler. It's like the
> scheduler managed by chance to get the CPU into a deeper sleep state and
> now you'd like the scheduler to get feedback form cpuidle and not
> disturb that CPU anymore. That's the closed loop I disagree with. Could
> the scheduler not make this informed decision before - it has this total
> load, let's get this CPU into deeper sleep state?

No, it couldn't in general, for the above reasons.

> > I don't see what the problem is with the cpuidle governor waiting for
> > the load to degrade before putting that cpu to sleep. In my opinion,
> > putting a cpu to deeper sleep states should happen gradually.

If we know in advance that the CPU can be put into idle state Cn, there is no
reason to put it into anything shallower than that.

On the other hand, if the CPU is in Cn already and there is a possibility to
put it into a deeper low-power state (which we didn't know about before), it
may make sense to promote it into that state (if that's safe) or even wake it
up and idle it again.

> > This means time will tell the governors what kinds of workloads are running
> > on the system. If the cpu is idle for long, it probably means that the system
> > is less loaded and it makes sense to put the cpus to deeper sleep
> > states. Of course there could be sporadic bursts or quieting down of
> > tasks, but these are corner cases.
> 
> It's nothing wrong with degrading given the information that cpuidle
> currently has. It's a heuristics that worked ok so far and may continue
> to do so. But see my comments above on why the scheduler could make more
> informed decisions.
> 
> We may not move all the power gating information to the scheduler but
> maybe find a way to abstract this by giving more hints via the CPU and
> cache topology. The cpuidle framework (it may not be much left of a
> governor) would then take hints about estimated idle time and invoke the
> low-level driver about the right C state.

Overall, it looks like it'd be better to split the governor "layer" between the
scheduler and the idle driver with a well defined interface between them.  That
interface needs to be general enough to be independent of the underlying
hardware.

We need to determine what kinds of information should be passed both ways and
how to represent it.

> > > Of course, you could export more scheduler information to cpuidle,
> > > various hooks (task wakeup etc.) but then we have another framework,
> > > cpufreq. It also decides the CPU parameters (frequency) based on the
> > > load controlled by the scheduler. Can cpufreq decide whether it's
> > > better to keep the CPU at higher frequency so that it gets to idle
> > > quicker and therefore deeper sleep states? I don't think it has enough
> > > information because there are at least three deciding factors
> > > (cpufreq, cpuidle and scheduler's load balancing) which are not
> > > unified.
> > 
> > Why not? When the cpu load is high, cpu frequency governor knows it has
> > to boost the frequency of that CPU. The task gets over quickly, the CPU
> > goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
> > sleep state gradually.
> 
> The cpufreq governor boosts the frequency enough to cover the load,
> which means reducing the idle time. It does not know whether it is
> better to boost the frequency twice as high so that it gets to idle
> quicker. You can change the governor's policy but does it have any
> information from cpuidle?

Well, it may get that information directly from the hardware.  Actually,
intel_pstate does that, but intel_pstate is the governor and the scaling
driver combined.

> > Meanwhile the scheduler should ensure that the tasks are retained on
> > that CPU,whose frequency is boosted and should not load balance it, so
> > that they can get over quickly. This I think is what is missing. Again
> > this comes down to the scheduler taking feedback from the CPU frequency
> > governors which is not currently happening.
> 
> Same loop again. The cpu load goes high because (a) there is more work,
> possibly triggered by external events, and (b) the scheduler decided to
> balance the CPUs in a certain way. As for cpuidle above, the scheduler
> has direct influence on the cpufreq decisions. How would the scheduler
> know which CPU not to balance against? Are CPUs in a cluster
> synchronous? Is it better do let other CPU idle or more efficient to run
> this cluster at half-speed?
> 
> Let's say there is an increase in the load, does the scheduler wait
> until cpufreq figures this out or tries to take the other CPUs out of
> idle? Who's making this decision? That's currently a potentially
> unstable loop.

Yes, it is and I don't think we currently have good answers here.

The results of many measurements seem to indicate that it generally is better
to do the work as quickly as possible and then go idle again, but there are
costs associated with going back and forth from idle to non-idle etc.

The main problem with cpufreq that I personally have is that the governors
carry out their own sampling with pretty much arbitrary resolution that may
lead to suboptimal decisions.  It would be much better if the scheduler
indicated when to *consider* the changing of CPU performance parameters (that
may not be frequency alone and not even frequency at all in general), more or
less the same way it tells cpuidle about idle CPUs, but I'm not sure if it
should decide what performance points to run at.

> > >> I would repeat here that today we interface cpuidle/cpufrequency
> > >> policies with scheduler but not the other way around. They do their bit
> > >> when a cpu is busy/idle. However scheduler does not see that somebody
> > >> else is taking instructions from it and comes back to give different
> > >> instructions!
> > > 
> > > The key here is that cpuidle/cpufreq make their primary decision based
> > > on something controlled by the scheduler: the CPU load (via run-queue
> > > balancing). You would then like the scheduler take such decision back
> > > into account. It just looks like a closed loop, possibly 'unstable' .
> > 
> > Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a
> > closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a
> > closed loop? Here too the scheduler should be made well aware of the
> > decisions it took in the past right?
> 
> It's more like:
> 
> scheduler -> cpuidle/cpufreq -> hardware operating point
>    ^                                      |
>    +--------------------------------------+
> 
> You can argue that you can make an adaptive loop that works fine but
> there are so many parameters that I don't see how it would work. The
> patches so far don't seem to address this. Small task packing, while
> useful, it's some heuristics just at the scheduler level.

I agree.

> With a combined decision maker, you aim to reduce this separate decision
> process and feedback loop. Probably impossible to eliminate the loop
> completely because of hardware latencies, PLLs, CPU frequency not always
> the main factor, but you can make the loop more tolerant to
> instabilities.

Well, in theory. :-)

Another question to ask is whether or not the structure of our software
reflects the underlying problem.  I mean, on the one hand there is the
scheduler that needs to optimally assign work items to computational units
(hyperthreads, CPU cores, packages) and on the other hand there's hardware
with different capabilities (idle states, performance points etc.).  Arguably,
the scheduler internals cannot cover all of the differences between all of the
existing types of hardware Linux can run on, so there needs to be a layer of
code providing an interface between the scheduler and the hardware.  But that
layer of code needs to be just *one*, so why do we have *two* different
frameworks (cpuidle and cpufreq) that talk to the same hardware and kind of to
the scheduler, but not to each other?

To me, the reason is history, and more precisely the fact that cpufreq had been
there first, then came cpuidle and only then poeple started to realize that
some scheduler tweaks may allow us to save energy without sacrificing too
much performance.  However, it looks like there's time to go back and see how
we can integrate all that.  And there's more, because we may need to take power
budgets and thermal management into account as well (i.e. we may not be allowed
to use full performance of the processors all the time because of some
additional limitations) and the CPUs may be members of power domains, so what
we can do with them may depend on the states of other devices.

> > > So I think we either (a) come up with 'clearer' separation of
> > > responsibilities between scheduler and cpufreq/cpuidle 
> > 
> > I agree with this. This is what I have been emphasizing, if we feel that
> > the cpufrequency/ cpuidle subsystems are suboptimal in terms of the
> > information that they use to make their decisions, let us improve them.
> > But this will not yield us any improvement if the scheduler does not
> > have enough information. And IMHO, the next fundamental information that
> > the scheduler needs should come from cpufreq and cpuidle.
> 
> What kind of information? Your suggestion that the scheduler should
> avoid loading a CPU because it went idle is wrong IMHO. It went idle
> because the scheduler decided this in first instance.
> 
> > Then we should move onto supplying scheduler information from the power
> > domain topology, thermal factors, user policies.
> 
> I agree with this but at this point you get the scheduler to make more
> informed decisions about task placement. It can then give more precise
> hints to cpufreq/cpuidle like the predicted load and those frameworks
> could become dumber in time, just complying with the requested
> performance level (trying to break the loop above).

Well, there's nothing like "predicted load".  At best, we may be able to make
more or less educated guesses about it, so in my opinion it is better to use
the information about what happened in the past for making decisions regarding
the current settings and re-adjust them over time as we get more information.

So how much decision making regarding the idle state to put the given CPU into
should be there in the scheduler?  I believe the only information coming out
of the scheduler regarding that should be "OK, this CPU is now idle and I'll
need it in X nanoseconds from now" plus possibly a hint about the wakeup
latency tolerance (but those hints may come from other places too).  That said
the decision *which* CPU should become idle at the moment very well may require
some information about what options are available from the layer below (for
example, "putting core X into idle for Y of time will save us Z energy" or
something like that).

And what about performance scaling?  Quite frankly, in my opinion that
requires some more investigation, because there still are some open questions
in that area.  To start with we can just continue using the current heuristics,
but perhaps with the scheduler calling the scaling "governor" when it sees fit
instead of that "governor" running kind of in parallel with it.

> > > or (b) come up
> > > with a unified load-balancing/cpufreq/cpuidle implementation as per
> > > Ingo's request. The latter is harder but, with a good design, has
> > > potentially a lot more benefits.
> > > 
> > > A possible implementation for (a) is to let the scheduler focus on
> > > performance load-balancing but control the balance ratio from a
> > > cpufreq governor (via things like arch_scale_freq_power() or something
> > > new). CPUfreq would not be concerned just with individual CPU
> > > load/frequency but also making a decision on how tasks are balanced
> > > between CPUs based on the overall load (e.g. four CPUs are enough for
> > > the current load, I can shut the other four off by telling the
> > > scheduler not to use them).
> > > 
> > > As for Ingo's preferred solution (b), a proposal forward could be to
> > > factor the load balancing out of kernel/sched/fair.c and provide an
> > > abstract interface (like load_class?) for easier extending or
> > > different policies (e.g. small task packing). 
> > 
> >  Let me elaborate on the patches that have been posted so far on the
> > power awareness of the scheduler. When we say *power aware scheduler*
> > what exactly do we want it to do?
> > 
> > In my opinion, we want it to *avoid touching idle cpus*, so as to keep
> > them in that state longer and *keep more power domains idle*, so as to
> > yield power savings with them turned off. The patches released so far
> > are striving to do the latter. Correct me if I am wrong at this.
> 
> Don't take me wrong, task packing to keep more power domains idle is
> probably in the right direction but it may not address all issues. You
> realised this is not enough since you are now asking for the scheduler
> to take feedback from cpuidle. As I pointed out above, you try to create
> a loop which may or may not work, especially given the wide variety of
> hardware parameters.
> 
> > Also
> > feel free to point out any other expectation from the power aware
> > scheduler if I am missing any.
> 
> If the patches so far are enough and solved all the problems, you are
> not missing any. Otherwise, please see my view above.
> 
> Please define clearly what the scheduler, cpufreq, cpuidle should be
> doing and what communication should happen between them.
> 
> > If I have got Ingo's point right, the issues with them are that they are
> > not taking a holistic approach to meet the said goal.
> 
> Probably because scheduler changes, cpufreq and cpuidle are all trying
> to address the same thing but independent of each other and possibly
> conflicting.
> 
> > Keeping more power
> > domains idle (by packing tasks) would sound much better if the scheduler
> > has taken all aspects of doing such a thing into account, like
> > 
> > 1. How idle are the cpus, on the domain that it is packing
> > 2. Can they go to turbo mode, because if they do,then we cant pack
> > tasks. We would need certain cpus in that domain idle.
> > 3. Are the domains in which we pack tasks power gated?
> > 4. Will there be significant performance drop by packing? Meaning do the
> > tasks share cpu resources? If they do there will be severe contention.
> 
> So by this you add a lot more information about the power configuration
> into the scheduler, getting it to make more informed decisions about
> task scheduling. You may eventually reach a point where cpuidle governor
> doesn't have much to do (which may be a good thing) and reach Ingo's
> goal.
> 
> That's why I suggested maybe starting to take the load balancing out of
> fair.c and make it easily extensible (my opinion, the scheduler guys may
> disagree). Then make it more aware of topology, power configuration so
> that it makes the right task placement decision. You then get it to
> tell cpufreq about the expected performance requirements (frequency
> decided by cpufreq) and cpuidle about how long it could be idle for (you
> detect a periodic task every 1ms, or you don't have any at all because
> they were migrated, the right C state being decided by the governor).

There is another angle to look at that as I said somewhere above.

What if we could integrate cpuidle with cpufreq so that there is one code
layer representing what the hardware can do to the scheduler?  What benefits
can we get from that, if any?

Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-08 14:02     ` power-efficient scheduling design Rafael J. Wysocki
@ 2013-06-09  3:42       ` Preeti U Murthy
  2013-06-09 22:53         ` Catalin Marinas
                           ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-06-09  3:42 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Catalin Marinas, Ingo Molnar, Morten Rasmussen, alex.shi,
	Peter Zijlstra, Vincent Guittot, Mike Galbraith, pjt,
	Linux Kernel Mailing List, linaro-kernel, arjan, len.brown,
	corbet, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Linux PM list

Hi Rafael,

On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
> On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
>> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
>>> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
>>>> I think you are missing Ingo's point. It's not about the scheduler
>>>> complying with decisions made by various governors in the kernel
>>>> (which may or may not have enough information) but rather the
>>>> scheduler being in a better position for making such decisions.
>>>
>>> My mail pointed out that I disagree with this design ("the scheduler
>>> being in a better position for making such decisions").
>>> I think it should be a 2 way co-operation. I have elaborated below.
> 
> I agree with that.
> 
>>>> Take the cpuidle example, it uses the load average of the CPUs,
>>>> however this load average is currently controlled by the scheduler
>>>> (load balance). Rather than using a load average that degrades over
>>>> time and gradually putting the CPU into deeper sleep states, the
>>>> scheduler could predict more accurately that a run-queue won't have
>>>> any work over the next x ms and ask for a deeper sleep state from the
>>>> beginning.
>>>
>>> How will the scheduler know that there will not be work in the near
>>> future? How will the scheduler ask for a deeper sleep state?
>>>
>>> My answer to the above two questions are, the scheduler cannot know how
>>> much work will come up. All it knows is the current load of the
>>> runqueues and the nature of the task (thanks to the PJT's metric). It
>>> can then match the task load to the cpu capacity and schedule the tasks
>>> on the appropriate cpus.
>>
>> The scheduler can decide to load a single CPU or cluster and let the
>> others idle. If the total CPU load can fit into a smaller number of CPUs
>> it could as well tell cpuidle to go into deeper state from the
>> beginning as it moved all the tasks elsewhere.
> 
> So why can't it do that today?  What's the problem?

The reason that scheduler does not do it today is due to the
prefer_sibling logic. The tasks within a core get distributed across
cores if they are more than 1, since the cpu power of a core is not high
enough to handle more than one task.

However at a socket level/ MC level (cluster at a low level), there can
be as many tasks as there are cores because the socket has enough CPU
capacity to handle them. But the prefer_sibling logic moves tasks across
socket/MC level domains even when load<=domain_capacity.

I think the reason why the prefer_sibling logic was introduced, is that
scheduler looks at spreading tasks across all the resources it has. It
believes keeping tasks within a cluster/socket level domain would mean
tasks are being throttled by having access to only the cluster/socket
level resources. Which is why it spreads.

The prefer_sibling logic is nothing but a flag set at domain level to
communicate to the scheduler that load should be spread across the
groups of this domain. In the above example across sockets/clusters.

But I think it is time we take another look at the prefer_sibling logic
and decide on its worthiness.

> 
>> Regarding future work, neither cpuidle nor the scheduler know this but
>> the scheduler would make a better prediction, for example by tracking
>> task periodicity.
> 
> Well, basically, two pieces of information are needed to make target idle
> state selections: (1) when the CPU (core or package) is going to be used
> next time and (2) how much latency for going back to the non-idle state
> can be tolerated.  While the scheduler knows (1) to some extent (arguably,
> it generally cannot predict when hardware interrupts are going to occur),
> I'm not really sure about (2).
> 
>>> As a consequence, it leaves certain cpus idle. The load of these cpus
>>> degrade. It is via this load that the scheduler asks for a deeper sleep
>>> state. Right here we have scheduler talking to the cpuidle governor.
>>
>> So we agree that the scheduler _tells_ the cpuidle governor when to go
>> idle (but not how deep).
> 
> It does indicate to cpuidle how deep it can go, however, by providing it with
> the information about when the CPU is going to be used next time (from the
> scheduler's perspective).
> 
>> IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the
>> cpuidle does not get enough information from the scheduler (arguably this
>> could be fixed)
> 
> OK, so what information is missing in your opinion?
> 
>> and (2) the scheduler does not have any information about the idle states
>> (power gating etc.) to make any informed decision on which/when CPUs should
>> go idle.
> 
> That's correct, which is a drawback.  However, on some systems it may never
> have that information (because hardware coordinates idle states in a way that
> is opaque to the OS - e.g. by autopromoting deeper states when idle for
> sufficiently long time) and on some systems that information may change over
> time (i.e. the availablility of specific idle states may depend on factors
> that aren't constant).
> 
> If you attempted to take all of the possible complications related to hardware
> designs in that area in the scheduler, you'd end up with completely
> unmaintainable piece of code.
> 
>> As you said, it is a non-optimal one-way communication but the solution
>> is not feedback loop from cpuidle into scheduler. It's like the
>> scheduler managed by chance to get the CPU into a deeper sleep state and
>> now you'd like the scheduler to get feedback form cpuidle and not
>> disturb that CPU anymore. That's the closed loop I disagree with. Could
>> the scheduler not make this informed decision before - it has this total
>> load, let's get this CPU into deeper sleep state?
> 
> No, it couldn't in general, for the above reasons.
> 
>>> I don't see what the problem is with the cpuidle governor waiting for
>>> the load to degrade before putting that cpu to sleep. In my opinion,
>>> putting a cpu to deeper sleep states should happen gradually.
> 
> If we know in advance that the CPU can be put into idle state Cn, there is no
> reason to put it into anything shallower than that.
> 
> On the other hand, if the CPU is in Cn already and there is a possibility to
> put it into a deeper low-power state (which we didn't know about before), it
> may make sense to promote it into that state (if that's safe) or even wake it
> up and idle it again.

Yes, sorry I said it wrong in the previous mail. Today the cpuidle
governor is capable of putting a CPU in idle state Cn directly, by
looking at various factors like the current load, next timer, history of
interrupts, exit latency of states. At the end of this evaluation it
puts it into idle state Cn.

Also it cares to check if its decision is right. This is with respect to
your statement "if there is a possibility to put it into deeper low
power state". It queues a timer at a time just after its predicted wake
up time before putting the cpu to idle state. If this time of wakeup
prediction is wrong, this timer triggers to wake up the cpu and the cpu
is hence put into a deeper sleep state.

> 
>>> This means time will tell the governors what kinds of workloads are running
>>> on the system. If the cpu is idle for long, it probably means that the system
>>> is less loaded and it makes sense to put the cpus to deeper sleep
>>> states. Of course there could be sporadic bursts or quieting down of
>>> tasks, but these are corner cases.
>>
>> It's nothing wrong with degrading given the information that cpuidle
>> currently has. It's a heuristics that worked ok so far and may continue
>> to do so. But see my comments above on why the scheduler could make more
>> informed decisions.
>>
>> We may not move all the power gating information to the scheduler but
>> maybe find a way to abstract this by giving more hints via the CPU and
>> cache topology. The cpuidle framework (it may not be much left of a
>> governor) would then take hints about estimated idle time and invoke the
>> low-level driver about the right C state.
> 
> Overall, it looks like it'd be better to split the governor "layer" between the
> scheduler and the idle driver with a well defined interface between them.  That
> interface needs to be general enough to be independent of the underlying
> hardware.
> 
> We need to determine what kinds of information should be passed both ways and
> how to represent it.

I agree with this design decision.

>>>> Of course, you could export more scheduler information to cpuidle,
>>>> various hooks (task wakeup etc.) but then we have another framework,
>>>> cpufreq. It also decides the CPU parameters (frequency) based on the
>>>> load controlled by the scheduler. Can cpufreq decide whether it's
>>>> better to keep the CPU at higher frequency so that it gets to idle
>>>> quicker and therefore deeper sleep states? I don't think it has enough
>>>> information because there are at least three deciding factors
>>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>>>> unified.
>>>
>>> Why not? When the cpu load is high, cpu frequency governor knows it has
>>> to boost the frequency of that CPU. The task gets over quickly, the CPU
>>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
>>> sleep state gradually.
>>
>> The cpufreq governor boosts the frequency enough to cover the load,
>> which means reducing the idle time. It does not know whether it is
>> better to boost the frequency twice as high so that it gets to idle
>> quicker. You can change the governor's policy but does it have any
>> information from cpuidle?
> 
> Well, it may get that information directly from the hardware.  Actually,
> intel_pstate does that, but intel_pstate is the governor and the scaling
> driver combined.

To add to this, cpufreq currently functions in the below fashion. I am
talking of the on demand governor, since it is more relevant to our
discussion.

----stepped up frequency------
  ----threshold--------
      -----stepped down freq level1---
        -----stepped down freq level2---
          ---stepped down freq level3----

If the cpu idle time is below a threshold , it boosts the frequency to
one level above straight away and does not vary it any further. If the
cpu idle time is below a threshold there is a step down in frequency
levels by 5% of the current frequency at every sampling period, provided
the cpu behavior is constant.

I think we can improve this implementation by better interaction with
cpuidle and scheduler.

When it is stepping up frequency, it should do it in steps of frequency
being a *function of the current cpu load* also, or function of idle
time will also do.

When it is stepping down frequency, it should interact with cpuidle. It
should get from cpuidle information regarding the idle state that the
cpu is in.The reason is cpufrequency governor is aware of only the idle
time of the cpu, not the idle state it is in. If it gets to know that
the cpu is in a deep idle state, it could step down frequency levels to
leveln straight away. Just like cpuidle does to put cpus into state Cn.

Or an alternate option could be just like stepping up, make the stepping
down also a function of idle time. Perhaps
fn(|threshold-idle_time|).

Also one more point to note is that if cpuidle puts cpus into such idle
states that clock gate the cpus, then there is no need for cpufrequency
governor for that cpu. cpufreq can check with cpuidle on this front
before it queries a cpu.

> 
>>> Meanwhile the scheduler should ensure that the tasks are retained on
>>> that CPU,whose frequency is boosted and should not load balance it, so
>>> that they can get over quickly. This I think is what is missing. Again
>>> this comes down to the scheduler taking feedback from the CPU frequency
>>> governors which is not currently happening.
>>
>> Same loop again. The cpu load goes high because (a) there is more work,
>> possibly triggered by external events, and (b) the scheduler decided to
>> balance the CPUs in a certain way. As for cpuidle above, the scheduler
>> has direct influence on the cpufreq decisions. How would the scheduler
>> know which CPU not to balance against? Are CPUs in a cluster
>> synchronous? Is it better do let other CPU idle or more efficient to run
>> this cluster at half-speed?
>>
>> Let's say there is an increase in the load, does the scheduler wait
>> until cpufreq figures this out or tries to take the other CPUs out of
>> idle? Who's making this decision? That's currently a potentially
>> unstable loop.
> 
> Yes, it is and I don't think we currently have good answers here.

My answer to the above question is scheduler does not wait until cpufreq
figures it out. All that the scheduler cares about today is load
balancing. Spread the load and hope it finishes soon. There is a
possibility today that even before cpu frequency governor can boost the
frequency of cpu, the scheduler can spread the load.

As for the second question it will wakeup idle cpus if it must to load
balance.

It is a good question asked: "does the scheduler wait until cpufreq
figures it out." Currently the answer is no, it does not communicate
with cpu frequency at all (except through cpu power, but that is the
good part of the story, so I will not get there now). But maybe we
should change this. I think we can do so the following way.

When can a scheduler talk to cpu frequency? It can do so under the below
circumstances:

1. Load is too high across the systems, all cpus are loaded, no chance
of load balancing. Therefore ask cpu frequency governor to step up
frequency to get improve performance.

2. The scheduler finds out that if it has to load balance, it has to do
so on cpus which are in deep idle state( Currently this logic is not
present, but worth getting it in). It then decides to increase the
frequency of the already loaded cpus to improve performance. It calls
cpu freq governor.

3. The scheduler finds out that if it has to load balance, it has to do
so on a different power domain which is idle currently(shallow/deep). It
thinks the better of it and calls cpu frequency governor to boost the
frequency of the cpus in the current domain.

While 2 and 3 depend on scheduler having knowledge about idle states and
power domains, which it currently does not have, 1 can be achieved with
the current code. scheduler keeps track of failed ld balancing efforts
with lb_failed. If it finds that while load balancing from busy group
failed (lb_failed > 0), it can call cpu freq governor to step up the cpu
frequency of this busy cpu group, with gov_check_cpu() in cpufrequency
governor code.

> 
> The results of many measurements seem to indicate that it generally is better
> to do the work as quickly as possible and then go idle again, but there are
> costs associated with going back and forth from idle to non-idle etc.

I think we can even out the cost benefit of race to idle, by choosing to
do it wisely. Like for example if points 2 and 3 above are true (idle
cpus are in deep sleep states or need to ld balance on a different power
domain), then step up the frequency of the current working cpus and reap
its benefit.

> 
> The main problem with cpufreq that I personally have is that the governors
> carry out their own sampling with pretty much arbitrary resolution that may
> lead to suboptimal decisions.  It would be much better if the scheduler
> indicated when to *consider* the changing of CPU performance parameters (that
> may not be frequency alone and not even frequency at all in general), more or
> less the same way it tells cpuidle about idle CPUs, but I'm not sure if it
> should decide what performance points to run at.

Very true. See the points 1,2 and 3 above where I list out when
scheduler can call cpu frequency. Also an idea about how cpu frequency
governor can decide on the scaling frequency is stated above.

> 
>>>>> I would repeat here that today we interface cpuidle/cpufrequency
>>>>> policies with scheduler but not the other way around. They do their bit
>>>>> when a cpu is busy/idle. However scheduler does not see that somebody
>>>>> else is taking instructions from it and comes back to give different
>>>>> instructions!
>>>>
>>>> The key here is that cpuidle/cpufreq make their primary decision based
>>>> on something controlled by the scheduler: the CPU load (via run-queue
>>>> balancing). You would then like the scheduler take such decision back
>>>> into account. It just looks like a closed loop, possibly 'unstable' .
>>>
>>> Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a
>>> closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a
>>> closed loop? Here too the scheduler should be made well aware of the
>>> decisions it took in the past right?
>>
>> It's more like:
>>
>> scheduler -> cpuidle/cpufreq -> hardware operating point
>>    ^                                      |
>>    +--------------------------------------+
>>
>> You can argue that you can make an adaptive loop that works fine but
>> there are so many parameters that I don't see how it would work. The
>> patches so far don't seem to address this. Small task packing, while
>> useful, it's some heuristics just at the scheduler level.
> 
> I agree.
> 
>> With a combined decision maker, you aim to reduce this separate decision
>> process and feedback loop. Probably impossible to eliminate the loop
>> completely because of hardware latencies, PLLs, CPU frequency not always
>> the main factor, but you can make the loop more tolerant to
>> instabilities.
> 
> Well, in theory. :-)
> 
> Another question to ask is whether or not the structure of our software
> reflects the underlying problem.  I mean, on the one hand there is the
> scheduler that needs to optimally assign work items to computational units
> (hyperthreads, CPU cores, packages) and on the other hand there's hardware
> with different capabilities (idle states, performance points etc.).  Arguably,
> the scheduler internals cannot cover all of the differences between all of the
> existing types of hardware Linux can run on, so there needs to be a layer of
> code providing an interface between the scheduler and the hardware.  But that
> layer of code needs to be just *one*, so why do we have *two* different
> frameworks (cpuidle and cpufreq) that talk to the same hardware and kind of to
> the scheduler, but not to each other?
> 
> To me, the reason is history, and more precisely the fact that cpufreq had been
> there first, then came cpuidle and only then poeple started to realize that
> some scheduler tweaks may allow us to save energy without sacrificing too
> much performance.  However, it looks like there's time to go back and see how
> we can integrate all that.  And there's more, because we may need to take power
> budgets and thermal management into account as well (i.e. we may not be allowed
> to use full performance of the processors all the time because of some
> additional limitations) and the CPUs may be members of power domains, so what
> we can do with them may depend on the states of other devices.
> 
>>>> So I think we either (a) come up with 'clearer' separation of
>>>> responsibilities between scheduler and cpufreq/cpuidle 
>>>
>>> I agree with this. This is what I have been emphasizing, if we feel that
>>> the cpufrequency/ cpuidle subsystems are suboptimal in terms of the
>>> information that they use to make their decisions, let us improve them.
>>> But this will not yield us any improvement if the scheduler does not
>>> have enough information. And IMHO, the next fundamental information that
>>> the scheduler needs should come from cpufreq and cpuidle.
>>
>> What kind of information? Your suggestion that the scheduler should
>> avoid loading a CPU because it went idle is wrong IMHO. It went idle
>> because the scheduler decided this in first instance.
>>
>>> Then we should move onto supplying scheduler information from the power
>>> domain topology, thermal factors, user policies.
>>
>> I agree with this but at this point you get the scheduler to make more
>> informed decisions about task placement. It can then give more precise
>> hints to cpufreq/cpuidle like the predicted load and those frameworks
>> could become dumber in time, just complying with the requested
>> performance level (trying to break the loop above).
> 
> Well, there's nothing like "predicted load".  At best, we may be able to make
> more or less educated guesses about it, so in my opinion it is better to use
> the information about what happened in the past for making decisions regarding
> the current settings and re-adjust them over time as we get more information.

Agree with this as well. scheduler can at best supply information
regarding the historic load and hope that it is what defines the future
as well. Apart from this I dont know what other information scheduler
can supply cpuidle governor with.
> 
> So how much decision making regarding the idle state to put the given CPU into
> should be there in the scheduler?  I believe the only information coming out
> of the scheduler regarding that should be "OK, this CPU is now idle and I'll
> need it in X nanoseconds from now" plus possibly a hint about the wakeup
> latency tolerance (but those hints may come from other places too).  That said
> the decision *which* CPU should become idle at the moment very well may require
> some information about what options are available from the layer below (for
> example, "putting core X into idle for Y of time will save us Z energy" or
> something like that).

Agree. Except that the information should be "Ok , this CPU is now idle
and it has not done much work in the recent past,it is a 10% loaded CPU".

This can be said today using PJT's metric. It is now for the cpuidle
governor to decide the idle state to go to. Thats what happens today too.

> 
> And what about performance scaling?  Quite frankly, in my opinion that
> requires some more investigation, because there still are some open questions
> in that area.  To start with we can just continue using the current heuristics,
> but perhaps with the scheduler calling the scaling "governor" when it sees fit
> instead of that "governor" running kind of in parallel with it.

Exactly. How this can be done is elaborated above. This is one of the
key things we need today,IMHO.

> 
>>>> or (b) come up
>>>> with a unified load-balancing/cpufreq/cpuidle implementation as per
>>>> Ingo's request. The latter is harder but, with a good design, has
>>>> potentially a lot more benefits.
>>>>
>>>> A possible implementation for (a) is to let the scheduler focus on
>>>> performance load-balancing but control the balance ratio from a
>>>> cpufreq governor (via things like arch_scale_freq_power() or something
>>>> new). CPUfreq would not be concerned just with individual CPU
>>>> load/frequency but also making a decision on how tasks are balanced
>>>> between CPUs based on the overall load (e.g. four CPUs are enough for
>>>> the current load, I can shut the other four off by telling the
>>>> scheduler not to use them).
>>>>
>>>> As for Ingo's preferred solution (b), a proposal forward could be to
>>>> factor the load balancing out of kernel/sched/fair.c and provide an
>>>> abstract interface (like load_class?) for easier extending or
>>>> different policies (e.g. small task packing). 
>>>
>>>  Let me elaborate on the patches that have been posted so far on the
>>> power awareness of the scheduler. When we say *power aware scheduler*
>>> what exactly do we want it to do?
>>>
>>> In my opinion, we want it to *avoid touching idle cpus*, so as to keep
>>> them in that state longer and *keep more power domains idle*, so as to
>>> yield power savings with them turned off. The patches released so far
>>> are striving to do the latter. Correct me if I am wrong at this.
>>
>> Don't take me wrong, task packing to keep more power domains idle is
>> probably in the right direction but it may not address all issues. You
>> realised this is not enough since you are now asking for the scheduler
>> to take feedback from cpuidle. As I pointed out above, you try to create
>> a loop which may or may not work, especially given the wide variety of
>> hardware parameters.
>>
>>> Also
>>> feel free to point out any other expectation from the power aware
>>> scheduler if I am missing any.
>>
>> If the patches so far are enough and solved all the problems, you are
>> not missing any. Otherwise, please see my view above.
>>
>> Please define clearly what the scheduler, cpufreq, cpuidle should be
>> doing and what communication should happen between them.
>>
>>> If I have got Ingo's point right, the issues with them are that they are
>>> not taking a holistic approach to meet the said goal.
>>
>> Probably because scheduler changes, cpufreq and cpuidle are all trying
>> to address the same thing but independent of each other and possibly
>> conflicting.
>>
>>> Keeping more power
>>> domains idle (by packing tasks) would sound much better if the scheduler
>>> has taken all aspects of doing such a thing into account, like
>>>
>>> 1. How idle are the cpus, on the domain that it is packing
>>> 2. Can they go to turbo mode, because if they do,then we cant pack
>>> tasks. We would need certain cpus in that domain idle.
>>> 3. Are the domains in which we pack tasks power gated?
>>> 4. Will there be significant performance drop by packing? Meaning do the
>>> tasks share cpu resources? If they do there will be severe contention.
>>
>> So by this you add a lot more information about the power configuration
>> into the scheduler, getting it to make more informed decisions about
>> task scheduling. You may eventually reach a point where cpuidle governor
>> doesn't have much to do (which may be a good thing) and reach Ingo's
>> goal.
>>
>> That's why I suggested maybe starting to take the load balancing out of
>> fair.c and make it easily extensible (my opinion, the scheduler guys may
>> disagree). Then make it more aware of topology, power configuration so
>> that it makes the right task placement decision. You then get it to
>> tell cpufreq about the expected performance requirements (frequency
>> decided by cpufreq) and cpuidle about how long it could be idle for (you
>> detect a periodic task every 1ms, or you don't have any at all because
>> they were migrated, the right C state being decided by the governor).
> 
> There is another angle to look at that as I said somewhere above.
> 
> What if we could integrate cpuidle with cpufreq so that there is one code
> layer representing what the hardware can do to the scheduler?  What benefits
> can we get from that, if any?

We could debate on this point. I am a bit confused about this. As I see
it, there is no problem with keeping them separately. One, because of
code readability; it is easy to understand what are the different
parameters that the performance of CPU depends on, without needing to
dig through the code. Two, because cpu frequency kicks in during runtime
primarily and cpuidle during idle time of the cpu.

But this would also mean creating well defined interfaces between them.
Integrating cpufreq and cpuidle seems like a better argument to make due
to their common functionality at a higher level of talking to hardware
and tuning the performance parameters of cpu. But I disagree that
scheduler should be put into this common framework as well as it has
functionalities which are totally disjoint from what subsystems such as
cpuidle and cpufreq are intended to do.
> 
> Rafael
> 
> 

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-09  3:42       ` Preeti U Murthy
@ 2013-06-09 22:53         ` Catalin Marinas
  2013-06-10 16:25         ` Daniel Lezcano
  2013-06-11  0:50         ` Rafael J. Wysocki
  2 siblings, 0 replies; 15+ messages in thread
From: Catalin Marinas @ 2013-06-09 22:53 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Rafael J. Wysocki, Ingo Molnar, Morten Rasmussen, alex.shi,
	Peter Zijlstra, Vincent Guittot, Mike Galbraith, pjt,
	Linux Kernel Mailing List, linaro-kernel, arjan, len.brown,
	corbet, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Linux PM list

Hi Preeti,

(trimming lots of text, hopefully to make it easier to follow)

On Sun, Jun 09, 2013 at 04:42:18AM +0100, Preeti U Murthy wrote:
> On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
> > On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
> >> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
> >>> Meanwhile the scheduler should ensure that the tasks are retained on
> >>> that CPU,whose frequency is boosted and should not load balance it, so
> >>> that they can get over quickly. This I think is what is missing. Again
> >>> this comes down to the scheduler taking feedback from the CPU frequency
> >>> governors which is not currently happening.
> >>
> >> Same loop again. The cpu load goes high because (a) there is more work,
> >> possibly triggered by external events, and (b) the scheduler decided to
> >> balance the CPUs in a certain way. As for cpuidle above, the scheduler
> >> has direct influence on the cpufreq decisions. How would the scheduler
> >> know which CPU not to balance against? Are CPUs in a cluster
> >> synchronous? Is it better do let other CPU idle or more efficient to run
> >> this cluster at half-speed?
> >>
> >> Let's say there is an increase in the load, does the scheduler wait
> >> until cpufreq figures this out or tries to take the other CPUs out of
> >> idle? Who's making this decision? That's currently a potentially
> >> unstable loop.
> >
> > Yes, it is and I don't think we currently have good answers here.
> 
> My answer to the above question is scheduler does not wait until cpufreq
> figures it out. All that the scheduler cares about today is load
> balancing. Spread the load and hope it finishes soon. There is a
> possibility today that even before cpu frequency governor can boost the
> frequency of cpu, the scheduler can spread the load.
> 
> As for the second question it will wakeup idle cpus if it must to load
> balance.

That's exactly my point. Such behaviour can become unstable (it probably
won't oscillate but it affects the power or performance).

> It is a good question asked: "does the scheduler wait until cpufreq
> figures it out." Currently the answer is no, it does not communicate
> with cpu frequency at all (except through cpu power, but that is the
> good part of the story, so I will not get there now). But maybe we
> should change this. I think we can do so the following way.
> 
> When can a scheduler talk to cpu frequency? It can do so under the below
> circumstances:
> 
> 1. Load is too high across the systems, all cpus are loaded, no chance
> of load balancing. Therefore ask cpu frequency governor to step up
> frequency to get improve performance.

Too high or too low loads across the whole system are relatively simple
scenarios: for the former boost the frequency (cpufreq can do this on
its own, the scheduler has nowhere to balance anyway), for the latter
pack small tasks (or other heuristics).

But the bigger issue is where some CPUs are idle while others are
running at a smaller frequency. With the current implementation it is
even hard to get into this asymmetric state (some cluster loaded while
the other in deep sleep) unless the load is low and you apply some small
task packing patch.

> 2. The scheduler finds out that if it has to load balance, it has to do
> so on cpus which are in deep idle state( Currently this logic is not
> present, but worth getting it in). It then decides to increase the
> frequency of the already loaded cpus to improve performance. It calls
> cpu freq governor.

So you say that the scheduler decides to increase the frequency of the
already loaded cpus to improve performance. Doesn't this mean that the
scheduler takes on some of the responsibilities of cpufreq? You now add
logic about boosting CPU frequency to the scheduler.

What's even more problematic is that cpufreq has policies decided by the
user (or pre-configured OS policies) but the scheduler is not aware of
them. Let's say the user wants a more conservative cpufreq policy, how
long should the scheduler wait for cpufreq to boost the frequency before
waking idle CPUs?

There are many questions like above. I'm not looking for specific
answers but rather trying get a higher level clear view of the
responsibilities of the three main factors contributing to
power/performance: load balancing (scheduler), cpufreq and cpuidle.

> 3. The scheduler finds out that if it has to load balance, it has to do
> so on a different power domain which is idle currently(shallow/deep). It
> thinks the better of it and calls cpu frequency governor to boost the
> frequency of the cpus in the current domain.

As for 2, the scheduler would make power decisions. Then why don't make
a unified implementation? Or remove such decisions from the scheduler.

> > The results of many measurements seem to indicate that it generally is better
> > to do the work as quickly as possible and then go idle again, but there are
> > costs associated with going back and forth from idle to non-idle etc.
> 
> I think we can even out the cost benefit of race to idle, by choosing to
> do it wisely. Like for example if points 2 and 3 above are true (idle
> cpus are in deep sleep states or need to ld balance on a different power
> domain), then step up the frequency of the current working cpus and reap
> its benefit.

And such decision would be made by ...? I guess the scheduler again.

> > And what about performance scaling?  Quite frankly, in my opinion that
> > requires some more investigation, because there still are some open questions
> > in that area.  To start with we can just continue using the current heuristics,
> > but perhaps with the scheduler calling the scaling "governor" when it sees fit
> > instead of that "governor" running kind of in parallel with it.
> 
> Exactly. How this can be done is elaborated above. This is one of the
> key things we need today,IMHO.

The scheduler asking the cpufreq governor of what it needs is a too
simplistic view IMHO. What if the governor is conservative? How much
does the scheduler wait until the feedback loop reacts (CPU frequency
raised increasing the idle time so that the scheduler eventually
measures a smaller load)?

The scheduler could get more direct feedback from cpufreq like "I'll get
to this frequency in x ms" or not at all but then the scheduler needs to
make another power-related decision on whether to wait (be conservative)
or wake up an idle CPU. Do you want to add various power policies at the
scheduler level just to match the cpufreq ones?

> >> That's why I suggested maybe starting to take the load balancing out of
> >> fair.c and make it easily extensible (my opinion, the scheduler guys may
> >> disagree). Then make it more aware of topology, power configuration so
> >> that it makes the right task placement decision. You then get it to
> >> tell cpufreq about the expected performance requirements (frequency
> >> decided by cpufreq) and cpuidle about how long it could be idle for (you
> >> detect a periodic task every 1ms, or you don't have any at all because
> >> they were migrated, the right C state being decided by the governor).
> >
> > There is another angle to look at that as I said somewhere above.
> >
> > What if we could integrate cpuidle with cpufreq so that there is one code
> > layer representing what the hardware can do to the scheduler?  What benefits
> > can we get from that, if any?
> 
> We could debate on this point. I am a bit confused about this. As I see
> it, there is no problem with keeping them separately. One, because of
> code readability; it is easy to understand what are the different
> parameters that the performance of CPU depends on, without needing to
> dig through the code. Two, because cpu frequency kicks in during runtime
> primarily and cpuidle during idle time of the cpu.
> 
> But this would also mean creating well defined interfaces between them.
> Integrating cpufreq and cpuidle seems like a better argument to make due
> to their common functionality at a higher level of talking to hardware
> and tuning the performance parameters of cpu. But I disagree that
> scheduler should be put into this common framework as well as it has
> functionalities which are totally disjoint from what subsystems such as
> cpuidle and cpufreq are intended to do.

It's not about the whole scheduler but rather the load balancing, task
placement. You can try to create well defined interfaces between them
but first of all let's define clearly what responsibilities each of the
three frameworks have.

As I said in my first email on this subject, we could:

a) let the scheduler focus on performance only but control (restrict)
   the load balancing from cpufreq. For example via cpu_power, a value
   of 0 meaning don't balance against it. Cpufreq changes the frequency
   based on the load and may allow the scheduler to use idle CPUs. Such
   approach requires closer collaboration between cpufreq and cpuidle
   (possibly even merging them) and cpufreq needs to become even more
   aware of CPU topology.

or:

b) Merge the load balancer and cpufreq together (could leave cpuidle
   out initially) with a new design.

Any other proposals are welcome. So far they were either tweaks in
various places (small task packing) or are relatively vague (like we
need two-way communication between cpuidle and scheduler).

Best regards.

-- 
Catalin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-09  3:42       ` Preeti U Murthy
  2013-06-09 22:53         ` Catalin Marinas
@ 2013-06-10 16:25         ` Daniel Lezcano
  2013-06-12  0:27           ` David Lang
  2013-06-11  0:50         ` Rafael J. Wysocki
  2 siblings, 1 reply; 15+ messages in thread
From: Daniel Lezcano @ 2013-06-10 16:25 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Rafael J. Wysocki, Catalin Marinas, Ingo Molnar,
	Morten Rasmussen, alex.shi, Peter Zijlstra, Vincent Guittot,
	Mike Galbraith, pjt, Linux Kernel Mailing List, linaro-kernel,
	arjan, len.brown, corbet, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Linux PM list

On 06/09/2013 05:42 AM, Preeti U Murthy wrote:
> Hi Rafael,
> 
> On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
>> On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
>>> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
>>>> On 06/07/2013 08:21 PM, Catalin Marinas wrote:
>>>>> I think you are missing Ingo's point. It's not about the scheduler
>>>>> complying with decisions made by various governors in the kernel
>>>>> (which may or may not have enough information) but rather the
>>>>> scheduler being in a better position for making such decisions.
>>>>
>>>> My mail pointed out that I disagree with this design ("the scheduler
>>>> being in a better position for making such decisions").
>>>> I think it should be a 2 way co-operation. I have elaborated below.
>>
>> I agree with that.
>>
>>>>> Take the cpuidle example, it uses the load average of the CPUs,
>>>>> however this load average is currently controlled by the scheduler
>>>>> (load balance). Rather than using a load average that degrades over
>>>>> time and gradually putting the CPU into deeper sleep states, the
>>>>> scheduler could predict more accurately that a run-queue won't have
>>>>> any work over the next x ms and ask for a deeper sleep state from the
>>>>> beginning.
>>>>
>>>> How will the scheduler know that there will not be work in the near
>>>> future? How will the scheduler ask for a deeper sleep state?
>>>>
>>>> My answer to the above two questions are, the scheduler cannot know how
>>>> much work will come up. All it knows is the current load of the
>>>> runqueues and the nature of the task (thanks to the PJT's metric). It
>>>> can then match the task load to the cpu capacity and schedule the tasks
>>>> on the appropriate cpus.
>>>
>>> The scheduler can decide to load a single CPU or cluster and let the
>>> others idle. If the total CPU load can fit into a smaller number of CPUs
>>> it could as well tell cpuidle to go into deeper state from the
>>> beginning as it moved all the tasks elsewhere.
>>
>> So why can't it do that today?  What's the problem?
> 
> The reason that scheduler does not do it today is due to the
> prefer_sibling logic. The tasks within a core get distributed across
> cores if they are more than 1, since the cpu power of a core is not high
> enough to handle more than one task.
> 
> However at a socket level/ MC level (cluster at a low level), there can
> be as many tasks as there are cores because the socket has enough CPU
> capacity to handle them. But the prefer_sibling logic moves tasks across
> socket/MC level domains even when load<=domain_capacity.
> 
> I think the reason why the prefer_sibling logic was introduced, is that
> scheduler looks at spreading tasks across all the resources it has. It
> believes keeping tasks within a cluster/socket level domain would mean
> tasks are being throttled by having access to only the cluster/socket
> level resources. Which is why it spreads.
> 
> The prefer_sibling logic is nothing but a flag set at domain level to
> communicate to the scheduler that load should be spread across the
> groups of this domain. In the above example across sockets/clusters.
> 
> But I think it is time we take another look at the prefer_sibling logic
> and decide on its worthiness.
> 
>>
>>> Regarding future work, neither cpuidle nor the scheduler know this but
>>> the scheduler would make a better prediction, for example by tracking
>>> task periodicity.
>>
>> Well, basically, two pieces of information are needed to make target idle
>> state selections: (1) when the CPU (core or package) is going to be used
>> next time and (2) how much latency for going back to the non-idle state
>> can be tolerated.  While the scheduler knows (1) to some extent (arguably,
>> it generally cannot predict when hardware interrupts are going to occur),
>> I'm not really sure about (2).
>>
>>>> As a consequence, it leaves certain cpus idle. The load of these cpus
>>>> degrade. It is via this load that the scheduler asks for a deeper sleep
>>>> state. Right here we have scheduler talking to the cpuidle governor.
>>>
>>> So we agree that the scheduler _tells_ the cpuidle governor when to go
>>> idle (but not how deep).
>>
>> It does indicate to cpuidle how deep it can go, however, by providing it with
>> the information about when the CPU is going to be used next time (from the
>> scheduler's perspective).
>>
>>> IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the
>>> cpuidle does not get enough information from the scheduler (arguably this
>>> could be fixed)
>>
>> OK, so what information is missing in your opinion?
>>
>>> and (2) the scheduler does not have any information about the idle states
>>> (power gating etc.) to make any informed decision on which/when CPUs should
>>> go idle.
>>
>> That's correct, which is a drawback.  However, on some systems it may never
>> have that information (because hardware coordinates idle states in a way that
>> is opaque to the OS - e.g. by autopromoting deeper states when idle for
>> sufficiently long time) and on some systems that information may change over
>> time (i.e. the availablility of specific idle states may depend on factors
>> that aren't constant).
>>
>> If you attempted to take all of the possible complications related to hardware
>> designs in that area in the scheduler, you'd end up with completely
>> unmaintainable piece of code.
>>
>>> As you said, it is a non-optimal one-way communication but the solution
>>> is not feedback loop from cpuidle into scheduler. It's like the
>>> scheduler managed by chance to get the CPU into a deeper sleep state and
>>> now you'd like the scheduler to get feedback form cpuidle and not
>>> disturb that CPU anymore. That's the closed loop I disagree with. Could
>>> the scheduler not make this informed decision before - it has this total
>>> load, let's get this CPU into deeper sleep state?
>>
>> No, it couldn't in general, for the above reasons.
>>
>>>> I don't see what the problem is with the cpuidle governor waiting for
>>>> the load to degrade before putting that cpu to sleep. In my opinion,
>>>> putting a cpu to deeper sleep states should happen gradually.
>>
>> If we know in advance that the CPU can be put into idle state Cn, there is no
>> reason to put it into anything shallower than that.
>>
>> On the other hand, if the CPU is in Cn already and there is a possibility to
>> put it into a deeper low-power state (which we didn't know about before), it
>> may make sense to promote it into that state (if that's safe) or even wake it
>> up and idle it again.
> 
> Yes, sorry I said it wrong in the previous mail. Today the cpuidle
> governor is capable of putting a CPU in idle state Cn directly, by
> looking at various factors like the current load, next timer, history of
> interrupts, exit latency of states. At the end of this evaluation it
> puts it into idle state Cn.
> 
> Also it cares to check if its decision is right. This is with respect to
> your statement "if there is a possibility to put it into deeper low
> power state". It queues a timer at a time just after its predicted wake
> up time before putting the cpu to idle state. If this time of wakeup
> prediction is wrong, this timer triggers to wake up the cpu and the cpu
> is hence put into a deeper sleep state.

Some SoC can have a cluster of cpus sharing some resources, eg cache, so
they must enter the same state at the same moment. Beside the
synchronization mechanisms, that adds a dependency with the next event.
For example, the u8500 board has a couple of cpus. In order to make them
to enter in retention, both must enter the same state, but not necessary
at the same moment. The first cpu will wait in WFI and the second one
will initiate the retention mode when entering to this state.
Unfortunately, some time could have passed while the second cpu entered
this state and the next event for the first cpu could be too close, thus
violating the criteria of the governor when it choose this state for the
second cpu.

Also the latencies could change with the frequencies, so there is a
dependency with cpufreq, the lesser the frequency is, the higher the
latency is. If the scheduler takes the decision to go to a specific
state assuming the exit latency is a given duration, if the frequency
decrease, this exit latency could increase also and lead the system to
be less responsive.

I don't know, how were made the latencies computation (eg. worst case,
taken with the lower frequency or not) but we have just one set of
values. That should happen with the current code.

Another point is the timer allowing to detect bad decision and go to a
deep idle state. With the cluster dependency described above, we may
wake up a particular cpu, which turns on the cluster and make the entire
cluster to wake up in order to enter a deeper state, which could fail
because of the other cpu may not fulfill the constraint at this moment.

>>>> This means time will tell the governors what kinds of workloads are running
>>>> on the system. If the cpu is idle for long, it probably means that the system
>>>> is less loaded and it makes sense to put the cpus to deeper sleep
>>>> states. Of course there could be sporadic bursts or quieting down of
>>>> tasks, but these are corner cases.
>>>
>>> It's nothing wrong with degrading given the information that cpuidle
>>> currently has. It's a heuristics that worked ok so far and may continue
>>> to do so. But see my comments above on why the scheduler could make more
>>> informed decisions.
>>>
>>> We may not move all the power gating information to the scheduler but
>>> maybe find a way to abstract this by giving more hints via the CPU and
>>> cache topology. The cpuidle framework (it may not be much left of a
>>> governor) would then take hints about estimated idle time and invoke the
>>> low-level driver about the right C state.
>>
>> Overall, it looks like it'd be better to split the governor "layer" between the
>> scheduler and the idle driver with a well defined interface between them.  That
>> interface needs to be general enough to be independent of the underlying
>> hardware.
>>
>> We need to determine what kinds of information should be passed both ways and
>> how to represent it.
> 
> I agree with this design decision.
> 
>>>>> Of course, you could export more scheduler information to cpuidle,
>>>>> various hooks (task wakeup etc.) but then we have another framework,
>>>>> cpufreq. It also decides the CPU parameters (frequency) based on the
>>>>> load controlled by the scheduler. Can cpufreq decide whether it's
>>>>> better to keep the CPU at higher frequency so that it gets to idle
>>>>> quicker and therefore deeper sleep states? I don't think it has enough
>>>>> information because there are at least three deciding factors
>>>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>>>>> unified.
>>>>
>>>> Why not? When the cpu load is high, cpu frequency governor knows it has
>>>> to boost the frequency of that CPU. The task gets over quickly, the CPU
>>>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
>>>> sleep state gradually.
>>>
>>> The cpufreq governor boosts the frequency enough to cover the load,
>>> which means reducing the idle time. It does not know whether it is
>>> better to boost the frequency twice as high so that it gets to idle
>>> quicker. You can change the governor's policy but does it have any
>>> information from cpuidle?
>>
>> Well, it may get that information directly from the hardware.  Actually,
>> intel_pstate does that, but intel_pstate is the governor and the scaling
>> driver combined.
> 
> To add to this, cpufreq currently functions in the below fashion. I am
> talking of the on demand governor, since it is more relevant to our
> discussion.
> 
> ----stepped up frequency------
>   ----threshold--------
>       -----stepped down freq level1---
>         -----stepped down freq level2---
>           ---stepped down freq level3----
> 
> If the cpu idle time is below a threshold , it boosts the frequency to
> one level above straight away and does not vary it any further. If the
> cpu idle time is below a threshold there is a step down in frequency
> levels by 5% of the current frequency at every sampling period, provided
> the cpu behavior is constant.
> 
> I think we can improve this implementation by better interaction with
> cpuidle and scheduler.
> 
> When it is stepping up frequency, it should do it in steps of frequency
> being a *function of the current cpu load* also, or function of idle
> time will also do.
> 
> When it is stepping down frequency, it should interact with cpuidle. It
> should get from cpuidle information regarding the idle state that the
> cpu is in.The reason is cpufrequency governor is aware of only the idle
> time of the cpu, not the idle state it is in. If it gets to know that
> the cpu is in a deep idle state, it could step down frequency levels to
> leveln straight away. Just like cpuidle does to put cpus into state Cn.
> 
> Or an alternate option could be just like stepping up, make the stepping
> down also a function of idle time. Perhaps
> fn(|threshold-idle_time|).
> 
> Also one more point to note is that if cpuidle puts cpus into such idle
> states that clock gate the cpus, then there is no need for cpufrequency
> governor for that cpu. cpufreq can check with cpuidle on this front
> before it queries a cpu.
> 
>>
>>>> Meanwhile the scheduler should ensure that the tasks are retained on
>>>> that CPU,whose frequency is boosted and should not load balance it, so
>>>> that they can get over quickly. This I think is what is missing. Again
>>>> this comes down to the scheduler taking feedback from the CPU frequency
>>>> governors which is not currently happening.
>>>
>>> Same loop again. The cpu load goes high because (a) there is more work,
>>> possibly triggered by external events, and (b) the scheduler decided to
>>> balance the CPUs in a certain way. As for cpuidle above, the scheduler
>>> has direct influence on the cpufreq decisions. How would the scheduler
>>> know which CPU not to balance against? Are CPUs in a cluster
>>> synchronous? Is it better do let other CPU idle or more efficient to run
>>> this cluster at half-speed?
>>>
>>> Let's say there is an increase in the load, does the scheduler wait
>>> until cpufreq figures this out or tries to take the other CPUs out of
>>> idle? Who's making this decision? That's currently a potentially
>>> unstable loop.
>>
>> Yes, it is and I don't think we currently have good answers here.
> 
> My answer to the above question is scheduler does not wait until cpufreq
> figures it out. All that the scheduler cares about today is load
> balancing. Spread the load and hope it finishes soon. There is a
> possibility today that even before cpu frequency governor can boost the
> frequency of cpu, the scheduler can spread the load.
> 
> As for the second question it will wakeup idle cpus if it must to load
> balance.
> 
> It is a good question asked: "does the scheduler wait until cpufreq
> figures it out." Currently the answer is no, it does not communicate
> with cpu frequency at all (except through cpu power, but that is the
> good part of the story, so I will not get there now). But maybe we
> should change this. I think we can do so the following way.
> 
> When can a scheduler talk to cpu frequency? It can do so under the below
> circumstances:
> 
> 1. Load is too high across the systems, all cpus are loaded, no chance
> of load balancing. Therefore ask cpu frequency governor to step up
> frequency to get improve performance.
> 
> 2. The scheduler finds out that if it has to load balance, it has to do
> so on cpus which are in deep idle state( Currently this logic is not
> present, but worth getting it in). It then decides to increase the
> frequency of the already loaded cpus to improve performance. It calls
> cpu freq governor.
> 
> 3. The scheduler finds out that if it has to load balance, it has to do
> so on a different power domain which is idle currently(shallow/deep). It
> thinks the better of it and calls cpu frequency governor to boost the
> frequency of the cpus in the current domain.
> 
> While 2 and 3 depend on scheduler having knowledge about idle states and
> power domains, which it currently does not have, 1 can be achieved with
> the current code. scheduler keeps track of failed ld balancing efforts
> with lb_failed. If it finds that while load balancing from busy group
> failed (lb_failed > 0), it can call cpu freq governor to step up the cpu
> frequency of this busy cpu group, with gov_check_cpu() in cpufrequency
> governor code.
> 
>>
>> The results of many measurements seem to indicate that it generally is better
>> to do the work as quickly as possible and then go idle again, but there are
>> costs associated with going back and forth from idle to non-idle etc.
> 
> I think we can even out the cost benefit of race to idle, by choosing to
> do it wisely. Like for example if points 2 and 3 above are true (idle
> cpus are in deep sleep states or need to ld balance on a different power
> domain), then step up the frequency of the current working cpus and reap
> its benefit.
> 
>>
>> The main problem with cpufreq that I personally have is that the governors
>> carry out their own sampling with pretty much arbitrary resolution that may
>> lead to suboptimal decisions.  It would be much better if the scheduler
>> indicated when to *consider* the changing of CPU performance parameters (that
>> may not be frequency alone and not even frequency at all in general), more or
>> less the same way it tells cpuidle about idle CPUs, but I'm not sure if it
>> should decide what performance points to run at.
> 
> Very true. See the points 1,2 and 3 above where I list out when
> scheduler can call cpu frequency. Also an idea about how cpu frequency
> governor can decide on the scaling frequency is stated above.
> 
>>
>>>>>> I would repeat here that today we interface cpuidle/cpufrequency
>>>>>> policies with scheduler but not the other way around. They do their bit
>>>>>> when a cpu is busy/idle. However scheduler does not see that somebody
>>>>>> else is taking instructions from it and comes back to give different
>>>>>> instructions!
>>>>>
>>>>> The key here is that cpuidle/cpufreq make their primary decision based
>>>>> on something controlled by the scheduler: the CPU load (via run-queue
>>>>> balancing). You would then like the scheduler take such decision back
>>>>> into account. It just looks like a closed loop, possibly 'unstable' .
>>>>
>>>> Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a
>>>> closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a
>>>> closed loop? Here too the scheduler should be made well aware of the
>>>> decisions it took in the past right?
>>>
>>> It's more like:
>>>
>>> scheduler -> cpuidle/cpufreq -> hardware operating point
>>>    ^                                      |
>>>    +--------------------------------------+
>>>
>>> You can argue that you can make an adaptive loop that works fine but
>>> there are so many parameters that I don't see how it would work. The
>>> patches so far don't seem to address this. Small task packing, while
>>> useful, it's some heuristics just at the scheduler level.
>>
>> I agree.
>>
>>> With a combined decision maker, you aim to reduce this separate decision
>>> process and feedback loop. Probably impossible to eliminate the loop
>>> completely because of hardware latencies, PLLs, CPU frequency not always
>>> the main factor, but you can make the loop more tolerant to
>>> instabilities.
>>
>> Well, in theory. :-)
>>
>> Another question to ask is whether or not the structure of our software
>> reflects the underlying problem.  I mean, on the one hand there is the
>> scheduler that needs to optimally assign work items to computational units
>> (hyperthreads, CPU cores, packages) and on the other hand there's hardware
>> with different capabilities (idle states, performance points etc.).  Arguably,
>> the scheduler internals cannot cover all of the differences between all of the
>> existing types of hardware Linux can run on, so there needs to be a layer of
>> code providing an interface between the scheduler and the hardware.  But that
>> layer of code needs to be just *one*, so why do we have *two* different
>> frameworks (cpuidle and cpufreq) that talk to the same hardware and kind of to
>> the scheduler, but not to each other?
>>
>> To me, the reason is history, and more precisely the fact that cpufreq had been
>> there first, then came cpuidle and only then poeple started to realize that
>> some scheduler tweaks may allow us to save energy without sacrificing too
>> much performance.  However, it looks like there's time to go back and see how
>> we can integrate all that.  And there's more, because we may need to take power
>> budgets and thermal management into account as well (i.e. we may not be allowed
>> to use full performance of the processors all the time because of some
>> additional limitations) and the CPUs may be members of power domains, so what
>> we can do with them may depend on the states of other devices.
>>
>>>>> So I think we either (a) come up with 'clearer' separation of
>>>>> responsibilities between scheduler and cpufreq/cpuidle 
>>>>
>>>> I agree with this. This is what I have been emphasizing, if we feel that
>>>> the cpufrequency/ cpuidle subsystems are suboptimal in terms of the
>>>> information that they use to make their decisions, let us improve them.
>>>> But this will not yield us any improvement if the scheduler does not
>>>> have enough information. And IMHO, the next fundamental information that
>>>> the scheduler needs should come from cpufreq and cpuidle.
>>>
>>> What kind of information? Your suggestion that the scheduler should
>>> avoid loading a CPU because it went idle is wrong IMHO. It went idle
>>> because the scheduler decided this in first instance.
>>>
>>>> Then we should move onto supplying scheduler information from the power
>>>> domain topology, thermal factors, user policies.
>>>
>>> I agree with this but at this point you get the scheduler to make more
>>> informed decisions about task placement. It can then give more precise
>>> hints to cpufreq/cpuidle like the predicted load and those frameworks
>>> could become dumber in time, just complying with the requested
>>> performance level (trying to break the loop above).
>>
>> Well, there's nothing like "predicted load".  At best, we may be able to make
>> more or less educated guesses about it, so in my opinion it is better to use
>> the information about what happened in the past for making decisions regarding
>> the current settings and re-adjust them over time as we get more information.
> 
> Agree with this as well. scheduler can at best supply information
> regarding the historic load and hope that it is what defines the future
> as well. Apart from this I dont know what other information scheduler
> can supply cpuidle governor with.
>>
>> So how much decision making regarding the idle state to put the given CPU into
>> should be there in the scheduler?  I believe the only information coming out
>> of the scheduler regarding that should be "OK, this CPU is now idle and I'll
>> need it in X nanoseconds from now" plus possibly a hint about the wakeup
>> latency tolerance (but those hints may come from other places too).  That said
>> the decision *which* CPU should become idle at the moment very well may require
>> some information about what options are available from the layer below (for
>> example, "putting core X into idle for Y of time will save us Z energy" or
>> something like that).
> 
> Agree. Except that the information should be "Ok , this CPU is now idle
> and it has not done much work in the recent past,it is a 10% loaded CPU".
> 
> This can be said today using PJT's metric. It is now for the cpuidle
> governor to decide the idle state to go to. Thats what happens today too.
> 
>>
>> And what about performance scaling?  Quite frankly, in my opinion that
>> requires some more investigation, because there still are some open questions
>> in that area.  To start with we can just continue using the current heuristics,
>> but perhaps with the scheduler calling the scaling "governor" when it sees fit
>> instead of that "governor" running kind of in parallel with it.
> 
> Exactly. How this can be done is elaborated above. This is one of the
> key things we need today,IMHO.
> 
>>
>>>>> or (b) come up
>>>>> with a unified load-balancing/cpufreq/cpuidle implementation as per
>>>>> Ingo's request. The latter is harder but, with a good design, has
>>>>> potentially a lot more benefits.
>>>>>
>>>>> A possible implementation for (a) is to let the scheduler focus on
>>>>> performance load-balancing but control the balance ratio from a
>>>>> cpufreq governor (via things like arch_scale_freq_power() or something
>>>>> new). CPUfreq would not be concerned just with individual CPU
>>>>> load/frequency but also making a decision on how tasks are balanced
>>>>> between CPUs based on the overall load (e.g. four CPUs are enough for
>>>>> the current load, I can shut the other four off by telling the
>>>>> scheduler not to use them).
>>>>>
>>>>> As for Ingo's preferred solution (b), a proposal forward could be to
>>>>> factor the load balancing out of kernel/sched/fair.c and provide an
>>>>> abstract interface (like load_class?) for easier extending or
>>>>> different policies (e.g. small task packing). 
>>>>
>>>>  Let me elaborate on the patches that have been posted so far on the
>>>> power awareness of the scheduler. When we say *power aware scheduler*
>>>> what exactly do we want it to do?
>>>>
>>>> In my opinion, we want it to *avoid touching idle cpus*, so as to keep
>>>> them in that state longer and *keep more power domains idle*, so as to
>>>> yield power savings with them turned off. The patches released so far
>>>> are striving to do the latter. Correct me if I am wrong at this.
>>>
>>> Don't take me wrong, task packing to keep more power domains idle is
>>> probably in the right direction but it may not address all issues. You
>>> realised this is not enough since you are now asking for the scheduler
>>> to take feedback from cpuidle. As I pointed out above, you try to create
>>> a loop which may or may not work, especially given the wide variety of
>>> hardware parameters.
>>>
>>>> Also
>>>> feel free to point out any other expectation from the power aware
>>>> scheduler if I am missing any.
>>>
>>> If the patches so far are enough and solved all the problems, you are
>>> not missing any. Otherwise, please see my view above.
>>>
>>> Please define clearly what the scheduler, cpufreq, cpuidle should be
>>> doing and what communication should happen between them.
>>>
>>>> If I have got Ingo's point right, the issues with them are that they are
>>>> not taking a holistic approach to meet the said goal.
>>>
>>> Probably because scheduler changes, cpufreq and cpuidle are all trying
>>> to address the same thing but independent of each other and possibly
>>> conflicting.
>>>
>>>> Keeping more power
>>>> domains idle (by packing tasks) would sound much better if the scheduler
>>>> has taken all aspects of doing such a thing into account, like
>>>>
>>>> 1. How idle are the cpus, on the domain that it is packing
>>>> 2. Can they go to turbo mode, because if they do,then we cant pack
>>>> tasks. We would need certain cpus in that domain idle.
>>>> 3. Are the domains in which we pack tasks power gated?
>>>> 4. Will there be significant performance drop by packing? Meaning do the
>>>> tasks share cpu resources? If they do there will be severe contention.
>>>
>>> So by this you add a lot more information about the power configuration
>>> into the scheduler, getting it to make more informed decisions about
>>> task scheduling. You may eventually reach a point where cpuidle governor
>>> doesn't have much to do (which may be a good thing) and reach Ingo's
>>> goal.
>>>
>>> That's why I suggested maybe starting to take the load balancing out of
>>> fair.c and make it easily extensible (my opinion, the scheduler guys may
>>> disagree). Then make it more aware of topology, power configuration so
>>> that it makes the right task placement decision. You then get it to
>>> tell cpufreq about the expected performance requirements (frequency
>>> decided by cpufreq) and cpuidle about how long it could be idle for (you
>>> detect a periodic task every 1ms, or you don't have any at all because
>>> they were migrated, the right C state being decided by the governor).
>>
>> There is another angle to look at that as I said somewhere above.
>>
>> What if we could integrate cpuidle with cpufreq so that there is one code
>> layer representing what the hardware can do to the scheduler?  What benefits
>> can we get from that, if any?
> 
> We could debate on this point. I am a bit confused about this. As I see
> it, there is no problem with keeping them separately. One, because of
> code readability; it is easy to understand what are the different
> parameters that the performance of CPU depends on, without needing to
> dig through the code. Two, because cpu frequency kicks in during runtime
> primarily and cpuidle during idle time of the cpu.
> 
> But this would also mean creating well defined interfaces between them.
> Integrating cpufreq and cpuidle seems like a better argument to make due
> to their common functionality at a higher level of talking to hardware
> and tuning the performance parameters of cpu. But I disagree that
> scheduler should be put into this common framework as well as it has
> functionalities which are totally disjoint from what subsystems such as
> cpuidle and cpufreq are intended to do.
>>
>> Rafael
>>
>>
> 
> Regards
> Preeti U Murthy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-09  3:42       ` Preeti U Murthy
  2013-06-09 22:53         ` Catalin Marinas
  2013-06-10 16:25         ` Daniel Lezcano
@ 2013-06-11  0:50         ` Rafael J. Wysocki
  2013-06-13  4:32           ` Preeti U Murthy
  2 siblings, 1 reply; 15+ messages in thread
From: Rafael J. Wysocki @ 2013-06-11  0:50 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Catalin Marinas, Ingo Molnar, Morten Rasmussen, alex.shi,
	Peter Zijlstra, Vincent Guittot, Mike Galbraith, pjt,
	Linux Kernel Mailing List, linaro-kernel, arjan, len.brown,
	corbet, Andrew Morton, Linus Torvalds, Thomas Gleixner,
	Linux PM list

On Sunday, June 09, 2013 09:12:18 AM Preeti U Murthy wrote:
> Hi Rafael,

Hi Preeti,

> On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
> > On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
> >> On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:

[...]

> >> The scheduler can decide to load a single CPU or cluster and let the
> >> others idle. If the total CPU load can fit into a smaller number of CPUs
> >> it could as well tell cpuidle to go into deeper state from the
> >> beginning as it moved all the tasks elsewhere.
> > 
> > So why can't it do that today?  What's the problem?
> 
> The reason that scheduler does not do it today is due to the
> prefer_sibling logic. The tasks within a core get distributed across
> cores if they are more than 1, since the cpu power of a core is not high
> enough to handle more than one task.
> 
> However at a socket level/ MC level (cluster at a low level), there can
> be as many tasks as there are cores because the socket has enough CPU
> capacity to handle them. But the prefer_sibling logic moves tasks across
> socket/MC level domains even when load<=domain_capacity.
> 
> I think the reason why the prefer_sibling logic was introduced, is that
> scheduler looks at spreading tasks across all the resources it has. It
> believes keeping tasks within a cluster/socket level domain would mean
> tasks are being throttled by having access to only the cluster/socket
> level resources. Which is why it spreads.
> 
> The prefer_sibling logic is nothing but a flag set at domain level to
> communicate to the scheduler that load should be spread across the
> groups of this domain. In the above example across sockets/clusters.
> 
> But I think it is time we take another look at the prefer_sibling logic
> and decide on its worthiness.

Well, it does look like something that would be good to reconsider.

Some results indicate that for a given CPU package (cluster/socket) there
is a threshold number of tasks such that it is beneficial to pack tasks
into that package as long as the total number of tasks running on it does
not exceed that number.  It may be 1 (which is the value used currently with
prefer_sibling set if I understood what you said correctly), but it very
well may be 2 or more (depending on the hardware characteristics).

[...]

> > If we know in advance that the CPU can be put into idle state Cn, there is no
> > reason to put it into anything shallower than that.
> > 
> > On the other hand, if the CPU is in Cn already and there is a possibility to
> > put it into a deeper low-power state (which we didn't know about before), it
> > may make sense to promote it into that state (if that's safe) or even wake it
> > up and idle it again.
> 
> Yes, sorry I said it wrong in the previous mail. Today the cpuidle
> governor is capable of putting a CPU in idle state Cn directly, by
> looking at various factors like the current load, next timer, history of
> interrupts, exit latency of states. At the end of this evaluation it
> puts it into idle state Cn.
> 
> Also it cares to check if its decision is right. This is with respect to
> your statement "if there is a possibility to put it into deeper low
> power state". It queues a timer at a time just after its predicted wake
> up time before putting the cpu to idle state. If this time of wakeup
> prediction is wrong, this timer triggers to wake up the cpu and the cpu
> is hence put into a deeper sleep state.

So I don't think we need to modify that behavior. :-)

> >>> This means time will tell the governors what kinds of workloads are running
> >>> on the system. If the cpu is idle for long, it probably means that the system
> >>> is less loaded and it makes sense to put the cpus to deeper sleep
> >>> states. Of course there could be sporadic bursts or quieting down of
> >>> tasks, but these are corner cases.
> >>
> >> It's nothing wrong with degrading given the information that cpuidle
> >> currently has. It's a heuristics that worked ok so far and may continue
> >> to do so. But see my comments above on why the scheduler could make more
> >> informed decisions.
> >>
> >> We may not move all the power gating information to the scheduler but
> >> maybe find a way to abstract this by giving more hints via the CPU and
> >> cache topology. The cpuidle framework (it may not be much left of a
> >> governor) would then take hints about estimated idle time and invoke the
> >> low-level driver about the right C state.
> > 
> > Overall, it looks like it'd be better to split the governor "layer" between the
> > scheduler and the idle driver with a well defined interface between them.  That
> > interface needs to be general enough to be independent of the underlying
> > hardware.
> > 
> > We need to determine what kinds of information should be passed both ways and
> > how to represent it.
> 
> I agree with this design decision.

OK, so let's try to take one step more and think about what part should belong
to the scheduler and what part should be taken care of by the "idle" driver.

Do you have any specific view on that?

> >>>> Of course, you could export more scheduler information to cpuidle,
> >>>> various hooks (task wakeup etc.) but then we have another framework,
> >>>> cpufreq. It also decides the CPU parameters (frequency) based on the
> >>>> load controlled by the scheduler. Can cpufreq decide whether it's
> >>>> better to keep the CPU at higher frequency so that it gets to idle
> >>>> quicker and therefore deeper sleep states? I don't think it has enough
> >>>> information because there are at least three deciding factors
> >>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
> >>>> unified.
> >>>
> >>> Why not? When the cpu load is high, cpu frequency governor knows it has
> >>> to boost the frequency of that CPU. The task gets over quickly, the CPU
> >>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
> >>> sleep state gradually.
> >>
> >> The cpufreq governor boosts the frequency enough to cover the load,
> >> which means reducing the idle time. It does not know whether it is
> >> better to boost the frequency twice as high so that it gets to idle
> >> quicker. You can change the governor's policy but does it have any
> >> information from cpuidle?
> > 
> > Well, it may get that information directly from the hardware.  Actually,
> > intel_pstate does that, but intel_pstate is the governor and the scaling
> > driver combined.
> 
> To add to this, cpufreq currently functions in the below fashion. I am
> talking of the on demand governor, since it is more relevant to our
> discussion.
> 
> ----stepped up frequency------
>   ----threshold--------
>       -----stepped down freq level1---
>         -----stepped down freq level2---
>           ---stepped down freq level3----
> 
> If the cpu idle time is below a threshold , it boosts the frequency to

Did you mean "above the threshold"?

> one level above straight away and does not vary it any further. If the
> cpu idle time is below a threshold there is a step down in frequency
> levels by 5% of the current frequency at every sampling period, provided
> the cpu behavior is constant.
> 
> I think we can improve this implementation by better interaction with
> cpuidle and scheduler.
> 
> When it is stepping up frequency, it should do it in steps of frequency
> being a *function of the current cpu load* also, or function of idle
> time will also do.
> 
> When it is stepping down frequency, it should interact with cpuidle. It
> should get from cpuidle information regarding the idle state that the
> cpu is in.The reason is cpufrequency governor is aware of only the idle
> time of the cpu, not the idle state it is in. If it gets to know that
> the cpu is in a deep idle state, it could step down frequency levels to
> leveln straight away. Just like cpuidle does to put cpus into state Cn.
> 
> Or an alternate option could be just like stepping up, make the stepping
> down also a function of idle time. Perhaps
> fn(|threshold-idle_time|).
> 
> Also one more point to note is that if cpuidle puts cpus into such idle
> states that clock gate the cpus, then there is no need for cpufrequency
> governor for that cpu. cpufreq can check with cpuidle on this front
> before it queries a cpu.

cpufreq ondemand (or intel_pstate for that matter) doesn't touch idle CPUs,
because it uses deferrable timers.  It basically only handles CPUs that aren't
idle at the moment.

However, it doesn't exactly know when the given CPU stopped being idle, because
its sampling is not generally synchronized with the scheduler's operations.
That, among other things, is why I'm thinking that it might be better if the
scheduler told cpufreq (or intel_pstate) when to try to adjust frequencies so
that it doesn't need to sample by itself.

[...]

> >>
> >> Let's say there is an increase in the load, does the scheduler wait
> >> until cpufreq figures this out or tries to take the other CPUs out of
> >> idle? Who's making this decision? That's currently a potentially
> >> unstable loop.
> > 
> > Yes, it is and I don't think we currently have good answers here.
> 
> My answer to the above question is scheduler does not wait until cpufreq
> figures it out. All that the scheduler cares about today is load
> balancing. Spread the load and hope it finishes soon. There is a
> possibility today that even before cpu frequency governor can boost the
> frequency of cpu, the scheduler can spread the load.

That is a valid observation, but I wanted to say that we didn't really
understood how those things should be arranged.

> As for the second question it will wakeup idle cpus if it must to load
> balance.
> 
> It is a good question asked: "does the scheduler wait until cpufreq
> figures it out." Currently the answer is no, it does not communicate
> with cpu frequency at all (except through cpu power, but that is the
> good part of the story, so I will not get there now). But maybe we
> should change this. I think we can do so the following way.
> 
> When can a scheduler talk to cpu frequency? It can do so under the below
> circumstances:
> 
> 1. Load is too high across the systems, all cpus are loaded, no chance
> of load balancing. Therefore ask cpu frequency governor to step up
> frequency to get improve performance.
> 
> 2. The scheduler finds out that if it has to load balance, it has to do
> so on cpus which are in deep idle state( Currently this logic is not
> present, but worth getting it in). It then decides to increase the
> frequency of the already loaded cpus to improve performance. It calls
> cpu freq governor.
> 
> 3. The scheduler finds out that if it has to load balance, it has to do
> so on a different power domain which is idle currently(shallow/deep). It
> thinks the better of it and calls cpu frequency governor to boost the
> frequency of the cpus in the current domain.
> 
> While 2 and 3 depend on scheduler having knowledge about idle states and
> power domains, which it currently does not have, 1 can be achieved with
> the current code. scheduler keeps track of failed ld balancing efforts
> with lb_failed. If it finds that while load balancing from busy group
> failed (lb_failed > 0), it can call cpu freq governor to step up the cpu
> frequency of this busy cpu group, with gov_check_cpu() in cpufrequency
> governor code.

Well, if the model is that the scheduler tells cpufreq when to modify
frequencies, then it'll need to do that on a regular basis, like every time
a task is scheduled or similar.

> > The results of many measurements seem to indicate that it generally is better
> > to do the work as quickly as possible and then go idle again, but there are
> > costs associated with going back and forth from idle to non-idle etc.
> 
> I think we can even out the cost benefit of race to idle, by choosing to
> do it wisely. Like for example if points 2 and 3 above are true (idle
> cpus are in deep sleep states or need to ld balance on a different power
> domain), then step up the frequency of the current working cpus and reap
> its benefit.
> 
> > 
> > The main problem with cpufreq that I personally have is that the governors
> > carry out their own sampling with pretty much arbitrary resolution that may
> > lead to suboptimal decisions.  It would be much better if the scheduler
> > indicated when to *consider* the changing of CPU performance parameters (that
> > may not be frequency alone and not even frequency at all in general), more or
> > less the same way it tells cpuidle about idle CPUs, but I'm not sure if it
> > should decide what performance points to run at.
> 
> Very true. See the points 1,2 and 3 above where I list out when
> scheduler can call cpu frequency.

Well, as I said above, I think that'd need to be done more frequently.

> Also an idea about how cpu frequency governor can decide on the scaling
> frequency is stated above.

Actaully, intel_pstate uses a PID controller for making those decisions and
I think this may be just the right thing to do.

[...]

> > 
> > Well, there's nothing like "predicted load".  At best, we may be able to make
> > more or less educated guesses about it, so in my opinion it is better to use
> > the information about what happened in the past for making decisions regarding
> > the current settings and re-adjust them over time as we get more information.
> 
> Agree with this as well. scheduler can at best supply information
> regarding the historic load and hope that it is what defines the future
> as well. Apart from this I dont know what other information scheduler
> can supply cpuidle governor with.
> > 
> > So how much decision making regarding the idle state to put the given CPU into
> > should be there in the scheduler?  I believe the only information coming out
> > of the scheduler regarding that should be "OK, this CPU is now idle and I'll
> > need it in X nanoseconds from now" plus possibly a hint about the wakeup
> > latency tolerance (but those hints may come from other places too).  That said
> > the decision *which* CPU should become idle at the moment very well may require
> > some information about what options are available from the layer below (for
> > example, "putting core X into idle for Y of time will save us Z energy" or
> > something like that).
> 
> Agree. Except that the information should be "Ok , this CPU is now idle
> and it has not done much work in the recent past,it is a 10% loaded CPU".

And what would that be useful for to the "idle" layer?  What matters is the
"I'll need it in X nanoseconds from now" part.

Yes, the load part would be interesting to the "frequency" layer.

> This can be said today using PJT's metric. It is now for the cpuidle
> governor to decide the idle state to go to. Thats what happens today too.
> 
> > 
> > And what about performance scaling?  Quite frankly, in my opinion that
> > requires some more investigation, because there still are some open questions
> > in that area.  To start with we can just continue using the current heuristics,
> > but perhaps with the scheduler calling the scaling "governor" when it sees fit
> > instead of that "governor" running kind of in parallel with it.
> 
> Exactly. How this can be done is elaborated above. This is one of the
> key things we need today,IMHO.
> 
> > 

[...]

> > 
> > There is another angle to look at that as I said somewhere above.
> > 
> > What if we could integrate cpuidle with cpufreq so that there is one code
> > layer representing what the hardware can do to the scheduler?  What benefits
> > can we get from that, if any?
> 
> We could debate on this point. I am a bit confused about this. As I see
> it, there is no problem with keeping them separately. One, because of
> code readability; it is easy to understand what are the different
> parameters that the performance of CPU depends on, without needing to
> dig through the code. Two, because cpu frequency kicks in during runtime
> primarily and cpuidle during idle time of the cpu.

That's a very useful observation.  Indeed, there's the "idle" part that needs
to be invoked when the CPU goes idle (and it should decide what idle state to
put that CPU into), and there's the "scaling" part that needs to be invoked
when the CPU has work to do (and it should decide what performance point to
put that CPU into).  The question is, though, if it's better to have two
separate frameworks for those things (which is what we have today) or to make
them two parts of the same framework (like two callbacks one of which will be
executed for CPUs that have just become idle and the other will be invoked
for CPUs that have just got work to do).

> But this would also mean creating well defined interfaces between them.
> Integrating cpufreq and cpuidle seems like a better argument to make due
> to their common functionality at a higher level of talking to hardware
> and tuning the performance parameters of cpu. But I disagree that
> scheduler should be put into this common framework as well as it has
> functionalities which are totally disjoint from what subsystems such as
> cpuidle and cpufreq are intended to do.

That's correct.  The role of the scheduler, in my opinion, may be to call the
"idle" and "scaling" functions at the right time and to give them information
needed to make optimal choices.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-10 16:25         ` Daniel Lezcano
@ 2013-06-12  0:27           ` David Lang
  2013-06-12  1:48             ` Arjan van de Ven
  2013-06-12  9:50             ` Daniel Lezcano
  0 siblings, 2 replies; 15+ messages in thread
From: David Lang @ 2013-06-12  0:27 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Preeti U Murthy, Rafael J. Wysocki, Catalin Marinas, Ingo Molnar,
	Morten Rasmussen, alex.shi, Peter Zijlstra, Vincent Guittot,
	Mike Galbraith, pjt, Linux Kernel Mailing List, linaro-kernel,
	arjan, len.brown, corbet, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Linux PM list

On Mon, 10 Jun 2013, Daniel Lezcano wrote:

> Some SoC can have a cluster of cpus sharing some resources, eg cache, so
> they must enter the same state at the same moment. Beside the
> synchronization mechanisms, that adds a dependency with the next event.
> For example, the u8500 board has a couple of cpus. In order to make them
> to enter in retention, both must enter the same state, but not necessary
> at the same moment. The first cpu will wait in WFI and the second one
> will initiate the retention mode when entering to this state.
> Unfortunately, some time could have passed while the second cpu entered
> this state and the next event for the first cpu could be too close, thus
> violating the criteria of the governor when it choose this state for the
> second cpu.
>
> Also the latencies could change with the frequencies, so there is a
> dependency with cpufreq, the lesser the frequency is, the higher the
> latency is. If the scheduler takes the decision to go to a specific
> state assuming the exit latency is a given duration, if the frequency
> decrease, this exit latency could increase also and lead the system to
> be less responsive.
>
> I don't know, how were made the latencies computation (eg. worst case,
> taken with the lower frequency or not) but we have just one set of
> values. That should happen with the current code.
>
> Another point is the timer allowing to detect bad decision and go to a
> deep idle state. With the cluster dependency described above, we may
> wake up a particular cpu, which turns on the cluster and make the entire
> cluster to wake up in order to enter a deeper state, which could fail
> because of the other cpu may not fulfill the constraint at this moment.

Nobody is saying that this sort of thing should be in the fastpath of the 
scheduler.

But if the scheduler has a table that tells it the possible states, and the cost 
to get from the current state to each of these states (and to get back and/or 
wake up to full power), then the scheduler can make the decision on what to do, 
invoke a routine to make the change (and in the meantime, not be fighting the 
change by trying to schedule processes on a core that's about to be powered 
off), and then when the change happens, the scheduler will have a new version of 
the table of possible states and costs

This isn't in the fastpath, it's in the rebalancing logic.

David Lang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-12  0:27           ` David Lang
@ 2013-06-12  1:48             ` Arjan van de Ven
  2013-06-12  9:48               ` Amit Kucheria
  2013-06-12 10:20               ` Catalin Marinas
  2013-06-12  9:50             ` Daniel Lezcano
  1 sibling, 2 replies; 15+ messages in thread
From: Arjan van de Ven @ 2013-06-12  1:48 UTC (permalink / raw)
  To: David Lang
  Cc: Daniel Lezcano, Preeti U Murthy, Rafael J. Wysocki,
	Catalin Marinas, Ingo Molnar, Morten Rasmussen, alex.shi,
	Peter Zijlstra, Vincent Guittot, Mike Galbraith, pjt,
	Linux Kernel Mailing List, linaro-kernel, len.brown, corbet,
	Andrew Morton, Linus Torvalds, Thomas Gleixner, Linux PM list

On 6/11/2013 5:27 PM, David Lang wrote:
>
> Nobody is saying that this sort of thing should be in the fastpath of the scheduler.
>
> But if the scheduler has a table that tells it the possible states, and the cost to get from the current state to each of these states (and to get back and/or wake up to
> full power), then the scheduler can make the decision on what to do, invoke a routine to make the change (and in the meantime, not be fighting the change by trying to
> schedule processes on a core that's about to be powered off), and then when the change happens, the scheduler will have a new version of the table of possible states and costs
>
> This isn't in the fastpath, it's in the rebalancing logic.

the reality is much more complex unfortunately.
C and P states hang together tightly, and even C state on
one core impacts other cores' performance, just like P state selection
on one core impacts other cores.

(at least for x86, we should really stop talking as if the OS picks the "frequency",
that's just not the case anymore)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-12  1:48             ` Arjan van de Ven
@ 2013-06-12  9:48               ` Amit Kucheria
  2013-06-12 16:22                 ` David Lang
  2013-06-12 10:20               ` Catalin Marinas
  1 sibling, 1 reply; 15+ messages in thread
From: Amit Kucheria @ 2013-06-12  9:48 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: David Lang, len.brown, alex.shi, corbet, Peter Zijlstra,
	Catalin Marinas, Linux PM list, Rafael J. Wysocki,
	Linux Kernel Mailing List, Morten Rasmussen, Linus Torvalds,
	linaro-kernel, Mike Galbraith, Preeti U Murthy, Andrew Morton,
	pjt, Ingo Molnar

On Wed, Jun 12, 2013 at 7:18 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 6/11/2013 5:27 PM, David Lang wrote:
>>
>>
>> Nobody is saying that this sort of thing should be in the fastpath of the
>> scheduler.
>>
>> But if the scheduler has a table that tells it the possible states, and
>> the cost to get from the current state to each of these states (and to get
>> back and/or wake up to
>> full power), then the scheduler can make the decision on what to do,
>> invoke a routine to make the change (and in the meantime, not be fighting
>> the change by trying to
>> schedule processes on a core that's about to be powered off), and then
>> when the change happens, the scheduler will have a new version of the table
>> of possible states and costs
>>
>> This isn't in the fastpath, it's in the rebalancing logic.
>
>
> the reality is much more complex unfortunately.
> C and P states hang together tightly, and even C state on
> one core impacts other cores' performance, just like P state selection
> on one core impacts other cores.
>
> (at least for x86, we should really stop talking as if the OS picks the
> "frequency",
> that's just not the case anymore)

This is true of ARM platforms too. As Daniel pointed out in an earlier
email, the operating point (frequency, voltage) has a bearing on the
c-state latency too.

An additional complexity is thermal constraints. E.g. On a quad-core
Cortex-A15 processor capable of say 1.5GHz, you won't be able to run
all 4 cores at that speed for very long w/o exceeding the thermal
envelope. These overdrive frequencies (turbo in x86-speak) impact the
rest of the system by either constraining the frequency of other cores
or requiring aggresive thermal management.

Do we really want to track these details in the scheduler or just let
the scheduler provide notifications to the existing subsystems
(cpufreq, cpuidle, thermal, etc.) with some sort of feedback going
back to the scheduler to influence future decisions?

Feeback to the scheduler could be something like the following (pardon
the names):

1. ISOLATE_CORE: Don't schedule anything on this core - cpuidle might
use this to synchronise cores for a cluster shutdown, thermal
framework could use this as idle injection to reduce temperature
2. CAP_CAPACITY: Don't expect cpufreq to raise the frequency on this
core - cpufreq might use this to cap overall energy since overdrive
operating points are very expensive, thermal might use this to slow
down rate of increase of die temperature

Regards,
Amit

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-12  0:27           ` David Lang
  2013-06-12  1:48             ` Arjan van de Ven
@ 2013-06-12  9:50             ` Daniel Lezcano
  2013-06-12 16:30               ` David Lang
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel Lezcano @ 2013-06-12  9:50 UTC (permalink / raw)
  To: David Lang
  Cc: Preeti U Murthy, Rafael J. Wysocki, Catalin Marinas, Ingo Molnar,
	Morten Rasmussen, alex.shi, Peter Zijlstra, Vincent Guittot,
	Mike Galbraith, pjt, Linux Kernel Mailing List, linaro-kernel,
	arjan, len.brown, corbet, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Linux PM list

On 06/12/2013 02:27 AM, David Lang wrote:
> On Mon, 10 Jun 2013, Daniel Lezcano wrote:
> 
>> Some SoC can have a cluster of cpus sharing some resources, eg cache, so
>> they must enter the same state at the same moment. Beside the
>> synchronization mechanisms, that adds a dependency with the next event.
>> For example, the u8500 board has a couple of cpus. In order to make them
>> to enter in retention, both must enter the same state, but not necessary
>> at the same moment. The first cpu will wait in WFI and the second one
>> will initiate the retention mode when entering to this state.
>> Unfortunately, some time could have passed while the second cpu entered
>> this state and the next event for the first cpu could be too close, thus
>> violating the criteria of the governor when it choose this state for the
>> second cpu.
>>
>> Also the latencies could change with the frequencies, so there is a
>> dependency with cpufreq, the lesser the frequency is, the higher the
>> latency is. If the scheduler takes the decision to go to a specific
>> state assuming the exit latency is a given duration, if the frequency
>> decrease, this exit latency could increase also and lead the system to
>> be less responsive.
>>
>> I don't know, how were made the latencies computation (eg. worst case,
>> taken with the lower frequency or not) but we have just one set of
>> values. That should happen with the current code.
>>
>> Another point is the timer allowing to detect bad decision and go to a
>> deep idle state. With the cluster dependency described above, we may
>> wake up a particular cpu, which turns on the cluster and make the entire
>> cluster to wake up in order to enter a deeper state, which could fail
>> because of the other cpu may not fulfill the constraint at this moment.
> 
> Nobody is saying that this sort of thing should be in the fastpath of
> the scheduler.
> 
> But if the scheduler has a table that tells it the possible states, and
> the cost to get from the current state to each of these states (and to
> get back and/or wake up to full power), then the scheduler can make the
> decision on what to do, invoke a routine to make the change (and in the
> meantime, not be fighting the change by trying to schedule processes on
> a core that's about to be powered off), and then when the change
> happens, the scheduler will have a new version of the table of possible
> states and costs
> 
> This isn't in the fastpath, it's in the rebalancing logic.

As Arjan mentionned it is not as simple as this.

We want the scheduler to take some decisions with the knowledge of idle
latencies. In other words move the governor logic into the scheduler.

The scheduler can take decision and the backend driver provides the
interface to go to the idle state.

But unfortunately each hardware is behaving in different ways and
describing such behaviors will help to find the correct design, I am not
raising a lot of issues but just trying to enumerate the constraints we
have.

What is the correct decision when a lot of pm blocks are tied together
and the

In the example given by Arjan, the frequencies could be per cluster,
hence decreasing the frequency for a core will decrease the frequency of
the other core. So if the scheduler takes the decision to put one core
into a specific idle state, regarding the target residency and the exit
latency when the frequency is at max (the other core is doing
something), and then the frequency decrease, the exit latency may
increase in this case and the idle cpu will take more time to exit the
idle state than expected thus adding latency to the system.

What would be the correct decision in this case ? Wake up the idle cpu
when the frequency change to re-evaluate an idle state ? Provide idle
latencies for the min freq only ? Or is it acceptable to have such
latency added when the frequency decrease ?

Also, an interesting question is how do we get these latencies ?

They are all written in the c-state tables but we don't know the
accuracy of these values ? Were they measured with freq max / min ?

Were they measured with a driver powering down the peripherals or without ?

For the embedded systems, we may have different implementations and
maybe different latencies. Would be makes sense to pass these values
through a device tree and let the SoC vendor to specify the right values
? (IMHO, only the SoC vendor can do a correct measurement with an
oscilloscope).

I know there are lot of questions :)

-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-12  1:48             ` Arjan van de Ven
  2013-06-12  9:48               ` Amit Kucheria
@ 2013-06-12 10:20               ` Catalin Marinas
  2013-06-12 15:24                 ` Arjan van de Ven
  1 sibling, 1 reply; 15+ messages in thread
From: Catalin Marinas @ 2013-06-12 10:20 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: David Lang, Daniel Lezcano, Preeti U Murthy, Rafael J. Wysocki,
	Ingo Molnar, Morten Rasmussen, alex.shi, Peter Zijlstra,
	Vincent Guittot, Mike Galbraith, pjt, Linux Kernel Mailing List,
	linaro-kernel, len.brown, corbet, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Linux PM list

Hi Arjan,

On Wed, Jun 12, 2013 at 02:48:58AM +0100, Arjan van de Ven wrote:
> On 6/11/2013 5:27 PM, David Lang wrote:
> > Nobody is saying that this sort of thing should be in the fastpath
> > of the scheduler.
> >
> > But if the scheduler has a table that tells it the possible states,
> > and the cost to get from the current state to each of these states
> > (and to get back and/or wake up to full power), then the scheduler
> > can make the decision on what to do, invoke a routine to make the
> > change (and in the meantime, not be fighting the change by trying to
> > schedule processes on a core that's about to be powered off), and
> > then when the change happens, the scheduler will have a new version
> > of the table of possible states and costs
> >
> > This isn't in the fastpath, it's in the rebalancing logic.
> 
> the reality is much more complex unfortunately.
> C and P states hang together tightly, and even C state on one core
> impacts other cores' performance, just like P state selection on one
> core impacts other cores.
> 
> (at least for x86, we should really stop talking as if the OS picks
> the "frequency", that's just not the case anymore)

I agree, the reality is very complex. But we should go back and analyse
what problem we are trying to solve, what each framework is trying to
address.

When viewed separately from the scheduler, cpufreq and cpuidle governors
do the right thing. But they both base their action on the CPU load
(balance) decided by the scheduler and it's the latter that we are
trying to adjust (and we are still debating what the right approach is).

Since such information seems too complex to be moved into the scheduler,
why don't we get cpufreq in charge of restricting the load balancing to
certain CPUs? It already tracks the load/idle time to (gradually) change
the P state. Depending on the governor/policy, it could decide that (for
example) 4 CPUs running at higher power P state are enough, telling the
scheduler to ignore the other CPUs. It won't pick a frequency, but (as
it currently does) adjust it to keep a minimal idle state on those CPUs.
If that's not longer possible (high load), it can remove the restriction
and let the scheduler use the other idle CPUs (cpufreq could even do a
direct a load_balance() call). This is a governor decision and the user
is in control of what governors are used.

Cpuidle I think for now can stay the same, gradually entering deeper
sleep states. It could be later unified with cpufreq if there are any
benefits. In deciding the load balancing restrictions, maybe cpufreq
should be aware of C-state latencies.

Cpufreq would need to get more knowledge of the power topology and
thermal management. It would still be the framework restricting the P
state or changing the load balancing restrictions to let CPUs cool down.
More hooks could be added if needed for better responsiveness (like
entering idle or task wake-up).

With the above, the scheduler will just focus on performance (given the
restrictions imposed by cpufreq) and it only needs to be aware of the
CPU topology from a performance perspective (caches, hyperthreading)
together with the cpu_power parameter for the weighted load.

-- 
Catalin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-12 10:20               ` Catalin Marinas
@ 2013-06-12 15:24                 ` Arjan van de Ven
  2013-06-12 17:04                   ` Catalin Marinas
  0 siblings, 1 reply; 15+ messages in thread
From: Arjan van de Ven @ 2013-06-12 15:24 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: David Lang, Daniel Lezcano, Preeti U Murthy, Rafael J. Wysocki,
	Ingo Molnar, Morten Rasmussen, alex.shi, Peter Zijlstra,
	Vincent Guittot, Mike Galbraith, pjt, Linux Kernel Mailing List,
	linaro-kernel, len.brown, corbet, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Linux PM list

>>> This isn't in the fastpath, it's in the rebalancing logic.
>>
>> the reality is much more complex unfortunately.
>> C and P states hang together tightly, and even C state on one core
>> impacts other cores' performance, just like P state selection on one
>> core impacts other cores.
>>
>> (at least for x86, we should really stop talking as if the OS picks
>> the "frequency", that's just not the case anymore)
>
> I agree, the reality is very complex. But we should go back and analyse
> what problem we are trying to solve, what each framework is trying to
> address.
>
> When viewed separately from the scheduler, cpufreq and cpuidle governors
> do the right thing. But they both base their action on the CPU load
> (balance) decided by the scheduler and it's the latter that we are
> trying to adjust (and we are still debating what the right approach is).
>
> Since such information seems too complex to be moved into the scheduler,
> why don't we get cpufreq in charge of restricting the load balancing to
> certain CPUs? It already tracks the load/idle time to (gradually) change
> the P state. Depending on the governor/policy, it could decide that (for

(btw in case you missed it, for Intel HW we no longer use cpufreq anymore)


> Cpuidle I think for now can stay the same, gradually entering deeper
> sleep states. It could be later unified with cpufreq if there are any
> benefits. In deciding the load balancing restrictions, maybe cpufreq
> should be aware of C-state latencies.

on the Intel side, we're likely to merge the Intel idle driver and P state driver
in the near future fwiw.
We'll keep using cpuidle framework (since it doesn't do all that much other than
provide a nice hook for the idle loop), but we likely will make a hw specific
selection logic there.

I do agree the scheduler needs to get integrated a bit better, in that it
has some better knowledge, and to be honest, we likely need to switch from
giving tasks credit for "time consumed" to giving them credit for something like
"cycles consumed" or "instructions executed" or a mix thereof.
So that a task that runs on a slower CPU (for either policy choice reasons or
due to hardware capabilities), it gets charged less than when it runs fast.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-12  9:48               ` Amit Kucheria
@ 2013-06-12 16:22                 ` David Lang
  0 siblings, 0 replies; 15+ messages in thread
From: David Lang @ 2013-06-12 16:22 UTC (permalink / raw)
  To: Amit Kucheria
  Cc: Arjan van de Ven, len.brown, alex.shi, corbet, Peter Zijlstra,
	Catalin Marinas, Linux PM list, Rafael J. Wysocki,
	Linux Kernel Mailing List, Morten Rasmussen, Linus Torvalds,
	linaro-kernel, Mike Galbraith, Preeti U Murthy, Andrew Morton,
	pjt, Ingo Molnar

On Wed, 12 Jun 2013, Amit Kucheria wrote:

> On Wed, Jun 12, 2013 at 7:18 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>> On 6/11/2013 5:27 PM, David Lang wrote:
>>>
>>>
>>> Nobody is saying that this sort of thing should be in the fastpath of the
>>> scheduler.
>>>
>>> But if the scheduler has a table that tells it the possible states, and
>>> the cost to get from the current state to each of these states (and to get
>>> back and/or wake up to
>>> full power), then the scheduler can make the decision on what to do,
>>> invoke a routine to make the change (and in the meantime, not be fighting
>>> the change by trying to
>>> schedule processes on a core that's about to be powered off), and then
>>> when the change happens, the scheduler will have a new version of the table
>>> of possible states and costs
>>>
>>> This isn't in the fastpath, it's in the rebalancing logic.
>>
>>
>> the reality is much more complex unfortunately.
>> C and P states hang together tightly, and even C state on
>> one core impacts other cores' performance, just like P state selection
>> on one core impacts other cores.
>>
>> (at least for x86, we should really stop talking as if the OS picks the
>> "frequency",
>> that's just not the case anymore)
>
> This is true of ARM platforms too. As Daniel pointed out in an earlier
> email, the operating point (frequency, voltage) has a bearing on the
> c-state latency too.
>
> An additional complexity is thermal constraints. E.g. On a quad-core
> Cortex-A15 processor capable of say 1.5GHz, you won't be able to run
> all 4 cores at that speed for very long w/o exceeding the thermal
> envelope. These overdrive frequencies (turbo in x86-speak) impact the
> rest of the system by either constraining the frequency of other cores
> or requiring aggresive thermal management.
>
> Do we really want to track these details in the scheduler or just let
> the scheduler provide notifications to the existing subsystems
> (cpufreq, cpuidle, thermal, etc.) with some sort of feedback going
> back to the scheduler to influence future decisions?
>
> Feeback to the scheduler could be something like the following (pardon
> the names):
>
> 1. ISOLATE_CORE: Don't schedule anything on this core - cpuidle might
> use this to synchronise cores for a cluster shutdown, thermal
> framework could use this as idle injection to reduce temperature
> 2. CAP_CAPACITY: Don't expect cpufreq to raise the frequency on this
> core - cpufreq might use this to cap overall energy since overdrive
> operating points are very expensive, thermal might use this to slow
> down rate of increase of die temperature

How much data are you going to have to move back and forth between the different 
systems?

do you really only want the all-or-nothing "use this core as much as possible" 
vs "don't use this core at all"? or do you need the ability to indicate how much 
to use a particular core (something that is needed anyway for asymetrical cores 
I think)

If there is too much information that needs to be moved back and forth between 
these 'subsystems' for the 'right' thing to happen, then it would seem like it 
makes more sense to combine them.

Even combined, there are parts that are still pretty modular (like the details 
of shifting from one state to another, and the different high level strategies 
to follow for different modes of operation), but having access to all the 
information rather than only bits and pieces of the information at lower 
granularity would seem like an improvement.

David Lang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-12  9:50             ` Daniel Lezcano
@ 2013-06-12 16:30               ` David Lang
  0 siblings, 0 replies; 15+ messages in thread
From: David Lang @ 2013-06-12 16:30 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Preeti U Murthy, Rafael J. Wysocki, Catalin Marinas, Ingo Molnar,
	Morten Rasmussen, alex.shi, Peter Zijlstra, Vincent Guittot,
	Mike Galbraith, pjt, Linux Kernel Mailing List, linaro-kernel,
	arjan, len.brown, corbet, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Linux PM list

On Wed, 12 Jun 2013, Daniel Lezcano wrote:

>> On Mon, 10 Jun 2013, Daniel Lezcano wrote:
>>
>>> Some SoC can have a cluster of cpus sharing some resources, eg cache, so
>>> they must enter the same state at the same moment. Beside the
>>> synchronization mechanisms, that adds a dependency with the next event.
>>> For example, the u8500 board has a couple of cpus. In order to make them
>>> to enter in retention, both must enter the same state, but not necessary
>>> at the same moment. The first cpu will wait in WFI and the second one
>>> will initiate the retention mode when entering to this state.
>>> Unfortunately, some time could have passed while the second cpu entered
>>> this state and the next event for the first cpu could be too close, thus
>>> violating the criteria of the governor when it choose this state for the
>>> second cpu.
>>>
>>> Also the latencies could change with the frequencies, so there is a
>>> dependency with cpufreq, the lesser the frequency is, the higher the
>>> latency is. If the scheduler takes the decision to go to a specific
>>> state assuming the exit latency is a given duration, if the frequency
>>> decrease, this exit latency could increase also and lead the system to
>>> be less responsive.
>>>
>>> I don't know, how were made the latencies computation (eg. worst case,
>>> taken with the lower frequency or not) but we have just one set of
>>> values. That should happen with the current code.
>>>
>>> Another point is the timer allowing to detect bad decision and go to a
>>> deep idle state. With the cluster dependency described above, we may
>>> wake up a particular cpu, which turns on the cluster and make the entire
>>> cluster to wake up in order to enter a deeper state, which could fail
>>> because of the other cpu may not fulfill the constraint at this moment.
>>
>> Nobody is saying that this sort of thing should be in the fastpath of
>> the scheduler.
>>
>> But if the scheduler has a table that tells it the possible states, and
>> the cost to get from the current state to each of these states (and to
>> get back and/or wake up to full power), then the scheduler can make the
>> decision on what to do, invoke a routine to make the change (and in the
>> meantime, not be fighting the change by trying to schedule processes on
>> a core that's about to be powered off), and then when the change
>> happens, the scheduler will have a new version of the table of possible
>> states and costs
>>
>> This isn't in the fastpath, it's in the rebalancing logic.
>
> As Arjan mentionned it is not as simple as this.
>
> We want the scheduler to take some decisions with the knowledge of idle
> latencies. In other words move the governor logic into the scheduler.
>
> The scheduler can take decision and the backend driver provides the
> interface to go to the idle state.
>
> But unfortunately each hardware is behaving in different ways and
> describing such behaviors will help to find the correct design, I am not
> raising a lot of issues but just trying to enumerate the constraints we
> have.
>
> What is the correct decision when a lot of pm blocks are tied together
> and the
>
> In the example given by Arjan, the frequencies could be per cluster,
> hence decreasing the frequency for a core will decrease the frequency of
> the other core. So if the scheduler takes the decision to put one core
> into a specific idle state, regarding the target residency and the exit
> latency when the frequency is at max (the other core is doing
> something), and then the frequency decrease, the exit latency may
> increase in this case and the idle cpu will take more time to exit the
> idle state than expected thus adding latency to the system.
>
> What would be the correct decision in this case ? Wake up the idle cpu
> when the frequency change to re-evaluate an idle state ? Provide idle
> latencies for the min freq only ? Or is it acceptable to have such
> latency added when the frequency decrease ?
>
> Also, an interesting question is how do we get these latencies ?
>
> They are all written in the c-state tables but we don't know the
> accuracy of these values ? Were they measured with freq max / min ?
>
> Were they measured with a driver powering down the peripherals or without ?
>
> For the embedded systems, we may have different implementations and
> maybe different latencies. Would be makes sense to pass these values
> through a device tree and let the SoC vendor to specify the right values
> ? (IMHO, only the SoC vendor can do a correct measurement with an
> oscilloscope).
>
> I know there are lot of questions :)

well, I have two immediate reactions.

First, use the values provided by the vendor, if they are wrong performance is 
not optimum and people will pick a different vendor (so they have an incentive 
to be right :-)

Second, "measure them" :-)

have the device tree enumerate the modes of operation, but then at bootup, run 
through a series of tests to bounce between the different modes and measure how 
long it takes to move back and forth. If the system can't measure the difference 
against it's clocks, then the user isn't going to see the difference either, so 
there's no need to be as accurate as a lab bench with a scope. What matters is 
how much work can end up getting done for the user, not the number of 
nanoseconds between voltage changes (the latter will affect the former, but it's 
the former that you really care about)

remember, perfect is the enemy of good enough. you don't have to have a perfect 
mapping of every possible change, you just need to be close enough to make 
reasonable decisions. You can't really predict the future anyway, so you are 
making a guess at what the load on the system is going to be in the future. 
Sometimes you will guess wrong no matter how accurate your latency measurements 
are. You have to accept that, and once you accept that, the severity of being 
wrong in some corner cases become less significant.

David Lang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-12 15:24                 ` Arjan van de Ven
@ 2013-06-12 17:04                   ` Catalin Marinas
  0 siblings, 0 replies; 15+ messages in thread
From: Catalin Marinas @ 2013-06-12 17:04 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: David Lang, Daniel Lezcano, Preeti U Murthy, Rafael J. Wysocki,
	Ingo Molnar, Morten Rasmussen, alex.shi, Peter Zijlstra,
	Vincent Guittot, Mike Galbraith, pjt, Linux Kernel Mailing List,
	linaro-kernel, len.brown, corbet, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Linux PM list

On Wed, Jun 12, 2013 at 04:24:52PM +0100, Arjan van de Ven wrote:
> >>> This isn't in the fastpath, it's in the rebalancing logic.
> >>
> >> the reality is much more complex unfortunately.
> >> C and P states hang together tightly, and even C state on one core
> >> impacts other cores' performance, just like P state selection on one
> >> core impacts other cores.
> >>
> >> (at least for x86, we should really stop talking as if the OS picks
> >> the "frequency", that's just not the case anymore)
> >
> > I agree, the reality is very complex. But we should go back and analyse
> > what problem we are trying to solve, what each framework is trying to
> > address.
> >
> > When viewed separately from the scheduler, cpufreq and cpuidle governors
> > do the right thing. But they both base their action on the CPU load
> > (balance) decided by the scheduler and it's the latter that we are
> > trying to adjust (and we are still debating what the right approach is).
> >
> > Since such information seems too complex to be moved into the scheduler,
> > why don't we get cpufreq in charge of restricting the load balancing to
> > certain CPUs? It already tracks the load/idle time to (gradually) change
> > the P state. Depending on the governor/policy, it could decide that (for
> 
> (btw in case you missed it, for Intel HW we no longer use cpufreq anymore)

Do you mean the intel_pstate.c code? It indeed doesn't use much of
cpufreq, just setpolicy and it's on its own afterwards. Separating this
from the framework probably has real benefits for the Intel processors
but it would make a unified scheduler/cpufreq/cpuidle solution harder
(just a remark, I don't say it's good or bad, there are many
opinions against the unified solution; ARM could do the same for
configurations like big.LITTLE).

But such driver could still interact with the scheduler to control it's
load balancing. At a quick look (I'm not familiar with this driver), it
tracks the per-CPU load and increases or decreases the P-state (similar
to a cpufreq governor). It could as well track the total load and
(depending on hardware configuration), get some CPUs in lower
performance P-state (or even C-state) and tell the scheduler to avoid
them.

One way to control load-balancing ratio is via something like
arch_scale_freq_power(). We could tweak the scheduler further so that
something like cpu_power==0 means do not schedule anything there.

So my proposal is to move the load-balancing hints (load ratio, avoiding
CPUs etc.) outside the scheduler into drivers like intel_pstate.c or
cpufreq governors. We then focus on getting the best performance out of
the scheduler (like quicker migration) but it would not be concerned
with the power consumption.

> I do agree the scheduler needs to get integrated a bit better, in that it
> has some better knowledge, and to be honest, we likely need to switch from
> giving tasks credit for "time consumed" to giving them credit for something like
> "cycles consumed" or "instructions executed" or a mix thereof.
> So that a task that runs on a slower CPU (for either policy choice reasons or
> due to hardware capabilities), it gets charged less than when it runs fast.

I agree, this would be useful in optimising the scheduler so that it
makes the right task placement/migration decisions (but as I said above,
make the power aspect transparent to the scheduler).

-- 
Catalin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: power-efficient scheduling design
  2013-06-11  0:50         ` Rafael J. Wysocki
@ 2013-06-13  4:32           ` Preeti U Murthy
  0 siblings, 0 replies; 15+ messages in thread
From: Preeti U Murthy @ 2013-06-13  4:32 UTC (permalink / raw)
  To: Rafael J. Wysocki, Catalin Marinas, Ingo Molnar, arjan,
	David Lang, daniel.lezcano, Amit Kucheria
  Cc: Morten Rasmussen, alex.shi, Peter Zijlstra, Vincent Guittot,
	Mike Galbraith, pjt, Linux Kernel Mailing List, linaro-kernel,
	len.brown, corbet, Andrew Morton, Linus Torvalds,
	Thomas Gleixner, Linux PM list

Hi,

On 06/11/2013 06:20 AM, Rafael J. Wysocki wrote:
> 
> OK, so let's try to take one step more and think about what part should belong
> to the scheduler and what part should be taken care of by the "idle" driver.
> 
> Do you have any specific view on that?

I gave it some thought and went through Ingo's mail once again. I have
some view points which I have stated at the end of this mail.

>>>>>> Of course, you could export more scheduler information to cpuidle,
>>>>>> various hooks (task wakeup etc.) but then we have another framework,
>>>>>> cpufreq. It also decides the CPU parameters (frequency) based on the
>>>>>> load controlled by the scheduler. Can cpufreq decide whether it's
>>>>>> better to keep the CPU at higher frequency so that it gets to idle
>>>>>> quicker and therefore deeper sleep states? I don't think it has enough
>>>>>> information because there are at least three deciding factors
>>>>>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>>>>>> unified.
>>>>>
>>>>> Why not? When the cpu load is high, cpu frequency governor knows it has
>>>>> to boost the frequency of that CPU. The task gets over quickly, the CPU
>>>>> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
>>>>> sleep state gradually.
>>>>
>>>> The cpufreq governor boosts the frequency enough to cover the load,
>>>> which means reducing the idle time. It does not know whether it is
>>>> better to boost the frequency twice as high so that it gets to idle
>>>> quicker. You can change the governor's policy but does it have any
>>>> information from cpuidle?
>>>
>>> Well, it may get that information directly from the hardware.  Actually,
>>> intel_pstate does that, but intel_pstate is the governor and the scaling
>>> driver combined.
>>
>> To add to this, cpufreq currently functions in the below fashion. I am
>> talking of the on demand governor, since it is more relevant to our
>> discussion.
>>
>> ----stepped up frequency------
>>   ----threshold--------
>>       -----stepped down freq level1---
>>         -----stepped down freq level2---
>>           ---stepped down freq level3----
>>
>> If the cpu idle time is below a threshold , it boosts the frequency to
> 
> Did you mean "above the threshold"?

No I meant "above". I am referring to the cpu *idle* time.

>> Also an idea about how cpu frequency governor can decide on the scaling
>> frequency is stated above.
> 
> Actaully, intel_pstate uses a PID controller for making those decisions and
> I think this may be just the right thing to do.

But don't you think we need to include the current cpu load during this
decision making as well? I mean a fn(idle_time) logic in cpu frequency
governor, which is currently absent. Today, it just checks if idle_time
< threshold, and sets one specific frequency. Of course the PID could
then make the decision about the frequencies which can be candidates for
scaling up, but cpu freq governor could decide which among these to pick
based on fn(idle_time) .

> 
> [...]
> 
>>>
>>> Well, there's nothing like "predicted load".  At best, we may be able to make
>>> more or less educated guesses about it, so in my opinion it is better to use
>>> the information about what happened in the past for making decisions regarding
>>> the current settings and re-adjust them over time as we get more information.
>>
>> Agree with this as well. scheduler can at best supply information
>> regarding the historic load and hope that it is what defines the future
>> as well. Apart from this I dont know what other information scheduler
>> can supply cpuidle governor with.
>>>
>>> So how much decision making regarding the idle state to put the given CPU into
>>> should be there in the scheduler?  I believe the only information coming out
>>> of the scheduler regarding that should be "OK, this CPU is now idle and I'll
>>> need it in X nanoseconds from now" plus possibly a hint about the wakeup
>>> latency tolerance (but those hints may come from other places too).  That said
>>> the decision *which* CPU should become idle at the moment very well may require
>>> some information about what options are available from the layer below (for
>>> example, "putting core X into idle for Y of time will save us Z energy" or
>>> something like that).
>>
>> Agree. Except that the information should be "Ok , this CPU is now idle
>> and it has not done much work in the recent past,it is a 10% loaded CPU".
> 
> And what would that be useful for to the "idle" layer?  What matters is the
> "I'll need it in X nanoseconds from now" part.
> 
> Yes, the load part would be interesting to the "frequency" layer.

>>> What if we could integrate cpuidle with cpufreq so that there is one code
>>> layer representing what the hardware can do to the scheduler?  What benefits
>>> can we get from that, if any?
>>
>> We could debate on this point. I am a bit confused about this. As I see
>> it, there is no problem with keeping them separately. One, because of
>> code readability; it is easy to understand what are the different
>> parameters that the performance of CPU depends on, without needing to
>> dig through the code. Two, because cpu frequency kicks in during runtime
>> primarily and cpuidle during idle time of the cpu.
> 
> That's a very useful observation.  Indeed, there's the "idle" part that needs
> to be invoked when the CPU goes idle (and it should decide what idle state to
> put that CPU into), and there's the "scaling" part that needs to be invoked
> when the CPU has work to do (and it should decide what performance point to
> put that CPU into).  The question is, though, if it's better to have two
> separate frameworks for those things (which is what we have today) or to make
> them two parts of the same framework (like two callbacks one of which will be
> executed for CPUs that have just become idle and the other will be invoked
> for CPUs that have just got work to do).
> 
>> But this would also mean creating well defined interfaces between them.
>> Integrating cpufreq and cpuidle seems like a better argument to make due
>> to their common functionality at a higher level of talking to hardware
>> and tuning the performance parameters of cpu. But I disagree that
>> scheduler should be put into this common framework as well as it has
>> functionalities which are totally disjoint from what subsystems such as
>> cpuidle and cpufreq are intended to do.
> 
> That's correct.  The role of the scheduler, in my opinion, may be to call the
> "idle" and "scaling" functions at the right time and to give them information
> needed to make optimal choices.

Having looked at the points being brought about in this discussion and
the mail that Ingo sent out regarding his view points, I have a few
points to make.

David Lezcano made a valid point when he stated that we need to
*move cpufrequency and cpuidle governor logic into scheduler while
retaining their driver functionality in those subsystems.*

It is true that I was strongly against moving the governor logic into
the scheduler, thinking it would be simpler to enhance the communication
interface between the scheduler and the governors.
But having given this some thought,I think this would mean greater scope
for loopholes.

Catalin pointed it out well with an example, when he said in one of his
mails that, assuming scheduler ends up telling cpu frequency governor
when to boost/lower the frequency and note that scheduler is not aware
of the user policies that have gone in to decide if cpu frequency
governor actually does what the scheduler is asking it to do.

And it is only cpu frequency governor who is aware of these user
policies and not scheduler. So how long should the scheduler wait for
cpu frequency governor to boost the frequency? What if the user has
selected a powersave mode, and the cpu frequency cannot rise any
further? That would mean cpu frequency governor telling scheduler that
it can't do what the scheduler is asking it to do.
This decision of scheduler then is a waste of time,since it gets
rejected by the cpufrequency governor and nothing comes of it.

Very clearly the scheduler not being aware of the user policy is a big
drawback; had it known the user policies before hand it would not even
have considered boosting the cpu frequency of the cpu in question.

This point that Ingo made is something we need to look hard at."Today
the power saving landscape is fragmented." The scheduler today does not
know what in the world is the end result of its decisions. cpuidle and
cpu frequency could take decisions that is totally counter intuitive to
the scheduler's. Improving the communication between them would surely
mean we export more and more information back and forth for better
communication, whose end result would probably be to merge the governor
and scheduler. If this vision that "they will eventually get so close,
that we will end up merging them", is agreed upon, then it might be best
to merge them right away without wasting effort into adding logic that
tries to communicate between them or even trying to separate the
functionalities between scheduler and governors.

I don't think removing certain scheduler functionalities and putting it
instead into governors is the right thing to do. Scheduler's functions
are tightly coupled with one another. Breaking one will in my opinion
break a lot of things.

There have been points brought out strongly about how the scheduler
should have global view of cores so that it knows the effect on a socket
when it decides on what to do with a core for instance. This could be
the next step in its enhancement. Taking up one of the examples that
Daniel brought out:" Putting one of the cpus to idle state could lower
the frequency of the socket,thus hampering the exit latency of this idle
state ". (Not the exact words, but this is the point.)

Notice how in the above,if a scheduler were to be able to understand the
above statement, it needs to first off be aware of the cpu frequency and
idle state details. *Therefore as a first step we need better knowledge
in scheduler before it makes global decisions*.

Also note a scheduler cannot under the above circumstances talk back and
forth to the governors to begin to learn about idle states and
frequencies at that point. This simply does not make sense.(True at this
point I am heavily contradicting my previous arguments :P. I felt that
the existing communication is good enough and all that was needed a few
more additions, but that does not seem to be the case. )

Arjan also pointed out how the a task running on a slower core, should
be charged less than when it runs on a faster core. Right here is a use
case for scheduler to be aware of the cpu frequency of a core, since
today it is the one which charges a task, but is not aware of what cpu
frequency it is running on.(It is aware of cpu frequency of core through
cpu power stats, but it uses it only for load balancing today and not
when it charges a task for its run time).

My suggestion at this point is :

1. Begin to move the cpuidle and cpufreq *governor* logic into the
scheduler little by little.

2. Scheduler is already aware of the topology details, maybe enhance
that as the next step.

At this point, we would have a scheduler well aware of the effect of its
load balancing decisions to some extent.

3. Add the logic for the scheduler to get a global view of the cpufreq
and idle.

4. Then get system user policies (powersave/performance) to alter
scheduler behavior accordingly.

At this point if we bring in today's patchsets (power aware scheduling
and packing tasks), they could fetch us their intended benefits pretty
much in most cases as against sporadic behaviour, because
the scheduler is aware of the whole picture and will do what these
patches command only if it is right till the point of idle states and
cpu frequencies and not just till load balancing.

I would appreciate all of yours feedback on the above. I think at this
point we are in a position to judge what would be the next move in this
direction and make that move soon.


Regards
Preeti U Murthy




^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-06-13  4:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20130530134718.GB32728@e103034-lin>
     [not found] ` <51B221AF.9070906@linux.vnet.ibm.com>
     [not found]   ` <20130608112801.GA8120@MacBook-Pro.local>
2013-06-08 14:02     ` power-efficient scheduling design Rafael J. Wysocki
2013-06-09  3:42       ` Preeti U Murthy
2013-06-09 22:53         ` Catalin Marinas
2013-06-10 16:25         ` Daniel Lezcano
2013-06-12  0:27           ` David Lang
2013-06-12  1:48             ` Arjan van de Ven
2013-06-12  9:48               ` Amit Kucheria
2013-06-12 16:22                 ` David Lang
2013-06-12 10:20               ` Catalin Marinas
2013-06-12 15:24                 ` Arjan van de Ven
2013-06-12 17:04                   ` Catalin Marinas
2013-06-12  9:50             ` Daniel Lezcano
2013-06-12 16:30               ` David Lang
2013-06-11  0:50         ` Rafael J. Wysocki
2013-06-13  4:32           ` Preeti U Murthy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).