All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: A proposal for power capping through forced idle in the Linux  Kernel
@ 2009-12-14 23:11 Salman Qazi
  2009-12-14 23:21 ` Andi Kleen
                   ` (7 more replies)
  0 siblings, 8 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-14 23:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: Andrew Morton, Michael Rubin, Taliver Heath

Greetings,

Google is implementing power capping, a technology that improves the
power efficiency of data centers. There are also some interesting
applications of this technology for laptops and cell phones.  Google
aims to send most of its Linux technology upstream. So, how can we get
this feature into the mainline kernel?

Overview:

Data centers are typically statically and pessimistically populated
based on the limitations of the power infrastructure in them.  Peak
power consumption of machines is determined, and based on this, the
number of machines and their placement in the hierarchy is limited to
not exceed the available power in the worst case.  Google is looking
at moving away from this static allocation of power to machines, to a
more dynamic model.  A key component of this model is power capping
done in software.

The idea is to place more machines in the data center than there is
power available to support (when all machines are operating at peak)
and then running the machines with a power cap.  The aim of the
project is to utilize more of the available power in the data center
than possible with static provisioning.  As the amount of work
available changes through the day, the power caps on various machines
are changed as well, while staying within the infrastructure
constraints.  Power can be moved from the more idle parts of the data
center to the busier ones.

Since not all of our existing hardware is able to provide good power
measurements to the software running on it, we have decided to model
power in terms of CPU usage [0].

Current Interface used by Google:

The component of the kernel that we have built to implement software
power capping is called the "Idle Cycle Injector".

It has the following inputs, provided through procfs:

Forced Idle Percentage: This is the minimum percentage of time the CPU
is promised to be idle over the enforcement interval.

Enforcement interval: This is the length of time over which the power
cap is promised.

Aside from this, every cgroup has a new quantity added to the CPU
component called "Power Capping Priority".  This quantity indicates
the order in which the scheduler attributes the time spent injecting
idle cycles to specific processes.  This allows us to discriminate
among processes when it comes to accounting for the injected idle
time.  There is also an indication of interactivity versus batch for
the cgroup provided in the CPU component of the cgroup.

Basic Algorithm:

Rather than blindly blasting the machine with the minimum required
idle cycles, our implementation keeps track of naturally occurring
idle cycles as follows:

0.  Set a timer (hrtimer API is used) for the earliest of: the end of
the enforcement interval (clock time constraint) and the expected time
when we run out of allowed busy cycles if the CPU was entirely busy
from now on (cpu time constraint).
1.  When this timer expires, determine which constraint has been reached.
          a) If it is the clock time constraint, then we must start
with a new interval and go back to step 0.
          b) If it is the CPU time constraint, then rest of the
enforcement interval must be spent idling.
              Continue to step 2.
2.  Set up a timer for the end of the enforcement interval and start
calling the idle function in a loop.   In our current implementation
we wake up a real time kernel thread to do this.  Once finished,
account any injected idle time in the vruntime of processes taken in
the order of power capping priority.  Finally, go back to step 0 and
start a new interval.

Eager Injection:

An interactive task may be prevented from running sufficiently early
by presence of a batch task and end up wanting to run in the capped
portion of the interval.  But, since it cannot run in the capped
portion, it sees a severe latency hiccup.  To counter this, we
discriminate between the two classes through the concept of eager
injection.  The idea is that while we are below our desired minimum
idle quota, we do not let batch tasks run, but instead idle the CPU.
However, during this time, we let interactive tasks run (should it
happen to be runnable).  Once we are past the minimum idle quota,
everyone is free to run.  If the interactive tasks are abusive and
exhaust the CPU time, then idle cycles have to be injected to avoid
exceeding the quota.

Known Limitations of Current Implementation:

0.  The major limitation of injecting in the thread context is that we
cannot prevent soft IRQ handlers from running and using up power.

1.  Sufficiently high forced idle percentages, the Idle Cycle Injector
starts working against itself.  In such cases, it is better to use
other means to make the CPU idle.

2.  Needs some work for SMT support.


Why not use voltage and frequency scaling?

Forced Idle Injection is more effective[1] and more widely available.
Even with voltage and frequency scaling, interpolation is needed
between the available settings.  So, if we did use voltage and
frequency scaling, we would still have to use a timer to take
measurements every so often and adjust the settings.  It would save us
on having to take over the CPU and actively inject though.

Application to Laptops and Cellphones:

Imagine being in a tent in Death Valley with a laptop.  You are bored,
and you want to watch a movie.  However, you also want to do your best
to make the battery last and watch as much of the movie as possible.
Forced idle power capping is a solution.  If your machine has a knob
that allows you to control the available power, you can turn that knob
until your video starts getting choppy.  And then, turn the knob back
a little bit.  Now, you have your video playing just as you like it,
with the minimal amount of power available to the machine.  With eager
injection and the power capping priority, your machine should spend
power on work that you care about, rather than background processes.

What does this have to do with mainline Linux?

We'd like to get as much of our stuff upstream as we can.  Given that
this is a somewhat sizable chunk of work, it would be impolite of me
to just send out a bunch of patches without hearing the concerns of
the community.  What are your thoughts on our design and what do we
need to change to get this to be more acceptable to the community?  I
also would like to know if there are any existing pieces of
infrastructure that this can utilize.

Relevant papers:

[0]. http://research.google.com/pubs/pub32980.html
[1]. http://www.cs.cmu.edu/~anshulg/weed2009.pdf
[2]. http://www.springerlink.com/index/D6287205272LK822.pdf

Regards,

Salman Qazi.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux  Kernel
  2009-12-14 23:11 RFC: A proposal for power capping through forced idle in the Linux Kernel Salman Qazi
@ 2009-12-14 23:21 ` Andi Kleen
  2009-12-14 23:51   ` tytso
  2009-12-14 23:51   ` tytso
  2009-12-14 23:21 ` Andi Kleen
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 44+ messages in thread
From: Andi Kleen @ 2009-12-14 23:21 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin,
	Taliver Heath, lenb

Salman Qazi <sqazi@google.com> writes:
>
> We'd like to get as much of our stuff upstream as we can.  Given that
> this is a somewhat sizable chunk of work, it would be impolite of me
> to just send out a bunch of patches without hearing the concerns of
> the community.  What are your thoughts on our design and what do we
> need to change to get this to be more acceptable to the community?  I
> also would like to know if there are any existing pieces of
> infrastructure that this can utilize.

There were a lot of discussions on this a few months ago in context
of the ACPI 4 "power aggregator" which is a similar (perhaps
slightly less sophisticated) concept. 

While there was a lot of talk about teaching the scheduler about this 
the end result was just a driver which just starts real time threads
and then idles in them. This is in current mainline.

It might be a good idea to review these discussions in the archives.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-14 23:11 RFC: A proposal for power capping through forced idle in the Linux Kernel Salman Qazi
  2009-12-14 23:21 ` Andi Kleen
@ 2009-12-14 23:21 ` Andi Kleen
  2009-12-15  0:19 ` Arjan van de Ven
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2009-12-14 23:21 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Michael Rubin, linux-kernel, linux-pm, Taliver Heath, Andrew Morton

Salman Qazi <sqazi@google.com> writes:
>
> We'd like to get as much of our stuff upstream as we can.  Given that
> this is a somewhat sizable chunk of work, it would be impolite of me
> to just send out a bunch of patches without hearing the concerns of
> the community.  What are your thoughts on our design and what do we
> need to change to get this to be more acceptable to the community?  I
> also would like to know if there are any existing pieces of
> infrastructure that this can utilize.

There were a lot of discussions on this a few months ago in context
of the ACPI 4 "power aggregator" which is a similar (perhaps
slightly less sophisticated) concept. 

While there was a lot of talk about teaching the scheduler about this 
the end result was just a driver which just starts real time threads
and then idles in them. This is in current mainline.

It might be a good idea to review these discussions in the archives.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux  Kernel
  2009-12-14 23:21 ` Andi Kleen
@ 2009-12-14 23:51   ` tytso
  2009-12-15  0:42     ` Salman Qazi
                       ` (3 more replies)
  2009-12-14 23:51   ` tytso
  1 sibling, 4 replies; 44+ messages in thread
From: tytso @ 2009-12-14 23:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Salman Qazi, linux-kernel, linux-pm, Andrew Morton,
	Michael Rubin, Taliver Heath, lenb

On Tue, Dec 15, 2009 at 12:21:07AM +0100, Andi Kleen wrote:
> Salman Qazi <sqazi@google.com> writes:
> >
> > We'd like to get as much of our stuff upstream as we can.  Given that
> > this is a somewhat sizable chunk of work, it would be impolite of me
> > to just send out a bunch of patches without hearing the concerns of
> > the community.  What are your thoughts on our design and what do we
> > need to change to get this to be more acceptable to the community?  I
> > also would like to know if there are any existing pieces of
> > infrastructure that this can utilize.
> 
> There were a lot of discussions on this a few months ago in context
> of the ACPI 4 "power aggregator" which is a similar (perhaps
> slightly less sophisticated) concept. 
> 
> While there was a lot of talk about teaching the scheduler about this 
> the end result was just a driver which just starts real time threads
> and then idles in them. This is in current mainline.
> 
> It might be a good idea to review these discussions in the archives.

It should be noted that most of the heat from those discussions was
over adding the ACPI 4 mechanism to accept requests from the hardware
platform to add idle cycles in the case of thermal/power emergencies,
before we had the scheduler improvements to be able to do so in the
most efficient way possible.  See the description of commit 8e0af5141:

   ACPI 4.0 created the logical "processor aggregator device" as a
   mechinism for platforms to ask the OS to force otherwise busy
   processors to enter (power saving) idle.

   The intent is to lower power consumption to ride-out transient
   electrical and thermal emergencies, rather than powering off the
   server....

   Vaidyanathan Srinivasan has proposed scheduler enhancements to
   allow injecting idle time into the system. This driver doesn't
   depend on those enhancements, but could cut over to them when they
   are available.

   Peter Z. does not favor upstreaming this driver until the those
   scheduler enhancements are in place. However, we favor upstreaming
   this driver now because it is useful now, and can be enhanced over
   time.

It looks to me that scheme that Salman has proposed for adding idle
cycles is quite sophisticated, probably more than Vaidyanathan's, and
the main difference is that Google wants the ability to be able to
control the system's power/thermal envelope from userspace, as opposed
to letting the hardware request in an emergency situation.  This makes
sense, if you are trying to balance the power/thermal requirements
across a large number of systems, as opposed to responding to a local
power/thermal emergency signalled from the platform's firmware.

So it would seem to me that Salman's suggestions are very similar to
what Peter requested before this commit went in (over his objections).

Regards,

     	   	     	    	 	     	- Ted


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-14 23:21 ` Andi Kleen
  2009-12-14 23:51   ` tytso
@ 2009-12-14 23:51   ` tytso
  1 sibling, 0 replies; 44+ messages in thread
From: tytso @ 2009-12-14 23:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Salman Qazi, Michael Rubin, linux-kernel, Andrew Morton,
	Taliver Heath, linux-pm

On Tue, Dec 15, 2009 at 12:21:07AM +0100, Andi Kleen wrote:
> Salman Qazi <sqazi@google.com> writes:
> >
> > We'd like to get as much of our stuff upstream as we can.  Given that
> > this is a somewhat sizable chunk of work, it would be impolite of me
> > to just send out a bunch of patches without hearing the concerns of
> > the community.  What are your thoughts on our design and what do we
> > need to change to get this to be more acceptable to the community?  I
> > also would like to know if there are any existing pieces of
> > infrastructure that this can utilize.
> 
> There were a lot of discussions on this a few months ago in context
> of the ACPI 4 "power aggregator" which is a similar (perhaps
> slightly less sophisticated) concept. 
> 
> While there was a lot of talk about teaching the scheduler about this 
> the end result was just a driver which just starts real time threads
> and then idles in them. This is in current mainline.
> 
> It might be a good idea to review these discussions in the archives.

It should be noted that most of the heat from those discussions was
over adding the ACPI 4 mechanism to accept requests from the hardware
platform to add idle cycles in the case of thermal/power emergencies,
before we had the scheduler improvements to be able to do so in the
most efficient way possible.  See the description of commit 8e0af5141:

   ACPI 4.0 created the logical "processor aggregator device" as a
   mechinism for platforms to ask the OS to force otherwise busy
   processors to enter (power saving) idle.

   The intent is to lower power consumption to ride-out transient
   electrical and thermal emergencies, rather than powering off the
   server....

   Vaidyanathan Srinivasan has proposed scheduler enhancements to
   allow injecting idle time into the system. This driver doesn't
   depend on those enhancements, but could cut over to them when they
   are available.

   Peter Z. does not favor upstreaming this driver until the those
   scheduler enhancements are in place. However, we favor upstreaming
   this driver now because it is useful now, and can be enhanced over
   time.

It looks to me that scheme that Salman has proposed for adding idle
cycles is quite sophisticated, probably more than Vaidyanathan's, and
the main difference is that Google wants the ability to be able to
control the system's power/thermal envelope from userspace, as opposed
to letting the hardware request in an emergency situation.  This makes
sense, if you are trying to balance the power/thermal requirements
across a large number of systems, as opposed to responding to a local
power/thermal emergency signalled from the platform's firmware.

So it would seem to me that Salman's suggestions are very similar to
what Peter requested before this commit went in (over his objections).

Regards,

     	   	     	    	 	     	- Ted

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux  Kernel
  2009-12-14 23:11 RFC: A proposal for power capping through forced idle in the Linux Kernel Salman Qazi
  2009-12-14 23:21 ` Andi Kleen
  2009-12-14 23:21 ` Andi Kleen
@ 2009-12-15  0:19 ` Arjan van de Ven
  2009-12-15  0:36   ` Salman Qazi
                     ` (3 more replies)
  2009-12-15  0:19 ` Arjan van de Ven
                   ` (4 subsequent siblings)
  7 siblings, 4 replies; 44+ messages in thread
From: Arjan van de Ven @ 2009-12-15  0:19 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin, Taliver Heath

On Mon, 14 Dec 2009 15:11:47 -0800
Salman Qazi <sqazi@google.com> wrote:


I like the general idea, I have one request (that I didn't see quite in
your explanation): Please make sure that all cpus in the system do
their idle injection at the same time, so that memory can go into power
saving mode as well during this time etc etc...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-14 23:11 RFC: A proposal for power capping through forced idle in the Linux Kernel Salman Qazi
                   ` (2 preceding siblings ...)
  2009-12-15  0:19 ` Arjan van de Ven
@ 2009-12-15  0:19 ` Arjan van de Ven
  2009-12-18 17:04 ` Pavel Machek
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 44+ messages in thread
From: Arjan van de Ven @ 2009-12-15  0:19 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-pm, Taliver Heath, Michael Rubin, linux-kernel, Andrew Morton

On Mon, 14 Dec 2009 15:11:47 -0800
Salman Qazi <sqazi@google.com> wrote:


I like the general idea, I have one request (that I didn't see quite in
your explanation): Please make sure that all cpus in the system do
their idle injection at the same time, so that memory can go into power
saving mode as well during this time etc etc...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the  Linux Kernel
  2009-12-15  0:19 ` Arjan van de Ven
@ 2009-12-15  0:36   ` Salman Qazi
  2009-12-15  1:06     ` Arjan van de Ven
                       ` (3 more replies)
  2009-12-15  0:36   ` Salman Qazi
                     ` (2 subsequent siblings)
  3 siblings, 4 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-15  0:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin, Taliver Heath

On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven <arjan@infradead.org> wrote:
> On Mon, 14 Dec 2009 15:11:47 -0800
> Salman Qazi <sqazi@google.com> wrote:
>
>
> I like the general idea, I have one request (that I didn't see quite in
> your explanation): Please make sure that all cpus in the system do
> their idle injection at the same time, so that memory can go into power
> saving mode as well during this time etc etc...
>

With the current interface, the forced idle percentages on the CPUs
are controlled independently.  There's a trade-off here.  If we inject
idle cycles on all the CPU at the same time, our machine
responsiveness also degrades: essentially every CPU becomes equally
bad for an interactive task to run on.  Our aim at the moment is to
try to concentrate the idle cycles on a small set of CPUs, to strive
to leave some CPUs where interactive tasks can run unhindered.  But,
given a different workload and goals the correct policy may be
different.

Simultaneously idling multiple "cores" becomes necessary in the SMT
case: as there is no point in idling a single thread, while the other
thread is running full tilt.  So, in such a case it is necessary to
idle all the threads making up the physical core.  This feature has
not been implemented yet.

I think the best approach may be to provide a way to specify the
policy from the user space.  Basically let the user decide at what
level of CPU hierarchy the forced idle percentages are specified.
Then, in the levels below, we simply inject at the same time.

>
> --
> Arjan van de Ven        Intel Open Source Technology Centre
> For development, discussion and tips for power savings,
> visit http://www.lesswatts.org
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-15  0:19 ` Arjan van de Ven
  2009-12-15  0:36   ` Salman Qazi
@ 2009-12-15  0:36   ` Salman Qazi
  2009-12-22 19:48   ` Peter Zijlstra
  2009-12-22 19:48   ` Peter Zijlstra
  3 siblings, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-15  0:36 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-pm, Taliver Heath, Michael Rubin, linux-kernel, Andrew Morton

On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven <arjan@infradead.org> wrote:
> On Mon, 14 Dec 2009 15:11:47 -0800
> Salman Qazi <sqazi@google.com> wrote:
>
>
> I like the general idea, I have one request (that I didn't see quite in
> your explanation): Please make sure that all cpus in the system do
> their idle injection at the same time, so that memory can go into power
> saving mode as well during this time etc etc...
>

With the current interface, the forced idle percentages on the CPUs
are controlled independently.  There's a trade-off here.  If we inject
idle cycles on all the CPU at the same time, our machine
responsiveness also degrades: essentially every CPU becomes equally
bad for an interactive task to run on.  Our aim at the moment is to
try to concentrate the idle cycles on a small set of CPUs, to strive
to leave some CPUs where interactive tasks can run unhindered.  But,
given a different workload and goals the correct policy may be
different.

Simultaneously idling multiple "cores" becomes necessary in the SMT
case: as there is no point in idling a single thread, while the other
thread is running full tilt.  So, in such a case it is necessary to
idle all the threads making up the physical core.  This feature has
not been implemented yet.

I think the best approach may be to provide a way to specify the
policy from the user space.  Basically let the user decide at what
level of CPU hierarchy the forced idle percentages are specified.
Then, in the levels below, we simply inject at the same time.

>
> --
> Arjan van de Ven        Intel Open Source Technology Centre
> For development, discussion and tips for power savings,
> visit http://www.lesswatts.org
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the  Linux Kernel
  2009-12-14 23:51   ` tytso
  2009-12-15  0:42     ` Salman Qazi
@ 2009-12-15  0:42     ` Salman Qazi
  2009-12-22 19:48     ` Peter Zijlstra
  2009-12-22 19:48     ` Peter Zijlstra
  3 siblings, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-15  0:42 UTC (permalink / raw)
  To: tytso, Andi Kleen, Salman Qazi, linux-kernel, linux-pm,
	Andrew Morton, Michael Rubin, Taliver Heath, lenb

On Mon, Dec 14, 2009 at 3:51 PM,  <tytso@mit.edu> wrote:
> On Tue, Dec 15, 2009 at 12:21:07AM +0100, Andi Kleen wrote:
>> Salman Qazi <sqazi@google.com> writes:
>> >
>> > We'd like to get as much of our stuff upstream as we can.  Given that
>> > this is a somewhat sizable chunk of work, it would be impolite of me
>> > to just send out a bunch of patches without hearing the concerns of
>> > the community.  What are your thoughts on our design and what do we
>> > need to change to get this to be more acceptable to the community?  I
>> > also would like to know if there are any existing pieces of
>> > infrastructure that this can utilize.
>>
>> There were a lot of discussions on this a few months ago in context
>> of the ACPI 4 "power aggregator" which is a similar (perhaps
>> slightly less sophisticated) concept.
>>
>> While there was a lot of talk about teaching the scheduler about this
>> the end result was just a driver which just starts real time threads
>> and then idles in them. This is in current mainline.
>>
>> It might be a good idea to review these discussions in the archives.
>
> It should be noted that most of the heat from those discussions was
> over adding the ACPI 4 mechanism to accept requests from the hardware
> platform to add idle cycles in the case of thermal/power emergencies,
> before we had the scheduler improvements to be able to do so in the
> most efficient way possible.  See the description of commit 8e0af5141:
>
>   ACPI 4.0 created the logical "processor aggregator device" as a
>   mechinism for platforms to ask the OS to force otherwise busy
>   processors to enter (power saving) idle.
>
>   The intent is to lower power consumption to ride-out transient
>   electrical and thermal emergencies, rather than powering off the
>   server....
>
>   Vaidyanathan Srinivasan has proposed scheduler enhancements to
>   allow injecting idle time into the system. This driver doesn't
>   depend on those enhancements, but could cut over to them when they
>   are available.
>
>   Peter Z. does not favor upstreaming this driver until the those
>   scheduler enhancements are in place. However, we favor upstreaming
>   this driver now because it is useful now, and can be enhanced over
>   time.
>
> It looks to me that scheme that Salman has proposed for adding idle
> cycles is quite sophisticated, probably more than Vaidyanathan's, and
> the main difference is that Google wants the ability to be able to
> control the system's power/thermal envelope from userspace, as opposed
> to letting the hardware request in an emergency situation.  This makes
> sense, if you are trying to balance the power/thermal requirements
> across a large number of systems, as opposed to responding to a local
> power/thermal emergency signalled from the platform's firmware.
>

This is correct.  The source of our signal is user space software on
remote machines ("hierarchical regulators") that have information
about power usage across multiple machines at various levels of power
hierarchy.  This is ultimately delivered to a local daemon, which
communicates it to the kernel.

As we are not using this as an emergency measure, we are very
concerned about the performance implications.

> So it would seem to me that Salman's suggestions are very similar to
> what Peter requested before this commit went in (over his objections).
>
> Regards,
>
>                                                - Ted
>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-14 23:51   ` tytso
@ 2009-12-15  0:42     ` Salman Qazi
  2009-12-15  0:42     ` Salman Qazi
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-15  0:42 UTC (permalink / raw)
  To: tytso, Andi Kleen, Salman Qazi, linux-kernel, linux-pm, Andrew

On Mon, Dec 14, 2009 at 3:51 PM,  <tytso@mit.edu> wrote:
> On Tue, Dec 15, 2009 at 12:21:07AM +0100, Andi Kleen wrote:
>> Salman Qazi <sqazi@google.com> writes:
>> >
>> > We'd like to get as much of our stuff upstream as we can.  Given that
>> > this is a somewhat sizable chunk of work, it would be impolite of me
>> > to just send out a bunch of patches without hearing the concerns of
>> > the community.  What are your thoughts on our design and what do we
>> > need to change to get this to be more acceptable to the community?  I
>> > also would like to know if there are any existing pieces of
>> > infrastructure that this can utilize.
>>
>> There were a lot of discussions on this a few months ago in context
>> of the ACPI 4 "power aggregator" which is a similar (perhaps
>> slightly less sophisticated) concept.
>>
>> While there was a lot of talk about teaching the scheduler about this
>> the end result was just a driver which just starts real time threads
>> and then idles in them. This is in current mainline.
>>
>> It might be a good idea to review these discussions in the archives.
>
> It should be noted that most of the heat from those discussions was
> over adding the ACPI 4 mechanism to accept requests from the hardware
> platform to add idle cycles in the case of thermal/power emergencies,
> before we had the scheduler improvements to be able to do so in the
> most efficient way possible.  See the description of commit 8e0af5141:
>
>   ACPI 4.0 created the logical "processor aggregator device" as a
>   mechinism for platforms to ask the OS to force otherwise busy
>   processors to enter (power saving) idle.
>
>   The intent is to lower power consumption to ride-out transient
>   electrical and thermal emergencies, rather than powering off the
>   server....
>
>   Vaidyanathan Srinivasan has proposed scheduler enhancements to
>   allow injecting idle time into the system. This driver doesn't
>   depend on those enhancements, but could cut over to them when they
>   are available.
>
>   Peter Z. does not favor upstreaming this driver until the those
>   scheduler enhancements are in place. However, we favor upstreaming
>   this driver now because it is useful now, and can be enhanced over
>   time.
>
> It looks to me that scheme that Salman has proposed for adding idle
> cycles is quite sophisticated, probably more than Vaidyanathan's, and
> the main difference is that Google wants the ability to be able to
> control the system's power/thermal envelope from userspace, as opposed
> to letting the hardware request in an emergency situation.  This makes
> sense, if you are trying to balance the power/thermal requirements
> across a large number of systems, as opposed to responding to a local
> power/thermal emergency signalled from the platform's firmware.
>

This is correct.  The source of our signal is user space software on
remote machines ("hierarchical regulators") that have information
about power usage across multiple machines at various levels of power
hierarchy.  This is ultimately delivered to a local daemon, which
communicates it to the kernel.

As we are not using this as an emergency measure, we are very
concerned about the performance implications.

> So it would seem to me that Salman's suggestions are very similar to
> what Peter requested before this commit went in (over his objections).
>
> Regards,
>
>                                                - Ted
>
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the  Linux Kernel
  2009-12-15  0:36   ` Salman Qazi
@ 2009-12-15  1:06     ` Arjan van de Ven
  2009-12-15 20:15       ` Salman Qazi
  2009-12-15 20:15       ` Salman Qazi
  2009-12-15  1:06     ` Arjan van de Ven
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 44+ messages in thread
From: Arjan van de Ven @ 2009-12-15  1:06 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin, Taliver Heath

On Mon, 14 Dec 2009 16:36:20 -0800
Salman Qazi <sqazi@google.com> wrote:

> On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven
> <arjan@infradead.org> wrote:
> > On Mon, 14 Dec 2009 15:11:47 -0800
> > Salman Qazi <sqazi@google.com> wrote:
> >
> >
> > I like the general idea, I have one request (that I didn't see
> > quite in your explanation): Please make sure that all cpus in the
> > system do their idle injection at the same time, so that memory can
> > go into power saving mode as well during this time etc etc...
> >
> 
> With the current interface, the forced idle percentages on the CPUs
> are controlled independently.  There's a trade-off here.  If we inject

I'm fine with that... just want to ask that even if we inject different
percentages, that we inject them for maximum overlap
(having the memory power in a machine suddenly be half or less is a
huge step in power, for something, the alignment itself, that does not
cost much if any extra performance over randomly distributed idle
insertions)

> idle cycles on all the CPU at the same time, our machine
> responsiveness also degrades: essentially every CPU becomes equally
> bad for an interactive task to run on.  Our aim at the moment is to
> try to concentrate the idle cycles on a small set of CPUs, to strive
> to leave some CPUs where interactive tasks can run unhindered.  But,
> given a different workload and goals the correct policy may be
> different.

as long as the tentative portion of the idle time gets injected at the
same time.. I suspect there can be a decent balance here where most of
the time we get the full CPU *and* memory savings, while we degrade
gracefully for the case where we get increasingly more interactive
activity.

 
> Simultaneously idling multiple "cores" becomes necessary in the SMT
> case: as there is no point in idling a single thread, while the other
> thread is running full tilt.  

I can argue the same for package level btw ;)


> I think the best approach may be to provide a way to specify the
> policy from the user space.  Basically let the user decide at what
> level of CPU hierarchy the forced idle percentages are specified.
> Then, in the levels below, we simply inject at the same time.

it's not so much about the specification part; per logical cpu is a nice
place to specify things... as long as we, in the execution part, align
things up smart.




-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-15  0:36   ` Salman Qazi
  2009-12-15  1:06     ` Arjan van de Ven
@ 2009-12-15  1:06     ` Arjan van de Ven
  2009-12-15 10:29     ` Vaidyanathan Srinivasan
  2009-12-15 10:29     ` Vaidyanathan Srinivasan
  3 siblings, 0 replies; 44+ messages in thread
From: Arjan van de Ven @ 2009-12-15  1:06 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-pm, Taliver Heath, Michael Rubin, linux-kernel, Andrew Morton

On Mon, 14 Dec 2009 16:36:20 -0800
Salman Qazi <sqazi@google.com> wrote:

> On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven
> <arjan@infradead.org> wrote:
> > On Mon, 14 Dec 2009 15:11:47 -0800
> > Salman Qazi <sqazi@google.com> wrote:
> >
> >
> > I like the general idea, I have one request (that I didn't see
> > quite in your explanation): Please make sure that all cpus in the
> > system do their idle injection at the same time, so that memory can
> > go into power saving mode as well during this time etc etc...
> >
> 
> With the current interface, the forced idle percentages on the CPUs
> are controlled independently.  There's a trade-off here.  If we inject

I'm fine with that... just want to ask that even if we inject different
percentages, that we inject them for maximum overlap
(having the memory power in a machine suddenly be half or less is a
huge step in power, for something, the alignment itself, that does not
cost much if any extra performance over randomly distributed idle
insertions)

> idle cycles on all the CPU at the same time, our machine
> responsiveness also degrades: essentially every CPU becomes equally
> bad for an interactive task to run on.  Our aim at the moment is to
> try to concentrate the idle cycles on a small set of CPUs, to strive
> to leave some CPUs where interactive tasks can run unhindered.  But,
> given a different workload and goals the correct policy may be
> different.

as long as the tentative portion of the idle time gets injected at the
same time.. I suspect there can be a decent balance here where most of
the time we get the full CPU *and* memory savings, while we degrade
gracefully for the case where we get increasingly more interactive
activity.

 
> Simultaneously idling multiple "cores" becomes necessary in the SMT
> case: as there is no point in idling a single thread, while the other
> thread is running full tilt.  

I can argue the same for package level btw ;)


> I think the best approach may be to provide a way to specify the
> policy from the user space.  Basically let the user decide at what
> level of CPU hierarchy the forced idle percentages are specified.
> Then, in the levels below, we simply inject at the same time.

it's not so much about the specification part; per logical cpu is a nice
place to specify things... as long as we, in the execution part, align
things up smart.




-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-15  0:36   ` Salman Qazi
                       ` (2 preceding siblings ...)
  2009-12-15 10:29     ` Vaidyanathan Srinivasan
@ 2009-12-15 10:29     ` Vaidyanathan Srinivasan
  2009-12-15 11:50         ` Vaidyanathan Srinivasan
                         ` (2 more replies)
  3 siblings, 3 replies; 44+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-12-15 10:29 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Arjan van de Ven, linux-kernel, linux-pm, Andrew Morton,
	Michael Rubin, Taliver Heath

* Salman Qazi <sqazi@google.com> [2009-12-14 16:36:20]:

> On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven <arjan@infradead.org> wrote:
> > On Mon, 14 Dec 2009 15:11:47 -0800
> > Salman Qazi <sqazi@google.com> wrote:
> >
> >
> > I like the general idea, I have one request (that I didn't see quite in
> > your explanation): Please make sure that all cpus in the system do
> > their idle injection at the same time, so that memory can go into power
> > saving mode as well during this time etc etc...
> >

The value of the overall idea is well understood but the
implementation and benefits in terms of power savings was the major
point of discussion earlier. 
 
> With the current interface, the forced idle percentages on the CPUs
> are controlled independently.  There's a trade-off here.  If we inject
> idle cycles on all the CPU at the same time, our machine
> responsiveness also degrades: essentially every CPU becomes equally
> bad for an interactive task to run on.  Our aim at the moment is to
> try to concentrate the idle cycles on a small set of CPUs, to strive
> to leave some CPUs where interactive tasks can run unhindered.  But,
> given a different workload and goals the correct policy may be
> different.
> 
> Simultaneously idling multiple "cores" becomes necessary in the SMT
> case: as there is no point in idling a single thread, while the other
> thread is running full tilt.  So, in such a case it is necessary to
> idle all the threads making up the physical core.  This feature has
> not been implemented yet.
> 
> I think the best approach may be to provide a way to specify the
> policy from the user space.  Basically let the user decide at what
> level of CPU hierarchy the forced idle percentages are specified.
> Then, in the levels below, we simply inject at the same time.

Synchronising the idle times across multiple cores and also selecting
sibling threads belonging to the same core is important.  The current
ACPI forced idle driver can inject idle time but not synchronized
across multiple cores.

Allowing the scheduler load balancer to avoid using a part of the
sched domain tree will allow easy grouping of sibling threads and
sibling cores if that saves more power.

However as Arjan mentioned, new architectures have significant power
savings at full system idle where memory power is reduced.  Injecting
idle time in any of the core will actually increase the utilisation on
the other cores (unless the system is full loaded) and reduce the full
system idle time opportunity.  Basically injecting idle time on some
of the cores in the system goes against the race-to-idle policy
thereby decreasing overall system operating efficiency.

Can you please clarify the following questions:

* What is the typical duration of idle time injected?
        - 10s of milli seconds?  CPUs are expected to goto lowest
          power idle state within this time?

* You mentioned that natural idle time in the system is taken into
  account before injecting forced idle time, which is a good feature
  to have.
        - In most workloads, as the utilisation drops, all the cpus
          have similar idle times.  This is favourable for exploiting
          memory power saving.  
        - Now when more idle time need to be inserted, is it
          uniformly spread across all CPUs?

Suggestions:

* Can cgroup hardlimits help here to inject idle times
  http://lkml.org/lkml/2009/11/17/191

  The problem of distributing idle time equally across CPUs and
  relating sibling threads is still and issue, but can be worked out.
  As of now hardlimits can distribute idle time across CPUs thereby
  enabling full system idle.

--Vaidy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-15  0:36   ` Salman Qazi
  2009-12-15  1:06     ` Arjan van de Ven
  2009-12-15  1:06     ` Arjan van de Ven
@ 2009-12-15 10:29     ` Vaidyanathan Srinivasan
  2009-12-15 10:29     ` Vaidyanathan Srinivasan
  3 siblings, 0 replies; 44+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-12-15 10:29 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Michael Rubin, linux-kernel, linux-pm, Taliver Heath,
	Andrew Morton, Arjan van de Ven

* Salman Qazi <sqazi@google.com> [2009-12-14 16:36:20]:

> On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven <arjan@infradead.org> wrote:
> > On Mon, 14 Dec 2009 15:11:47 -0800
> > Salman Qazi <sqazi@google.com> wrote:
> >
> >
> > I like the general idea, I have one request (that I didn't see quite in
> > your explanation): Please make sure that all cpus in the system do
> > their idle injection at the same time, so that memory can go into power
> > saving mode as well during this time etc etc...
> >

The value of the overall idea is well understood but the
implementation and benefits in terms of power savings was the major
point of discussion earlier. 
 
> With the current interface, the forced idle percentages on the CPUs
> are controlled independently.  There's a trade-off here.  If we inject
> idle cycles on all the CPU at the same time, our machine
> responsiveness also degrades: essentially every CPU becomes equally
> bad for an interactive task to run on.  Our aim at the moment is to
> try to concentrate the idle cycles on a small set of CPUs, to strive
> to leave some CPUs where interactive tasks can run unhindered.  But,
> given a different workload and goals the correct policy may be
> different.
> 
> Simultaneously idling multiple "cores" becomes necessary in the SMT
> case: as there is no point in idling a single thread, while the other
> thread is running full tilt.  So, in such a case it is necessary to
> idle all the threads making up the physical core.  This feature has
> not been implemented yet.
> 
> I think the best approach may be to provide a way to specify the
> policy from the user space.  Basically let the user decide at what
> level of CPU hierarchy the forced idle percentages are specified.
> Then, in the levels below, we simply inject at the same time.

Synchronising the idle times across multiple cores and also selecting
sibling threads belonging to the same core is important.  The current
ACPI forced idle driver can inject idle time but not synchronized
across multiple cores.

Allowing the scheduler load balancer to avoid using a part of the
sched domain tree will allow easy grouping of sibling threads and
sibling cores if that saves more power.

However as Arjan mentioned, new architectures have significant power
savings at full system idle where memory power is reduced.  Injecting
idle time in any of the core will actually increase the utilisation on
the other cores (unless the system is full loaded) and reduce the full
system idle time opportunity.  Basically injecting idle time on some
of the cores in the system goes against the race-to-idle policy
thereby decreasing overall system operating efficiency.

Can you please clarify the following questions:

* What is the typical duration of idle time injected?
        - 10s of milli seconds?  CPUs are expected to goto lowest
          power idle state within this time?

* You mentioned that natural idle time in the system is taken into
  account before injecting forced idle time, which is a good feature
  to have.
        - In most workloads, as the utilisation drops, all the cpus
          have similar idle times.  This is favourable for exploiting
          memory power saving.  
        - Now when more idle time need to be inserted, is it
          uniformly spread across all CPUs?

Suggestions:

* Can cgroup hardlimits help here to inject idle times
  http://lkml.org/lkml/2009/11/17/191

  The problem of distributing idle time equally across CPUs and
  relating sibling threads is still and issue, but can be worked out.
  As of now hardlimits can distribute idle time across CPUs thereby
  enabling full system idle.

--Vaidy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-15 10:29     ` Vaidyanathan Srinivasan
@ 2009-12-15 11:50         ` Vaidyanathan Srinivasan
  2009-12-15 20:50       ` Salman Qazi
  2009-12-15 20:50       ` Salman Qazi
  2 siblings, 0 replies; 44+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-12-15 11:50 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Arjan van de Ven, linux-kernel, linux-pm, Andrew Morton,
	Michael Rubin, Taliver Heath

* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2009-12-15 15:59:09]:

> * Salman Qazi <sqazi@google.com> [2009-12-14 16:36:20]:
> 
> > On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven <arjan@infradead.org> wrote:
> > > On Mon, 14 Dec 2009 15:11:47 -0800
> > > Salman Qazi <sqazi@google.com> wrote:
> > >
> > >
> > > I like the general idea, I have one request (that I didn't see quite in
> > > your explanation): Please make sure that all cpus in the system do
> > > their idle injection at the same time, so that memory can go into power
> > > saving mode as well during this time etc etc...
> > >
> 
> The value of the overall idea is well understood but the
> implementation and benefits in terms of power savings was the major
> point of discussion earlier. 
> 
> > With the current interface, the forced idle percentages on the CPUs
> > are controlled independently.  There's a trade-off here.  If we inject
> > idle cycles on all the CPU at the same time, our machine
> > responsiveness also degrades: essentially every CPU becomes equally
> > bad for an interactive task to run on.  Our aim at the moment is to
> > try to concentrate the idle cycles on a small set of CPUs, to strive
> > to leave some CPUs where interactive tasks can run unhindered.  But,
> > given a different workload and goals the correct policy may be
> > different.
> > 
> > Simultaneously idling multiple "cores" becomes necessary in the SMT
> > case: as there is no point in idling a single thread, while the other
> > thread is running full tilt.  So, in such a case it is necessary to
> > idle all the threads making up the physical core.  This feature has
> > not been implemented yet.
> > 
> > I think the best approach may be to provide a way to specify the
> > policy from the user space.  Basically let the user decide at what
> > level of CPU hierarchy the forced idle percentages are specified.
> > Then, in the levels below, we simply inject at the same time.
> 
> Synchronising the idle times across multiple cores and also selecting
> sibling threads belonging to the same core is important.  The current
> ACPI forced idle driver can inject idle time but not synchronized
> across multiple cores.
> 
> Allowing the scheduler load balancer to avoid using a part of the
> sched domain tree will allow easy grouping of sibling threads and
> sibling cores if that saves more power.
> 
> However as Arjan mentioned, new architectures have significant power
> savings at full system idle where memory power is reduced.  Injecting
> idle time in any of the core will actually increase the utilisation on
> the other cores (unless the system is full loaded) and reduce the full
> system idle time opportunity.  Basically injecting idle time on some
> of the cores in the system goes against the race-to-idle policy
> thereby decreasing overall system operating efficiency.
> 
> Can you please clarify the following questions:
> 
> * What is the typical duration of idle time injected?
>         - 10s of milli seconds?  CPUs are expected to goto lowest
>           power idle state within this time?
> 
> * You mentioned that natural idle time in the system is taken into
>   account before injecting forced idle time, which is a good feature
>   to have.
>         - In most workloads, as the utilisation drops, all the cpus
>           have similar idle times.  This is favourable for exploiting
>           memory power saving.  
>         - Now when more idle time need to be inserted, is it
>           uniformly spread across all CPUs?

* How is the fairness issue in the scheduler handled?  Inserting idle
  time may affect interactivity and fairness badly.

--Vaidy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
@ 2009-12-15 11:50         ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 44+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-12-15 11:50 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Michael Rubin, linux-kernel, linux-pm, Taliver Heath,
	Andrew Morton, Arjan van de Ven

* Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2009-12-15 15:59:09]:

> * Salman Qazi <sqazi@google.com> [2009-12-14 16:36:20]:
> 
> > On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven <arjan@infradead.org> wrote:
> > > On Mon, 14 Dec 2009 15:11:47 -0800
> > > Salman Qazi <sqazi@google.com> wrote:
> > >
> > >
> > > I like the general idea, I have one request (that I didn't see quite in
> > > your explanation): Please make sure that all cpus in the system do
> > > their idle injection at the same time, so that memory can go into power
> > > saving mode as well during this time etc etc...
> > >
> 
> The value of the overall idea is well understood but the
> implementation and benefits in terms of power savings was the major
> point of discussion earlier. 
> 
> > With the current interface, the forced idle percentages on the CPUs
> > are controlled independently.  There's a trade-off here.  If we inject
> > idle cycles on all the CPU at the same time, our machine
> > responsiveness also degrades: essentially every CPU becomes equally
> > bad for an interactive task to run on.  Our aim at the moment is to
> > try to concentrate the idle cycles on a small set of CPUs, to strive
> > to leave some CPUs where interactive tasks can run unhindered.  But,
> > given a different workload and goals the correct policy may be
> > different.
> > 
> > Simultaneously idling multiple "cores" becomes necessary in the SMT
> > case: as there is no point in idling a single thread, while the other
> > thread is running full tilt.  So, in such a case it is necessary to
> > idle all the threads making up the physical core.  This feature has
> > not been implemented yet.
> > 
> > I think the best approach may be to provide a way to specify the
> > policy from the user space.  Basically let the user decide at what
> > level of CPU hierarchy the forced idle percentages are specified.
> > Then, in the levels below, we simply inject at the same time.
> 
> Synchronising the idle times across multiple cores and also selecting
> sibling threads belonging to the same core is important.  The current
> ACPI forced idle driver can inject idle time but not synchronized
> across multiple cores.
> 
> Allowing the scheduler load balancer to avoid using a part of the
> sched domain tree will allow easy grouping of sibling threads and
> sibling cores if that saves more power.
> 
> However as Arjan mentioned, new architectures have significant power
> savings at full system idle where memory power is reduced.  Injecting
> idle time in any of the core will actually increase the utilisation on
> the other cores (unless the system is full loaded) and reduce the full
> system idle time opportunity.  Basically injecting idle time on some
> of the cores in the system goes against the race-to-idle policy
> thereby decreasing overall system operating efficiency.
> 
> Can you please clarify the following questions:
> 
> * What is the typical duration of idle time injected?
>         - 10s of milli seconds?  CPUs are expected to goto lowest
>           power idle state within this time?
> 
> * You mentioned that natural idle time in the system is taken into
>   account before injecting forced idle time, which is a good feature
>   to have.
>         - In most workloads, as the utilisation drops, all the cpus
>           have similar idle times.  This is favourable for exploiting
>           memory power saving.  
>         - Now when more idle time need to be inserted, is it
>           uniformly spread across all CPUs?

* How is the fairness issue in the scheduler handled?  Inserting idle
  time may affect interactivity and fairness badly.

--Vaidy

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the  Linux Kernel
  2009-12-15  1:06     ` Arjan van de Ven
@ 2009-12-15 20:15       ` Salman Qazi
  2009-12-17 11:01         ` Arjan van de Ven
  2009-12-17 11:01         ` Arjan van de Ven
  2009-12-15 20:15       ` Salman Qazi
  1 sibling, 2 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-15 20:15 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin, Taliver Heath

On Mon, Dec 14, 2009 at 5:06 PM, Arjan van de Ven <arjan@infradead.org> wrote:
> On Mon, 14 Dec 2009 16:36:20 -0800
> Salman Qazi <sqazi@google.com> wrote:
>
>> On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven
>> <arjan@infradead.org> wrote:
>> > On Mon, 14 Dec 2009 15:11:47 -0800
>> > Salman Qazi <sqazi@google.com> wrote:
>> >
>> >
>> > I like the general idea, I have one request (that I didn't see
>> > quite in your explanation): Please make sure that all cpus in the
>> > system do their idle injection at the same time, so that memory can
>> > go into power saving mode as well during this time etc etc...
>> >
>>
>> With the current interface, the forced idle percentages on the CPUs
>> are controlled independently.  There's a trade-off here.  If we inject
>
> I'm fine with that... just want to ask that even if we inject different
> percentages, that we inject them for maximum overlap
> (having the memory power in a machine suddenly be half or less is a
> huge step in power, for something, the alignment itself, that does not
> cost much if any extra performance over randomly distributed idle
> insertions)

I think there is a difference in goals here.  Our goal is to free up
power from one machine, so that it can be used elsewhere.  This means
that any power that we save has to be predictably saved.  We have
models that map between CPU time and power usage, that have to account
for the worst case.  So, if we can save some extra power by
opportunistic means, unfortunately we have no way of using that power
elsewhere.

Having said that, energy savings is a worthy goal by itself.  But, we
have to make sure to balance it with performance.

>
>> idle cycles on all the CPU at the same time, our machine
>> responsiveness also degrades: essentially every CPU becomes equally
>> bad for an interactive task to run on.  Our aim at the moment is to
>> try to concentrate the idle cycles on a small set of CPUs, to strive
>> to leave some CPUs where interactive tasks can run unhindered.  But,
>> given a different workload and goals the correct policy may be
>> different.
>
> as long as the tentative portion of the idle time gets injected at the
> same time.. I suspect there can be a decent balance here where most of
> the time we get the full CPU *and* memory savings, while we degrade
> gracefully for the case where we get increasingly more interactive
> activity.

Let me rephrase what you just said to verify that I understand you
correctly:  we should align the enforcement intervals across CPUs and
thereby eagerly inject over the same time period across multiple CPUs.
 If there is no interactive workload then we would get maximal CPU and
memory savings.  If there are interactive tasks, then we just let them
run like we would in the eager injection in original design.

If I understand correctly, then I like your idea.

>
>
>> Simultaneously idling multiple "cores" becomes necessary in the SMT
>> case: as there is no point in idling a single thread, while the other
>> thread is running full tilt.
>
> I can argue the same for package level btw ;)
>
>
>> I think the best approach may be to provide a way to specify the
>> policy from the user space.  Basically let the user decide at what
>> level of CPU hierarchy the forced idle percentages are specified.
>> Then, in the levels below, we simply inject at the same time.
>
> it's not so much about the specification part; per logical cpu is a nice
> place to specify things... as long as we, in the execution part, align
> things up smart.
>
>
>
>
> --
> Arjan van de Ven        Intel Open Source Technology Centre
> For development, discussion and tips for power savings,
> visit http://www.lesswatts.org
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-15  1:06     ` Arjan van de Ven
  2009-12-15 20:15       ` Salman Qazi
@ 2009-12-15 20:15       ` Salman Qazi
  1 sibling, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-15 20:15 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: linux-pm, Taliver Heath, Michael Rubin, linux-kernel, Andrew Morton

On Mon, Dec 14, 2009 at 5:06 PM, Arjan van de Ven <arjan@infradead.org> wrote:
> On Mon, 14 Dec 2009 16:36:20 -0800
> Salman Qazi <sqazi@google.com> wrote:
>
>> On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven
>> <arjan@infradead.org> wrote:
>> > On Mon, 14 Dec 2009 15:11:47 -0800
>> > Salman Qazi <sqazi@google.com> wrote:
>> >
>> >
>> > I like the general idea, I have one request (that I didn't see
>> > quite in your explanation): Please make sure that all cpus in the
>> > system do their idle injection at the same time, so that memory can
>> > go into power saving mode as well during this time etc etc...
>> >
>>
>> With the current interface, the forced idle percentages on the CPUs
>> are controlled independently.  There's a trade-off here.  If we inject
>
> I'm fine with that... just want to ask that even if we inject different
> percentages, that we inject them for maximum overlap
> (having the memory power in a machine suddenly be half or less is a
> huge step in power, for something, the alignment itself, that does not
> cost much if any extra performance over randomly distributed idle
> insertions)

I think there is a difference in goals here.  Our goal is to free up
power from one machine, so that it can be used elsewhere.  This means
that any power that we save has to be predictably saved.  We have
models that map between CPU time and power usage, that have to account
for the worst case.  So, if we can save some extra power by
opportunistic means, unfortunately we have no way of using that power
elsewhere.

Having said that, energy savings is a worthy goal by itself.  But, we
have to make sure to balance it with performance.

>
>> idle cycles on all the CPU at the same time, our machine
>> responsiveness also degrades: essentially every CPU becomes equally
>> bad for an interactive task to run on.  Our aim at the moment is to
>> try to concentrate the idle cycles on a small set of CPUs, to strive
>> to leave some CPUs where interactive tasks can run unhindered.  But,
>> given a different workload and goals the correct policy may be
>> different.
>
> as long as the tentative portion of the idle time gets injected at the
> same time.. I suspect there can be a decent balance here where most of
> the time we get the full CPU *and* memory savings, while we degrade
> gracefully for the case where we get increasingly more interactive
> activity.

Let me rephrase what you just said to verify that I understand you
correctly:  we should align the enforcement intervals across CPUs and
thereby eagerly inject over the same time period across multiple CPUs.
 If there is no interactive workload then we would get maximal CPU and
memory savings.  If there are interactive tasks, then we just let them
run like we would in the eager injection in original design.

If I understand correctly, then I like your idea.

>
>
>> Simultaneously idling multiple "cores" becomes necessary in the SMT
>> case: as there is no point in idling a single thread, while the other
>> thread is running full tilt.
>
> I can argue the same for package level btw ;)
>
>
>> I think the best approach may be to provide a way to specify the
>> policy from the user space.  Basically let the user decide at what
>> level of CPU hierarchy the forced idle percentages are specified.
>> Then, in the levels below, we simply inject at the same time.
>
> it's not so much about the specification part; per logical cpu is a nice
> place to specify things... as long as we, in the execution part, align
> things up smart.
>
>
>
>
> --
> Arjan van de Ven        Intel Open Source Technology Centre
> For development, discussion and tips for power savings,
> visit http://www.lesswatts.org
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the  Linux Kernel
  2009-12-15 10:29     ` Vaidyanathan Srinivasan
  2009-12-15 11:50         ` Vaidyanathan Srinivasan
@ 2009-12-15 20:50       ` Salman Qazi
  2009-12-15 20:50       ` Salman Qazi
  2 siblings, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-15 20:50 UTC (permalink / raw)
  To: svaidy
  Cc: Arjan van de Ven, linux-kernel, linux-pm, Andrew Morton,
	Michael Rubin, Taliver Heath

On Tue, Dec 15, 2009 at 2:29 AM, Vaidyanathan Srinivasan
<svaidy@linux.vnet.ibm.com> wrote:
> * Salman Qazi <sqazi@google.com> [2009-12-14 16:36:20]:
>
>> On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven <arjan@infradead.org> wrote:
>> > On Mon, 14 Dec 2009 15:11:47 -0800
>> > Salman Qazi <sqazi@google.com> wrote:
>> >
>> >
>> > I like the general idea, I have one request (that I didn't see quite in
>> > your explanation): Please make sure that all cpus in the system do
>> > their idle injection at the same time, so that memory can go into power
>> > saving mode as well during this time etc etc...
>> >
>
> The value of the overall idea is well understood but the
> implementation and benefits in terms of power savings was the major
> point of discussion earlier.
>
>> With the current interface, the forced idle percentages on the CPUs
>> are controlled independently.  There's a trade-off here.  If we inject
>> idle cycles on all the CPU at the same time, our machine
>> responsiveness also degrades: essentially every CPU becomes equally
>> bad for an interactive task to run on.  Our aim at the moment is to
>> try to concentrate the idle cycles on a small set of CPUs, to strive
>> to leave some CPUs where interactive tasks can run unhindered.  But,
>> given a different workload and goals the correct policy may be
>> different.
>>
>> Simultaneously idling multiple "cores" becomes necessary in the SMT
>> case: as there is no point in idling a single thread, while the other
>> thread is running full tilt.  So, in such a case it is necessary to
>> idle all the threads making up the physical core.  This feature has
>> not been implemented yet.
>>
>> I think the best approach may be to provide a way to specify the
>> policy from the user space.  Basically let the user decide at what
>> level of CPU hierarchy the forced idle percentages are specified.
>> Then, in the levels below, we simply inject at the same time.
>
> Synchronising the idle times across multiple cores and also selecting
> sibling threads belonging to the same core is important.  The current
> ACPI forced idle driver can inject idle time but not synchronized
> across multiple cores.
>
> Allowing the scheduler load balancer to avoid using a part of the
> sched domain tree will allow easy grouping of sibling threads and
> sibling cores if that saves more power.
>
> However as Arjan mentioned, new architectures have significant power
> savings at full system idle where memory power is reduced.  Injecting
> idle time in any of the core will actually increase the utilisation on
> the other cores (unless the system is full loaded) and reduce the full
> system idle time opportunity.  Basically injecting idle time on some
> of the cores in the system goes against the race-to-idle policy
> thereby decreasing overall system operating efficiency.
>
> Can you please clarify the following questions:
>
> * What is the typical duration of idle time injected?
>        - 10s of milli seconds?  CPUs are expected to goto lowest
>          power idle state within this time?

This depends on the specific user.  I can only speak for our Google's
intentions for this.  The duration of the injected time would
typically be single digit milliseconds.  We don't need the CPUs to go
into the lowest power idle state for our purposes.  We care more about
the predictable component of the power savings, as this is the
component that we can use elsewhere.  Given that there may be
interrupts that prevent us from reaching the lowest power idle state,
we should really not rely on that in our power models.  Therefore,
while it is great from an energy savings point of view to reach the
lowest power idle state, it doesn't help us from a power shifting
point of view.

>
> * You mentioned that natural idle time in the system is taken into
>  account before injecting forced idle time, which is a good feature
>  to have.
>        - In most workloads, as the utilisation drops, all the cpus
>          have similar idle times.  This is favourable for exploiting
>          memory power saving.
>        - Now when more idle time need to be inserted, is it
>          uniformly spread across all CPUs?

The settings at the moment are per-CPU and so is enforcement.  The
current implementation does not do any CPU cross talk.  Each CPU
simply maintains its own minimum forced idle percentage and these
cycles are not horse traded across CPUs.  So, the answer to your
question in the general case is no.  The user may even choose to not
set any kind of a cap on some subset of CPUs.

>
> Suggestions:
>
> * Can cgroup hardlimits help here to inject idle times
>  http://lkml.org/lkml/2009/11/17/191
>
>  The problem of distributing idle time equally across CPUs and
>  relating sibling threads is still and issue, but can be worked out.
>  As of now hardlimits can distribute idle time across CPUs thereby
>  enabling full system idle.

Sibling threads is a major issue here.  If all of the idle cycles are
not injected simultaneously on both threads, then the resulting power
savings will not match the expected power savings.  Since we care
about predictability of the power savings, such savings would not help
us at all.  So, having a heuristic that improves the probability of
the right thing happening is not sufficient.  Hard assurances are
required.

Aside from that, our current implementation discriminates between
batch and interactive tasks.  In the first phase called "eager
injection", we let the interactive tasks run but prevent batch tasks
from running (preferring to idle the machine instead).  This allows us
to reduce the impact on interactive tasks by preventing batch tasks
from forcing the interactive tasks into the fully idle part of the
time period.  Thus, interactive tasks should not incur any additional
latency due to the behavior of the batch tasks.  If we are going to
use cgroup hardlimits, an equivalent feature would need to be added.
Basically, have an initial "protected period" for interactive tasks
where we do not let batch tasks run.



>
> --Vaidy
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-15 10:29     ` Vaidyanathan Srinivasan
  2009-12-15 11:50         ` Vaidyanathan Srinivasan
  2009-12-15 20:50       ` Salman Qazi
@ 2009-12-15 20:50       ` Salman Qazi
  2 siblings, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-15 20:50 UTC (permalink / raw)
  To: svaidy
  Cc: Michael Rubin, linux-kernel, linux-pm, Taliver Heath,
	Andrew Morton, Arjan van de Ven

On Tue, Dec 15, 2009 at 2:29 AM, Vaidyanathan Srinivasan
<svaidy@linux.vnet.ibm.com> wrote:
> * Salman Qazi <sqazi@google.com> [2009-12-14 16:36:20]:
>
>> On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven <arjan@infradead.org> wrote:
>> > On Mon, 14 Dec 2009 15:11:47 -0800
>> > Salman Qazi <sqazi@google.com> wrote:
>> >
>> >
>> > I like the general idea, I have one request (that I didn't see quite in
>> > your explanation): Please make sure that all cpus in the system do
>> > their idle injection at the same time, so that memory can go into power
>> > saving mode as well during this time etc etc...
>> >
>
> The value of the overall idea is well understood but the
> implementation and benefits in terms of power savings was the major
> point of discussion earlier.
>
>> With the current interface, the forced idle percentages on the CPUs
>> are controlled independently.  There's a trade-off here.  If we inject
>> idle cycles on all the CPU at the same time, our machine
>> responsiveness also degrades: essentially every CPU becomes equally
>> bad for an interactive task to run on.  Our aim at the moment is to
>> try to concentrate the idle cycles on a small set of CPUs, to strive
>> to leave some CPUs where interactive tasks can run unhindered.  But,
>> given a different workload and goals the correct policy may be
>> different.
>>
>> Simultaneously idling multiple "cores" becomes necessary in the SMT
>> case: as there is no point in idling a single thread, while the other
>> thread is running full tilt.  So, in such a case it is necessary to
>> idle all the threads making up the physical core.  This feature has
>> not been implemented yet.
>>
>> I think the best approach may be to provide a way to specify the
>> policy from the user space.  Basically let the user decide at what
>> level of CPU hierarchy the forced idle percentages are specified.
>> Then, in the levels below, we simply inject at the same time.
>
> Synchronising the idle times across multiple cores and also selecting
> sibling threads belonging to the same core is important.  The current
> ACPI forced idle driver can inject idle time but not synchronized
> across multiple cores.
>
> Allowing the scheduler load balancer to avoid using a part of the
> sched domain tree will allow easy grouping of sibling threads and
> sibling cores if that saves more power.
>
> However as Arjan mentioned, new architectures have significant power
> savings at full system idle where memory power is reduced.  Injecting
> idle time in any of the core will actually increase the utilisation on
> the other cores (unless the system is full loaded) and reduce the full
> system idle time opportunity.  Basically injecting idle time on some
> of the cores in the system goes against the race-to-idle policy
> thereby decreasing overall system operating efficiency.
>
> Can you please clarify the following questions:
>
> * What is the typical duration of idle time injected?
>        - 10s of milli seconds?  CPUs are expected to goto lowest
>          power idle state within this time?

This depends on the specific user.  I can only speak for our Google's
intentions for this.  The duration of the injected time would
typically be single digit milliseconds.  We don't need the CPUs to go
into the lowest power idle state for our purposes.  We care more about
the predictable component of the power savings, as this is the
component that we can use elsewhere.  Given that there may be
interrupts that prevent us from reaching the lowest power idle state,
we should really not rely on that in our power models.  Therefore,
while it is great from an energy savings point of view to reach the
lowest power idle state, it doesn't help us from a power shifting
point of view.

>
> * You mentioned that natural idle time in the system is taken into
>  account before injecting forced idle time, which is a good feature
>  to have.
>        - In most workloads, as the utilisation drops, all the cpus
>          have similar idle times.  This is favourable for exploiting
>          memory power saving.
>        - Now when more idle time need to be inserted, is it
>          uniformly spread across all CPUs?

The settings at the moment are per-CPU and so is enforcement.  The
current implementation does not do any CPU cross talk.  Each CPU
simply maintains its own minimum forced idle percentage and these
cycles are not horse traded across CPUs.  So, the answer to your
question in the general case is no.  The user may even choose to not
set any kind of a cap on some subset of CPUs.

>
> Suggestions:
>
> * Can cgroup hardlimits help here to inject idle times
>  http://lkml.org/lkml/2009/11/17/191
>
>  The problem of distributing idle time equally across CPUs and
>  relating sibling threads is still and issue, but can be worked out.
>  As of now hardlimits can distribute idle time across CPUs thereby
>  enabling full system idle.

Sibling threads is a major issue here.  If all of the idle cycles are
not injected simultaneously on both threads, then the resulting power
savings will not match the expected power savings.  Since we care
about predictability of the power savings, such savings would not help
us at all.  So, having a heuristic that improves the probability of
the right thing happening is not sufficient.  Hard assurances are
required.

Aside from that, our current implementation discriminates between
batch and interactive tasks.  In the first phase called "eager
injection", we let the interactive tasks run but prevent batch tasks
from running (preferring to idle the machine instead).  This allows us
to reduce the impact on interactive tasks by preventing batch tasks
from forcing the interactive tasks into the fully idle part of the
time period.  Thus, interactive tasks should not incur any additional
latency due to the behavior of the batch tasks.  If we are going to
use cgroup hardlimits, an equivalent feature would need to be added.
Basically, have an initial "protected period" for interactive tasks
where we do not let batch tasks run.



>
> --Vaidy
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the  Linux Kernel
  2009-12-15 11:50         ` Vaidyanathan Srinivasan
@ 2009-12-15 21:00           ` Salman Qazi
  -1 siblings, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-15 21:00 UTC (permalink / raw)
  To: svaidy
  Cc: Arjan van de Ven, linux-kernel, linux-pm, Andrew Morton,
	Michael Rubin, Taliver Heath

On Tue, Dec 15, 2009 at 3:50 AM, Vaidyanathan Srinivasan
<svaidy@linux.vnet.ibm.com> wrote:
> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2009-12-15 15:59:09]:
>
>> * Salman Qazi <sqazi@google.com> [2009-12-14 16:36:20]:
>>
>> > On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven <arjan@infradead.org> wrote:
>> > > On Mon, 14 Dec 2009 15:11:47 -0800
>> > > Salman Qazi <sqazi@google.com> wrote:
>> > >
>> > >
>> > > I like the general idea, I have one request (that I didn't see quite in
>> > > your explanation): Please make sure that all cpus in the system do
>> > > their idle injection at the same time, so that memory can go into power
>> > > saving mode as well during this time etc etc...
>> > >
>>
>> The value of the overall idea is well understood but the
>> implementation and benefits in terms of power savings was the major
>> point of discussion earlier.
>>
>> > With the current interface, the forced idle percentages on the CPUs
>> > are controlled independently.  There's a trade-off here.  If we inject
>> > idle cycles on all the CPU at the same time, our machine
>> > responsiveness also degrades: essentially every CPU becomes equally
>> > bad for an interactive task to run on.  Our aim at the moment is to
>> > try to concentrate the idle cycles on a small set of CPUs, to strive
>> > to leave some CPUs where interactive tasks can run unhindered.  But,
>> > given a different workload and goals the correct policy may be
>> > different.
>> >
>> > Simultaneously idling multiple "cores" becomes necessary in the SMT
>> > case: as there is no point in idling a single thread, while the other
>> > thread is running full tilt.  So, in such a case it is necessary to
>> > idle all the threads making up the physical core.  This feature has
>> > not been implemented yet.
>> >
>> > I think the best approach may be to provide a way to specify the
>> > policy from the user space.  Basically let the user decide at what
>> > level of CPU hierarchy the forced idle percentages are specified.
>> > Then, in the levels below, we simply inject at the same time.
>>
>> Synchronising the idle times across multiple cores and also selecting
>> sibling threads belonging to the same core is important.  The current
>> ACPI forced idle driver can inject idle time but not synchronized
>> across multiple cores.
>>
>> Allowing the scheduler load balancer to avoid using a part of the
>> sched domain tree will allow easy grouping of sibling threads and
>> sibling cores if that saves more power.
>>
>> However as Arjan mentioned, new architectures have significant power
>> savings at full system idle where memory power is reduced.  Injecting
>> idle time in any of the core will actually increase the utilisation on
>> the other cores (unless the system is full loaded) and reduce the full
>> system idle time opportunity.  Basically injecting idle time on some
>> of the cores in the system goes against the race-to-idle policy
>> thereby decreasing overall system operating efficiency.
>>
>> Can you please clarify the following questions:
>>
>> * What is the typical duration of idle time injected?
>>         - 10s of milli seconds?  CPUs are expected to goto lowest
>>           power idle state within this time?
>>
>> * You mentioned that natural idle time in the system is taken into
>>   account before injecting forced idle time, which is a good feature
>>   to have.
>>         - In most workloads, as the utilisation drops, all the cpus
>>           have similar idle times.  This is favourable for exploiting
>>           memory power saving.
>>         - Now when more idle time need to be inserted, is it
>>           uniformly spread across all CPUs?
>
> * How is the fairness issue in the scheduler handled?  Inserting idle
>  time may affect interactivity and fairness badly.

As mentioned in the design, we have two features to make this work.
First, we have "Eager Injection" phase, where we do not let any batch
tasks run but permit interactive tasks to run.  This phase lasts until
we are either sure that we have enough idle cycles (in which case
everyone is free to run) or we are sure that we have to spend the rest
of the interval injecting.  This latter scenario is called the lazy
injection phase.

Second, we have "power capping priority", a per-cgroup value which
determines the order in which the "blame" is assigned for the injected
cycles.  For the purposes of scheduling decisions, we pretend that the
lowest power capping priority job was running when we were injecting
idle cycles.  If the lowest priority job did not deserve sufficient
run time in the period in question, then we move to the next higher
priority job and so on.  Thus, we penalize the jobs in the power
capping priority order for the time spent injecting idle cycles.  This
allows us to make sure that important jobs get to use the available
power, and the less important jobs are the first to suffer.

>
> --Vaidy
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
@ 2009-12-15 21:00           ` Salman Qazi
  0 siblings, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-15 21:00 UTC (permalink / raw)
  To: svaidy
  Cc: Michael Rubin, linux-kernel, linux-pm, Taliver Heath,
	Andrew Morton, Arjan van de Ven

On Tue, Dec 15, 2009 at 3:50 AM, Vaidyanathan Srinivasan
<svaidy@linux.vnet.ibm.com> wrote:
> * Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> [2009-12-15 15:59:09]:
>
>> * Salman Qazi <sqazi@google.com> [2009-12-14 16:36:20]:
>>
>> > On Mon, Dec 14, 2009 at 4:19 PM, Arjan van de Ven <arjan@infradead.org> wrote:
>> > > On Mon, 14 Dec 2009 15:11:47 -0800
>> > > Salman Qazi <sqazi@google.com> wrote:
>> > >
>> > >
>> > > I like the general idea, I have one request (that I didn't see quite in
>> > > your explanation): Please make sure that all cpus in the system do
>> > > their idle injection at the same time, so that memory can go into power
>> > > saving mode as well during this time etc etc...
>> > >
>>
>> The value of the overall idea is well understood but the
>> implementation and benefits in terms of power savings was the major
>> point of discussion earlier.
>>
>> > With the current interface, the forced idle percentages on the CPUs
>> > are controlled independently.  There's a trade-off here.  If we inject
>> > idle cycles on all the CPU at the same time, our machine
>> > responsiveness also degrades: essentially every CPU becomes equally
>> > bad for an interactive task to run on.  Our aim at the moment is to
>> > try to concentrate the idle cycles on a small set of CPUs, to strive
>> > to leave some CPUs where interactive tasks can run unhindered.  But,
>> > given a different workload and goals the correct policy may be
>> > different.
>> >
>> > Simultaneously idling multiple "cores" becomes necessary in the SMT
>> > case: as there is no point in idling a single thread, while the other
>> > thread is running full tilt.  So, in such a case it is necessary to
>> > idle all the threads making up the physical core.  This feature has
>> > not been implemented yet.
>> >
>> > I think the best approach may be to provide a way to specify the
>> > policy from the user space.  Basically let the user decide at what
>> > level of CPU hierarchy the forced idle percentages are specified.
>> > Then, in the levels below, we simply inject at the same time.
>>
>> Synchronising the idle times across multiple cores and also selecting
>> sibling threads belonging to the same core is important.  The current
>> ACPI forced idle driver can inject idle time but not synchronized
>> across multiple cores.
>>
>> Allowing the scheduler load balancer to avoid using a part of the
>> sched domain tree will allow easy grouping of sibling threads and
>> sibling cores if that saves more power.
>>
>> However as Arjan mentioned, new architectures have significant power
>> savings at full system idle where memory power is reduced.  Injecting
>> idle time in any of the core will actually increase the utilisation on
>> the other cores (unless the system is full loaded) and reduce the full
>> system idle time opportunity.  Basically injecting idle time on some
>> of the cores in the system goes against the race-to-idle policy
>> thereby decreasing overall system operating efficiency.
>>
>> Can you please clarify the following questions:
>>
>> * What is the typical duration of idle time injected?
>>         - 10s of milli seconds?  CPUs are expected to goto lowest
>>           power idle state within this time?
>>
>> * You mentioned that natural idle time in the system is taken into
>>   account before injecting forced idle time, which is a good feature
>>   to have.
>>         - In most workloads, as the utilisation drops, all the cpus
>>           have similar idle times.  This is favourable for exploiting
>>           memory power saving.
>>         - Now when more idle time need to be inserted, is it
>>           uniformly spread across all CPUs?
>
> * How is the fairness issue in the scheduler handled?  Inserting idle
>  time may affect interactivity and fairness badly.

As mentioned in the design, we have two features to make this work.
First, we have "Eager Injection" phase, where we do not let any batch
tasks run but permit interactive tasks to run.  This phase lasts until
we are either sure that we have enough idle cycles (in which case
everyone is free to run) or we are sure that we have to spend the rest
of the interval injecting.  This latter scenario is called the lazy
injection phase.

Second, we have "power capping priority", a per-cgroup value which
determines the order in which the "blame" is assigned for the injected
cycles.  For the purposes of scheduling decisions, we pretend that the
lowest power capping priority job was running when we were injecting
idle cycles.  If the lowest priority job did not deserve sufficient
run time in the period in question, then we move to the next higher
priority job and so on.  Thus, we penalize the jobs in the power
capping priority order for the time spent injecting idle cycles.  This
allows us to make sure that important jobs get to use the available
power, and the less important jobs are the first to suffer.

>
> --Vaidy
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the  Linux Kernel
  2009-12-15 20:15       ` Salman Qazi
  2009-12-17 11:01         ` Arjan van de Ven
@ 2009-12-17 11:01         ` Arjan van de Ven
  1 sibling, 0 replies; 44+ messages in thread
From: Arjan van de Ven @ 2009-12-17 11:01 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin, Taliver Heath

On Tue, 15 Dec 2009 12:15:30 -0800
Salman Qazi <sqazi@google.com> wrote:

> > (having the memory power in a machine suddenly be half or less is a
> > huge step in power, for something, the alignment itself, that does
> > not cost much if any extra performance over randomly distributed
> > idle insertions)
> 
> I think there is a difference in goals here.  Our goal is to free up
> power from one machine, so that it can be used elsewhere.  This means
> that any power that we save has to be predictably saved. 

unless your population of machines is big enough, and you get "close
enough" in terms of predictability.

> We have
> models that map between CPU time and power usage, that have to account
> for the worst case.

the delta between worst case and average/typical even for cpu time <->
power usage is enormous (2x to 3x would not surprise me)... adding
memory to that equation does not change the fundamentals much.



> Having said that, energy savings is a worthy goal by itself.  But, we
> have to make sure to balance it with performance.

I would think the people who pay the power bill at the end of the month
would notice a 10% reduction ;-)


> > as long as the tentative portion of the idle time gets injected at
> > the same time.. I suspect there can be a decent balance here where
> > most of the time we get the full CPU *and* memory savings, while we
> > degrade gracefully for the case where we get increasingly more
> > interactive activity.
> 
> Let me rephrase what you just said to verify that I understand you
> correctly:  we should align the enforcement intervals across CPUs and
> thereby eagerly inject over the same time period across multiple CPUs.

we should at least try hard to do this. it's ok to not always get it, 
but when we can we should.

>  If there is no interactive workload then we would get maximal CPU and
> memory savings.  If there are interactive tasks, then we just let them
> run like we would in the eager injection in original design.

the interactive tasks would indeed run "normal"; but since they are
interactive that is supposedly "only sporadic".

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-15 20:15       ` Salman Qazi
@ 2009-12-17 11:01         ` Arjan van de Ven
  2009-12-17 11:01         ` Arjan van de Ven
  1 sibling, 0 replies; 44+ messages in thread
From: Arjan van de Ven @ 2009-12-17 11:01 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-pm, Taliver Heath, Michael Rubin, linux-kernel, Andrew Morton

On Tue, 15 Dec 2009 12:15:30 -0800
Salman Qazi <sqazi@google.com> wrote:

> > (having the memory power in a machine suddenly be half or less is a
> > huge step in power, for something, the alignment itself, that does
> > not cost much if any extra performance over randomly distributed
> > idle insertions)
> 
> I think there is a difference in goals here.  Our goal is to free up
> power from one machine, so that it can be used elsewhere.  This means
> that any power that we save has to be predictably saved. 

unless your population of machines is big enough, and you get "close
enough" in terms of predictability.

> We have
> models that map between CPU time and power usage, that have to account
> for the worst case.

the delta between worst case and average/typical even for cpu time <->
power usage is enormous (2x to 3x would not surprise me)... adding
memory to that equation does not change the fundamentals much.



> Having said that, energy savings is a worthy goal by itself.  But, we
> have to make sure to balance it with performance.

I would think the people who pay the power bill at the end of the month
would notice a 10% reduction ;-)


> > as long as the tentative portion of the idle time gets injected at
> > the same time.. I suspect there can be a decent balance here where
> > most of the time we get the full CPU *and* memory savings, while we
> > degrade gracefully for the case where we get increasingly more
> > interactive activity.
> 
> Let me rephrase what you just said to verify that I understand you
> correctly:  we should align the enforcement intervals across CPUs and
> thereby eagerly inject over the same time period across multiple CPUs.

we should at least try hard to do this. it's ok to not always get it, 
but when we can we should.

>  If there is no interactive workload then we would get maximal CPU and
> memory savings.  If there are interactive tasks, then we just let them
> run like we would in the eager injection in original design.

the interactive tasks would indeed run "normal"; but since they are
interactive that is supposedly "only sporadic".

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-14 23:11 RFC: A proposal for power capping through forced idle in the Linux Kernel Salman Qazi
                   ` (3 preceding siblings ...)
  2009-12-15  0:19 ` Arjan van de Ven
@ 2009-12-18 17:04 ` Pavel Machek
  2009-12-22 21:10   ` Salman Qazi
  2009-12-22 21:10   ` Salman Qazi
  2009-12-18 17:04 ` Pavel Machek
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 44+ messages in thread
From: Pavel Machek @ 2009-12-18 17:04 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin, Taliver Heath

Hi


> Why not use voltage and frequency scaling?
> 
> Forced Idle Injection is more effective[1] and more widely available.
> Even with voltage and frequency scaling, interpolation is needed
> between the available settings.  So, if we did use voltage and

It is only more efficient on new hardware.

You should also explain 'why not throttling' because that is actually
designed for power capping. 

> Application to Laptops and Cellphones:
> 
> Imagine being in a tent in Death Valley with a laptop.  You are bored,
> and you want to watch a movie.  However, you also want to do your best
> to make the battery last and watch as much of the movie as possible.
> Forced idle power capping is a solution.  If your machine has a knob
> that allows you to control the available power, you can turn that knob
> until your video starts getting choppy.  And then, turn the knob back

That's bad example. Video player should already sleep between frames.

Better example would be 'make video so choppy that expected battery
time rises over length ov movie.

(And yes, this would have been useful for me on notebook with failed
fan and ineffective throttling).


								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-14 23:11 RFC: A proposal for power capping through forced idle in the Linux Kernel Salman Qazi
                   ` (4 preceding siblings ...)
  2009-12-18 17:04 ` Pavel Machek
@ 2009-12-18 17:04 ` Pavel Machek
  2009-12-21  8:57 ` Pavel Machek
  2009-12-21  8:57 ` Pavel Machek
  7 siblings, 0 replies; 44+ messages in thread
From: Pavel Machek @ 2009-12-18 17:04 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-pm, Taliver Heath, Michael Rubin, linux-kernel, Andrew Morton

Hi


> Why not use voltage and frequency scaling?
> 
> Forced Idle Injection is more effective[1] and more widely available.
> Even with voltage and frequency scaling, interpolation is needed
> between the available settings.  So, if we did use voltage and

It is only more efficient on new hardware.

You should also explain 'why not throttling' because that is actually
designed for power capping. 

> Application to Laptops and Cellphones:
> 
> Imagine being in a tent in Death Valley with a laptop.  You are bored,
> and you want to watch a movie.  However, you also want to do your best
> to make the battery last and watch as much of the movie as possible.
> Forced idle power capping is a solution.  If your machine has a knob
> that allows you to control the available power, you can turn that knob
> until your video starts getting choppy.  And then, turn the knob back

That's bad example. Video player should already sleep between frames.

Better example would be 'make video so choppy that expected battery
time rises over length ov movie.

(And yes, this would have been useful for me on notebook with failed
fan and ineffective throttling).


								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-14 23:11 RFC: A proposal for power capping through forced idle in the Linux Kernel Salman Qazi
                   ` (6 preceding siblings ...)
  2009-12-21  8:57 ` Pavel Machek
@ 2009-12-21  8:57 ` Pavel Machek
  2009-12-22 21:15   ` Salman Qazi
  2009-12-22 21:15   ` Salman Qazi
  7 siblings, 2 replies; 44+ messages in thread
From: Pavel Machek @ 2009-12-21  8:57 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin, Taliver Heath

Hi!

> Google is implementing power capping, a technology that improves the
> power efficiency of data centers. There are also some interesting
> applications of this technology for laptops and cell phones.  Google
> aims to send most of its Linux technology upstream. So, how can we get
> this feature into the mainline kernel?
...
> Aside from this, every cgroup has a new quantity added to the CPU
> component called "Power Capping Priority".  This quantity indicates
> the order in which the scheduler attributes the time spent injecting
> idle cycles to specific processes.  This allows us to discriminate
> among processes when it comes to accounting for the injected idle
> time.  There is also an indication of interactivity versus batch for
> the cgroup provided in the CPU component of the cgroup.
> 
> Basic Algorithm:
> 
> Rather than blindly blasting the machine with the minimum required
> idle cycles, our implementation keeps track of naturally occurring
> idle cycles as follows:

(Rather complex algorithm snipped)

Well.. having all this complexity just for forcing idle... And it
still will not work, right? Linux kernel is not real time, so you
can't guarantee anything.

OTOH realtime people already have tools you could make good use of:
your power capping approach looks like 'high priority idle task that
needs to run for 2 seconds every 5 seconds' or something...

Talk to rt people?
								Pavel 
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-14 23:11 RFC: A proposal for power capping through forced idle in the Linux Kernel Salman Qazi
                   ` (5 preceding siblings ...)
  2009-12-18 17:04 ` Pavel Machek
@ 2009-12-21  8:57 ` Pavel Machek
  2009-12-21  8:57 ` Pavel Machek
  7 siblings, 0 replies; 44+ messages in thread
From: Pavel Machek @ 2009-12-21  8:57 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-pm, Taliver Heath, Michael Rubin, linux-kernel, Andrew Morton

Hi!

> Google is implementing power capping, a technology that improves the
> power efficiency of data centers. There are also some interesting
> applications of this technology for laptops and cell phones.  Google
> aims to send most of its Linux technology upstream. So, how can we get
> this feature into the mainline kernel?
...
> Aside from this, every cgroup has a new quantity added to the CPU
> component called "Power Capping Priority".  This quantity indicates
> the order in which the scheduler attributes the time spent injecting
> idle cycles to specific processes.  This allows us to discriminate
> among processes when it comes to accounting for the injected idle
> time.  There is also an indication of interactivity versus batch for
> the cgroup provided in the CPU component of the cgroup.
> 
> Basic Algorithm:
> 
> Rather than blindly blasting the machine with the minimum required
> idle cycles, our implementation keeps track of naturally occurring
> idle cycles as follows:

(Rather complex algorithm snipped)

Well.. having all this complexity just for forcing idle... And it
still will not work, right? Linux kernel is not real time, so you
can't guarantee anything.

OTOH realtime people already have tools you could make good use of:
your power capping approach looks like 'high priority idle task that
needs to run for 2 seconds every 5 seconds' or something...

Talk to rt people?
								Pavel 
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux  Kernel
  2009-12-14 23:51   ` tytso
  2009-12-15  0:42     ` Salman Qazi
  2009-12-15  0:42     ` Salman Qazi
@ 2009-12-22 19:48     ` Peter Zijlstra
  2009-12-22 19:48     ` Peter Zijlstra
  3 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2009-12-22 19:48 UTC (permalink / raw)
  To: tytso
  Cc: Andi Kleen, Salman Qazi, linux-kernel, linux-pm, Andrew Morton,
	Michael Rubin, Taliver Heath, lenb, Ingo Molnar,
	Gautham R Shenoy, Balbir Singh

On Mon, 2009-12-14 at 18:51 -0500, tytso@mit.edu wrote:
> On Tue, Dec 15, 2009 at 12:21:07AM +0100, Andi Kleen wrote:
> > Salman Qazi <sqazi@google.com> writes:
> > >
> > > We'd like to get as much of our stuff upstream as we can.  Given that
> > > this is a somewhat sizable chunk of work, it would be impolite of me
> > > to just send out a bunch of patches without hearing the concerns of
> > > the community.  What are your thoughts on our design and what do we
> > > need to change to get this to be more acceptable to the community?  I
> > > also would like to know if there are any existing pieces of
> > > infrastructure that this can utilize.
> > 
> > There were a lot of discussions on this a few months ago in context
> > of the ACPI 4 "power aggregator" which is a similar (perhaps
> > slightly less sophisticated) concept. 
> > 
> > While there was a lot of talk about teaching the scheduler about this 
> > the end result was just a driver which just starts real time threads
> > and then idles in them. This is in current mainline.
> > 
> > It might be a good idea to review these discussions in the archives.
> 
> It should be noted that most of the heat from those discussions was
> over adding the ACPI 4 mechanism to accept requests from the hardware
> platform to add idle cycles in the case of thermal/power emergencies,
> before we had the scheduler improvements to be able to do so in the
> most efficient way possible.  See the description of commit 8e0af5141:
> 
>    ACPI 4.0 created the logical "processor aggregator device" as a
>    mechinism for platforms to ask the OS to force otherwise busy
>    processors to enter (power saving) idle.
> 
>    The intent is to lower power consumption to ride-out transient
>    electrical and thermal emergencies, rather than powering off the
>    server....
> 
>    Vaidyanathan Srinivasan has proposed scheduler enhancements to
>    allow injecting idle time into the system. This driver doesn't
>    depend on those enhancements, but could cut over to them when they
>    are available.
> 
>    Peter Z. does not favor upstreaming this driver until the those
>    scheduler enhancements are in place. However, we favor upstreaming
>    this driver now because it is useful now, and can be enhanced over
>    time.
> 
> It looks to me that scheme that Salman has proposed for adding idle
> cycles is quite sophisticated, probably more than Vaidyanathan's, and
> the main difference is that Google wants the ability to be able to
> control the system's power/thermal envelope from userspace, as opposed
> to letting the hardware request in an emergency situation.  This makes
> sense, if you are trying to balance the power/thermal requirements
> across a large number of systems, as opposed to responding to a local
> power/thermal emergency signalled from the platform's firmware.
> 
> So it would seem to me that Salman's suggestions are very similar to
> what Peter requested before this commit went in (over his objections).

Right, so the power scheduling guys from IBM were working on something
sensible in this regard, which with a feedback control interface should
provide adequate controls to manage power consumption in a rack.

So their solution is to pack tasks into smaller sched domains allowing
up to an overload parameter, this nicely works together with things like
cpusets which can partition the load-balancing system.

[ If you configure your system into 1-cpu load-balance domains then
  this will of course fail, but then that's exactly what you asked for ]

Also, since it affects SCHED_OTHER tasks only, it does not affect
determinism of RT tasks.

So what this needs is a cluster controller increasing/decreasing the
overload numbers as the power consumption gets near/far from the limit.

The problem with the ACPI 4.0 spec is that it only signals a single 'do
something' or we'll kill you hard 'soon'. Which is kinda useless.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-14 23:51   ` tytso
                       ` (2 preceding siblings ...)
  2009-12-22 19:48     ` Peter Zijlstra
@ 2009-12-22 19:48     ` Peter Zijlstra
  3 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2009-12-22 19:48 UTC (permalink / raw)
  To: tytso
  Cc: Salman Qazi, Balbir Singh, Ingo Molnar, Michael Rubin,
	linux-kernel, Andi Kleen, Andrew Morton, Taliver Heath, linux-pm

On Mon, 2009-12-14 at 18:51 -0500, tytso@mit.edu wrote:
> On Tue, Dec 15, 2009 at 12:21:07AM +0100, Andi Kleen wrote:
> > Salman Qazi <sqazi@google.com> writes:
> > >
> > > We'd like to get as much of our stuff upstream as we can.  Given that
> > > this is a somewhat sizable chunk of work, it would be impolite of me
> > > to just send out a bunch of patches without hearing the concerns of
> > > the community.  What are your thoughts on our design and what do we
> > > need to change to get this to be more acceptable to the community?  I
> > > also would like to know if there are any existing pieces of
> > > infrastructure that this can utilize.
> > 
> > There were a lot of discussions on this a few months ago in context
> > of the ACPI 4 "power aggregator" which is a similar (perhaps
> > slightly less sophisticated) concept. 
> > 
> > While there was a lot of talk about teaching the scheduler about this 
> > the end result was just a driver which just starts real time threads
> > and then idles in them. This is in current mainline.
> > 
> > It might be a good idea to review these discussions in the archives.
> 
> It should be noted that most of the heat from those discussions was
> over adding the ACPI 4 mechanism to accept requests from the hardware
> platform to add idle cycles in the case of thermal/power emergencies,
> before we had the scheduler improvements to be able to do so in the
> most efficient way possible.  See the description of commit 8e0af5141:
> 
>    ACPI 4.0 created the logical "processor aggregator device" as a
>    mechinism for platforms to ask the OS to force otherwise busy
>    processors to enter (power saving) idle.
> 
>    The intent is to lower power consumption to ride-out transient
>    electrical and thermal emergencies, rather than powering off the
>    server....
> 
>    Vaidyanathan Srinivasan has proposed scheduler enhancements to
>    allow injecting idle time into the system. This driver doesn't
>    depend on those enhancements, but could cut over to them when they
>    are available.
> 
>    Peter Z. does not favor upstreaming this driver until the those
>    scheduler enhancements are in place. However, we favor upstreaming
>    this driver now because it is useful now, and can be enhanced over
>    time.
> 
> It looks to me that scheme that Salman has proposed for adding idle
> cycles is quite sophisticated, probably more than Vaidyanathan's, and
> the main difference is that Google wants the ability to be able to
> control the system's power/thermal envelope from userspace, as opposed
> to letting the hardware request in an emergency situation.  This makes
> sense, if you are trying to balance the power/thermal requirements
> across a large number of systems, as opposed to responding to a local
> power/thermal emergency signalled from the platform's firmware.
> 
> So it would seem to me that Salman's suggestions are very similar to
> what Peter requested before this commit went in (over his objections).

Right, so the power scheduling guys from IBM were working on something
sensible in this regard, which with a feedback control interface should
provide adequate controls to manage power consumption in a rack.

So their solution is to pack tasks into smaller sched domains allowing
up to an overload parameter, this nicely works together with things like
cpusets which can partition the load-balancing system.

[ If you configure your system into 1-cpu load-balance domains then
  this will of course fail, but then that's exactly what you asked for ]

Also, since it affects SCHED_OTHER tasks only, it does not affect
determinism of RT tasks.

So what this needs is a cluster controller increasing/decreasing the
overload numbers as the power consumption gets near/far from the limit.

The problem with the ACPI 4.0 spec is that it only signals a single 'do
something' or we'll kill you hard 'soon'. Which is kinda useless.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux  Kernel
  2009-12-15  0:19 ` Arjan van de Ven
  2009-12-15  0:36   ` Salman Qazi
  2009-12-15  0:36   ` Salman Qazi
@ 2009-12-22 19:48   ` Peter Zijlstra
  2009-12-22 19:57     ` Arjan van de Ven
  2009-12-22 19:57     ` Arjan van de Ven
  2009-12-22 19:48   ` Peter Zijlstra
  3 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2009-12-22 19:48 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Salman Qazi, linux-kernel, linux-pm, Andrew Morton,
	Michael Rubin, Taliver Heath

On Mon, 2009-12-14 at 16:19 -0800, Arjan van de Ven wrote:

> I like the general idea, I have one request (that I didn't see quite in
> your explanation): Please make sure that all cpus in the system do
> their idle injection at the same time, so that memory can go into power
> saving mode as well during this time etc etc...

And then you're going to ask that it scales too, right? :-)

Gang-scheduling is inherently non scalable, be it for idle time or not.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-15  0:19 ` Arjan van de Ven
                     ` (2 preceding siblings ...)
  2009-12-22 19:48   ` Peter Zijlstra
@ 2009-12-22 19:48   ` Peter Zijlstra
  3 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2009-12-22 19:48 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Salman Qazi, Michael Rubin, linux-kernel, Andrew Morton,
	Taliver Heath, linux-pm

On Mon, 2009-12-14 at 16:19 -0800, Arjan van de Ven wrote:

> I like the general idea, I have one request (that I didn't see quite in
> your explanation): Please make sure that all cpus in the system do
> their idle injection at the same time, so that memory can go into power
> saving mode as well during this time etc etc...

And then you're going to ask that it scales too, right? :-)

Gang-scheduling is inherently non scalable, be it for idle time or not.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux  Kernel
  2009-12-22 19:48   ` Peter Zijlstra
  2009-12-22 19:57     ` Arjan van de Ven
@ 2009-12-22 19:57     ` Arjan van de Ven
  1 sibling, 0 replies; 44+ messages in thread
From: Arjan van de Ven @ 2009-12-22 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Salman Qazi, linux-kernel, linux-pm, Andrew Morton,
	Michael Rubin, Taliver Heath

On Tue, 22 Dec 2009 20:48:24 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, 2009-12-14 at 16:19 -0800, Arjan van de Ven wrote:
> 
> > I like the general idea, I have one request (that I didn't see
> > quite in your explanation): Please make sure that all cpus in the
> > system do their idle injection at the same time, so that memory can
> > go into power saving mode as well during this time etc etc...
> 
> And then you're going to ask that it scales too, right? :-)
> 
> Gang-scheduling is inherently non scalable, be it for idle time or
> not.

well... there's many ways to do this... one option is to agree, ahead
of time, which jiffies values you're going to do the idle thing on.
Say every 100 jiffies where jiffies % 100 is 0....

then the scalability thing isn't a big deal.. and you still do it all
at the same time. Or at least "enough" at the same time for it to not
matter



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-22 19:48   ` Peter Zijlstra
@ 2009-12-22 19:57     ` Arjan van de Ven
  2009-12-22 19:57     ` Arjan van de Ven
  1 sibling, 0 replies; 44+ messages in thread
From: Arjan van de Ven @ 2009-12-22 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Salman Qazi, Taliver, Michael Rubin, linux-kernel, Andrew Morton,
	Heath, linux-pm

On Tue, 22 Dec 2009 20:48:24 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, 2009-12-14 at 16:19 -0800, Arjan van de Ven wrote:
> 
> > I like the general idea, I have one request (that I didn't see
> > quite in your explanation): Please make sure that all cpus in the
> > system do their idle injection at the same time, so that memory can
> > go into power saving mode as well during this time etc etc...
> 
> And then you're going to ask that it scales too, right? :-)
> 
> Gang-scheduling is inherently non scalable, be it for idle time or
> not.

well... there's many ways to do this... one option is to agree, ahead
of time, which jiffies values you're going to do the idle thing on.
Say every 100 jiffies where jiffies % 100 is 0....

then the scalability thing isn't a big deal.. and you still do it all
at the same time. Or at least "enough" at the same time for it to not
matter



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the  Linux Kernel
  2009-12-18 17:04 ` Pavel Machek
  2009-12-22 21:10   ` Salman Qazi
@ 2009-12-22 21:10   ` Salman Qazi
  2009-12-23  9:49     ` Pavel Machek
  2009-12-23  9:49     ` Pavel Machek
  1 sibling, 2 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-22 21:10 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin, Taliver Heath

On Fri, Dec 18, 2009 at 9:04 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi
>
>
>> Why not use voltage and frequency scaling?
>>
>> Forced Idle Injection is more effective[1] and more widely available.
>> Even with voltage and frequency scaling, interpolation is needed
>> between the available settings.  So, if we did use voltage and
>
> It is only more efficient on new hardware.
>
> You should also explain 'why not throttling' because that is actually
> designed for power capping.

Do you mean t-states?

>
>> Application to Laptops and Cellphones:
>>
>> Imagine being in a tent in Death Valley with a laptop.  You are bored,
>> and you want to watch a movie.  However, you also want to do your best
>> to make the battery last and watch as much of the movie as possible.
>> Forced idle power capping is a solution.  If your machine has a knob
>> that allows you to control the available power, you can turn that knob
>> until your video starts getting choppy.  And then, turn the knob back
>
> That's bad example. Video player should already sleep between frames.

Yes, the video player should sleep.  However, there will be other
things running.  And certainly, it is possible to cap the power and
discriminate so that those things are prevented from running while the
video player is allowed to run with minimal latency impact.

>
> Better example would be 'make video so choppy that expected battery
> time rises over length ov movie.
>
> (And yes, this would have been useful for me on notebook with failed
> fan and ineffective throttling).
>
>
>                                                                Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-18 17:04 ` Pavel Machek
@ 2009-12-22 21:10   ` Salman Qazi
  2009-12-22 21:10   ` Salman Qazi
  1 sibling, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-22 21:10 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-pm, Taliver Heath, Michael Rubin, linux-kernel, Andrew Morton

On Fri, Dec 18, 2009 at 9:04 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi
>
>
>> Why not use voltage and frequency scaling?
>>
>> Forced Idle Injection is more effective[1] and more widely available.
>> Even with voltage and frequency scaling, interpolation is needed
>> between the available settings.  So, if we did use voltage and
>
> It is only more efficient on new hardware.
>
> You should also explain 'why not throttling' because that is actually
> designed for power capping.

Do you mean t-states?

>
>> Application to Laptops and Cellphones:
>>
>> Imagine being in a tent in Death Valley with a laptop.  You are bored,
>> and you want to watch a movie.  However, you also want to do your best
>> to make the battery last and watch as much of the movie as possible.
>> Forced idle power capping is a solution.  If your machine has a knob
>> that allows you to control the available power, you can turn that knob
>> until your video starts getting choppy.  And then, turn the knob back
>
> That's bad example. Video player should already sleep between frames.

Yes, the video player should sleep.  However, there will be other
things running.  And certainly, it is possible to cap the power and
discriminate so that those things are prevented from running while the
video player is allowed to run with minimal latency impact.

>
> Better example would be 'make video so choppy that expected battery
> time rises over length ov movie.
>
> (And yes, this would have been useful for me on notebook with failed
> fan and ineffective throttling).
>
>
>                                                                Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the  Linux Kernel
  2009-12-21  8:57 ` Pavel Machek
  2009-12-22 21:15   ` Salman Qazi
@ 2009-12-22 21:15   ` Salman Qazi
  2009-12-23  9:52     ` Pavel Machek
  2009-12-23  9:52     ` Pavel Machek
  1 sibling, 2 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-22 21:15 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin, Taliver Heath

On Mon, Dec 21, 2009 at 12:57 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> Google is implementing power capping, a technology that improves the
>> power efficiency of data centers. There are also some interesting
>> applications of this technology for laptops and cell phones.  Google
>> aims to send most of its Linux technology upstream. So, how can we get
>> this feature into the mainline kernel?
> ...
>> Aside from this, every cgroup has a new quantity added to the CPU
>> component called "Power Capping Priority".  This quantity indicates
>> the order in which the scheduler attributes the time spent injecting
>> idle cycles to specific processes.  This allows us to discriminate
>> among processes when it comes to accounting for the injected idle
>> time.  There is also an indication of interactivity versus batch for
>> the cgroup provided in the CPU component of the cgroup.
>>
>> Basic Algorithm:
>>
>> Rather than blindly blasting the machine with the minimum required
>> idle cycles, our implementation keeps track of naturally occurring
>> idle cycles as follows:
>
> (Rather complex algorithm snipped)
>
> Well.. having all this complexity just for forcing idle... And it
> still will not work, right? Linux kernel is not real time, so you
> can't guarantee anything.

The purpose of all the "complexity" is to avoid injecting idle cycles
when the machine is already sufficiently idle.  That is, to lower the
impact when the feature is not needed.  And you are right, there are
no hard guarantees.  A lot of the practical use will rest on empirical
data.

>
> OTOH realtime people already have tools you could make good use of:
> your power capping approach looks like 'high priority idle task that
> needs to run for 2 seconds every 5 seconds' or something...
>
> Talk to rt people?

At the core of it, you are correct.  However, in our implementation it
also avoids running when the system is already idle and operates at
much finer granularities than seconds.

Which specific tools are you referring to?  Real-time Linux as a whole
is a trade off:  one gets predictable latency in exchange for some
performance.  Any specific contacts that I should direct my inquiries
to?

>                                                                Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-21  8:57 ` Pavel Machek
@ 2009-12-22 21:15   ` Salman Qazi
  2009-12-22 21:15   ` Salman Qazi
  1 sibling, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-22 21:15 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-pm, Taliver Heath, Michael Rubin, linux-kernel, Andrew Morton

On Mon, Dec 21, 2009 at 12:57 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> Google is implementing power capping, a technology that improves the
>> power efficiency of data centers. There are also some interesting
>> applications of this technology for laptops and cell phones.  Google
>> aims to send most of its Linux technology upstream. So, how can we get
>> this feature into the mainline kernel?
> ...
>> Aside from this, every cgroup has a new quantity added to the CPU
>> component called "Power Capping Priority".  This quantity indicates
>> the order in which the scheduler attributes the time spent injecting
>> idle cycles to specific processes.  This allows us to discriminate
>> among processes when it comes to accounting for the injected idle
>> time.  There is also an indication of interactivity versus batch for
>> the cgroup provided in the CPU component of the cgroup.
>>
>> Basic Algorithm:
>>
>> Rather than blindly blasting the machine with the minimum required
>> idle cycles, our implementation keeps track of naturally occurring
>> idle cycles as follows:
>
> (Rather complex algorithm snipped)
>
> Well.. having all this complexity just for forcing idle... And it
> still will not work, right? Linux kernel is not real time, so you
> can't guarantee anything.

The purpose of all the "complexity" is to avoid injecting idle cycles
when the machine is already sufficiently idle.  That is, to lower the
impact when the feature is not needed.  And you are right, there are
no hard guarantees.  A lot of the practical use will rest on empirical
data.

>
> OTOH realtime people already have tools you could make good use of:
> your power capping approach looks like 'high priority idle task that
> needs to run for 2 seconds every 5 seconds' or something...
>
> Talk to rt people?

At the core of it, you are correct.  However, in our implementation it
also avoids running when the system is already idle and operates at
much finer granularities than seconds.

Which specific tools are you referring to?  Real-time Linux as a whole
is a trade off:  one gets predictable latency in exchange for some
performance.  Any specific contacts that I should direct my inquiries
to?

>                                                                Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-22 21:10   ` Salman Qazi
  2009-12-23  9:49     ` Pavel Machek
@ 2009-12-23  9:49     ` Pavel Machek
  1 sibling, 0 replies; 44+ messages in thread
From: Pavel Machek @ 2009-12-23  9:49 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin, Taliver Heath

On Tue 2009-12-22 13:10:36, Salman Qazi wrote:
> On Fri, Dec 18, 2009 at 9:04 AM, Pavel Machek <pavel@ucw.cz> wrote:
> > Hi
> >
> >
> >> Why not use voltage and frequency scaling?
> >>
> >> Forced Idle Injection is more effective[1] and more widely available.
> >> Even with voltage and frequency scaling, interpolation is needed
> >> between the available settings.  So, if we did use voltage and
> >
> > It is only more efficient on new hardware.
> >
> > You should also explain 'why not throttling' because that is actually
> > designed for power capping.
> 
> Do you mean t-states?

Yes.

> >> Application to Laptops and Cellphones:
> >>
> >> Imagine being in a tent in Death Valley with a laptop.  You are bored,
> >> and you want to watch a movie.  However, you also want to do your best
> >> to make the battery last and watch as much of the movie as possible.
> >> Forced idle power capping is a solution.  If your machine has a knob
> >> that allows you to control the available power, you can turn that knob
> >> until your video starts getting choppy.  And then, turn the knob back
> >
> > That's bad example. Video player should already sleep between frames.
> 
> Yes, the video player should sleep.  However, there will be other
> things running.  And certainly, it is possible to cap the power and
> discriminate so that those things are prevented from running while the
> video player is allowed to run with minimal latency impact.

I don't see how it would work without much of extra setup. Lets say
your windowmanager wants to do some work, and you starve it
indefinitely?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-22 21:10   ` Salman Qazi
@ 2009-12-23  9:49     ` Pavel Machek
  2009-12-23  9:49     ` Pavel Machek
  1 sibling, 0 replies; 44+ messages in thread
From: Pavel Machek @ 2009-12-23  9:49 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-pm, Taliver Heath, Michael Rubin, linux-kernel, Andrew Morton

On Tue 2009-12-22 13:10:36, Salman Qazi wrote:
> On Fri, Dec 18, 2009 at 9:04 AM, Pavel Machek <pavel@ucw.cz> wrote:
> > Hi
> >
> >
> >> Why not use voltage and frequency scaling?
> >>
> >> Forced Idle Injection is more effective[1] and more widely available.
> >> Even with voltage and frequency scaling, interpolation is needed
> >> between the available settings.  So, if we did use voltage and
> >
> > It is only more efficient on new hardware.
> >
> > You should also explain 'why not throttling' because that is actually
> > designed for power capping.
> 
> Do you mean t-states?

Yes.

> >> Application to Laptops and Cellphones:
> >>
> >> Imagine being in a tent in Death Valley with a laptop.  You are bored,
> >> and you want to watch a movie.  However, you also want to do your best
> >> to make the battery last and watch as much of the movie as possible.
> >> Forced idle power capping is a solution.  If your machine has a knob
> >> that allows you to control the available power, you can turn that knob
> >> until your video starts getting choppy.  And then, turn the knob back
> >
> > That's bad example. Video player should already sleep between frames.
> 
> Yes, the video player should sleep.  However, there will be other
> things running.  And certainly, it is possible to cap the power and
> discriminate so that those things are prevented from running while the
> video player is allowed to run with minimal latency impact.

I don't see how it would work without much of extra setup. Lets say
your windowmanager wants to do some work, and you starve it
indefinitely?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-22 21:15   ` Salman Qazi
@ 2009-12-23  9:52     ` Pavel Machek
  2009-12-23  9:52     ` Pavel Machek
  1 sibling, 0 replies; 44+ messages in thread
From: Pavel Machek @ 2009-12-23  9:52 UTC (permalink / raw)
  To: Salman Qazi
  Cc: linux-kernel, linux-pm, Andrew Morton, Michael Rubin,
	Taliver Heath, Ingo Molnar, Peter Zijlstra

Hi!

> > OTOH realtime people already have tools you could make good use of:
> > your power capping approach looks like 'high priority idle task that
> > needs to run for 2 seconds every 5 seconds' or something...
> >
> > Talk to rt people?
> 
> At the core of it, you are correct.  However, in our implementation it
> also avoids running when the system is already idle and operates at
> much finer granularities than seconds.

Seconds were examples, I suspect rt kernels need lower granularities,
to.

> Which specific tools are you referring to?  Real-time Linux as a whole
> is a trade off:  one gets predictable latency in exchange for some
> performance.  Any specific contacts that I should direct my inquiries
> to?

I guess Peter and Ingo (added to the Cc)....

Anyway, I guess that what you really want is to be able to change
priority of the idle threads, even making them realtime...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: RFC: A proposal for power capping through forced idle in the Linux Kernel
  2009-12-22 21:15   ` Salman Qazi
  2009-12-23  9:52     ` Pavel Machek
@ 2009-12-23  9:52     ` Pavel Machek
  1 sibling, 0 replies; 44+ messages in thread
From: Pavel Machek @ 2009-12-23  9:52 UTC (permalink / raw)
  To: Salman Qazi
  Cc: Michael Rubin, Andrew Morton, Peter Zijlstra, linux-kernel,
	Ingo Molnar, Taliver Heath, linux-pm

Hi!

> > OTOH realtime people already have tools you could make good use of:
> > your power capping approach looks like 'high priority idle task that
> > needs to run for 2 seconds every 5 seconds' or something...
> >
> > Talk to rt people?
> 
> At the core of it, you are correct.  However, in our implementation it
> also avoids running when the system is already idle and operates at
> much finer granularities than seconds.

Seconds were examples, I suspect rt kernels need lower granularities,
to.

> Which specific tools are you referring to?  Real-time Linux as a whole
> is a trade off:  one gets predictable latency in exchange for some
> performance.  Any specific contacts that I should direct my inquiries
> to?

I guess Peter and Ingo (added to the Cc)....

Anyway, I guess that what you really want is to be able to change
priority of the idle threads, even making them realtime...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RFC: A proposal for power capping through forced idle in the Linux Kernel
@ 2009-12-14 23:11 Salman Qazi
  0 siblings, 0 replies; 44+ messages in thread
From: Salman Qazi @ 2009-12-14 23:11 UTC (permalink / raw)
  To: linux-kernel, linux-pm; +Cc: Andrew Morton, Taliver Heath, Michael Rubin

Greetings,

Google is implementing power capping, a technology that improves the
power efficiency of data centers. There are also some interesting
applications of this technology for laptops and cell phones.  Google
aims to send most of its Linux technology upstream. So, how can we get
this feature into the mainline kernel?

Overview:

Data centers are typically statically and pessimistically populated
based on the limitations of the power infrastructure in them.  Peak
power consumption of machines is determined, and based on this, the
number of machines and their placement in the hierarchy is limited to
not exceed the available power in the worst case.  Google is looking
at moving away from this static allocation of power to machines, to a
more dynamic model.  A key component of this model is power capping
done in software.

The idea is to place more machines in the data center than there is
power available to support (when all machines are operating at peak)
and then running the machines with a power cap.  The aim of the
project is to utilize more of the available power in the data center
than possible with static provisioning.  As the amount of work
available changes through the day, the power caps on various machines
are changed as well, while staying within the infrastructure
constraints.  Power can be moved from the more idle parts of the data
center to the busier ones.

Since not all of our existing hardware is able to provide good power
measurements to the software running on it, we have decided to model
power in terms of CPU usage [0].

Current Interface used by Google:

The component of the kernel that we have built to implement software
power capping is called the "Idle Cycle Injector".

It has the following inputs, provided through procfs:

Forced Idle Percentage: This is the minimum percentage of time the CPU
is promised to be idle over the enforcement interval.

Enforcement interval: This is the length of time over which the power
cap is promised.

Aside from this, every cgroup has a new quantity added to the CPU
component called "Power Capping Priority".  This quantity indicates
the order in which the scheduler attributes the time spent injecting
idle cycles to specific processes.  This allows us to discriminate
among processes when it comes to accounting for the injected idle
time.  There is also an indication of interactivity versus batch for
the cgroup provided in the CPU component of the cgroup.

Basic Algorithm:

Rather than blindly blasting the machine with the minimum required
idle cycles, our implementation keeps track of naturally occurring
idle cycles as follows:

0.  Set a timer (hrtimer API is used) for the earliest of: the end of
the enforcement interval (clock time constraint) and the expected time
when we run out of allowed busy cycles if the CPU was entirely busy
from now on (cpu time constraint).
1.  When this timer expires, determine which constraint has been reached.
          a) If it is the clock time constraint, then we must start
with a new interval and go back to step 0.
          b) If it is the CPU time constraint, then rest of the
enforcement interval must be spent idling.
              Continue to step 2.
2.  Set up a timer for the end of the enforcement interval and start
calling the idle function in a loop.   In our current implementation
we wake up a real time kernel thread to do this.  Once finished,
account any injected idle time in the vruntime of processes taken in
the order of power capping priority.  Finally, go back to step 0 and
start a new interval.

Eager Injection:

An interactive task may be prevented from running sufficiently early
by presence of a batch task and end up wanting to run in the capped
portion of the interval.  But, since it cannot run in the capped
portion, it sees a severe latency hiccup.  To counter this, we
discriminate between the two classes through the concept of eager
injection.  The idea is that while we are below our desired minimum
idle quota, we do not let batch tasks run, but instead idle the CPU.
However, during this time, we let interactive tasks run (should it
happen to be runnable).  Once we are past the minimum idle quota,
everyone is free to run.  If the interactive tasks are abusive and
exhaust the CPU time, then idle cycles have to be injected to avoid
exceeding the quota.

Known Limitations of Current Implementation:

0.  The major limitation of injecting in the thread context is that we
cannot prevent soft IRQ handlers from running and using up power.

1.  Sufficiently high forced idle percentages, the Idle Cycle Injector
starts working against itself.  In such cases, it is better to use
other means to make the CPU idle.

2.  Needs some work for SMT support.


Why not use voltage and frequency scaling?

Forced Idle Injection is more effective[1] and more widely available.
Even with voltage and frequency scaling, interpolation is needed
between the available settings.  So, if we did use voltage and
frequency scaling, we would still have to use a timer to take
measurements every so often and adjust the settings.  It would save us
on having to take over the CPU and actively inject though.

Application to Laptops and Cellphones:

Imagine being in a tent in Death Valley with a laptop.  You are bored,
and you want to watch a movie.  However, you also want to do your best
to make the battery last and watch as much of the movie as possible.
Forced idle power capping is a solution.  If your machine has a knob
that allows you to control the available power, you can turn that knob
until your video starts getting choppy.  And then, turn the knob back
a little bit.  Now, you have your video playing just as you like it,
with the minimal amount of power available to the machine.  With eager
injection and the power capping priority, your machine should spend
power on work that you care about, rather than background processes.

What does this have to do with mainline Linux?

We'd like to get as much of our stuff upstream as we can.  Given that
this is a somewhat sizable chunk of work, it would be impolite of me
to just send out a bunch of patches without hearing the concerns of
the community.  What are your thoughts on our design and what do we
need to change to get this to be more acceptable to the community?  I
also would like to know if there are any existing pieces of
infrastructure that this can utilize.

Relevant papers:

[0]. http://research.google.com/pubs/pub32980.html
[1]. http://www.cs.cmu.edu/~anshulg/weed2009.pdf
[2]. http://www.springerlink.com/index/D6287205272LK822.pdf

Regards,

Salman Qazi.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2009-12-23  9:52 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-14 23:11 RFC: A proposal for power capping through forced idle in the Linux Kernel Salman Qazi
2009-12-14 23:21 ` Andi Kleen
2009-12-14 23:51   ` tytso
2009-12-15  0:42     ` Salman Qazi
2009-12-15  0:42     ` Salman Qazi
2009-12-22 19:48     ` Peter Zijlstra
2009-12-22 19:48     ` Peter Zijlstra
2009-12-14 23:51   ` tytso
2009-12-14 23:21 ` Andi Kleen
2009-12-15  0:19 ` Arjan van de Ven
2009-12-15  0:36   ` Salman Qazi
2009-12-15  1:06     ` Arjan van de Ven
2009-12-15 20:15       ` Salman Qazi
2009-12-17 11:01         ` Arjan van de Ven
2009-12-17 11:01         ` Arjan van de Ven
2009-12-15 20:15       ` Salman Qazi
2009-12-15  1:06     ` Arjan van de Ven
2009-12-15 10:29     ` Vaidyanathan Srinivasan
2009-12-15 10:29     ` Vaidyanathan Srinivasan
2009-12-15 11:50       ` Vaidyanathan Srinivasan
2009-12-15 11:50         ` Vaidyanathan Srinivasan
2009-12-15 21:00         ` Salman Qazi
2009-12-15 21:00           ` Salman Qazi
2009-12-15 20:50       ` Salman Qazi
2009-12-15 20:50       ` Salman Qazi
2009-12-15  0:36   ` Salman Qazi
2009-12-22 19:48   ` Peter Zijlstra
2009-12-22 19:57     ` Arjan van de Ven
2009-12-22 19:57     ` Arjan van de Ven
2009-12-22 19:48   ` Peter Zijlstra
2009-12-15  0:19 ` Arjan van de Ven
2009-12-18 17:04 ` Pavel Machek
2009-12-22 21:10   ` Salman Qazi
2009-12-22 21:10   ` Salman Qazi
2009-12-23  9:49     ` Pavel Machek
2009-12-23  9:49     ` Pavel Machek
2009-12-18 17:04 ` Pavel Machek
2009-12-21  8:57 ` Pavel Machek
2009-12-21  8:57 ` Pavel Machek
2009-12-22 21:15   ` Salman Qazi
2009-12-22 21:15   ` Salman Qazi
2009-12-23  9:52     ` Pavel Machek
2009-12-23  9:52     ` Pavel Machek
2009-12-14 23:11 Salman Qazi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.