All of lore.kernel.org
 help / color / mirror / Atom feed
* [0/11] Energy-aware scheduling use-cases and scheduler issues
@ 2013-12-20 16:45 Morten Rasmussen
  2013-12-20 16:45 ` [1/11] issue 1: Missing power topology information in scheduler Morten Rasmussen
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

Hi,

One of the requests from the scheduler maintainers at the Energy-aware
Scheduling workshop at Kernel Summit this year was to provide plain text
descriptions of use-cases (workloads) and system topologies. To get that
moving I have written some short texts about some use-cases. In addition
I described a list of issues that today prevent mainly the scheduler
from achieving a good energy/performance balance in common use-cases.
The follow-up emails are structured as follows:

1-6:	Current issues related to energy/performance balance.
7-10:	Use-cases (overall behaviour and energy/performance goals)
11:	DVFS example (for reference)

I'm hoping that this provides some of the background for why I'm
interested in improving energy-awareness in the scheduler. I'm aware
that the use-cases and issues/wishlist don't cover everyone's area of
interest. Input is needed to fix that.

Comments and input are appreciated.

Morten


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [1/11] issue 1: Missing power topology information in scheduler
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2013-12-22 15:19   ` mark gross
  2013-12-20 16:45 ` [2/11] issue 2: Energy-awareness for heterogeneous systems Morten Rasmussen
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

The current mainline scheduler has no power topology information
available to enable it to make energy-aware decisions. The energy cost
of running a cpu at different frequencies and the energy cost of waking
up another cpu are needed.

One example where this could be useful is audio on Android. With the
current mainline scheduler it would utilize three cpus when active. Due
to the size of the tasks it is still possible to meet the performance
criteria when execution is serialized on a single cpu. Depending on the
power topology leaving two cpus idle and running one longer may lead to
energy savings if the cpus can be power-gated individually.

The audio performance requirements can be satisfied by most cpus at the
lowest frequency. Video is a more interesting use-case due to its higher
performance requirements. Running all tasks on a single cpu is likely to
require a higher frequency than if the tasks are spread out across
more cpus.

Running Android video playback on an ARM Cortex-A7 platform with 1, 2,
and 4 cpus online has lead to the following power measurements
(normalized):

video 720p (Android)
cpus	power
1	1.59
2	1.00
4	1.10

Restricting the number of cpus to one forces the frequency up to cope
with the load, but the overall cpu load is only ~60% (busy %-age). Using
two cpus keeps the frequency in the more power efficient range and gives
a ~37% power reduction. With four cpus the power consumption is worse,
likely due to the wake/idle transitions increase (~100%).

For this use-case it appears that the optimal busy %-age is ~30% (use
two cpus). However, that is likely to vary depending on the use-case.

Proposed solution: Represent energy costs for each P-states and C-states
in the topology to enable the scheduler to estimate the energy cost of
the scheduling decisions. Coupled with P-state awareness that would
allow the scheduler to avoid expensive high P-states.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [2/11] issue 2: Energy-awareness for heterogeneous systems
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
  2013-12-20 16:45 ` [1/11] issue 1: Missing power topology information in scheduler Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2013-12-20 16:45 ` [3/11] issue 3: No understanding of potential cpu capacity Morten Rasmussen
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

While performance is non-deterministic with the mainline scheduler
(described in issue 6) it also leads to non-deterministic energy
consumption.  First step is to get performance right, but if we don't
keep energy in mind, heterogeneous systems will end up with high
performance and energy consumption.

To save energy low intensity workloads should not be scheduled on fast
cpus as these are generally less energy efficient. Audio playback is an
example where the performance offered by the slow cpus in todays
heterogeneous systems like ARM big.LITTLE is more than sufficient.

The mainline scheduler may schedule it on any cpu leading to
non-deterministic energy consumption. The energy expense on big (A15) is
3.63x little (A7) for Android mp3 audio playback on ARM TC2 (2xA15+3xA7)
when using just the A15s or just the A7s.

If we run multiple workloads at the same time, e.g. audio and
webbrowsing, both performance and energy is non-deterministic. Because
of audio we may even get poor webbrowsing performance and high energy
consumption at the same time.

Running that scenario on Android on ARM TC2 gives the following
execution times and energy measurements for 10 runs (normalized to avg):

Run     Exec    Energy
1       1.03    1.04
2       1.12    1.11    Worst energy
3       0.85    1.08
4       0.85    1.08    Best performance
5       0.94    1.06
6       1.01    0.78
7       0.90    0.63    Best performance/energy and best energy
8       1.22    1.08    Worst performance/energy and worst performance
9       0.94    1.08
10      1.14    1.07

Run 7 had a very good schedule as it led to both lowest energy and also
good performance at the same time. That is not generally the case. Run 2
is an example of a poor schedule where performance is 12% worse than
average and energy is 11% higher. Best performance in run 3 comes at the
cost of high energy.

While run 7 seems to be ideal from an energy-awareness point of view,
it may be disqualified by performance constraints. Hence, ideally the
performance level should be tunable.

Possible solution: We know that a simple heuristic that controls task
placement based on tracked load works rather well for most smartphone
workloads. However, realistic patterns exist that defeat this heuristic.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [3/11] issue 3: No understanding of potential cpu capacity
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
  2013-12-20 16:45 ` [1/11] issue 1: Missing power topology information in scheduler Morten Rasmussen
  2013-12-20 16:45 ` [2/11] issue 2: Energy-awareness for heterogeneous systems Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2013-12-20 16:45 ` [4/11] issue 4: Tracking idle states Morten Rasmussen
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

To minimize energy it may sometimes be better to put waking tasks on
partially loaded cpus instead of powering up more cpus (particularly if
it implies powering up a new cluster/group of cpus with associated
caches). To make that call, information about the potential spare cycles
on the busy cpus is required.

Currently, the CFS scheduler has no knowledge about frequency scaling.
Frequency scaling governors generally try to match the frequency to
the load, which means that the idle time has no absolute meaning. The
potential spare cpu capacity may be much higher than indicated by the
idle time if the cpu is running at a low P-state.

The energy trade-off may justify putting another task on a loaded cpu
even if it causes a change to a higher P-state to handle the extra load.
Related issues are frequency (and cpu micro architecture) invariant task
load and power topology information, which are both needed to enable the
scheduler for energy-aware task placement. This is covered in more
detail in issue 5.

The potential cpu capacity cannot be assumed to be constant as thermal
management may restrict the usage of high performance P-states
dynamically.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [4/11] issue 4: Tracking idle states
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
                   ` (2 preceding siblings ...)
  2013-12-20 16:45 ` [3/11] issue 3: No understanding of potential cpu capacity Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2013-12-20 16:45 ` [5/11] issue 5: Frequency and uarch invariant task load Morten Rasmussen
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

Similar to the issue of knowing the potential capacity of a cpu, the CFS
scheduler also needs to know the idle state of idle cpus. Currently, an
idle cpu is found using cpumask_first() when an extra cpu is needed (for
nohz_idle_balance in find_new_ilb() in sched/fair.c). The energy
trade-off whether to wake another cpu or put tasks on already busy cpus
depend on this information.

The cost of waking up a cpu in terms of latency and energy depends on
the idle state the cpu is in. Deeper idle states typically affects more
than a single cpu. Waking up a single cpu from such state is more
expensive as it also affects the idle states of of its related cpus.

Energy costs are not currently represented in the cpuidle framework, but
latency is. Taking ARM TC2 as an example [1], which has two idle states:
Per-core clock-gating (WFI), and cluster power-down (power down all
related cpus and caches). The target residencies and exit latencies
specified in the driver give an idea about the cost involved in
entering/exiting these states.

			Target		Exit
			residency	latency
Clock-gating (WFI)	1		1
Cluster power-down	2000/2500	500/700		(big/LITTLE)

Picking the cheapest idle cpu would also have the effect that wake-ups
are likely to happen on the same cpu and leave the remaining cpus in
idle for longer.

Potential solution: Make the scheduler idle state aware by either moving
idle handling into the scheduler or let the idle framework (cpuidle)
maintain a cpumask of the cheapest cpus to wake up which is accessible
to the scheduler.

[1] drivers/cpuidle/cpuidle-big_little.c


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [5/11] issue 5: Frequency and uarch invariant task load
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
                   ` (3 preceding siblings ...)
  2013-12-20 16:45 ` [4/11] issue 4: Tracking idle states Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2013-12-20 16:45 ` [6/11] issue 6: Poor and non-deterministic performance on heterogeneous systems Morten Rasmussen
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

Related to the issue of potential cpu capacity, task load is influenced
directly by the current P-state of the cpu it is running on. For
energy-aware task placement decisions the scheduler would need to
estimate the energy impact of scheduling a specific task on a specific
cpu. Depending on the resulting P-state it may be more energy efficient
to wake-up another cpu (see system 1 in mail 11 for energy efficiency
example).

The frequency and uarch impact can be rather significant. On modern
systems frequency scaling covers a range of 5-6x. On top of that uarch
differences may give another 1.5-3x for a total cpu capacity range
covering >10x.

Measurements on ARM TC2 for a simple periodic test workload (single
task, 16 ms period):

        cpu load        load_avg_contrib (10 sample avg.)
Freq    A7      A15     A7      A15
500     16.76%   9.94%  ~201    ~135
700     12.06%   6.95%  ~145     ~87
1000     8.19%   5.23%  ~103     ~65

The cpu load estimate used for load balancing is based on
load_avg_contrib which means that for this example the load estimate may
vary 3x depending on where tasks are scheduled and the frequency scaling
governors used.

Potential solution: Frequency invariance has been proposed before [1]
where the task load is scaled by the cur/max freq ratio. Another
possibility is to use hardware counters if such are available on the
platform.

[1] https://lkml.org/lkml/2013/4/16/289


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [6/11] issue 6: Poor and non-deterministic performance on heterogeneous systems
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
                   ` (4 preceding siblings ...)
  2013-12-20 16:45 ` [5/11] issue 5: Frequency and uarch invariant task load Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2013-12-20 16:45 ` [7/11] use-case 1: Webbrowsing on Android Morten Rasmussen
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

The current mainline scheduler doesn't give optimum performance on
heterogeneous systems for workload with few tasks (#tasks <= #cpu).
Using cpu_power (in its current form) to inform the scheduler about the
relative compute capacity of the cpus is not sufficient.

1. cpu_power is not used on wake-up which means that new tasks may end
up anywhere. Periodic load-balance generally bails out if there is only
one task running on a cpu, so the task isn't moved later. Hence, the
execution time of the task may be anywhere between the execution it
would have had running exclusively on the fastest cpu and running
exclusively on the slowest cpu.

Running a single cpu intensive task on an otherwise idle system while
measuring its execution time will show this problem. On ARM TC2
(big.LITTLE) we get the following numbers:

cpu_power       1024    606/1441
		default	slow/fast
execution time:
(100 runs)
Max             4.33    4.33
Min             2.09    2.91
Distribution:
Runs within
5% of Min       14      11
5% of Max       86      89

Only a few runs randomly ended up on a fast cpu irrespective of the
cpu_power settings. The distribution can easily change depending on
other tasks, reordering the cpus, or changing the topology.

The problem can also be observed for smartphone workloads like
webbrowsing where page rendering times vary significantly as the threads
are randomly scheduled on fast and slow cpus.

2. Using cpu_power to represent the relative performance of the cpus,
leads to undesirable task balance in common scenarios. group_power =
sum(cpu_power) for a group of cpus and is used in the periodic
load-balance, idle balance, and nohz idle balance to determine the
number of tasks that should be in each group. However, depending on the
number of cpus in the groups, that causes one group to be overloaded
while another has idle cpus if the number of tasks is equal to the
number of cpus (or slightly larger).

Running a simple parallel workload (OpenMP) will reveal this as it uses
one worker thread per cpu by default. On ARM TC2 we get the following
behaviour:

cpu_power       1024    606/1441 (slow/fast)
execution time:
(20 runs)
avg             8.63    9.87            14.34% (slowdown)
stdev           0.01    0.01

The kernelshark trace reveals that the 606/1441 configuration puts three
tasks on the two fast cpus and two tasks of the three slow cpus leaving
one of them idle. The 1024 case has one task per cpu.

Overall cpu_power in its current form does not solve any of the
performance issues on heterogeneous systems. It even makes them worse
for some common workload scenarios.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [7/11] use-case 1: Webbrowsing on Android
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
                   ` (5 preceding siblings ...)
  2013-12-20 16:45 ` [6/11] issue 6: Poor and non-deterministic performance on heterogeneous systems Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2013-12-20 16:45 ` [8/11] use-case 2: Audio playback " Morten Rasmussen
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

Common webbrowsing use-cases (no embedded videos, but dynamic contents
is ok) typically exhibit three distinct modes of operation depending on
what it is doing in relation to the user:

1)
Mode: Page load and rendering.

Behaviour: The duration is highly depends on the website but is
relatively short. Typically a few seconds. Page loading time impacts
user experience directly.  Minor performance drops may be acceptable if
it comes with good overall energy savings.

Performance criteria: Complete as fast as possible.
 

2)
Mode: Display website (user reading, no user interaction)

Behaviour: Low load. Only minor updates of dynamic contents.

Performance criteria: Minimize energy.


3)
Mode: Page scrolling.

Behaviour: Relatively short in duration. Rendering of contents which was
previously off screen.

Performance criteria: Ensure smooth UI interaction. Without UI
experience feedback (lag, etc.), optimizing for best performance might
the only way to get necessary performance.


Task behaviour

The task descriptions are based on traces from a modern ARM platform
with a fairly recent version of Android. It may be different other
platforms and software stacks. This serves just as an example.

There are three main tasks involved on the browser side, and one or more
tasks related to the graphics driver. Each of them behaves differently
in each of the three modes of operation.

Render task: Mainly active in mode 1, but also active in mode 3.
Accounts for about a third of the total cpu time. May run for a more or
less continuous burst of 1-2s in 1.

Texture task: Active in all modes, but mainly in modes 1 and 3. Active
after the render task burst in mode 1. Somewhat periodic behaviour
during modes 1 and 3 indicating dependencies on other tasks.  Account
for about a sixth of the cpu time.

Browser task: Active in all modes. Blocks often. Only running about half
the time when it is active. Short occasional periods of activity in mode
2 along with the texture task. Accounts for about a third of the cpu
time.

Graphics driver task: Mainly active in mode 1 and 3, very little
activity in mode 2 and only when browser and texture tasks are
active. Runs for a short amount of time but frequently when active.
Accounts for about a sixth of the cpu time.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [8/11] use-case 2: Audio playback on Android
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
                   ` (6 preceding siblings ...)
  2013-12-20 16:45 ` [7/11] use-case 1: Webbrowsing on Android Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2014-01-07 12:15   ` Peter Zijlstra
  2013-12-20 16:45 ` [9/11] use-case 3: Video " Morten Rasmussen
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

Audio playback is a low load periodic application that has little/no
variation in period and load over time. It consists of tasks involved in
decoding the audio stream and communicating with audio frameworks and
drivers.

Performance Criteria

All tasks must have completed before the next request to fill the audio
buffer. Most modern hardware should be able to deal with the load even
at the lowest P-state.

Task behaviour

The task load pattern period is dictated by the audio interrupt. On an
example modern ARM based system this occurs every ~6 ms. The decoding
work is triggered every fourth interrupt, i.e. a ~24 ms period. No tasks
are scheduled at the intermediate interrupts. The tasks involved are:

Main audio framework task (AudioOut): The first task to be scheduled
after the interrupt and continues running until decoding has completed.
That is ~5 ms. Runs at nice=-19.

Audio framework task 2 (AudioTrack): Woken up by the main task ~250-300
us after the main audio task is scheduled. Runs for ~300 us at nice=-16.

Decoder task (mp3.decoder): Woken up by the audio task 2 when that
finishes (serialized). Runs for ~1 ms until it wakes a third Android
task on which it blocks and continues for another ~150 us afterwards
(serialized). Runs at nice=-2.

Android task 3 (OMXCallbackDisp): Woken by decoder task. Runs for ~300
us at nice=-2.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [9/11] use-case 3: Video playback on Android
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
                   ` (7 preceding siblings ...)
  2013-12-20 16:45 ` [8/11] use-case 2: Audio playback " Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2013-12-20 16:45 ` [10/11] use-case 4: Game " Morten Rasmussen
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

Depending on the platform hardware video is a low to medium load
periodic application. There may be some variation in the load depending
on the video codec, content, and resolution. The load pattern is roughly
synchronized to the video frame-rate (typically 30 FPS). Video playback
also includes audio playback as part of the workload.

Performance Criteria

Video decoding must be done in time to avoid dropped frames. Similarly,
audio must be decoded in time to never let the audio buffer run empty.

Task behaviour

Based on video playback (720p and 1080p) on a modern ARM SoC, the cpu
load is generally modest. Video resolution has only a minor impact on
the overall cpu load. The load pattern is repeating every ~33 ms.

Rendering task: The main Android graphics rendering task accounts for
about 18% of the total cpu load. It is active for about ~6-8 ms each
period during which it is blocking a couple of times. It appears to run
in sync with a handful of other tasks. Most of which are related to
timed queues in Android and graphics.

Timed events queue tasks: An Android TimedEventQueue task for each cpu
is active during video playback. In total, they account for around 30%
of the cpu load. They are all following the 33 ms period and run for
250-700 us when scheduled (average).

Audio decoding task: Accounts for ~10% of the cpu load. Load pattern
repeat every ~23 ms. Since this is different from the period of video
rendering, this may run in parallel with video rendering form time to
time. Audio decoding runs for about 2 ms when scheduled each period.

A lot of smaller tasks are involved in the periodic load pattern.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [10/11] use-case 4: Game on Android
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
                   ` (8 preceding siblings ...)
  2013-12-20 16:45 ` [9/11] use-case 3: Video " Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2013-12-20 16:45 ` [11/11] system 1: Saving energy using DVFS Morten Rasmussen
  2013-12-22 16:28 ` [0/11] Energy-aware scheduling use-cases and scheduler issues mark gross
  11 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

Games generally have periodic load pattern synchronized to the
frame-rate (30 or 60 Hz). Games workloads typically involve both
graphics rendering (game engine) and audio mixing.

Performance Criteria

Keep the frame-rate as close to the target as possible. Variations are
acceptable. Audio must be handled before the audio buffers runs empty.

Task behaviour

This description is based on one particular Android game, but similar
patterns have been observed for a number of games. Overall, 10+ threads
are active and context switches happen very often. Key game engine tasks
and graphics driver tasks are scheduled ~200-700 times per second. The
top 10 tasks (by cpu time) consists of: One game task, one main game
engine task, three graphics related tasks, three audio tasks, one event
handling task, and one kworker task.

Game engine task: By far the most cpu intensive task. Accounts for about
50% of all cpu load. It is scheduled ~375 times per second (average).
The scheduling pattern repeats every ~16 ms (~60 Hz), where the task
runs for ~12 ms, followed by three shorter periods of activity where the
longest is ~2 ms (unless it is preempted by other tasks). In addition,
the game engine has a worker thread for each cpu. Each of the worker
threads account for ~0.4% of the load, is scheduled ~115 times per
second (average), and only runs for ~56 us (average).

Rendering task: Accounts for ~6% of the load. Scheduled ~200 times per
second (average) and runs for ~420 us (average).

Graphics driver task: Accounts for ~6% of the load. Scheduled ~700 times
per second (average) and runs for 11 us (average).

Game main task: Accounts for ~4% of the load. Scheduled ~170 times per
second (average) and runs for ~37 us (average).

Audio system task: Accounts for ~3% of the load. Scheduled ~120 times
per second (average) and runs for ~42 us (average).

kworker task: Accounts for ~3% of the load. Scheduled ~320 times per
second (average) and runs for ~13 us (average).


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [11/11] system 1: Saving energy using DVFS
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
                   ` (9 preceding siblings ...)
  2013-12-20 16:45 ` [10/11] use-case 4: Game " Morten Rasmussen
@ 2013-12-20 16:45 ` Morten Rasmussen
  2013-12-22 16:28 ` [0/11] Energy-aware scheduling use-cases and scheduler issues mark gross
  11 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-20 16:45 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm

Most modern systems use DVFS to save power by slowing down computation
throughput when less performance is necessary. The power/performance
relation is platform specific. Some platforms may have better energy
savings (energy per instruction) than others at low frequencies.

To have something to relate to, here is an anonymized example based on
a modern ARM platform:

Performance	Energy/instruction
1.0		1.0
1.3		1.6
1.7		1.8
2.0		1.9
2.3		2.1
2.7		2.4
3.0		2.7

Performance is frequency (~instruction issue rate) and
energy/instruction is the energy cost of executing one (or a fixed
number of instructions) at that level of performance (frequency). For
this example, it costs 2.7x more energy per instruction to increase the
performance from 1.0 to 3.0 (3x). That is, the amount of work
(instructions) that can be done on one battery charge is reduced by 2.7x
(~63%) if you run as fast as possible (3.0) compared to running at
slowest frequency (1.0).

A lot of things haven't been accounted for in this simplified example.
There is a number of factors that influence the energy efficiency
including whether the cpu is the only one awake in its frequency/power
domain or not. The numbers shown above won't be accurate for all
workloads. They are meant as a ballpark figures.

To save energy, the higher frequencies should be avoided and only used
when the application performance requirements can not be satisfied
otherwise (e.g. spread tasks across more cpus if possible).

When considering the total system power it may save energy in some
scenarios by running the cpu faster to allow other power hungry parts of
the system to be shut down faster. However, this is highly platform and
application dependent.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/11] issue 1: Missing power topology information in scheduler
  2013-12-20 16:45 ` [1/11] issue 1: Missing power topology information in scheduler Morten Rasmussen
@ 2013-12-22 15:19   ` mark gross
  2013-12-30 14:00     ` Morten Rasmussen
  0 siblings, 1 reply; 26+ messages in thread
From: mark gross @ 2013-12-22 15:19 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, rjw, markgross, vincent.guittot, catalin.marinas,
	linux-pm

On Fri, Dec 20, 2013 at 04:45:41PM +0000, Morten Rasmussen wrote:
> The current mainline scheduler has no power topology information
> available to enable it to make energy-aware decisions. The energy cost
> of running a cpu at different frequencies and the energy cost of waking
> up another cpu are needed.
> 
> One example where this could be useful is audio on Android. With the
> current mainline scheduler it would utilize three cpus when active. Due
> to the size of the tasks it is still possible to meet the performance
> criteria when execution is serialized on a single cpu. Depending on the
> power topology leaving two cpus idle and running one longer may lead to
> energy savings if the cpus can be power-gated individually.
> 
> The audio performance requirements can be satisfied by most cpus at the
> lowest frequency. Video is a more interesting use-case due to its higher
> performance requirements. Running all tasks on a single cpu is likely to
> require a higher frequency than if the tasks are spread out across
> more cpus.
> 
> Running Android video playback on an ARM Cortex-A7 platform with 1, 2,
> and 4 cpus online has lead to the following power measurements
> (normalized):
> 
> video 720p (Android)
> cpus	power
> 1	1.59
> 2	1.00
> 4	1.10

I wonder what 3 CPU's shows?  Also, is this "display-on" power measured from
the battery?  The variance seems too big for a display-on measurement.

Here we seem to have a workload consisting of about 2 threads and where if we
use more than 2 CPUs' we pay a penalty for task migration.  There is no tie to
cpu L2 or power rail topology in this example.  From this data alown the
scheduler simply needs to avoid using more CPU's until the workload truely has
more threads.

What data do you have on the actual video workload in terms of threads?  My
guess is we are looking an audio decode and video decode processing.  Is this
video playback measurement including any rendering power or is all CPU?

I guess what I'm calling out is it is not clear what the right thing for the
scheduer to do as there is no physical model coupling power to SoC topology and
workload charactoristics.

I'll see if the folks at work who are hands on with similar KPI measureing can
share similar data.  (they read this list too) It may be easier for them to
share if we can agree on a normalization of the power data.  Say 100 "lumps"
(of coal) measured from the battery or psu output rails as the power burned on
a workload if run by booting with MAXCPUS=1 kernel command line?  (or should it
be measured from the SoC power rails?) That way we don't need to worry as much
about exposing competitive sensitive data in physical units.  FWIW I would go
with display off measurements (in "airplane mode"?) from the battery or
equivelent.

BTW remember my comment about power measuring being a path to hell?  Agreeing
on what workloads to measure, how to normalize, what to measure and from where
on a device, and how to report it (statistical data across multiple runs) is a
pain.  Details on screen on vrs off and reproducibilty of data by a third party
quickly come into play. 

--mark
 
> Restricting the number of cpus to one forces the frequency up to cope
> with the load, but the overall cpu load is only ~60% (busy %-age). Using
> two cpus keeps the frequency in the more power efficient range and gives
> a ~37% power reduction. With four cpus the power consumption is worse,
> likely due to the wake/idle transitions increase (~100%).
> 
> For this use-case it appears that the optimal busy %-age is ~30% (use
> two cpus). However, that is likely to vary depending on the use-case.
> 
> Proposed solution: Represent energy costs for each P-states and C-states
> in the topology to enable the scheduler to estimate the energy cost of
> the scheduling decisions. Coupled with P-state awareness that would
> allow the scheduler to avoid expensive high P-states.
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [0/11] Energy-aware scheduling use-cases and scheduler issues
  2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
                   ` (10 preceding siblings ...)
  2013-12-20 16:45 ` [11/11] system 1: Saving energy using DVFS Morten Rasmussen
@ 2013-12-22 16:28 ` mark gross
  2013-12-30 12:10   ` Morten Rasmussen
  11 siblings, 1 reply; 26+ messages in thread
From: mark gross @ 2013-12-22 16:28 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, rjw, markgross, vincent.guittot, catalin.marinas,
	linux-pm

On Fri, Dec 20, 2013 at 04:45:40PM +0000, Morten Rasmussen wrote:
> Hi,
> 
> One of the requests from the scheduler maintainers at the Energy-aware
> Scheduling workshop at Kernel Summit this year was to provide plain text
> descriptions of use-cases (workloads) and system topologies. To get that
> moving I have written some short texts about some use-cases. In addition
> I described a list of issues that today prevent mainly the scheduler
> from achieving a good energy/performance balance in common use-cases.
> The follow-up emails are structured as follows:
> 
> 1-6:	Current issues related to energy/performance balance.
We have seen some of these issues as well.  Still from my point of view (which
may not be the most well informed) most of my issues are related to bad choices
on task migrations.

> 7-10:	Use-cases (overall behaviour and energy/performance goals)
I really like your break down of the use cases.  I like the Android focus as
well.  However; can we get some similar workload break downs for representive
data center workloads from other folks?

> 11:	DVFS example (for reference)
> 
> I'm hoping that this provides some of the background for why I'm
> interested in improving energy-awareness in the scheduler. I'm aware
> that the use-cases and issues/wishlist don't cover everyone's area of
> interest. Input is needed to fix that.
> 
> Comments and input are appreciated.
What is missing is more data or modeling tying the SoC charactoristics to
scheduling choices.  You have some (energy per instruction at different
P-states) but there are a lot more topological differences that are important
for proper scheduler choices.  Specifically shared L2's between some cores and
not others, or shared power rails, or if the cores are hyper threaded, or if
there are mutliple sockets.

--mark 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [0/11] Energy-aware scheduling use-cases and scheduler issues
  2013-12-22 16:28 ` [0/11] Energy-aware scheduling use-cases and scheduler issues mark gross
@ 2013-12-30 12:10   ` Morten Rasmussen
  2014-01-12 16:47     ` mark gross
  0 siblings, 1 reply; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-30 12:10 UTC (permalink / raw)
  To: mark gross; +Cc: peterz, mingo, rjw, vincent.guittot, Catalin Marinas, linux-pm

On Sun, Dec 22, 2013 at 04:28:22PM +0000, mark gross wrote:
> On Fri, Dec 20, 2013 at 04:45:40PM +0000, Morten Rasmussen wrote:
> > Hi,
> > 
> > One of the requests from the scheduler maintainers at the Energy-aware
> > Scheduling workshop at Kernel Summit this year was to provide plain text
> > descriptions of use-cases (workloads) and system topologies. To get that
> > moving I have written some short texts about some use-cases. In addition
> > I described a list of issues that today prevent mainly the scheduler
> > from achieving a good energy/performance balance in common use-cases.
> > The follow-up emails are structured as follows:
> > 
> > 1-6:	Current issues related to energy/performance balance.
> We have seen some of these issues as well.  Still from my point of view (which
> may not be the most well informed) most of my issues are related to bad choices
> on task migrations.

Thanks for sharing your view. In my opinion, all of these issues relate
to task migration choices in one way or another. Lack of knowledge about
the power topology, frequency scaling, and different types of cores
(e.g. big.LITTLE) lead to bad migration choices.

> 
> > 7-10:	Use-cases (overall behaviour and energy/performance goals)
> I really like your break down of the use cases.  I like the Android focus as
> well.  However; can we get some similar workload break downs for representive
> data center workloads from other folks?

I don't have much insight into data center workloads, so I was hoping
for input from other folks.

> 
> > 11:	DVFS example (for reference)
> > 
> > I'm hoping that this provides some of the background for why I'm
> > interested in improving energy-awareness in the scheduler. I'm aware
> > that the use-cases and issues/wishlist don't cover everyone's area of
> > interest. Input is needed to fix that.
> > 
> > Comments and input are appreciated.
> What is missing is more data or modeling tying the SoC charactoristics to
> scheduling choices.  You have some (energy per instruction at different
> P-states) but there are a lot more topological differences that are important
> for proper scheduler choices.  Specifically shared L2's between some cores and
> not others, or shared power rails, or if the cores are hyper threaded, or if
> there are mutliple sockets.

Agree. This is the missing power topology information in the scheduler.
Power domain information (power rail sharing), including the cost of
waking up the first cpu and additional cpus in the domain, is required.
I guess multi-socket can be modelled that way too?

Most aspects of power management is implementation dependent on ARM, but
a typical big.LITTLE system looks like this:

      little  big
cpu   0   1   2   3
L1   |-| |-| |-| |-|
L2   |-----| |-----|

Two clusters (cpu groups), one little and one big. Cluster shared L2
cache. cpus have (depending on implementation) per cpu C-states and
deeper C-states apply to the entire cluster including the L2. P-states
often apply to the entire cluster (cpu 0-1 and 2-3 in this example).
Clusters may have 1-4 cpus each and doesn't have to be the same for all
clusters (e.g. ARM TC2 is 2+3).

Is that fundamentally different from the systems that you are working
on?

I was hoping that we could come up with a fairly simplistic energy model
that could guide the scheduling decisions based on data provided by the
vendor. I would start we something very simple and see far we can get
and which data that is necessary.

Morten

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/11] issue 1: Missing power topology information in scheduler
  2013-12-22 15:19   ` mark gross
@ 2013-12-30 14:00     ` Morten Rasmussen
  2014-01-13 20:23       ` Rafael J. Wysocki
  0 siblings, 1 reply; 26+ messages in thread
From: Morten Rasmussen @ 2013-12-30 14:00 UTC (permalink / raw)
  To: mark gross; +Cc: peterz, mingo, rjw, vincent.guittot, Catalin Marinas, linux-pm

On Sun, Dec 22, 2013 at 03:19:05PM +0000, mark gross wrote:
> On Fri, Dec 20, 2013 at 04:45:41PM +0000, Morten Rasmussen wrote:
> > The current mainline scheduler has no power topology information
> > available to enable it to make energy-aware decisions. The energy cost
> > of running a cpu at different frequencies and the energy cost of waking
> > up another cpu are needed.
> > 
> > One example where this could be useful is audio on Android. With the
> > current mainline scheduler it would utilize three cpus when active. Due
> > to the size of the tasks it is still possible to meet the performance
> > criteria when execution is serialized on a single cpu. Depending on the
> > power topology leaving two cpus idle and running one longer may lead to
> > energy savings if the cpus can be power-gated individually.
> > 
> > The audio performance requirements can be satisfied by most cpus at the
> > lowest frequency. Video is a more interesting use-case due to its higher
> > performance requirements. Running all tasks on a single cpu is likely to
> > require a higher frequency than if the tasks are spread out across
> > more cpus.
> > 
> > Running Android video playback on an ARM Cortex-A7 platform with 1, 2,
> > and 4 cpus online has lead to the following power measurements
> > (normalized):
> > 
> > video 720p (Android)
> > cpus	power
> > 1	1.59
> > 2	1.00
> > 4	1.10
> 
> I wonder what 3 CPU's shows?  Also, is this "display-on" power measured from
> the battery?  The variance seems too big for a display-on measurement.

These are cpus-only measurements excluding gpu, dram, and other
peripherals. So yes, the relative total power saving is much smaller.

I don't have numbers for 3 cpus, but I will see if I can get them. Based
on the traces for 2 and 4 cpus my guess is that 3 cpus would be very
close to 2 cpus if not slightly better. The available parallelism seems
limited. The fourth cpu is hardly used and the third is only used for
short periods. 

> 
> Here we seem to have a workload consisting of about 2 threads and where if we
> use more than 2 CPUs' we pay a penalty for task migration.  There is no tie to
> cpu L2 or power rail topology in this example.  From this data alown the
> scheduler simply needs to avoid using more CPU's until the workload truely has
> more threads.

We have more threads in this workload (use-case 3), but rarely more than
two of them running (or runnable) simultaneously. I agree, that the
scheduler needs to avoid using more cpus than necessary.

The scheduler is particularly bad at this for thread patterns like the
ones observed for audio and video playback. Both have chains of threads
that wake up the next thread and then go to sleep. Since the current
thread continues to run for a moment after the next one has been woken
up, the wakee tends to be placed on a new cpu rather than waiting a few
tens of microseconds for the current cpu to be vacant.

            t0                     t0
cpu 0	===========            ===========
	         |             ^
                 v   t1        |
cpu 1	         ===========   |
                          |    |
                          v t2 |
cpu 2                     =======


> 
> What data do you have on the actual video workload in terms of threads?  My
> guess is we are looking an audio decode and video decode processing.  Is this
> video playback measurement including any rendering power or is all CPU?

As said above, these are cpus-only power numbers, so any gpu rendering is
excluded. The cpu workload includes both partial video decoding and
audio decoding. I'm not sure how much is offloaded to the gpu. use-case
3 has a short description of the main threads.

What sort of data are you looking for?

> 
> I guess what I'm calling out is it is not clear what the right thing for the
> scheduer to do as there is no physical model coupling power to SoC topology and
> workload charactoristics.
> 
> I'll see if the folks at work who are hands on with similar KPI measureing can
> share similar data.  (they read this list too) It may be easier for them to
> share if we can agree on a normalization of the power data.  Say 100 "lumps"
> (of coal) measured from the battery or psu output rails as the power burned on
> a workload if run by booting with MAXCPUS=1 kernel command line?  (or should it
> be measured from the SoC power rails?) That way we don't need to worry as much
> about exposing competitive sensitive data in physical units.  FWIW I would go
> with display off measurements (in "airplane mode"?) from the battery or
> equivelent.

Good question. System power (battery) seems right if you have your SoC
on a board which is fairly similar to the end product. That is not always
the case, so I tend to look at SoC power instead. How large is the
difference if the display is off and in airplane mode? Would it work if
we just state the measurement method when posting numbers?

Normalizing to single cpu (MAXCPUS=1 or equivalent) would work I think.

> 
> BTW remember my comment about power measuring being a path to hell?  Agreeing
> on what workloads to measure, how to normalize, what to measure and from where
> on a device, and how to report it (statistical data across multiple runs) is a
> pain.  Details on screen on vrs off and reproducibilty of data by a third party
> quickly come into play. 

Yes :) I don't think we necessarily have to have a fully specified test
suite. As long as each interested party makes sure to test whatever
patches that might eventually come out of this, we have the proof
whether it works or not. Third party reproducibility is more difficult.

Morten

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [8/11] use-case 2: Audio playback on Android
  2013-12-20 16:45 ` [8/11] use-case 2: Audio playback " Morten Rasmussen
@ 2014-01-07 12:15   ` Peter Zijlstra
  2014-01-07 12:16     ` Peter Zijlstra
  2014-01-07 15:55     ` Morten Rasmussen
  0 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-01-07 12:15 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, rjw, markgross, vincent.guittot, catalin.marinas, linux-pm

On Fri, Dec 20, 2013 at 04:45:48PM +0000, Morten Rasmussen wrote:
> Audio playback is a low load periodic application that has little/no
> variation in period and load over time. It consists of tasks involved in
> decoding the audio stream and communicating with audio frameworks and
> drivers.
> 
> Performance Criteria
> 
> All tasks must have completed before the next request to fill the audio
> buffer. Most modern hardware should be able to deal with the load even
> at the lowest P-state.
> 
> Task behaviour
> 
> The task load pattern period is dictated by the audio interrupt. On an
> example modern ARM based system this occurs every ~6 ms. The decoding
> work is triggered every fourth interrupt, i.e. a ~24 ms period. No tasks
> are scheduled at the intermediate interrupts. The tasks involved are:
> 
> Main audio framework task (AudioOut): The first task to be scheduled
> after the interrupt and continues running until decoding has completed.
> That is ~5 ms. Runs at nice=-19.
> 
> Audio framework task 2 (AudioTrack): Woken up by the main task ~250-300
> us after the main audio task is scheduled. Runs for ~300 us at nice=-16.
> 
> Decoder task (mp3.decoder): Woken up by the audio task 2 when that
> finishes (serialized). Runs for ~1 ms until it wakes a third Android
> task on which it blocks and continues for another ~150 us afterwards
> (serialized). Runs at nice=-2.
> 
> Android task 3 (OMXCallbackDisp): Woken by decoder task. Runs for ~300
> us at nice=-2.

I probably shouldn't ask; but..

Why would the AudioOut task keep running while waiting for the
mp3.decoder thing to provide content? That doesn't make sense.

One would expect something simple like:

DMA buffer reaches low mark, sends interrupt, interrupt wakes task, task fills
up buffer, goto 1.

Instead we get this merry dance of far too many tasks.

And even if you want to add mixing multiple streams in software (and do
not optimize the single active stream) and you want to do this with
multiple tasks instead of chaining calls in the same task, you still
get a normal blocking task chain with at most 1 runnable task.

Something seems fucked with Android if it needs more than 1 running task
to fill an audio buffer.

/me continues reading..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [8/11] use-case 2: Audio playback on Android
  2014-01-07 12:15   ` Peter Zijlstra
@ 2014-01-07 12:16     ` Peter Zijlstra
  2014-01-07 16:02       ` Morten Rasmussen
  2014-01-07 15:55     ` Morten Rasmussen
  1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2014-01-07 12:16 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, rjw, markgross, vincent.guittot, catalin.marinas, linux-pm



Oh, I just noticed, we're hiding this on the linux-pm list :-(

I suppose I should stop reading for I might be tricked into replying
more..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [8/11] use-case 2: Audio playback on Android
  2014-01-07 12:15   ` Peter Zijlstra
  2014-01-07 12:16     ` Peter Zijlstra
@ 2014-01-07 15:55     ` Morten Rasmussen
  1 sibling, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2014-01-07 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rjw, markgross, vincent.guittot, Catalin Marinas, linux-pm

On Tue, Jan 07, 2014 at 12:15:00PM +0000, Peter Zijlstra wrote:
> On Fri, Dec 20, 2013 at 04:45:48PM +0000, Morten Rasmussen wrote:
> > Audio playback is a low load periodic application that has little/no
> > variation in period and load over time. It consists of tasks involved in
> > decoding the audio stream and communicating with audio frameworks and
> > drivers.
> > 
> > Performance Criteria
> > 
> > All tasks must have completed before the next request to fill the audio
> > buffer. Most modern hardware should be able to deal with the load even
> > at the lowest P-state.
> > 
> > Task behaviour
> > 
> > The task load pattern period is dictated by the audio interrupt. On an
> > example modern ARM based system this occurs every ~6 ms. The decoding
> > work is triggered every fourth interrupt, i.e. a ~24 ms period. No tasks
> > are scheduled at the intermediate interrupts. The tasks involved are:
> > 
> > Main audio framework task (AudioOut): The first task to be scheduled
> > after the interrupt and continues running until decoding has completed.
> > That is ~5 ms. Runs at nice=-19.
> > 
> > Audio framework task 2 (AudioTrack): Woken up by the main task ~250-300
> > us after the main audio task is scheduled. Runs for ~300 us at nice=-16.
> > 
> > Decoder task (mp3.decoder): Woken up by the audio task 2 when that
> > finishes (serialized). Runs for ~1 ms until it wakes a third Android
> > task on which it blocks and continues for another ~150 us afterwards
> > (serialized). Runs at nice=-2.
> > 
> > Android task 3 (OMXCallbackDisp): Woken by decoder task. Runs for ~300
> > us at nice=-2.
> 
> I probably shouldn't ask; but..
> 
> Why would the AudioOut task keep running while waiting for the
> mp3.decoder thing to provide content? That doesn't make sense.
> 
> One would expect something simple like:
> 
> DMA buffer reaches low mark, sends interrupt, interrupt wakes task, task fills
> up buffer, goto 1.
> 
> Instead we get this merry dance of far too many tasks.
> 
> And even if you want to add mixing multiple streams in software (and do
> not optimize the single active stream) and you want to do this with
> multiple tasks instead of chaining calls in the same task, you still
> get a normal blocking task chain with at most 1 runnable task.
> 
> Something seems fucked with Android if it needs more than 1 running task
> to fill an audio buffer.

I'm not an Android internals expert so I won't try to defend why it is
like that. I did the descriptions by starring at ftrace output and tried
to figure out the task dependencies. These become fairly clear when
changing the number of cpus online so the scheduling patterns changes.

Not helping much, I know.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [8/11] use-case 2: Audio playback on Android
  2014-01-07 12:16     ` Peter Zijlstra
@ 2014-01-07 16:02       ` Morten Rasmussen
  0 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2014-01-07 16:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, rjw, markgross, vincent.guittot, Catalin Marinas, linux-pm

On Tue, Jan 07, 2014 at 12:16:00PM +0000, Peter Zijlstra wrote:
> 
> 
> Oh, I just noticed, we're hiding this on the linux-pm list :-(

It was requested at KS that we had these discussions at linux-pm. Some
of the Intel people don't follow lkml.

I should have added lkml too, sorry.

> 
> I suppose I should stop reading for I might be tricked into replying
> more..

I can repost the series with lkml on cc.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [0/11] Energy-aware scheduling use-cases and scheduler issues
  2013-12-30 12:10   ` Morten Rasmussen
@ 2014-01-12 16:47     ` mark gross
  2014-01-13 12:04       ` Catalin Marinas
  0 siblings, 1 reply; 26+ messages in thread
From: mark gross @ 2014-01-12 16:47 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mark gross, peterz, mingo, rjw, vincent.guittot, Catalin Marinas,
	linux-pm

On Mon, Dec 30, 2013 at 12:10:10PM +0000, Morten Rasmussen wrote:
> On Sun, Dec 22, 2013 at 04:28:22PM +0000, mark gross wrote:
> > On Fri, Dec 20, 2013 at 04:45:40PM +0000, Morten Rasmussen wrote:
> > > Hi,
> > > 
> > > One of the requests from the scheduler maintainers at the Energy-aware
> > > Scheduling workshop at Kernel Summit this year was to provide plain text
> > > descriptions of use-cases (workloads) and system topologies. To get that
> > > moving I have written some short texts about some use-cases. In addition
> > > I described a list of issues that today prevent mainly the scheduler
> > > from achieving a good energy/performance balance in common use-cases.
> > > The follow-up emails are structured as follows:
> > > 
> > > 1-6:	Current issues related to energy/performance balance.
> > We have seen some of these issues as well.  Still from my point of view (which
> > may not be the most well informed) most of my issues are related to bad choices
> > on task migrations.
> 
> Thanks for sharing your view. In my opinion, all of these issues relate
> to task migration choices in one way or another. Lack of knowledge about
> the power topology, frequency scaling, and different types of cores
> (e.g. big.LITTLE) lead to bad migration choices.
> 
> > 
> > > 7-10:	Use-cases (overall behaviour and energy/performance goals)
> > I really like your break down of the use cases.  I like the Android focus as
> > well.  However; can we get some similar workload break downs for representive
> > data center workloads from other folks?
> 
> I don't have much insight into data center workloads, so I was hoping
> for input from other folks.
> 
> > 
> > > 11:	DVFS example (for reference)
> > > 
> > > I'm hoping that this provides some of the background for why I'm
> > > interested in improving energy-awareness in the scheduler. I'm aware
> > > that the use-cases and issues/wishlist don't cover everyone's area of
> > > interest. Input is needed to fix that.
> > > 
> > > Comments and input are appreciated.
> > What is missing is more data or modeling tying the SoC charactoristics to
> > scheduling choices.  You have some (energy per instruction at different
> > P-states) but there are a lot more topological differences that are important
> > for proper scheduler choices.  Specifically shared L2's between some cores and
> > not others, or shared power rails, or if the cores are hyper threaded, or if
> > there are mutliple sockets.
> 
> Agree. This is the missing power topology information in the scheduler.
> Power domain information (power rail sharing), including the cost of
> waking up the first cpu and additional cpus in the domain, is required.
> I guess multi-socket can be modelled that way too?
> 
> Most aspects of power management is implementation dependent on ARM, but
> a typical big.LITTLE system looks like this:
> 
>       little  big
> cpu   0   1   2   3
> L1   |-| |-| |-| |-|
> L2   |-----| |-----|
> 
> Two clusters (cpu groups), one little and one big. Cluster shared L2
> cache. cpus have (depending on implementation) per cpu C-states and
> deeper C-states apply to the entire cluster including the L2. P-states
> often apply to the entire cluster (cpu 0-1 and 2-3 in this example).
> Clusters may have 1-4 cpus each and doesn't have to be the same for all
> clusters (e.g. ARM TC2 is 2+3).
> 
> Is that fundamentally different from the systems that you are working
> on?
This is not so different for the SoC's I've interacted with.  I'v seen a few
more variants  (as does the ARM SoC's)

One of the newer intel CPU's has a cache picture that looks like a cut and paste
of yours (minus the little/big):

 cpu   0   1   2   3
 L1   |-| |-| |-| |-|
 L2   |-----| |-----|

An older CPU looks just like this but adds SMT to each CPU core.

> I was hoping that we could come up with a fairly simplistic energy model
> that could guide the scheduling decisions based on data provided by the
> vendor. I would start we something very simple and see far we can get
> and which data that is necessary.
>
I keep flip flopping in my mind over what is more important.  Energy modeling or
latency performance measuring.

I mean, one way to look at the world is given a workload with minimal latency
and throughput expectations we need deliver those first and then optimize power.

With poor load balancing we do not deliver on performance expectations typically
in the areas of latencies.  Note, Linux does well on throughput IMO because that
is easier to measure with kstats and other sampling.

what sorts of missing thing are needed to measure and understand when wrong
choices are getting made?  What basic information do we need to capture to know
if we are doing a good job or not?

--mark
> Morten

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [0/11] Energy-aware scheduling use-cases and scheduler issues
  2014-01-12 16:47     ` mark gross
@ 2014-01-13 12:04       ` Catalin Marinas
  0 siblings, 0 replies; 26+ messages in thread
From: Catalin Marinas @ 2014-01-13 12:04 UTC (permalink / raw)
  To: mark gross
  Cc: Morten Rasmussen, peterz, mingo, rjw, vincent.guittot, linux-pm

On Sun, Jan 12, 2014 at 04:47:59PM +0000, mark gross wrote:
> On Mon, Dec 30, 2013 at 12:10:10PM +0000, Morten Rasmussen wrote:
> > I was hoping that we could come up with a fairly simplistic energy model
> > that could guide the scheduling decisions based on data provided by the
> > vendor. I would start we something very simple and see far we can get
> > and which data that is necessary.
>
> I keep flip flopping in my mind over what is more important.  Energy modeling or
> latency performance measuring.

Both ;)

> I mean, one way to look at the world is given a workload with minimal latency
> and throughput expectations we need deliver those first and then optimize power.

I agree, that's why I don't think blind packing tasks to the left would
work either for power or performance.

> With poor load balancing we do not deliver on performance expectations typically
> in the areas of latencies.  Note, Linux does well on throughput IMO because that
> is easier to measure with kstats and other sampling.
> 
> what sorts of missing thing are needed to measure and understand when wrong
> choices are getting made?  What basic information do we need to capture to know
> if we are doing a good job or not?

I think whatever power awareness we add to the scheduler should aim to
optimise the power consumption (based on some simple model of measuring
the idle time/states and transition on certain platforms and estimating
the energy) but with *minimal* effect on the latency and throughput.
Standard latency/performance benchmarks should always be run to ensure
there are no regressions. Morten's use-cases try to describe scenarios
where the scheduler can do better from a power perspective but without
(drastically) affecting other parameters.

If you have a predictable workload, the scheduler can make the right
decision to optimise for power while keeping the latency under control.
The problem is when the workload changes, the latency would be affected
if tasks need to migrate or the CPU frequency needs to be increased (and
for the latter we currently rely on a cpufreq governor or driver to
detect the workload change and this introduces additional latencies).
Given these pretty independent cpufreq decisions, the best heuristics
for now wrt latency is probably to spread the workload among all the
CPUs and leave enough room for workload changes.

But even with latency under certain limits, you may have for example
small threads (like audio decoding) that could still fit on a CPU when
running at the minimal P-state, with the risk of a big sudden change in
the workload of such thread. That's a trade-off between optimising for
performance and power. A power-aware scheduler does not aim to trade the
latency or throughput for power but rather how well it copes with
workload unpredictability, what margins are guaranteed.

IMHO, adding power awareness to the scheduler could be done in two
(main) ways:

1. Heuristics like packing small tasks with tunables like what "small"
   actually means, how many such "small" tasks and such parameters would
   be specific to each SoC.

2. Power model in the scheduler (I proposed a simplistic one at the end
   of last year) where the scheduler can associate an energy cost with
   its actions (e.g. migrating a task to a CPU) and it would try to
   optimise the overall system energy consumption while preserving the
   latency and throughput.

I consider the second approach being better as you can extend it other
things like power budgets. But it doesn't always go well with hardware
people who don't want to expose real numbers (they don't even need to be
real W or J but just some relative numbers).

-- 
Catalin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/11] issue 1: Missing power topology information in scheduler
  2013-12-30 14:00     ` Morten Rasmussen
@ 2014-01-13 20:23       ` Rafael J. Wysocki
  2014-01-14 16:21         ` Morten Rasmussen
  0 siblings, 1 reply; 26+ messages in thread
From: Rafael J. Wysocki @ 2014-01-13 20:23 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mark gross, peterz, mingo, vincent.guittot, Catalin Marinas, linux-pm

On Monday, December 30, 2013 02:00:25 PM Morten Rasmussen wrote:
> On Sun, Dec 22, 2013 at 03:19:05PM +0000, mark gross wrote:
> > On Fri, Dec 20, 2013 at 04:45:41PM +0000, Morten Rasmussen wrote:
> > > The current mainline scheduler has no power topology information
> > > available to enable it to make energy-aware decisions. The energy cost
> > > of running a cpu at different frequencies and the energy cost of waking
> > > up another cpu are needed.
> > > 
> > > One example where this could be useful is audio on Android. With the
> > > current mainline scheduler it would utilize three cpus when active. Due
> > > to the size of the tasks it is still possible to meet the performance
> > > criteria when execution is serialized on a single cpu. Depending on the
> > > power topology leaving two cpus idle and running one longer may lead to
> > > energy savings if the cpus can be power-gated individually.
> > > 
> > > The audio performance requirements can be satisfied by most cpus at the
> > > lowest frequency. Video is a more interesting use-case due to its higher
> > > performance requirements. Running all tasks on a single cpu is likely to
> > > require a higher frequency than if the tasks are spread out across
> > > more cpus.
> > > 
> > > Running Android video playback on an ARM Cortex-A7 platform with 1, 2,
> > > and 4 cpus online has lead to the following power measurements
> > > (normalized):
> > > 
> > > video 720p (Android)
> > > cpus	power
> > > 1	1.59
> > > 2	1.00
> > > 4	1.10
> > 
> > I wonder what 3 CPU's shows?  Also, is this "display-on" power measured from
> > the battery?  The variance seems too big for a display-on measurement.
> 
> These are cpus-only measurements excluding gpu, dram, and other
> peripherals. So yes, the relative total power saving is much smaller.

And that number is what actually matters, because that's what the battery life
depends on.

> I don't have numbers for 3 cpus, but I will see if I can get them. Based
> on the traces for 2 and 4 cpus my guess is that 3 cpus would be very
> close to 2 cpus if not slightly better. The available parallelism seems
> limited. The fourth cpu is hardly used and the third is only used for
> short periods. 
> 
> > 
> > Here we seem to have a workload consisting of about 2 threads and where if we
> > use more than 2 CPUs' we pay a penalty for task migration.  There is no tie to
> > cpu L2 or power rail topology in this example.  From this data alown the
> > scheduler simply needs to avoid using more CPU's until the workload truely has
> > more threads.
> 
> We have more threads in this workload (use-case 3), but rarely more than
> two of them running (or runnable) simultaneously. I agree, that the
> scheduler needs to avoid using more cpus than necessary.
> 
> The scheduler is particularly bad at this for thread patterns like the
> ones observed for audio and video playback. Both have chains of threads
> that wake up the next thread and then go to sleep. Since the current
> thread continues to run for a moment after the next one has been woken
> up, the wakee tends to be placed on a new cpu rather than waiting a few
> tens of microseconds for the current cpu to be vacant.
> 
>             t0                     t0
> cpu 0	===========            ===========
> 	         |             ^
>                  v   t1        |
> cpu 1	         ===========   |
>                           |    |
>                           v t2 |
> cpu 2                     =======
> 

This appears to be a property of the workload that the scheduler can't really
learn by itself (it can't possibly know, unless told somehow, that t0 is going
to go to sleep shortly after t1 has been started).

It looks like that might be covered by a new thread flag meaning "start my
children on the same CPU".  Or a clone flag meaning "run this child on the same
CPU".

> > What data do you have on the actual video workload in terms of threads?  My
> > guess is we are looking an audio decode and video decode processing.  Is this
> > video playback measurement including any rendering power or is all CPU?
> 
> As said above, these are cpus-only power numbers, so any gpu rendering is
> excluded. The cpu workload includes both partial video decoding and
> audio decoding. I'm not sure how much is offloaded to the gpu. use-case
> 3 has a short description of the main threads.
> 
> What sort of data are you looking for?
> 
> > 
> > I guess what I'm calling out is it is not clear what the right thing for the
> > scheduer to do as there is no physical model coupling power to SoC topology and
> > workload charactoristics.
> > 
> > I'll see if the folks at work who are hands on with similar KPI measureing can
> > share similar data.  (they read this list too) It may be easier for them to
> > share if we can agree on a normalization of the power data.  Say 100 "lumps"
> > (of coal) measured from the battery or psu output rails as the power burned on
> > a workload if run by booting with MAXCPUS=1 kernel command line?  (or should it
> > be measured from the SoC power rails?) That way we don't need to worry as much
> > about exposing competitive sensitive data in physical units.  FWIW I would go
> > with display off measurements (in "airplane mode"?) from the battery or
> > equivelent.
> 
> Good question. System power (battery) seems right if you have your SoC
> on a board which is fairly similar to the end product. That is not always
> the case, so I tend to look at SoC power instead. How large is the
> difference if the display is off and in airplane mode? Would it work if
> we just state the measurement method when posting numbers?

Well, that's absolutely necessary in my opinion.

I guess the main point would be to distinguish "improvements" from "regressions",
so if there is a convenient way to measure things on a given system, it will be
fine as long as it is exactly reproducible (same kernel with the same hardware
setup should give the same result and if it doesn't, we need to know what the
distribution of the results is, at least roughly, and how it changes after a
given patch).

Now, if you see an "improvement" in an artificial measurement setup, there still
is a question how much of a real improvement will be seen by a user of a final
product.  That needs to be known as well, so that we can write off "improvements"
that won't really improve anything in practice.

> Normalizing to single cpu (MAXCPUS=1 or equivalent) would work I think.
> 
> > 
> > BTW remember my comment about power measuring being a path to hell?  Agreeing
> > on what workloads to measure, how to normalize, what to measure and from where
> > on a device, and how to report it (statistical data across multiple runs) is a
> > pain.  Details on screen on vrs off and reproducibilty of data by a third party
> > quickly come into play. 
> 
> Yes :) I don't think we necessarily have to have a fully specified test
> suite. As long as each interested party makes sure to test whatever
> patches that might eventually come out of this, we have the proof
> whether it works or not. Third party reproducibility is more difficult.

And probably not necessary.

Thanks!

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/11] issue 1: Missing power topology information in scheduler
  2014-01-13 20:23       ` Rafael J. Wysocki
@ 2014-01-14 16:21         ` Morten Rasmussen
  2014-01-14 17:09           ` Peter Zijlstra
  0 siblings, 1 reply; 26+ messages in thread
From: Morten Rasmussen @ 2014-01-14 16:21 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: mark gross, peterz, mingo, vincent.guittot, Catalin Marinas, linux-pm

On Mon, Jan 13, 2014 at 08:23:32PM +0000, Rafael J. Wysocki wrote:
> On Monday, December 30, 2013 02:00:25 PM Morten Rasmussen wrote:
> > On Sun, Dec 22, 2013 at 03:19:05PM +0000, mark gross wrote:
> > > On Fri, Dec 20, 2013 at 04:45:41PM +0000, Morten Rasmussen wrote:
> > > > The current mainline scheduler has no power topology information
> > > > available to enable it to make energy-aware decisions. The energy cost
> > > > of running a cpu at different frequencies and the energy cost of waking
> > > > up another cpu are needed.
> > > > 
> > > > One example where this could be useful is audio on Android. With the
> > > > current mainline scheduler it would utilize three cpus when active. Due
> > > > to the size of the tasks it is still possible to meet the performance
> > > > criteria when execution is serialized on a single cpu. Depending on the
> > > > power topology leaving two cpus idle and running one longer may lead to
> > > > energy savings if the cpus can be power-gated individually.
> > > > 
> > > > The audio performance requirements can be satisfied by most cpus at the
> > > > lowest frequency. Video is a more interesting use-case due to its higher
> > > > performance requirements. Running all tasks on a single cpu is likely to
> > > > require a higher frequency than if the tasks are spread out across
> > > > more cpus.
> > > > 
> > > > Running Android video playback on an ARM Cortex-A7 platform with 1, 2,
> > > > and 4 cpus online has lead to the following power measurements
> > > > (normalized):
> > > > 
> > > > video 720p (Android)
> > > > cpus	power
> > > > 1	1.59
> > > > 2	1.00
> > > > 4	1.10
> > > 
> > > I wonder what 3 CPU's shows?  Also, is this "display-on" power measured from
> > > the battery?  The variance seems too big for a display-on measurement.
> > 
> > These are cpus-only measurements excluding gpu, dram, and other
> > peripherals. So yes, the relative total power saving is much smaller.
> 
> And that number is what actually matters, because that's what the battery life
> depends on.

Agreed. I can add that the above experiment didn't show any changes in
GPU power consumption.

> 
> > I don't have numbers for 3 cpus, but I will see if I can get them. Based
> > on the traces for 2 and 4 cpus my guess is that 3 cpus would be very
> > close to 2 cpus if not slightly better. The available parallelism seems
> > limited. The fourth cpu is hardly used and the third is only used for
> > short periods. 
> > 
> > > 
> > > Here we seem to have a workload consisting of about 2 threads and where if we
> > > use more than 2 CPUs' we pay a penalty for task migration.  There is no tie to
> > > cpu L2 or power rail topology in this example.  From this data alown the
> > > scheduler simply needs to avoid using more CPU's until the workload truely has
> > > more threads.
> > 
> > We have more threads in this workload (use-case 3), but rarely more than
> > two of them running (or runnable) simultaneously. I agree, that the
> > scheduler needs to avoid using more cpus than necessary.
> > 
> > The scheduler is particularly bad at this for thread patterns like the
> > ones observed for audio and video playback. Both have chains of threads
> > that wake up the next thread and then go to sleep. Since the current
> > thread continues to run for a moment after the next one has been woken
> > up, the wakee tends to be placed on a new cpu rather than waiting a few
> > tens of microseconds for the current cpu to be vacant.
> > 
> >             t0                     t0
> > cpu 0	===========            ===========
> > 	         |             ^
> >                  v   t1        |
> > cpu 1	         ===========   |
> >                           |    |
> >                           v t2 |
> > cpu 2                     =======
> > 
> 
> This appears to be a property of the workload that the scheduler can't really
> learn by itself (it can't possibly know, unless told somehow, that t0 is going
> to go to sleep shortly after t1 has been started).
> 
> It looks like that might be covered by a new thread flag meaning "start my
> children on the same CPU".  Or a clone flag meaning "run this child on the same
> CPU".

It could work for multithreaded applications, but I'm not sure if it would
work for Android muddleware threads. Android audio playback as described
in another post exhibits a pattern similar to the above. It may not be
possible to have a parent/child setup for the Android audio subsystem
threads that gives the desired behaviour.

> 
> > > What data do you have on the actual video workload in terms of threads?  My
> > > guess is we are looking an audio decode and video decode processing.  Is this
> > > video playback measurement including any rendering power or is all CPU?
> > 
> > As said above, these are cpus-only power numbers, so any gpu rendering is
> > excluded. The cpu workload includes both partial video decoding and
> > audio decoding. I'm not sure how much is offloaded to the gpu. use-case
> > 3 has a short description of the main threads.
> > 
> > What sort of data are you looking for?
> > 
> > > 
> > > I guess what I'm calling out is it is not clear what the right thing for the
> > > scheduer to do as there is no physical model coupling power to SoC topology and
> > > workload charactoristics.
> > > 
> > > I'll see if the folks at work who are hands on with similar KPI measureing can
> > > share similar data.  (they read this list too) It may be easier for them to
> > > share if we can agree on a normalization of the power data.  Say 100 "lumps"
> > > (of coal) measured from the battery or psu output rails as the power burned on
> > > a workload if run by booting with MAXCPUS=1 kernel command line?  (or should it
> > > be measured from the SoC power rails?) That way we don't need to worry as much
> > > about exposing competitive sensitive data in physical units.  FWIW I would go
> > > with display off measurements (in "airplane mode"?) from the battery or
> > > equivelent.
> > 
> > Good question. System power (battery) seems right if you have your SoC
> > on a board which is fairly similar to the end product. That is not always
> > the case, so I tend to look at SoC power instead. How large is the
> > difference if the display is off and in airplane mode? Would it work if
> > we just state the measurement method when posting numbers?
> 
> Well, that's absolutely necessary in my opinion.
> 
> I guess the main point would be to distinguish "improvements" from "regressions",
> so if there is a convenient way to measure things on a given system, it will be
> fine as long as it is exactly reproducible (same kernel with the same hardware
> setup should give the same result and if it doesn't, we need to know what the
> distribution of the results is, at least roughly, and how it changes after a
> given patch).
> 
> Now, if you see an "improvement" in an artificial measurement setup, there still
> is a question how much of a real improvement will be seen by a user of a final
> product.  That needs to be known as well, so that we can write off "improvements"
> that won't really improve anything in practice.

Agreed. The power measurements must include enough of the system that is
possible to determine whether power is being shifted from one part to
another without affecting the total power positively or it is a true
reduction in total power consumption. As I said earlier, it may not make
sense to do PSU level power measurements on development boards.

Morten

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/11] issue 1: Missing power topology information in scheduler
  2014-01-14 16:21         ` Morten Rasmussen
@ 2014-01-14 17:09           ` Peter Zijlstra
  0 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2014-01-14 17:09 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Rafael J. Wysocki, mark gross, mingo, vincent.guittot,
	Catalin Marinas, linux-pm

On Tue, Jan 14, 2014 at 04:21:15PM +0000, Morten Rasmussen wrote:
> > This appears to be a property of the workload that the scheduler can't really
> > learn by itself (it can't possibly know, unless told somehow, that t0 is going
> > to go to sleep shortly after t1 has been started).
> > 
> > It looks like that might be covered by a new thread flag meaning "start my
> > children on the same CPU".  Or a clone flag meaning "run this child on the same
> > CPU".
> 
> It could work for multithreaded applications, but I'm not sure if it would
> work for Android muddleware threads. Android audio playback as described
> in another post exhibits a pattern similar to the above. It may not be
> possible to have a parent/child setup for the Android audio subsystem
> threads that gives the desired behaviour.

We used to have a guestimator at some point that would measure the
overlap (the time we keep running after waking another task), which we
used to auto select WF_SYNC.

The fact that we no longer have this only says that we weren't smart
enough to make it work reliable enough :-)

ISTR there were also people poking at having special FUTEX flags to
indicate such situations, but I can't remember whatever became of that.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [8/11] use-case 2: Audio playback on Android
  2014-01-07 16:19 [0/11][REPOST] " Morten Rasmussen
@ 2014-01-07 16:19 ` Morten Rasmussen
  0 siblings, 0 replies; 26+ messages in thread
From: Morten Rasmussen @ 2014-01-07 16:19 UTC (permalink / raw)
  To: peterz, mingo
  Cc: rjw, markgross, vincent.guittot, catalin.marinas,
	morten.rasmussen, linux-pm, linux-kernel

Audio playback is a low load periodic application that has little/no
variation in period and load over time. It consists of tasks involved in
decoding the audio stream and communicating with audio frameworks and
drivers.

Performance Criteria

All tasks must have completed before the next request to fill the audio
buffer. Most modern hardware should be able to deal with the load even
at the lowest P-state.

Task behaviour

The task load pattern period is dictated by the audio interrupt. On an
example modern ARM based system this occurs every ~6 ms. The decoding
work is triggered every fourth interrupt, i.e. a ~24 ms period. No tasks
are scheduled at the intermediate interrupts. The tasks involved are:

Main audio framework task (AudioOut): The first task to be scheduled
after the interrupt and continues running until decoding has completed.
That is ~5 ms. Runs at nice=-19.

Audio framework task 2 (AudioTrack): Woken up by the main task ~250-300
us after the main audio task is scheduled. Runs for ~300 us at nice=-16.

Decoder task (mp3.decoder): Woken up by the audio task 2 when that
finishes (serialized). Runs for ~1 ms until it wakes a third Android
task on which it blocks and continues for another ~150 us afterwards
(serialized). Runs at nice=-2.

Android task 3 (OMXCallbackDisp): Woken by decoder task. Runs for ~300
us at nice=-2.



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-01-14 17:09 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-20 16:45 [0/11] Energy-aware scheduling use-cases and scheduler issues Morten Rasmussen
2013-12-20 16:45 ` [1/11] issue 1: Missing power topology information in scheduler Morten Rasmussen
2013-12-22 15:19   ` mark gross
2013-12-30 14:00     ` Morten Rasmussen
2014-01-13 20:23       ` Rafael J. Wysocki
2014-01-14 16:21         ` Morten Rasmussen
2014-01-14 17:09           ` Peter Zijlstra
2013-12-20 16:45 ` [2/11] issue 2: Energy-awareness for heterogeneous systems Morten Rasmussen
2013-12-20 16:45 ` [3/11] issue 3: No understanding of potential cpu capacity Morten Rasmussen
2013-12-20 16:45 ` [4/11] issue 4: Tracking idle states Morten Rasmussen
2013-12-20 16:45 ` [5/11] issue 5: Frequency and uarch invariant task load Morten Rasmussen
2013-12-20 16:45 ` [6/11] issue 6: Poor and non-deterministic performance on heterogeneous systems Morten Rasmussen
2013-12-20 16:45 ` [7/11] use-case 1: Webbrowsing on Android Morten Rasmussen
2013-12-20 16:45 ` [8/11] use-case 2: Audio playback " Morten Rasmussen
2014-01-07 12:15   ` Peter Zijlstra
2014-01-07 12:16     ` Peter Zijlstra
2014-01-07 16:02       ` Morten Rasmussen
2014-01-07 15:55     ` Morten Rasmussen
2013-12-20 16:45 ` [9/11] use-case 3: Video " Morten Rasmussen
2013-12-20 16:45 ` [10/11] use-case 4: Game " Morten Rasmussen
2013-12-20 16:45 ` [11/11] system 1: Saving energy using DVFS Morten Rasmussen
2013-12-22 16:28 ` [0/11] Energy-aware scheduling use-cases and scheduler issues mark gross
2013-12-30 12:10   ` Morten Rasmussen
2014-01-12 16:47     ` mark gross
2014-01-13 12:04       ` Catalin Marinas
2014-01-07 16:19 [0/11][REPOST] " Morten Rasmussen
2014-01-07 16:19 ` [8/11] use-case 2: Audio playback on Android Morten Rasmussen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.