[RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking

* [RFC PATCH 0/3] sched: Introduce Window Assisted Load Tracking
@ 2016-10-28  7:10 Vikram Mulukutla
  2016-10-28  7:10 ` [RFC PATCH 1/3] sched: Introduce structures necessary for WALT Vikram Mulukutla
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Vikram Mulukutla @ 2016-10-28  7:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Steve Muckle,
	Olav Haugan, Syed Rameez Mustafa, Joonwoo Park,
	Pavankumar Kondeti, Saravana Kannan, Bryan Huntsman, Juri Lelli,
	Morten Rasmussen, Dietmar Eggemann, Chris Redpath,
	Robin Randhawa, Patrick Bellasi, Todd Kjos, Srinath Sridharan,
	Andres Oportus, Leo Yan, Vincent Guittot, Vikram Mulukutla

We propose Window-Assisted Load Tracking (WALT) as an alternative or additional
load tracking scheme in lieu of or along with PELT, one that in our estimation
better tracks task demand and CPU utilization especially for use cases on
mobile devices. WALT was conceived by Srivatsa Vaddagiri to provide better
perf-per-watt numbers on asymmetric CPU (frequency and/or IPC) implementations,
(specifically on Qualcomm Snapgdragon chipsets running Android) and its metrics
have been used to guide task placement and p-state selection (load balancing
in CFS still uses PELT statistics). WALT is now present in the android-common
kernel as well. This RFC is not an attempt to supplant PELT statistics; this is
more about starting a discussion; perhaps one that may result in changes to
PELT to address the requirements listed below.

Background
__________

1) Classifying a task as "heavy" faster helps in mobile world use-cases such as
scrolling a UI or browsing a web page where tasks exhibit sporadically heavy
load, allowing for instance, faster migration to a more capable CPU (assuming
the scheduler uses task util and CPU capacity as inputs during task placement
decision making, e.g EAS). Reclassification of a light task as heavy is also
important - for example, a rendering thread may change its demand depending on
what is being shown on-screen. We believe that quickly associating a task with
its necessary resources requires a more dynamic demand or utilization signal
than provided by PELT, which is subject to the geometric series and does not
differentiate between task ramp-up and ramp-down. E.g, with a 32ms halflife,
PELT takes ~74ms to reach 80% nice-0-load_avg/util_avg for a continuously
executing 100% task, and ~139ms to reach 95%.

2) Historical task demand (even if a task sleeps for say 300ms due to network
delay) is required to restore both CPU and frequency resources to meet
performance throughput as well. PELT with a 32ms 'half-life' takes just ~213ms
to decay utilization/load (effectively) zero, forgetting that task's history,
requiring the task to re-execute its workload to gain that load_avg/util_avg
again.

3) There is a need to quickly ramp up CPU frequency as a result of heavy CPU
utilization, and to reduce frequency whenever possible to save power.
Consider as an example a single 100% thread running continuously on a single
core using schedutil with a 10ms transition latency. On an i7 8-core SMP
machine with SMT, PELT takes 60-70ms to ramp to FMAX from FMIN.

4) PELT's blocked util decay implies that the underlying cpufreq governor would
have to delay dropping frequency for longer than necessary (especially for
mobile world usecases). PELT with a 32ms 'half-life' takes ~108ms to reduce
the runqueue utilization/nice-0-load from max to 10%. Blocked load/util tracking
is of course a conservative approach recognizing that load might return to
the runqueue, but one that we feel is inflexible when it comes to certain real
world use cases.

WALT vs PELT
____________

WALT retains the per-entity properties of PELT; demand and utilization are
accounted for at the task level, and a CPU rq's utilization contribution is
the sum of its children's contributions. Wallclock time, however, is divided
into a series of windows - to be clear about what WALT tracks exactly, we
introduce the terms task demand and CPU busy-time here and show the benefits
versus PELT that apply to the four points mentioned above:

1) Task Demand - The contribution of a task's running and runnable time to a
window, averaged over N sliding windows. Note that a window has to be
*completed* before its contribution is considered. A task's demand is the
maximum of its contribution to the most recently completed window and its
average demand over the past N windows. This metric can be roughly thought of as
an unweighted load_avg without blocked load.
WALT can take as little as one window of execution to report max task demand,
although with freq and capacity invariance, this tends to be about 3 windows
with a good frequency governor (so that's about 30ms with a 10ms window). This
allows faster and more accurate task placement decisions.

2) WALT "forgets" blocked time entirely, i.e. the sliding windows exist only
when the task is on the runqueue or running. This allows rapid reclassification
of the same task as heavy after a short sleep. Thus a task does not need to
re-execute to gain its demand once more and can be migrated up to a big CPU
immediately.

3) CPU busy time - The sum of execution times of all tasks in the most recently
completed window. For the same machine described above and with schedutil, WALT
with a 10ms window takes around 30-50ms to ramp to FMAX, and with a 20ms window
takes between 60-100ms (note that it is assumed that schedutil has a transition
latency equal to or greater than the window size).

4) WALT "forgets" cpu utilization as soon as tasks are taken off of the
runqueue, and thus the cpufreq governor can choose to drop frequency after just
one window, potentially saving power as well as allowing the governor to choose
a hysteresis setting as necessary based on more accurate utilization data.

RFC patch
_________

We realize that WALT has been commercialized along with several other
out-of-tree changes to the Linux scheduler, and evaluating the code on top of
all those changes is not trivial. Therefore, this RFC is an attempt to provide
a simple, limited implementation of WALT - only the CPU utilization/frequency
guidance portion - that can be run on the X86 architecture as well. It provides
the basic building blocks in terms of the windowed view of time and how WALT
hooks into various scheduling points. The WALT CPU utilization metric is fed to
the schedutil governor which then sets the appropriate p-state. The intent
here is to show that we don't harm server or desktop workload performance
while still providing potential power savings.

The full WALT implementation can be browsed through at:
https://android.googlesource.com/kernel/common/+/android-4.4/kernel/sched/walt.c

This RFC patch has been tested on live X86 machines with the following sanity
and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick Bellasi
for initial code reviews):

(Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti graphics,
with the same frequency list as above. Running Ubuntu 16.04 on a v4.8.2
baseline. WALT window size was 10ms. Only deltas above 3% are considered
non-noise.Power measured with Intel RAPL counters)

+--------------------+----------------+----------------+----------+------------+
|                    | Schedutil+WALT | Schedutil+PELT | Ondemand | Delta      |
+--------------------+----------------+----------------+----------+------------+
| Chrome power test  |                |                |          | WALT saves |
| 1080p 60fps video  | 922 Joules     | 967 J          | 1011 J   | 4.6%       |
+--------------------+--------------- +----------------+----------+------------+
| Firefox power test |                |                |          | WALT saves |
| 1080p 60fps video  | 1079 Joules    | 1210 J         | 1248 J   | 10.82%     |
+--------------------+--------------- +----------------+----------+------------+
| perf sched bench   |                |                |          |            |
| messaging          |                |                |          |            |
| -g 50 -l 5000      | 16.94s         | 16.9326s       | 16.9318s | No diff    |
+--------------------+----------------+----------------+-------- -+------------+
| phoronix apache    | 26177          |                |          |            |
| benchmark          | requests/s     | 26088          | 26170    | No diff    |
+--------------------+----------------+----------------+----------+------------+
| phoronix pgbench   | 295            |                |          |            |
| database r/w test  | transactions/s | 300            | 302      | No diff    |
+--------------------+----------------+----------------+----------+------------+
| phoronix           |                |                |          |            |
| uniqingine-valley  | 49fps          | 49fps          | 49fps    | No diff    |
+--------------------+----------------+----------------+----------+------------+

+-----------------+-------+--------+--------+-------------+-----------+--------+
| cpufreq-bench  (from tools/power/cpupower/bench/)                            |
| measures % of work done against the performance govneror                     |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| Schedutil + WALT| round | load   | sleep  | performance | schedutil | %      |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 | 0     | 50000  | 50000  | 50255       | 74574     | 67.389 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 | 1     | 100000 | 100000 | 99931       | 125724    | 80.367 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 | 2     | 150000 | 150000 | 150496      | 175244    | 85.878 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 | 3     | 200000 | 200000 | 200772      | 224903    | 89.270 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 | 4     | 250000 | 250000 | 250716      | 275955    | 90.854 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 |       |        |        |             |           |        |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| Schedutil + PELT| round | load   | sleep  | performance | schedutil | %      |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 | 0     | 50000  | 50000  | 50055       | 65408     | 76.528 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 | 1     | 100000 | 100000 | 100494      | 131842    | 76.223 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 | 2     | 150000 | 150000 | 150423      | 182352    | 82.490 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 | 3     | 200000 | 200000 | 200605      | 234065    | 85.705 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
|                 | 4     | 250000 | 250000 | 251724      | 284186    | 88.577 |
+-----------------+-------+--------+--------+-------------+-----------+--------+

ARM data with EAS (Energy Aware Scheduler) and sched-freq
_________________________________________________________

The following is data collected on a Snapdragon 820 device. The WALT
implementation in this kernel is used by the EAS scheduler coupled with the
sched-freq governor and is a complete one with both task demand and cpu
utilization tracking (as opposed to the limited patch in this RFC). Power
is measured at battery. EAS uses cfs_rq->avg.util_avg and se->avg.util_avg
when using PELT and rq->prev_runnable_sum and p->ravg.demand when using
WALT metrics.

The "Jankbench" and power data are real world usecases where WALT provides the
most benefit, while the benchmarks are often used as a metric to compare OEM
devices. WALT uses a 10ms window in all of the following data.

+---------------------------------+---------+----------+----------------------+
| ARM data (ARMv8 based Snapdragon 820)                                       |
| Two clusters with 2 Kryo CPUs each, 1.8Ghz and 2.4GHz FMAX each.            |
| Running Android with EAS + WALT and EAS + PELT, with the sched-freq govneror|
+---------------------------------+---------+----------+----------------------+
|                                 | WALT    | PELT     |                      |
+---------------------------------+---------+----------+----------------------+
| Geekbench 3 Single Core         |  No difference                            |
+---------------------------------+---------+----------+----------------------+
| Geekbench 3 Multicore           |  No difference                            |
+---------------------------------+---------+----------+----------------------+
| Geekbench 4 Single Core         |  No difference                            |
+---------------------------------+---------+----------+----------------------+
| Geekbench 4 Multicore           |  No difference                            |
+---------------------------------+---------+----------+----------------------+
| MT Linpack Native               | 525     | 483      | WALT is 8.6% better  |
+---------------------------------+---------+----------+----------------------+
| MT Linpack Java                 | 682     | 710      | PELT is 4.1% better  |
+---------------------------------+---------+----------+----------------------+
| Threadbench                     | 200     | 103      | WALT is 94.1% better |
+---------------------------------+---------+----------+----------------------+
| sysbench                        | 230     | 259      | PELT is 12% better   |
+---------------------------------+---------+----------+----------------------+
| Parsec                          | 287     | 220      | WALT is 30% better   |
+---------------------------------+---------+----------+----------------------+
| IPC (Android Binder)            | 158     | 167      | PELT is 5.6% better  |
+---------------------------------+---------+----------+----------------------+
|                                 |         |          |                      |
+---------------------------------+---------+----------+----------------------+
| Jankbench - Xth frame percentile indicates X % of frames rendered in Y ms   |
| or less. Lower is better.                                                   |
| This test scrolls text and graphics on a mobile screen in Android.          |
| (list view, imagelist view subtests)                                        |
+---------------------------------+---------+----------+----------------------+
|                                 | WALT    | PELT     |                      |
+---------------------------------+---------+----------+----------------------+
| 90th percentile                 | 6.32ms  | 6.50ms   | WALT is 3% better    |
+---------------------------------+---------+----------+----------------------+
| 95th percentile                 | 7.20ms  | 7.50s    | WALT is 4% better    |
+---------------------------------+---------+----------+----------------------+
| 99th percentile                 | 8.50ms  | 8.53ms   | No diff              |
+---------------------------------+---------+----------+----------------------+
| Power (avg, inst. current)      | 115.1mA | 116.8mA  | No difference        |
+---------------------------------+---------+----------+----------------------+
|                                 |         |          |                      |
+---------------------------------+---------+----------+----------------------+
| Jankbench - Xth frame percentile indicates X % of frames rendered in Y ms   |
| or less. Lower is better.                                                   |
| This test scrolls text and graphics on a mobile screen in Android.          |
| (high bit-rate text render, edit text input, overdraw subtests)             |
+---------------------------------+---------+----------+----------------------+
| 90th percentile                 | 17.45ms | 18.23ms  | WALT is 4.2% better  |
+---------------------------------+---------+----------+----------------------+
| 95th percentile                 | 19.20ms | 19.68ms  | No diff              |
+---------------------------------+---------+----------+----------------------+
| 99th percentile                 | 38.4ms  | 41.44ms  | WALT is 7.33% better |
+---------------------------------+---------+----------+----------------------+
| Power (avg, inst. current)      | 378.0mA | 380.0mA  | No difference        |
+---------------------------------+---------+----------+----------------------+
|                                 |         |          |                      |
+---------------------------------+---------+----------+----------------------+
| Video 1080p (big buck bunny)  60 seconds of playback, sampling rate 1000/s  |
+---------------------------------+---------+----------+----------------------+
| Avg Instantaneous Current       | 103mA   | 114mA    | WALT is 9.6% better  |
+---------------------------------+---------+----------+----------------------+
|                                 |         |          |                      |
+---------------------------------+---------+----------+----------------------+
| Video 2160p (big buck bunny)  60 seconds of playback, sampling rate 1000/s  |
+---------------------------------+---------+----------+----------------------+
| Avg Instantaneous Current       | 160mA   | 164.5mA  | WALT is 3% better    |
+---------------------------------+---------+----------+----------------------+
|                                 |         |          |                      |
+---------------------------------+---------+----------+----------------------+

This patch is based on v4.8.2

Srivatsa Vaddagiri (3):
 sched: Introduce structures necessary for WALT
 sched: Introduce Window-Assisted CPU utilization Tracking
 sched: Introduce WALT hooks into core and scheduling classes

 include/linux/sched.h        |  35 ++++++
 include/linux/sched/sysctl.h |   1 +
 init/Kconfig                 |   9 ++
 kernel/sched/Makefile        |   1 +
 kernel/sched/core.c          |  29 ++++-
 kernel/sched/cputime.c       |  10 +-
 kernel/sched/deadline.c      |   7 ++
 kernel/sched/debug.c         |  10 ++
 kernel/sched/fair.c          |  11 +-
 kernel/sched/rt.c            |   6 +
 kernel/sched/sched.h         |  12 ++
 kernel/sched/walt.c          | 433 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/walt.h          |  69 ++++++++++
 kernel/sysctl.c              |  11 ++
 14 files changed, 638 insertions(+), 6 deletions(-)

Some references:
Steve Muckle's 2014 presentation on Qualcomm's HMP scheduler (with WALT):
  http://www.slideshare.net/linaroorg/lcu14-406-a-quick-take-on-energyaware-scheduling
  https://www.youtube.com/watch?v=2xb0vOV-E6E
2016 presentation of Qualcomm data on WALT (was called WinLT then!):
  http://www.slideshare.net/linaroorg/bkk16208-eas
  https://www.youtube.com/watch?v=g4srUao5ahg
Qualcomm data on WALT vs PELT CPU utilization on X86:
  https://www.youtube.com/watch?v=hc0aufpfWvY
Steve Muckle's presentation including WALT+schedutil data on a HiKey Board:
  http://www.slideshare.net/linaroorg/las16307-benchmarking-schedutil-in-android
  https://www.youtube.com/watch?v=Wxu5ApDRYH4 
The WALT wiki page
  https://wiki.codeaurora.org/xwiki/bin/QKERNEL/Window+Assisted+Load+Tracking/

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 18+ messages in thread