[RFC 0/4] IDLE gating in presence of latency-sensitive tasks

From: Parth Shah <parth@linux.ibm.com>
To: linux-kernel@vger.kernel.org
Cc: peterz@infradead.org, mingo@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	qais.yousef@arm.com, chris.hyser@oracle.com,
	pkondeti@codeaurora.org, valentin.schneider@arm.com,
	rjw@rjwysocki.net
Subject: [RFC 0/4] IDLE gating in presence of latency-sensitive tasks
Date: Thu,  7 May 2020 19:07:19 +0530	[thread overview]
Message-ID: <20200507133723.18325-1-parth@linux.ibm.com> (raw)

Abstract:
=========
The IDLE states provides a way to save power at the cost of wakeup
latencies. The deeper the IDLE state selected by the CPUIDLE governor, the
more the exit latency of the state will be. This exit latency adds to the
task's wakeup latency. Hence choosing the best trade-off between the power
save feature and the increase observable latency is what CPUIDLE governor
focus upon.

But CPUIDLE governor is generic in nature to provide best of both the
worlds. However, the CPUIDLE governor does not have the capability to
distinguish between latency sensitivity of tasks queued on the runqueue.

With the introduction of latency-nice attribute to provide the latency
requirements from the userspace, the CPUIDLE governor can use it to
identify latency sensitive tasks.

Hence, this patch-set restricts the CPU running latency-sensitive tasks to
go into any IDLE state, thus minimizing the impact of exit latency for such
tasks. Expected results is better power-saving and comparable performance
than compared to disabling all IDLE states.

- Why not use PM_QoS?
PM_QoS provide ways to assist CPUIDLE governor decision with per device(s)
(like CPU, network, etc.) attributes.  This behavior of decision assistance
at "per device" level lacks in providing the best of both power saving and
the latency minimization for specific task only. Hence, PM_QoS like feature
can be clubbed with scheduler to retrieve latency_nice information of a
task and provide better decision at per task level.  This leads to
providing better performance to only those CPUs which is queued up for
running latency sensitive tasks and thus saving power from other CPUs.

Implementation:
===============
- latency sensitive tasks are the task's marked with latency_nice == -20.
- Use the per-CPU variable to keep track of the (to be) queued up latency
  sensitive tasks.
- CPUIDLE governor does not choose a non-polling idle state on such marked
  CPUS until the percpu counter goes back to zero

This strategy solves many latency related problems for the tasks showing
sleep-wake-sleep pattern (basically most of GPU workloads, schbench,
RT-app, database workloads, and many more).

This series is based on latency_nice patch-set
PATCH v5 https://lkml.org/lkml/2020/2/28/166

One may use below file to set latency_nice value:
- lnice.c: lnice -l -20 <workload>
  https://github.com/parthsl/tools/blob/master/misc_scripts/lnice.c
or
- relnice.c: relnice -p <PID> -l 19
  https://github.com/parthsl/tools/blob/master/misc_scripts/relnice.c

Results:
========
# Baseline = tip/sched/core + latency_nice patches
# w/ patch = Baseline + this patch-set + workload's latency_nice = -20

=> Schbench (Lower is better)
-----------------------------
- Baseline:
$> schbench -r 30
+---------------------+----------+-------------------------+-----------+
| %ile Latency (in us)| Baseline | cpupower idle-set -D 10 | w/ patch  |
+---------------------+----------+-------------------------+-----------+
| 50                  |      371 | 21                      | 22        |
| 90                  |      729 | 33                      | 37        |
| 99                  |      889 | 40 (-95%)               | 48 (-94%) |
| max                 |     2197 | 59                      | 655       |
| Avg. Energy (Watts) |       73 | 96 (+31%)               | 88 (+20%) |
+---------------------+----------+-------------------------+-----------+

$> schbench -r 10 -m 2 -t 1
+---------------------+----------+-------------------------+-----------+
| %ile Latency (in us)| Baseline | cpupower idle-set -D 10 | w/ patch  |
+---------------------+----------+-------------------------+-----------+
| 50                  |      336 | 5                       | 4         |
| 90                  |      464 | 7                       | 6         |
| 99                  |      579 | 10 (-98%)               | 11 (-98%) |
| max                 |      691 | 12                      | 489       |
| Avg. Energy (Watts) |       28 | 40 (+42%)               | 33 (+17%) |
+---------------------+----------+-------------------------+-----------+

=> PostgreSQL (lower is better):
----------------------------------
- 44 Clients running in parallel
$> pgbench -T 30 -S -n -R 10  -c 44
+---------------------+----------+-------------------------+--------------+
|                     | Baseline | cpupower idle-set -D 10 |   w/ patch   |
+---------------------+----------+-------------------------+--------------+
| latency avg. (ms)   |    2.028 | 0.424 (-80%)            | 1.202 (-40%) |
| latency stddev      |    3.149 | 0.473                   | 0.234        |
| trans. completed    |      294 | 304 (+3%)               | 300 (+2%)    |
| Avg. Energy (Watts) |     23.6 | 42.5 (+80%)             | 26.5 (+20%)  |
+---------------------+----------+-------------------------+--------------+

- 1 Client running
$> pgbench -T 30 -S -n -R 10 -c 1
+---------------------+----------+-------------------------+--------------+
|                     | Baseline | cpupower idle-set -D 10 |   w/ patch   |
+---------------------+----------+-------------------------+--------------+
| latency avg. (ms)   |    1.292 | 0.282 (-78%)            | 0.237 (-81%) |
| latency stddev      |    0.572 | 0.126                   | 0.116        |
| trans. completed    |      294 | 268 (-8%)               | 315 (+7%)    |
| Avg. Energy (Watts) |      9.8 | 29.6 (+302%)            | 27.7 (+282%) |
+---------------------+----------+-------------------------+--------------+
*trans. completed = Total transactions processed (Higher is better)

Parth Shah (4):
  sched/core: Introduce per_cpu counter to track latency sensitive tasks
  sched/core: Set nr_lat_sensitive counter at various scheduler
    entry/exit points
  sched/idle: Disable idle call on least latency requirements
  sched/idle: Add debugging bits to validate inconsistency in latency
    sensitive task calculations

 kernel/sched/core.c  | 32 ++++++++++++++++++++++++++++++--
 kernel/sched/idle.c  |  8 +++++++-
 kernel/sched/sched.h |  7 +++++++
 3 files changed, 44 insertions(+), 3 deletions(-)

-- 
2.17.2