[RFC PATCH 0/3] sched: delayed thread migration

* [RFC PATCH 0/3] sched: delayed thread migration
@ 2020-10-20 15:44 Redha Gouicem
  2020-10-20 15:44 ` [PATCH 1/3] cpufreq: x86: allow external frequency measures Redha Gouicem
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Redha Gouicem @ 2020-10-20 15:44 UTC (permalink / raw)
  Cc: julien.sopena, julia.lawall, gilles.muller, carverdamien,
	jean-pierre.lozi, baptiste.lepers, nicolas.palix,
	willy.zwaenepoel, Redha Gouicem, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, x86, H. Peter Anvin, Rafael J. Wysocki,
	Viresh Kumar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Qais Yousef, Al Viro, Andrey Ignatov,
	Guilherme G. Piccoli, linux-kernel, linux-pm, linux-fsdevel

This patch series proposes a tweak in the thread placement strategy of CFS
to make a better use of cores running at a high frequency. It is the result
of a published work at ATC'20, available here:
https://www.usenix.org/conference/atc20/presentation/gouicern

We address the frequency inversion problem that stems from new DVFS
capabilities of CPUs where frequency can be different among cores of the
same chip. On applications with a large number of fork/wait patterns, we
see that CFS tends to place new threads on idle cores running at a low
frequency, while cores where the parent is running are at a high frequency
and become idle immediately after. This means that a low frequency core is
used while a high frequency core becomes idle. A more detailed analysis of
this behavior is available in the paper but here is small example with 2
cores c0 and c1 and 2 threads A and B.

Legend: - means high frequency
        . means low frequency
	  nothing means nothing is running.

c0:   A------fork()-wait()
                  \
c1:                B................--------

c0 becomes idle while at a high frequency and c1 becomes busy and executes
B at a low frequency for a long time.

The fix we propose is to delay the migration of threads in order to
see if the parent's core can be used quickly before using another core.
When choosing a core to place a new or waking up thread (in
select_task_rq_fair()), we check if the destination core is idle or busy
frequency-wise. If it is busy, we keep the current behavior and move the
thread to this new core. If it is idle, we place the thread locally
(parent's core on fork, waker's core on wakeup). We also arm a
high-resolution timer that will trigger a migration of this thread to the
new core chosen by the original algorithm of CFS. If the thread is able to
run quickly on the local core, we cancel the timer (e.g. the parent thread
calls the wait() syscall). Else, the thread will be moved to the core
decided by CFS and run quickly. When the parent thread waits, the previous
example becomes:

c0:   A------fork()-wait()-B---------------
                  \/
c1:

If the parent thread keeps running, the example becomes:

c0:   A------fork()---------timer------------
                  \/           \
c1:                             B.............--------

In the first case, we use a high frequency core (c0) and let c1 be idle. In
the second case, we loose a few us of execution time for B.

There are two configurable parameters in this patch:
     - the frequency threshold above which we consider a core busy
       frequency-wise. By default, the threshold is set to 0, which
       disables the delayed migrations. On our test server (an Intel Xeon
       E7-8870 v4), an efficient value was the minimal frequency of the
       CPU. On another test machine (AMD Ryzen 5 3400G), it was a frequency
       a little bit higher than the minimal one.
     - the delay of the high-resolution timer. We choose a default value of
       50 us since it is the mean delay we observed between a fork and a
       wait during a kernel build.
Both parameters can be modified through files in /proc/sys/kernel/. Since
this is kind of work in progress, we are still looking to find better ways
to choose the best threshold and delay, as both values probably depend on
the machine.

In terms of performance, we benchmarked 60 applications on a 5.4 kernel at
the time of the paper's publication. The detailed results are available in
the paper but overall, we improve the performance of 23 applications, with
a best improvement of 56%. Only 3 applications suffer large performance
degradations (more than 5%), but the degradation is "small" (at most 8%).
We also think there might be energy savings since we try to avoid waking
idle cores. We did see small improvements on our machines, but nothing as
good as in terms of performance (at most 5% improvement).

On the current development kernel (on the sched/core branch, commit
f166b111e049), we run a few of these benchmarks to confirm our results on
an up-to-date kernel. We run the build of the kernel (complete and
scheduler only), hackbench, 2 applications from the NAS benchmark suite and
openssl. We select these applications because they had either very good or
very bad results on the 5.4 kernel. We present the mean of 10 runs, with
the powersave scaling governor and the intel_pstate driver in active mode.

 Benchmark    | unit | 5.9-f166b111e049 |          patched
 -------------+------+------------------+------------------
			LOWER IS BETTER
 -------------+------+------------------+------------------
 hackbench    |    s |            5.661 |    5.612 (- 0.9%)
 kbuild-sched |    s |            7.370 |    6.309 (-14.4%)
 kbuild       |    s |           34.798 |   34.788 (- 0.0%)
 nas_ft.C     |    s |           10.080 |   10.012 (- 0.7%)
 nas_sp.B     |    s |            6.653 |    7.033 (+ 5.7%)
 -------------+------+------------------+------------------
 			HIGHER IS BETTER
 -------------+------+------------------+------------------
 openssl      |sign/s|         15086.89 | 15071.08 (- 0.1%)

On this small subset, the only application negatively impacted application
is an HPC workload, so it is less of a problem since HPC people usually
bypass the scheduler altogether. We still have a large beneficial impact on
the scheduler build that proeminently has the fork/wait pattern and
frequency inversions.

The first patch of the series is not specific to scheduling. It allows us
(or anyone else) to use the cpufreq infrastructure at a different sampling
rate without compromising the cpufreq subsystem and applications that
depend on it.

The main idea behind this patch series is to bring to light the frequency
inversion problem that will become more and more prominent with new CPUs
that feature per-core DVFS. The solution proposed is a first idea for
solving this problem that still needs to be tested across more CPUs and
with more applications.

Redha Gouicem (3):
  cpufreq: x86: allow external frequency measures
  sched: core: x86: query frequency at each tick
  sched/fair: delay thread migration on fork/wakeup/exec

 arch/x86/kernel/cpu/aperfmperf.c | 31 ++++++++++++++++++++---
 drivers/cpufreq/cpufreq.c        |  5 ++++
 include/linux/cpufreq.h          |  1 +
 include/linux/sched.h            |  4 +++
 include/linux/sched/sysctl.h     |  3 +++
 kernel/sched/core.c              | 43 ++++++++++++++++++++++++++++++++
 kernel/sched/fair.c              | 42 ++++++++++++++++++++++++++++++-
 kernel/sched/sched.h             |  7 ++++++
 kernel/sysctl.c                  | 14 +++++++++++
 9 files changed, 145 insertions(+), 5 deletions(-)

-- 
2.28.0

^ permalink raw reply	[flat|nested] 7+ messages in thread