[PATCH 00/13] High performance balancing logic for big.LITTLE

* [PATCH 00/13] High performance balancing logic for big.LITTLE
@ 2015-11-06 12:02 Arseniy Krasnov
  2015-11-06 12:02 ` [PATCH 01/13] hperf_hmp: add new config for arm and arm64 Arseniy Krasnov
                   ` (13 more replies)
  0 siblings, 14 replies; 17+ messages in thread
From: Arseniy Krasnov @ 2015-11-06 12:02 UTC (permalink / raw)
  To: linux, mingo, peterz; +Cc: a.krasnov, v.tyrtov, s.rogachev, linux-kernel

				Prologue

The patch set introduces an extension to the default Linux CPU scheduler (CFS).
The main purpose of the extension is utilization of a big.LITTLE CPU for maximum
performance. Such solution may be useful for users of OdroidXU-3 board
(supporting 8 cores) who doesn't care about power efficiency.

Maximum utilization was reached using the following policies:

1) A15 cores must be utilized as much as possible e.g. idle A15 cores always
   pull some task from A7 core.

2) After execution of a task on A7 core for some period of time it should be
   swapped with an appropriate task from A15 cluster in order to achieve
   fairness.

3) Load of big and little clusters is balanced according to frequency and
   A15/A7 slowdown coefficient.

			Approach Description

The scheduler creates a hierarchy of two domains: MC and HMP. The MC domain is a
default domain for SCHED_MC config. The HMP domain contains two clusters: A15
and A7 CPUs. Balancing between HMP domains is performed by the new logic, in MC
domains, in turn, balancing is done by the default logic of the 'load_balance()'
function.

To perform balancing between HMP domains, the load of each cluster is calculated
in scheduler's softirq handler. Then, this value is scaled according to each
cluster's frequency and slowdown coefficient which is a ratio of busy-loop
performance on A15 and A7. There are three ways of migration between two
clusters: from A15 cluster to A7 cluster (if load on A15 cluster is too high),
from A7 cluster to A15 cluster (otherwise) and task swapping when load on both
clusters is the same. To migrate some task from one cluster to another firstly
this task should be selected. To find a task suitable for migration the
scheduler uses a special per-task metric called 'druntime'. It is based on CFS's
vruntime metric but its grow direction depends on a core where the task is
executed: for A15 core it grows up, for A7 core, in turn, it goes down. So,
being the druntime value close to zero means that the task is executed on both
clusters for the same amount of time. As a result, to get a task for migration
it scans each runqueue to find a task with highest/lowest druntime depending on
which cluster is scanned; after, when the task is found, it is moved to another
cluster. These balancing steps are performed in each scheduler balancing
operation executed by softirq.

To get maximum performance A15 cores must be fully utilized; this means that
idle A15 cores are always able to pull tasks from A7 cores while A7 cores cannot
do that from A15 cores.

An finally, let's look to fairness - it is provided by swapping of tasks during
every softirq balancing: when balance is broken it tries to repair the balance
moving tasks from one cluster to another, then when the clusters are balanced,
the tasks are swapped during each softirq balancing. In addition to this logic,
'select_task_rq_fair' was modified in order to place woken tasks to least loaded
CPU, because it won't break the balance between A15 and A7 cores.

				Test results

Several test kits were used for performance measurement of the solution.
All comparision is done against the Linaro MP scheduler.

The first test case is a parsec benchmark suite. It contains different types of
tasks like cluster searching or pattern recognition in order to test scheduler
performance. Results of some benchmarks are listed in the text below (in
seconds):

Streamcluster:

Developed by Princeton University and solves the online clustering problem.
Streamcluster was included in the PARSEC benchmark suite because of the
importance of data mining algorithms and the prevalence of problems with
streaming characteristics.

	Threads		HPERF_HMP	Linaro MP
	1		27,333		27,422
	2		14,162		14,197
	3		10,099		10,168
	4		8,227		8,332
	5		10,922		23,349
	6		10,85		22,507
	7		11,39		22,041
	8		12,307		21,181
	9		20,339		22,115
	10		21,33		23,746
	11		23,289		24,831
	12		25,363		26,699
	13		34,091		34,84
	14		34,758		38,661
	15		35,743		38,688
	16		38,1		44,735
	17		41,165		77,098
	18		44,223		102,633
	19		46,177		113,748
	20		48,22		119,146
	21		52,372		135,499
	22		54,319		136,454
	23		56,218		141,924
	24		57,843		145,727
	25		61,759		158,754
	26		63,179		163,915
	27		64,987		167,559
	28		67,329		171,203
	29		70,489		185,171
	30		73,084		189,303
	31		75,264		192,487
	32		77,015		197,27
	avg		40,373		87,543

Bodytrack:

This computer vision application is an Intel RMS workload which tracks a human
body with multiple cameras through an image sequence. This benchmark was
included due to the increasing significance of computer vision algorithms in
areas such as video surveillance, character animation and computer interfaces.

	Threads		HPERF_HMP	Linaro MP
	1		15,884		16,632
	2		8,536		9,42
	3		6,037		7,257
	4		4,84		6,076
	5		8,835		5,739
	6		4,437		5,513
	7		4,119		5,474
	8		3,992		5,115
	9		3,854		5,164
	10		3,92		4,911
	11		3,854		4,932
	12		3,83		4,816
	13		3,839		5,643
	14		3,861		4,816
	15		3,889		4,896
	16		3,845		4,854
	17		3,872		4,837
	18		3,852		4,876
	19		4,304		4,868
	20		3,915		4,928
	21		3,87		4,841
	22		3,858		4,995
	23		3,881		4,97
	24		3,876		4,899
	25		3,854		4,96
	26		3,869		4,902
	27		3,874		4,979
	28		3,88		4,928
	29		3,914		5,008
	30		3,889		5,216
	31		3,898		5,242
	32		3,894		5,199
	avg		4,689		5,653

Blackscholes:

This application is an Intel RMS benchmark. It calculates the prices for a
portfolio of European options analytically with the Black-Scholes partial
differential equation. There is no closed-form expression for the blackscholes
equation and as such it must be computed numerically.

	Threads		HPERF_HMP	Linaro MP
	1		7,293		6,807
	2		3,886		4,044
	3		2,906		2,911
	4		2,429		2,427
	5		2,58		2,985
	6		2,401		2,672
	7		2,205		2,411
	8		2,132		2,293
	9		2,074		2,41
	10		2,067		2,264
	11		2,054		2,205
	12		2,091		2,222
	13		2,042		2,28
	14		2,035		2,222
	15		2,026		2,25
	16		2,024		2,177
	17		2,021		2,173
	18		2,033		2,09
	19		2,03		2,05
	20		2,024		2,158
	21		2,002		2,175
	22		2,026		2,179
	23		2,017		2,134
	24		2,01		2,156
	25		2,009		2,155
	26		2,013		2,179
	27		2,017		2,177
	28		2,019		2,189
	29		2,013		2,158
	30		2,002		2,162
	31		2,016		2,16
	32		2,012		2,159
	avg		2,328		2,469

Also, well known Antutu benchmark was executed on Exynos 5433 board:

					HPERF_HMP	Linaro MP
	Integral benchmark result  	42400		36860 
	Result: hperf_hmp is 15% better.

Arseniy Krasnov (13):
  hperf_hmp: add new config for arm and arm64.
  hperf_hmp: introduce hew domain flag.
  hperf_hmp: add sched domains initialization.
  hperf_hmp: scheduler initialization routines.
  hperf_hmp: introduce druntime metric.
  hperf_hmp: is_hmp_imbalance introduced.
  hperf_hmp: migration auxiliary functions.
  hperf_hmp: swap tasks function.
  hperf_hmp: one way balancing function.
  hperf_hmp: idle pull function.
  hperf_hmp: task CPU selection logic.
  hperf_hmp: rest of logic.
  hperf_hmp: cpufreq routines.

 arch/arm/Kconfig           |   21 +
 arch/arm/kernel/topology.c |    6 +-
 arch/arm64/Kconfig         |   21 +
 include/linux/sched.h      |   17 +
 kernel/sched/core.c        |   65 +-
 kernel/sched/fair.c        | 1553 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h       |   16 +
 7 files changed, 1586 insertions(+), 113 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 17+ messages in thread