[PATCH v4 0/3] sched/fair: Introduce scaled capacity awareness in enqueue

* [PATCH v4 0/3] sched/fair: Introduce scaled capacity awareness in enqueue
@ 2017-09-26  0:02 Rohit Jain
  2017-09-26  0:02 ` [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path Rohit Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Rohit Jain @ 2017-09-26  0:02 UTC (permalink / raw)
  To: linux-kernel, eas-dev
  Cc: peterz, mingo, joelaf, atish.patra, vincent.guittot,
	dietmar.eggemann, morten.rasmussen

During OLTP workload runs, threads can end up on CPUs with a lot of
softIRQ activity, thus delaying progress. For more reliable and
faster runs, if the system can spare it, these threads should be
scheduled on CPUs with lower IRQ/RT activity.

Currently, the scheduler takes into account the original capacity of
CPUs when providing 'hints' for select_idle_sibling code path to return
an idle CPU. However, the rest of the select_idle_* code paths remain
capacity agnostic. Further, these code paths are only aware of the
original capacity and not the capacity stolen by IRQ/RT activity.

This patch introduces capacity awarness in scheduler (CAS) which avoids
CPUs which might have their capacities reduced (due to IRQ/RT activity)
when trying to schedule threads (on the push side) in the system. This
awareness has been added into the fair scheduling class.

It does so by, using the following algorithm:
1) As in rt_avg the scaled capacities are already calculated.

2) Any CPU which is running below 80% capacity is considered running low
on capacity.

3) During idle CPU search if a CPU is found running low on capacity, it
is skipped if better CPUs are available.

4) If none of the CPUs are better in terms of idleness and capacity, then
the low-capacity CPU is considered to be the best available CPU.

The performance numbers:
---------------------------------------------------------------------------
CAS shows upto 1.5% improvement on x86 when running 'SELECT' database
workload.

I also used barrier.c (open_mp code) as a micro-benchmark. It does a number
of iterations and barrier sync at the end of each for loop.

I was also running ping on CPU 0 as:
'ping -l 10000 -q -s 10 -f host2'

The results below should be read as:

* 'Baseline without ping' is how the workload would've behaved if there
  was no IRQ activity.

* Compare 'Baseline with ping' and 'Baseline without ping' to see the
  effect of ping

* Compare 'Baseline with ping' and 'CAS with ping' to see the improvement
  CAS can give over baseline

The program (barrier.c) can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html

Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 20 core x86 machine:

+-------+----------------+----------------+------------------+
|Num.   |CAS             |Baseline        |Baseline without  |
|Threads|with ping       |with ping       |ping              |
+-------+-------+--------+-------+--------+-------+----------+
|       |Mean   |Std. Dev|Mean   |Std. Dev|Mean   |Std. Dev  |
+-------+-------+--------+-------+--------+-------+----------+
|1      | 511.7 | 6.9    | 508.3 | 17.3   | 514.6 | 4.7      |
|2      | 486.8 | 16.3   | 463.9 | 17.4   | 510.8 | 3.9      |
|4      | 466.1 | 11.7   | 451.4 | 12.5   | 489.3 | 4.1      |
|8      | 433.6 | 3.7    | 427.5 | 2.2    | 447.6 | 5.0      |
|16     | 391.9 | 7.9    | 385.5 | 16.4   | 396.2 | 0.3      |
|32     | 269.3 | 5.3    | 266.0 | 6.6    | 276.8 | 0.2      |
+-------+-------+--------+-------+--------+-------+----------+

Following are the runtime(s) with hackbench and ping activity as
described above (lower is better), on a 20 core x86 machine:

+---------------+------+--------+--------+
|Num.           |CAS   |Baseline|Baseline|
|Tasks          |with  |with    |without |
|(groups of 40) |ping  |ping    |ping    |
+---------------+------+--------+--------+
|               |Mean  |Mean    |Mean    |
+---------------+------+--------+--------+
|1              | 0.97 | 0.97   | 0.68   |
|2              | 1.36 | 1.36   | 1.30   |
|4              | 2.57 | 2.57   | 1.84   |
|8              | 3.31 | 3.34   | 2.86   |
|16             | 5.63 | 5.71   | 4.61   |
|25             | 7.99 | 8.23   | 6.78   |
+---------------+------+--------+--------+

Changelog:
---------------------------------------------------------------------------
v1->v2:
* Changed the dynamic threshold calculation as the having global state
  can be avoided.

v2->v3:
* Split up the patch for find_idlest_cpu and select_idle_sibling code
  paths.

v3->v4:
* Rebased it to peterz's tree (apologies for wrong tree for v3)

Previous discussion can be found at:
---------------------------------------------------------------------------
https://patchwork.kernel.org/patch/9741351/
https://lists.linaro.org/pipermail/eas-dev/2017-August/000933.html

Rohit Jain (3):
  sched/fair: Introduce scaled capacity awareness in find_idlest_cpu
    code path
  sched/fair: Introduce scaled capacity awareness in select_idle_sibling
    code path
  ignore_this_patch: Fixing compilation error on Peter's tree

 kernel/sched/fair.c      | 81 +++++++++++++++++++++++++++++++++++++++---------
 kernel/time/tick-sched.c |  1 +
 2 files changed, 68 insertions(+), 14 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 17+ messages in thread