[RFC 0/2] sched: Make idle_balance smarter about topology

* [RFC 0/2] sched: Make idle_balance smarter about topology
@ 2018-02-08 22:19 Rohit Jain
  2018-02-08 22:19 ` [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance Rohit Jain
  2018-02-08 22:19 ` [RFC 2/2] Introduce sysctl(s) for the migration costs Rohit Jain
  0 siblings, 2 replies; 18+ messages in thread
From: Rohit Jain @ 2018-02-08 22:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, joelaf, jbacik, riel, juri.lelli,
	dhaval.giani, efault

Current idle_balance does a check against migration cost (fixed value)
with the average idle time of a CPU. There is a huge difference in
migration costs between CPUs of the same core, different cores and
different sockets. Since sched_domain already captures the architectural
dependencies, this patch tries to encapsulate the migration cost based
on the topology of the machine.

Test Results:

* Wins:

1) hackbench results on 44 core (22 core per socket), 2 socket Intel x86
machine (lower is better)

+-------+----+-------+-------------------+--------------------------+
|       |    |       | Without patch     |With patch                |
+-------+----+-------+---------+---------+----------------+---------+
|Loops  |FD  |Groups | Average |%Std Dev |Average         |%Std Dev |
+-------+----+-------+---------+---------+----------------+---------+
|100000 |40  |4      | 9.701   |0.78     |9.623  (+0.81%) |3.67     |
|100000 |40  |8      | 17.186  |0.77     |17.068 (+0.68%) |1.89     |
|100000 |40  |16     | 30.378  |0.55     |30.072 (+1.52%) |0.46     |
|100000 |40  |32     | 54.712  |0.54     |53.588 (+2.28%) |0.21     |
+-------+----+-------+---------+---------+----------------+---------+

Note: I start with 4 groups because the Standard Deviation for groups 1
and 2 was very high.

2) sysbench MySQL results on 2 socket, 44 core and 88 threads Intel x86
machine (higher is better):

+-----------+--------------------------------+-------------------------------+
|           | Without Patch                  | With Patch                    |
+-----------+--------+------------+----------+--------------------+----------+
|Approx.    | Num    |  Average   |          |  Average           |          |
|Utilization| Threads|  througput | %Std Dev |  througput         | %Std Dev |
+-----------+--------+------------+----------+--------------------+----------+
|10.00%     |    8   |  133658.2  |  0.66    |  135071.6 (+1.06%) | 1.39     |
|20.00%     |    16  |  266540    |  0.48    |  268417.4 (+0.70%) | 0.88     |
|40.00%     |    32  |  466315.6  |  0.15    |  468289.0 (+0.42%) | 0.45     |
|75.00%     |    64  |  720039.4  |  0.23    |  726244.2 (+0.86%) | 0.03     |
|82.00%     |    72  |  757284.4  |  0.25    |  769020.6 (+1.55%) | 0.18     |
|90.00%     |    80  |  807955.6  |  0.22    |  818989.4 (+1.37%) | 0.22     |
|98.00%     |    88  |  863173.8  |  0.25    |  876121.8 (+1.50%) | 0.28     |
|100.00%    |    96  |  882950.8  |  0.32    |  890678.8 (+0.88%) | 0.51     |
|100.00%    |    128 |  895112.6  |  0.13    |  899149.6 (+0.47%) | 0.44     |
+-----------+--------+------------+----------+--------------------+----------+

* No change:

3) tbench sample results on 2 socket, 44 core and 88 threads Intel x86
machine:

With Patch:

Throughput 555.834 MB/sec  2 clients    2 procs  max_latency=0.330 ms
Throughput 1388.19 MB/sec  5 clients    5 procs  max_latency=3.666 ms
Throughput 2737.96 MB/sec  10 clients  10 procs  max_latency=1.646 ms
Throughput 5220.17 MB/sec  20 clients  20 procs  max_latency=3.666 ms
Throughput 8324.46 MB/sec  40 clients  40 procs  max_latency=0.732 ms

Without patch:

Throughput 557.142 MB/sec  2 clients    2 procs  max_latency=0.264 ms
Throughput 1381.59 MB/sec  5 clients    5 procs  max_latency=0.335 ms
Throughput 2726.84 MB/sec  10 clients  10 procs  max_latency=0.352 ms
Throughput 5230.12 MB/sec  20 clients  20 procs  max_latency=1.632 ms
Throughput 8474.5 MB/sec  40 clients   40 procs  max_latency=7.756 ms

Note: High variation observed in max_latency in different runs

Rohit Jain (2):
  sched: reduce migration cost between faster caches for idle_balance
  Introduce sysctl(s) for the migration costs

 include/linux/sched/sysctl.h   |  2 ++
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 10 ++++++----
 kernel/sched/topology.c        |  5 +++++
 kernel/sysctl.c                | 14 ++++++++++++++
 5 files changed, 28 insertions(+), 4 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 18+ messages in thread