[RFC/RFT PATCH 0/3] Improve scheduler scalability for fast path

* [RFC/RFT PATCH 0/3] Improve scheduler scalability for fast path
@ 2018-04-24  0:41 subhra mazumdar
  2018-04-24  0:41 ` [PATCH 1/3] sched: remove select_idle_core() for scalability subhra mazumdar
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: subhra mazumdar @ 2018-04-24  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, daniel.lezcano, steven.sistare, dhaval.giani,
	rohit.k.jain, subhra.mazumdar

Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.

This patch solves the scalability problem by:
-Removing select_idle_core() as it can potentially scan the full LLC domain
 even if there is only one idle core which doesn't scale
-Lowering the lower limit of nr variable in select_idle_cpu() and also
 setting an upper limit to restrict search time

Additionally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.

Following are the performance numbers with various benchmarks.

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline           %stdev  patch      	   %stdev
1      	0.5742		   21.13   0.5334 (7.10%)  5.2 
2       0.5776 		   7.87    0.5393 (6.63%)  6.39
4       0.9578             1.12    0.9537 (0.43%)  1.08
8       1.7018             1.35    1.682 (1.16%)   1.33
16      2.9955 		   1.36    2.9849 (0.35%)  0.96
32      5.4354 		   0.59    5.3308 (1.92%)  0.60

Sysbench MySQL on 1 socket, 6 core and 12 threads Intel x86 machine
(higher is better):
threads baseline    	  patch
2       49.53             49.83 (0.61%)
4       89.07		  90 (1.05%)
8	149		  154 (3.31%) 
16	240		  246 (2.56%)
32	357		  351 (-1.69%)
64	428		  428 (-0.03%)
128	473		  469 (-0.92%)

Sysbench PostgresSQL on 1 socket, 6 core and 12 threads Intel x86 machine
(higher is better):
threads baseline	  patch
2	68.35 		  70.07 (2.51%)
4	93.53		  92.54 (-1.05%)
8	125		  127 (1.16%)
16	145		  146 (0.92%)
32	158		  156 (-1.24%)
64	160		  160 (0.47%)

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline	%stdev  patch   	 %stdev
20	1		1.35	1.0075 (0.75%)	 0.71
40	1		0.42	0.9971 (-0.29%)	 0.26
60	1		1.54	0.9955 (-0.45%)	 0.83
80	1		0.58	1.0059 (0.59%)	 0.59
100	1		0.77	1.0201 (2.01%)	 0.39
120	1		0.35	1.0145 (1.45%)	 1.41
140	1		0.19	1.0325 (3.25%)	 0.77
160	1		0.09	1.0277 (2.77%)	 0.57
180	1		0.99	1.0249 (2.49%)	 0.79
200	1		1.03	1.0133 (1.33%)	 0.77
220	1		1.69	1.0317 (3.17%)	 1.41

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline	%stdev  patch   	 %stdev
8	49.47		0.35	50.96 (3.02%)	 0.12
16	95.28		0.77	99.01 (3.92%)	 0.14
32	156.77		1.17	180.64 (15.23%)	 1.05
48	193.24		0.22	214.7 (11.1%)	 1
64	216.21		9.33	252.81 (16.93%)	 1.68
128	379.62		10.29	397.47 (4.75)	 0.41

Dbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline	patch
1	627.62		629.14 (0.24%)
2	1153.45		1179.9 (2.29%)
4	2060.29		2051.62 (-0.42%)
8	2724.41		2609.4 (-4.22%)
16	2987.56		2891.54 (-3.21%)
32	2375.82		2345.29 (-1.29%)
64	1963.31		1903.61 (-3.04%)
128	1546.01		1513.17 (-2.12%)

Tbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):                                  
clients baseline	patch
1	279.33		285.154 (2.08%)
2	545.961		572.538 (4.87%)
4	1081.06		1126.51 (4.2%)
8	2158.47		2234.78 (3.53%)
16	4223.78		4358.11 (3.18%)
32	7117.08		8022.19 (12.72%)
64	8947.28		10719.7 (19.81%)
128	15976.7		17531.2 (9.73%)

Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 256 (higher is better):
clients  baseline 	 %stdev  patch		%stdev
1	 2699		 4.86	 2697 (-0.1%)	3.74
10	 18832		 0 	 18830 (0%)	0.01
100	 18830           0.05    18827 (0%)     0.08

Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 1K (higher is better):
clients	 baseline	 %stdev  patch 	        %stdev
1	 9414		 0.02	 9414 (0%)	0.01
10	 18832		 0	 18832 (0%)	0
100	 18830		 0.05	 18829 (0%)	0.04

Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 4K (higher is better):
clients  baseline	 %stdev  patch    	 %stdev
1	 9414		 0.01	 9414 (0%)	 0
10	 18832		 0	 18832 (0%)	 0
100	 18829		 0.04	 18833 (0%)	 0

Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 64K (higher is better):
clients  baseline	 %stdev  patch  	 %stdev
1	 9415		 0.01	 9415 (0%)	 0
10	 18832		 0	 18832 (0%)	 0
100	 18830		 0.04	 18833 (0%)	 0

Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 1M (higher is better):
clients  baseline  	 %stdev  patch 		 %stdev
1	 9415		 0.01	 9415 (0%)	 0.01
10	 18832		 0	 18832 (0%)	 0
100	 18830		 0.04	 18819 (-0.1%)	 0.13

JBB on 2 socket, 28 core and 56 threads Intel x86 machine
(higher is better):
		baseline	%stdev	 patch		%stdev
jops		60049		0.65	 60191 (0.2%)	0.99
critical jops	29689		0.76	 29044 (-2.2%)	1.46

Schbench on 2 socket, 24 core and 48 threads Intel x86 machine with 24
tasks (lower is better):
percentile	baseline        %stdev   patch          %stdev
50		5007		0.16	 5003 (0.1%)	0.12
75		10000		0	 10000 (0%)	0
90		16992		0	 16998 (0%)	0.12
95		21984		0	 22043 (-0.3%)	0.83
99		34229		1.2	 34069 (0.5%)	0.87
99.5		39147		1.1	 38741 (1%)	1.1
99.9		49568		1.59	 49579 (0%)	1.78

Ebizzy on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
threads		baseline        %stdev   patch          %stdev
1		26477		2.66	 26646 (0.6%)	2.81
2		52303		1.72	 52987 (1.3%)	1.59
4		100854		2.48	 101824 (1%)	2.42
8		188059		6.91	 189149 (0.6%)	1.75
16		328055		3.42	 333963 (1.8%)	2.03
32		504419		2.23	 492650 (-2.3%)	1.76
88		534999		5.35	 569326 (6.4%)	3.07
156		541703		2.42	 544463 (0.5%)	2.17

NAS: A whole suite of NAS benchmarks were run on 2 socket, 36 core and 72
threads Intel x86 machine with no statistically significant regressions
while giving improvements in some cases. I am not listing the results due
to too many data points.

subhra mazumdar (3):
  sched: remove select_idle_core() for scalability
  sched: introduce per-cpu var next_cpu to track search limit
  sched: limit cpu search and rotate search window for scalability

 include/linux/sched/topology.h |   1 -
 kernel/sched/core.c            |   2 +
 kernel/sched/fair.c            | 116 +++++------------------------------------
 kernel/sched/idle.c            |   1 -
 kernel/sched/sched.h           |  11 +---
 5 files changed, 17 insertions(+), 114 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 25+ messages in thread