All of lore.kernel.org
 help / color / mirror / Atom feed
* [RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path
@ 2018-06-28 22:44 subhra mazumdar
  2018-06-28 22:44 ` [PATCH 1/5] sched: limit cpu search in select_idle_cpu subhra mazumdar
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: subhra mazumdar @ 2018-06-28 22:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, dhaval.giani, rohit.k.jain,
	daniel.lezcano

Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.

This patch solves the scalability problem by:
 - Setting an upper and lower limit of idle cpu search in select_idle_cpu
   to keep search time low and constant
 - Adding a new sched feature SIS_CORE to disable select_idle_core

Additionally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.

Following are the performance numbers with various benchmarks with SIS_CORE
true (idle core search enabled).

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline           %stdev  patch           %stdev
1       0.5816             8.94    0.5903 (-1.5%)  11.28 
2       0.6428             10.64   0.5843 (9.1%)   4.93
4       1.0152             1.99    0.9965 (1.84%)  1.83
8       1.8128             1.4     1.7921 (1.14%)  1.76
16      3.1666             0.8     3.1345 (1.01%)  0.81
32      5.6084             0.83    5.5677 (0.73%)  0.8 

Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
threads baseline     %stdev     patch	           %stdev
8       2095.45      1.82 	2102.6 (0.34%)	   2.11
16      4218.45      0.06       4221.35 (0.07%)    0.38
32      7531.36      0.49       7607.18 (1.01%)    0.25
48      10206.42     0.21       10324.26 (1.15%)   0.13
64      12053.73     0.1        12158.3 (0.87%)    0.24
128     14810.33     0.04       14840.4 (0.2%)	   0.38

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline        %stdev  patch            %stdev
20      1               0.9     1.0068 (0.68%)   0.27
40      1               0.8     1.0103 (1.03%)   1.24
60      1               0.34    1.0178 (1.78%)   0.49
80      1               0.53    1.0092 (0.92%)   1.5
100     1               0.79    1.0090 (0.9%)    0.88
120     1               0.06    1.0048 (0.48%)   0.72
140     1               0.22    1.0116 (1.16%)   0.05
160     1               0.57    1.0264 (2.64%)   0.67
180     1               0.81    1.0194 (1.94%)   0.91
200     1               0.44    1.028 (2.8%)     3.09
220     1               1.74    1.0229 (2.29%)   0.21

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline        %stdev  patch            %stdev
8       45.36           0.43    46.28 (2.01%)    0.29
16      87.81           0.82    89.67 (2.12%)    0.38
32      151.19          0.02    153.5 (1.53%)    0.41
48      190.2           0.21    194.79 (2.41%)   0.07
64      190.42          0.35    202.9 (6.55%)    1.66
128     323.86          0.28    343.56 (6.08%)   1.34 

Dbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline        patch
1       629.8           603.83 (-4.12%)
2       1159.65         1155.75 (-0.34%)
4       2121.61         2093.99 (-1.3%)
8       2620.52         2641.51 (0.8%)
16      2879.31         2897.6 (0.64%)
32      2791.24         2936.47 (5.2%)
64      1853.07         1894.74 (2.25%)
128     1484.95         1494.29 (0.63%)

Tbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline        patch
1       256.41          255.8 (-0.24%)
2       509.89          504.52 (-1.05%)
4       999.44          1003.74 (0.43%)
8       1982.7          1976.42 (-0.32%)
16      3891.51         3916.04 (0.63%)
32      6819.24         6845.06 (0.38%)
64      8542.95         8568.28 (0.3%)
128     15277.6         15754.6 (3.12%)

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile      baseline      %stdev   patch             %stdev
50              94            2.82     92 (2.13%)        2.17
75              124           2.13     122 (1.61%)       1.42
90              152           1.74     151 (0.66%)       0.66 
95              171           2.11     170 (0.58%)       0
99              512.67        104.96   208.33 (59.36%)   1.2
99.5            2296          82.55    3674.66 (-60.05%) 22.19
99.9            12517.33      2.38     12784 (-2.13%)    0.66

Hackbench process on 2 socket, 16 core and 128 threads SPARC machine
(lower is better):  
groups  baseline           %stdev  patch             %stdev 
1       1.3085             6.65    1.2213 (6.66%)    10.32 
2       1.4559             8.55    1.5048 (-3.36%)   4.72
4       2.6271             1.74    2.5532 (2.81%)    2.02
8       4.7089             3.01    4.5118 (4.19%)    2.74
16      8.7406             2.25    8.6801 (0.69%)    4.78
32      17.7835            1.01    16.759 (5.76%)    1.38
64	36.1901		   0.65	   34.6652 (4.21%)   1.24
128	72.6585		   0.51	   70.9762 (2.32%)   0.9

Following are the performance numbers with various benchmarks with SIS_CORE
false (idle core search disabled). This improves throughput of certain
workloads but increases latency of other workloads.
                                                                           
Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine 
(lower is better):
groups  baseline           %stdev  patch            %stdev
1       0.5816             8.94    0.5835 (-0.33%)  8.21  
2       0.6428             10.64   0.5752 (10.52%)  4.05  
4       1.0152             1.99    0.9946 (2.03%)   2.56  
8       1.8128             1.4     1.7619 (2.81%)   1.88  
16      3.1666             0.8     3.1275 (1.23%)   0.42  
32      5.6084             0.83    5.5856 (0.41%)   0.89   
 
Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine 
(higher is better):
threads baseline     %stdev     patch              %stdev 
8       2095.45      1.82       2084.72 (-0.51%)   1.65
16      4218.45      0.06       4179.69 (-0.92%)   0.18
32      7531.36      0.49       7623.18 (1.22%)    0.39
48      10206.42     0.21       10159.16 (-0.46%)  0.21
64      12053.73     0.1        12087.21 (0.28%)   0.19
128     14810.33     0.04       14894.08 (0.57%)   0.08

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine            
(normalized, higher is better):                                            
users   baseline        %stdev  patch            %stdev                    
20      1               0.9     1.0056 (0.56%)   0.34
40      1               0.8     1.0173 (1.73%)   0.13
60      1               0.34    0.9995 (-0.05%)  0.85
80      1               0.53    1.0175 (1.75%)   1.56
100     1               0.79    1.0151 (1.51%)   1.31
120     1               0.06    1.0244 (2.44%)   0.5 
140     1               0.22    1.034 (3.4%)     0.66
160     1               0.57    1.0362 (3.62%)   0.07
180     1               0.81    1.041 (4.1%)     0.8
200     1               0.44    1.0233 (2.33%)   1.4  
220     1               1.74    1.0125 (1.25%)   1.41

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with  
message size = 8k (higher is better):
threads baseline        %stdev  patch            %stdev 
8       45.36           0.43    46.94 (3.48%)    0.2  
16      87.81           0.82    91.75 (4.49%)    0.43 
32      151.19          0.02    167.74 (10.95%)  1.29 
48      190.2           0.21    200.57 (5.45%)   0.89 
64      190.42          0.35    226.74 (19.07%)  1.79 
128     323.86          0.28    348.12 (7.49%)   0.77 

Dbench on 2 socket, 44 core and 88 threads Intel x86 machine               
(higher is better):         
clients baseline        patch 
1       629.8           600.19 (-4.7%)  
2       1159.65         1162.07 (0.21%) 
4       2121.61         2112.27 (-0.44%)
8       2620.52         2645.55 (0.96%) 
16      2879.31         2828.87 (-1.75%)
32      2791.24         2760.97 (-1.08%)
64      1853.07         1747.66 (-5.69%)
128     1484.95         1459.81 (-1.69%)

Tbench on 2 socket, 44 core and 88 threads Intel x86 machine               
(higher is better):
clients baseline        patch
1       256.41          258.11 (0.67%)  
2       509.89          509.13 (-0.15%) 
4       999.44          1016.58 (1.72%) 
8       1982.7          2006.53 (1.2%)  
16      3891.51         3964.43 (1.87%) 
32      6819.24         7376.92 (8.18%) 
64      8542.95         9660.45 (13.08%) 
128     15277.6         15438.4 (1.05%) 

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44     
tasks (lower is better):
percentile      baseline      %stdev   patch                %stdev
50              94            2.82     94.67 (-0.71%)       2.2  
75              124           2.13     124.67 (-0.54%)      1.67 
90              152           1.74     154.33 (-1.54%)      0.75 
95              171           2.11     176.67 (-3.31%)      0.86 
99              512.67        104.96   4130.33 (-705.65%)   79.41
99.5            2296          82.55    10066.67 (-338.44%)  26.15
99.9            12517.33      2.38     12869.33 (-2.81%)    0.8 
 
Hackbench process on 2 socket, 16 core and 128 threads SPARC machine       
(lower is better): 
groups  baseline           %stdev  patch             %stdev
1       1.3085             6.65    1.2514 (4.36%)    11.1
2       1.4559             8.55    1.5433 (-6%)      3.05
4       2.6271             1.74    2.5626 (2.5%)     2.69
8       4.7089             3.01    4.5316 (3.77%)    2.95
16      8.7406             2.25    8.6585 (0.94%)    2.91
32      17.7835            1.01    17.175 (3.42%)    1.38
64      36.1901            0.65    35.5294 (1.83%)   1.02
128     72.6585            0.51    71.8821 (1.07%)   1.05

Following are the schbench performance numbers with SIS_CORE false and
SIS_PROP false. This recovers the latency increase by having SIS_CORE
false.

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44     
tasks (lower is better):
percentile      baseline      %stdev   patch               %stdev
50              94            2.82     93.33 (0.71%)       1.24  
75              124           2.13     122.67 (1.08%)      1.7   
90              152           1.74     149.33 (1.75%)      2.35  
95              171           2.11     167 (2.34%)         2.74
99              512.67        104.96   206 (59.82%)        8.86
99.5            2296          82.55    3121.67 (-35.96%)   97.37 
99.9            12517.33      2.38     12592 (-0.6%)       1.67

Changes since v1
 - Compute the upper and lower limit based on number of cpus in a core
 - Split up the search limit and search window rotation into separate
   patches
 - Add new sched feature to have option of disabling idle core search

subhra mazumdar (5):
  sched: limit cpu search in select_idle_cpu
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search

 kernel/sched/core.c     |  2 ++
 kernel/sched/fair.c     | 31 +++++++++++++++++++++++--------
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h    |  1 +
 4 files changed, 27 insertions(+), 8 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/5] sched: limit cpu search in select_idle_cpu
  2018-06-28 22:44 [RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path subhra mazumdar
@ 2018-06-28 22:44 ` subhra mazumdar
  2018-06-28 22:44 ` [PATCH 2/5] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: subhra mazumdar @ 2018-06-28 22:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, dhaval.giani, rohit.k.jain,
	daniel.lezcano

Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05..7243146 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
+	int cpu, limit, floor, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6390,10 +6390,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	if (sched_feat(SIS_PROP)) {
 		u64 span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
+		floor = cpumask_weight(topology_sibling_cpumask(target));
+		if (floor < 2)
+			floor = 2;
+		limit = 2*floor;
+		if (span_avg > floor*avg_cost) {
 			nr = div_u64(span_avg, avg_cost);
-		else
-			nr = 4;
+			if (nr > limit)
+				nr = limit;
+		} else {
+			nr = floor;
+		}
 	}
 
 	time = local_clock();
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/5] sched: introduce per-cpu var next_cpu to track search limit
  2018-06-28 22:44 [RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path subhra mazumdar
  2018-06-28 22:44 ` [PATCH 1/5] sched: limit cpu search in select_idle_cpu subhra mazumdar
@ 2018-06-28 22:44 ` subhra mazumdar
  2018-06-28 22:44 ` [PATCH 3/5] sched: rotate the cpu search window for better spread subhra mazumdar
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: subhra mazumdar @ 2018-06-28 22:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, dhaval.giani, rohit.k.jain,
	daniel.lezcano

Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8d59b25..b3e4ec1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -20,6 +20,7 @@
 #include <trace/events/sched.h>
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
 /*
@@ -5996,6 +5997,7 @@ void __init sched_init(void)
 	for_each_possible_cpu(i) {
 		struct rq *rq;
 
+		per_cpu(next_cpu, i) = -1;
 		rq = cpu_rq(i);
 		raw_spin_lock_init(&rq->lock);
 		rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4..eb12b50 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -912,6 +912,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		this_cpu_ptr(&runqueues)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/5] sched: rotate the cpu search window for better spread
  2018-06-28 22:44 [RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path subhra mazumdar
  2018-06-28 22:44 ` [PATCH 1/5] sched: limit cpu search in select_idle_cpu subhra mazumdar
  2018-06-28 22:44 ` [PATCH 2/5] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
@ 2018-06-28 22:44 ` subhra mazumdar
  2018-06-28 22:44 ` [PATCH 4/5] sched: add sched feature to disable idle core search subhra mazumdar
  2018-06-28 22:44 ` [PATCH 5/5] sched: SIS_CORE " subhra mazumdar
  4 siblings, 0 replies; 9+ messages in thread
From: subhra mazumdar @ 2018-06-28 22:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, dhaval.giani, rohit.k.jain,
	daniel.lezcano

Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7243146..574cb14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, limit, floor, nr = INT_MAX;
+	int cpu, limit, floor, target_tmp, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6403,9 +6403,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		}
 	}
 
+	if (per_cpu(next_cpu, target) != -1)
+		target_tmp = per_cpu(next_cpu, target);
+	else
+		target_tmp = target;
+
 	time = local_clock();
 
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+		per_cpu(next_cpu, target) = cpu;
 		if (!--nr)
 			return -1;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/5] sched: add sched feature to disable idle core search
  2018-06-28 22:44 [RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path subhra mazumdar
                   ` (2 preceding siblings ...)
  2018-06-28 22:44 ` [PATCH 3/5] sched: rotate the cpu search window for better spread subhra mazumdar
@ 2018-06-28 22:44 ` subhra mazumdar
  2018-06-28 22:44 ` [PATCH 5/5] sched: SIS_CORE " subhra mazumdar
  4 siblings, 0 replies; 9+ messages in thread
From: subhra mazumdar @ 2018-06-28 22:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, dhaval.giani, rohit.k.jain,
	daniel.lezcano

Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae848..de15733 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/5] sched: SIS_CORE to disable idle core search
  2018-06-28 22:44 [RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path subhra mazumdar
                   ` (3 preceding siblings ...)
  2018-06-28 22:44 ` [PATCH 4/5] sched: add sched feature to disable idle core search subhra mazumdar
@ 2018-06-28 22:44 ` subhra mazumdar
  4 siblings, 0 replies; 9+ messages in thread
From: subhra mazumdar @ 2018-06-28 22:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, dhaval.giani, rohit.k.jain,
	daniel.lezcano

Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 574cb14..33fbc47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6464,9 +6464,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (!sd)
 		return target;
 
-	i = select_idle_core(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
+	if (sched_feat(SIS_CORE)) {
+		i = select_idle_core(p, sd, target);
+		if ((unsigned)i < nr_cpumask_bits)
+			return i;
+	}
 
 	i = select_idle_cpu(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/5] sched: limit cpu search in select_idle_cpu
  2018-06-12 20:33   ` kbuild test robot
@ 2018-06-12 22:12     ` Subhra Mazumdar
  0 siblings, 0 replies; 9+ messages in thread
From: Subhra Mazumdar @ 2018-06-12 22:12 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-kernel, peterz, mingo, steven.sistare,
	dhaval.giani, rohit.k.jain, daniel.lezcano



On 06/12/2018 01:33 PM, kbuild test robot wrote:
> Hi subhra,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on tip/sched/core]
> [also build test WARNING on v4.17 next-20180612]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url:    https://github.com/0day-ci/linux/commits/subhra-mazumdar/Improve-scheduler-scalability-for-fast-path/20180613-015158
> config: i386-randconfig-x070-201823 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
>          # save the attached .config to linux build tree
>          make ARCH=i386
>
> All warnings (new ones prefixed by >>):
>
>     kernel/sched/fair.c: In function 'select_idle_cpu':
>>> kernel/sched/fair.c:6396:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
>        u64 span_avg = sd->span_weight * avg_idle;
>        ^~~
>
>
<snip>

I fixed this patch, please try the following

---8<---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05..7243146 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, 
struct sched_domain *sd, int t
         u64 avg_cost, avg_idle;
         u64 time, cost;
         s64 delta;
-       int cpu, nr = INT_MAX;
+       int cpu, limit, floor, nr = INT_MAX;

         this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
         if (!this_sd)
@@ -6390,10 +6390,17 @@ static int select_idle_cpu(struct task_struct 
*p, struct sched_domain *sd, int t

         if (sched_feat(SIS_PROP)) {
                 u64 span_avg = sd->span_weight * avg_idle;
-               if (span_avg > 4*avg_cost)
+               floor = cpumask_weight(topology_sibling_cpumask(target));
+               if (floor < 2)
+                       floor = 2;
+               limit = 2*floor;
+               if (span_avg > floor*avg_cost) {
                         nr = div_u64(span_avg, avg_cost);
-               else
-                       nr = 4;
+                       if (nr > limit)
+                               nr = limit;
+               } else {
+                       nr = floor;
+               }
         }

         time = local_clock();


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/5] sched: limit cpu search in select_idle_cpu
  2018-06-12 17:49 ` [PATCH 1/5] sched: limit cpu search in select_idle_cpu subhra mazumdar
@ 2018-06-12 20:33   ` kbuild test robot
  2018-06-12 22:12     ` Subhra Mazumdar
  0 siblings, 1 reply; 9+ messages in thread
From: kbuild test robot @ 2018-06-12 20:33 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: kbuild-all, linux-kernel, peterz, mingo, steven.sistare,
	dhaval.giani, rohit.k.jain, daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 5488 bytes --]

Hi subhra,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on v4.17 next-20180612]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/subhra-mazumdar/Improve-scheduler-scalability-for-fast-path/20180613-015158
config: i386-randconfig-x070-201823 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   kernel/sched/fair.c: In function 'select_idle_cpu':
>> kernel/sched/fair.c:6396:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
      u64 span_avg = sd->span_weight * avg_idle;
      ^~~

vim +6396 kernel/sched/fair.c

10e2f1acd Peter Zijlstra  2016-05-09  6363  
10e2f1acd Peter Zijlstra  2016-05-09  6364  /*
10e2f1acd Peter Zijlstra  2016-05-09  6365   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
10e2f1acd Peter Zijlstra  2016-05-09  6366   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
10e2f1acd Peter Zijlstra  2016-05-09  6367   * average idle time for this rq (as found in rq->avg_idle).
10e2f1acd Peter Zijlstra  2016-05-09  6368   */
10e2f1acd Peter Zijlstra  2016-05-09  6369  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
10e2f1acd Peter Zijlstra  2016-05-09  6370  {
9cfb38a7b Wanpeng Li      2016-10-09  6371  	struct sched_domain *this_sd;
1ad3aaf3f Peter Zijlstra  2017-05-17  6372  	u64 avg_cost, avg_idle;
10e2f1acd Peter Zijlstra  2016-05-09  6373  	u64 time, cost;
10e2f1acd Peter Zijlstra  2016-05-09  6374  	s64 delta;
8dd662615 subhra mazumdar 2018-06-12  6375  	int cpu, limit, floor, nr = INT_MAX;
10e2f1acd Peter Zijlstra  2016-05-09  6376  
9cfb38a7b Wanpeng Li      2016-10-09  6377  	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
9cfb38a7b Wanpeng Li      2016-10-09  6378  	if (!this_sd)
9cfb38a7b Wanpeng Li      2016-10-09  6379  		return -1;
9cfb38a7b Wanpeng Li      2016-10-09  6380  
10e2f1acd Peter Zijlstra  2016-05-09  6381  	/*
10e2f1acd Peter Zijlstra  2016-05-09  6382  	 * Due to large variance we need a large fuzz factor; hackbench in
10e2f1acd Peter Zijlstra  2016-05-09  6383  	 * particularly is sensitive here.
10e2f1acd Peter Zijlstra  2016-05-09  6384  	 */
1ad3aaf3f Peter Zijlstra  2017-05-17  6385  	avg_idle = this_rq()->avg_idle / 512;
1ad3aaf3f Peter Zijlstra  2017-05-17  6386  	avg_cost = this_sd->avg_scan_cost + 1;
1ad3aaf3f Peter Zijlstra  2017-05-17  6387  
1ad3aaf3f Peter Zijlstra  2017-05-17  6388  	if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
10e2f1acd Peter Zijlstra  2016-05-09  6389  		return -1;
10e2f1acd Peter Zijlstra  2016-05-09  6390  
1ad3aaf3f Peter Zijlstra  2017-05-17  6391  	if (sched_feat(SIS_PROP)) {
8dd662615 subhra mazumdar 2018-06-12  6392  		floor = cpumask_weight(topology_sibling_cpumask(target));
8dd662615 subhra mazumdar 2018-06-12  6393  		if (floor < 2)
8dd662615 subhra mazumdar 2018-06-12  6394  			floor = 2;
8dd662615 subhra mazumdar 2018-06-12  6395  		limit = 2*floor;
1ad3aaf3f Peter Zijlstra  2017-05-17 @6396  		u64 span_avg = sd->span_weight * avg_idle;
8dd662615 subhra mazumdar 2018-06-12  6397  		if (span_avg > floor*avg_cost) {
1ad3aaf3f Peter Zijlstra  2017-05-17  6398  			nr = div_u64(span_avg, avg_cost);
8dd662615 subhra mazumdar 2018-06-12  6399  			if (nr > limit)
8dd662615 subhra mazumdar 2018-06-12  6400  				nr = limit;
8dd662615 subhra mazumdar 2018-06-12  6401  		} else {
8dd662615 subhra mazumdar 2018-06-12  6402  			nr = floor;
8dd662615 subhra mazumdar 2018-06-12  6403  		}
1ad3aaf3f Peter Zijlstra  2017-05-17  6404  	}
1ad3aaf3f Peter Zijlstra  2017-05-17  6405  
10e2f1acd Peter Zijlstra  2016-05-09  6406  	time = local_clock();
10e2f1acd Peter Zijlstra  2016-05-09  6407  
c743f0a5c Peter Zijlstra  2017-04-14  6408  	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
1ad3aaf3f Peter Zijlstra  2017-05-17  6409  		if (!--nr)
1ad3aaf3f Peter Zijlstra  2017-05-17  6410  			return -1;
0c98d344f Ingo Molnar     2017-02-05  6411  		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
10e2f1acd Peter Zijlstra  2016-05-09  6412  			continue;
943d355d7 Rohit Jain      2018-05-09  6413  		if (available_idle_cpu(cpu))
10e2f1acd Peter Zijlstra  2016-05-09  6414  			break;
10e2f1acd Peter Zijlstra  2016-05-09  6415  	}
10e2f1acd Peter Zijlstra  2016-05-09  6416  
10e2f1acd Peter Zijlstra  2016-05-09  6417  	time = local_clock() - time;
10e2f1acd Peter Zijlstra  2016-05-09  6418  	cost = this_sd->avg_scan_cost;
10e2f1acd Peter Zijlstra  2016-05-09  6419  	delta = (s64)(time - cost) / 8;
10e2f1acd Peter Zijlstra  2016-05-09  6420  	this_sd->avg_scan_cost += delta;
10e2f1acd Peter Zijlstra  2016-05-09  6421  
10e2f1acd Peter Zijlstra  2016-05-09  6422  	return cpu;
10e2f1acd Peter Zijlstra  2016-05-09  6423  }
10e2f1acd Peter Zijlstra  2016-05-09  6424  

:::::: The code at line 6396 was first introduced by commit
:::::: 1ad3aaf3fcd2444406628a19a9b9e0922b95e2d4 sched/core: Implement new approach to scale select_idle_cpu()

:::::: TO: Peter Zijlstra <peterz@infradead.org>
:::::: CC: Ingo Molnar <mingo@kernel.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28330 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/5] sched: limit cpu search in select_idle_cpu
  2018-06-12 17:49 [RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path subhra mazumdar
@ 2018-06-12 17:49 ` subhra mazumdar
  2018-06-12 20:33   ` kbuild test robot
  0 siblings, 1 reply; 9+ messages in thread
From: subhra mazumdar @ 2018-06-12 17:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, dhaval.giani, rohit.k.jain,
	daniel.lezcano

Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05..9a6d28d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6372,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
+	int cpu, limit, floor, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6389,11 +6389,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		return -1;
 
 	if (sched_feat(SIS_PROP)) {
+		floor = cpumask_weight(topology_sibling_cpumask(target));
+		if (floor < 2)
+			floor = 2;
+		limit = 2*floor;
 		u64 span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
+		if (span_avg > floor*avg_cost) {
 			nr = div_u64(span_avg, avg_cost);
-		else
-			nr = 4;
+			if (nr > limit)
+				nr = limit;
+		} else {
+			nr = floor;
+		}
 	}
 
 	time = local_clock();
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-06-28 22:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-28 22:44 [RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path subhra mazumdar
2018-06-28 22:44 ` [PATCH 1/5] sched: limit cpu search in select_idle_cpu subhra mazumdar
2018-06-28 22:44 ` [PATCH 2/5] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
2018-06-28 22:44 ` [PATCH 3/5] sched: rotate the cpu search window for better spread subhra mazumdar
2018-06-28 22:44 ` [PATCH 4/5] sched: add sched feature to disable idle core search subhra mazumdar
2018-06-28 22:44 ` [PATCH 5/5] sched: SIS_CORE " subhra mazumdar
  -- strict thread matches above, loose matches on Subject: below --
2018-06-12 17:49 [RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path subhra mazumdar
2018-06-12 17:49 ` [PATCH 1/5] sched: limit cpu search in select_idle_cpu subhra mazumdar
2018-06-12 20:33   ` kbuild test robot
2018-06-12 22:12     ` Subhra Mazumdar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.