linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
@ 2019-06-27  1:29 subhra mazumdar
  2019-06-27  1:29 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar
                   ` (7 more replies)
  0 siblings, 8 replies; 33+ messages in thread
From: subhra mazumdar @ 2019-06-27  1:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

Hi,

Resending this patchset, will be good to get some feedback. Any suggestions
that will make it more acceptable are welcome. We have been shipping this
with Unbreakable Enterprise Kernel in Oracle Linux.

Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.

This patch solves the scalability problem by:
 - Setting an upper and lower limit of idle cpu search in select_idle_cpu
   to keep search time low and constant
 - Adding a new sched feature SIS_CORE to disable select_idle_core

Additionally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.

Following are the performance numbers with various benchmarks with SIS_CORE
true (idle core search enabled).

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline           %stdev  patch           %stdev
1       0.5816             8.94    0.5903 (-1.5%)  11.28 
2       0.6428             10.64   0.5843 (9.1%)   4.93
4       1.0152             1.99    0.9965 (1.84%)  1.83
8       1.8128             1.4     1.7921 (1.14%)  1.76
16      3.1666             0.8     3.1345 (1.01%)  0.81
32      5.6084             0.83    5.5677 (0.73%)  0.8 

Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
threads baseline     %stdev     patch              %stdev
8       2095.45      1.82       2102.6 (0.34%)     2.11
16      4218.45      0.06       4221.35 (0.07%)    0.38
32      7531.36      0.49       7607.18 (1.01%)    0.25
48      10206.42     0.21       10324.26 (1.15%)   0.13
64      12053.73     0.1        12158.3 (0.87%)    0.24
128     14810.33     0.04       14840.4 (0.2%)     0.38

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline        %stdev  patch            %stdev
20      1               0.9     1.0068 (0.68%)   0.27
40      1               0.8     1.0103 (1.03%)   1.24
60      1               0.34    1.0178 (1.78%)   0.49
80      1               0.53    1.0092 (0.92%)   1.5
100     1               0.79    1.0090 (0.9%)    0.88
120     1               0.06    1.0048 (0.48%)   0.72
140     1               0.22    1.0116 (1.16%)   0.05
160     1               0.57    1.0264 (2.64%)   0.67
180     1               0.81    1.0194 (1.94%)   0.91
200     1               0.44    1.028 (2.8%)     3.09
220     1               1.74    1.0229 (2.29%)   0.21

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline        %stdev  patch            %stdev
8       45.36           0.43    46.28 (2.01%)    0.29
16      87.81           0.82    89.67 (2.12%)    0.38
32      151.19          0.02    153.5 (1.53%)    0.41
48      190.2           0.21    194.79 (2.41%)   0.07
64      190.42          0.35    202.9 (6.55%)    1.66
128     323.86          0.28    343.56 (6.08%)   1.34 

Dbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline        patch
1       629.8           603.83 (-4.12%)
2       1159.65         1155.75 (-0.34%)
4       2121.61         2093.99 (-1.3%)
8       2620.52         2641.51 (0.8%)
16      2879.31         2897.6 (0.64%)

Tbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline        patch
1       256.41          255.8 (-0.24%)
2       509.89          504.52 (-1.05%)
4       999.44          1003.74 (0.43%)
8       1982.7          1976.42 (-0.32%)
16      3891.51         3916.04 (0.63%)
32      6819.24         6845.06 (0.38%)
64      8542.95         8568.28 (0.3%)
128     15277.6         15754.6 (3.12%)

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile      baseline      %stdev   patch             %stdev
50              94            2.82     92 (2.13%)        2.17
75              124           2.13     122 (1.61%)       1.42
90              152           1.74     151 (0.66%)       0.66 
95              171           2.11     170 (0.58%)       0
99              512.67        104.96   208.33 (59.36%)   1.2
99.5            2296          82.55    3674.66 (-60.05%) 22.19
99.9            12517.33      2.38     12784 (-2.13%)    0.66

Hackbench process on 2 socket, 16 core and 128 threads SPARC machine
(lower is better):
groups  baseline           %stdev  patch             %stdev
1       1.3085             6.65    1.2213 (6.66%)    10.32 
2       1.4559             8.55    1.5048 (-3.36%)   4.72
4       2.6271             1.74    2.5532 (2.81%)    2.02
8       4.7089             3.01    4.5118 (4.19%)    2.74
16      8.7406             2.25    8.6801 (0.69%)    4.78
32      17.7835            1.01    16.759 (5.76%)    1.38
64      36.1901            0.65    34.6652 (4.21%)   1.24
128     72.6585            0.51    70.9762 (2.32%)   0.9

Following are the performance numbers with various benchmarks with SIS_CORE
false (idle core search disabled). This improves throughput of certain
workloads but increases latency of other workloads.

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline           %stdev  patch            %stdev
1       0.5816             8.94    0.5835 (-0.33%)  8.21  
2       0.6428             10.64   0.5752 (10.52%)  4.05  
4       1.0152             1.99    0.9946 (2.03%)   2.56  
8       1.8128             1.4     1.7619 (2.81%)   1.88  
16      3.1666             0.8     3.1275 (1.23%)   0.42  
32      5.6084             0.83    5.5856 (0.41%)   0.89   

Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
threads baseline     %stdev     patch              %stdev
8       2095.45      1.82       2084.72 (-0.51%)   1.65
16      4218.45      0.06       4179.69 (-0.92%)   0.18
32      7531.36      0.49       7623.18 (1.22%)    0.39
48      10206.42     0.21       10159.16 (-0.46%)  0.21
64      12053.73     0.1        12087.21 (0.28%)   0.19
128     14810.33     0.04       14894.08 (0.57%)   0.08

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline        %stdev  patch            %stdev
20      1               0.9     1.0056 (0.56%)   0.34
40      1               0.8     1.0173 (1.73%)   0.13
60      1               0.34    0.9995 (-0.05%)  0.85
80      1               0.53    1.0175 (1.75%)   1.56
100     1               0.79    1.0151 (1.51%)   1.31
120     1               0.06    1.0244 (2.44%)   0.5 
140     1               0.22    1.034 (3.4%)     0.66
160     1               0.57    1.0362 (3.62%)   0.07
180     1               0.81    1.041 (4.1%)     0.8
200     1               0.44    1.0233 (2.33%)   1.4  
220     1               1.74    1.0125 (1.25%)   1.41

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline        %stdev  patch            %stdev
8       45.36           0.43    46.94 (3.48%)    0.2  
16      87.81           0.82    91.75 (4.49%)    0.43 
32      151.19          0.02    167.74 (10.95%)  1.29 
48      190.2           0.21    200.57 (5.45%)   0.89 
64      190.42          0.35    226.74 (19.07%)  1.79 
128     323.86          0.28    348.12 (7.49%)   0.77 

Dbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline        patch
1       629.8           600.19 (-4.7%)  
2       1159.65         1162.07 (0.21%) 
4       2121.61         2112.27 (-0.44%)
8       2620.52         2645.55 (0.96%) 
16      2879.31         2828.87 (-1.75%)
32      2791.24         2760.97 (-1.08%)
64      1853.07         1747.66 (-5.69%)
128     1484.95         1459.81 (-1.69%)

Tbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline        patch
1       256.41          258.11 (0.67%)  
2       509.89          509.13 (-0.15%) 
4       999.44          1016.58 (1.72%) 
8       1982.7          2006.53 (1.2%)  
16      3891.51         3964.43 (1.87%) 
32      6819.24         7376.92 (8.18%) 
64      8542.95         9660.45 (13.08%) 
128     15277.6         15438.4 (1.05%) 

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile      baseline      %stdev   patch                %stdev
50              94            2.82     94.67 (-0.71%)       2.2  
75              124           2.13     124.67 (-0.54%)      1.67 
90              152           1.74     154.33 (-1.54%)      0.75 
95              171           2.11     176.67 (-3.31%)      0.86 
99              512.67        104.96   4130.33 (-705.65%)   79.41
99.5            2296          82.55    10066.67 (-338.44%)  26.15
99.9            12517.33      2.38     12869.33 (-2.81%)    0.8 

Hackbench process on 2 socket, 16 core and 128 threads SPARC machine
(lower is better):
groups  baseline           %stdev  patch             %stdev
1       1.3085             6.65    1.2514 (4.36%)    11.1
2       1.4559             8.55    1.5433 (-6%)      3.05
4       2.6271             1.74    2.5626 (2.5%)     2.69
8       4.7089             3.01    4.5316 (3.77%)    2.95
16      8.7406             2.25    8.6585 (0.94%)    2.91
32      17.7835            1.01    17.175 (3.42%)    1.38
64      36.1901            0.65    35.5294 (1.83%)   1.02
128     72.6585            0.51    71.8821 (1.07%)   1.05

Following are the schbench performance numbers with SIS_CORE false and
SIS_PROP false. This recovers the latency increase by having SIS_CORE
false.

Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44
tasks (lower is better):
percentile      baseline      %stdev   patch               %stdev
50              94            2.82     93.33 (0.71%)       1.24  
75              124           2.13     122.67 (1.08%)      1.7   
90              152           1.74     149.33 (1.75%)      2.35  
95              171           2.11     167 (2.34%)         2.74
99              512.67        104.96   206 (59.82%)        8.86
99.5            2296          82.55    3121.67 (-35.96%)   97.37 
99.9            12517.33      2.38     12592 (-0.6%)       1.67

Changes from v2->v3:
-Use shift operator instead of multiplication to compute limit
-Use per-CPU variable to precompute the number of sibling SMTs for x86

subhra mazumdar (7):
  sched: limit cpu search in select_idle_cpu
  sched: introduce per-cpu var next_cpu to track search limit
  sched: rotate the cpu search window for better spread
  sched: add sched feature to disable idle core search
  sched: SIS_CORE to disable idle core search
  x86/smpboot: introduce per-cpu variable for HT siblings
  sched: use per-cpu variable cpumask_weight_sibling

 arch/x86/include/asm/smp.h      |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c       | 17 ++++++++++++++++-
 include/linux/topology.h        |  4 ++++
 kernel/sched/core.c             |  2 ++
 kernel/sched/fair.c             | 31 +++++++++++++++++++++++--------
 kernel/sched/features.h         |  1 +
 kernel/sched/sched.h            |  1 +
 8 files changed, 49 insertions(+), 9 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu
  2019-06-27  1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar
@ 2019-06-27  1:29 ` subhra mazumdar
  2019-06-28 18:47   ` Parth Shah
  2019-06-27  1:29 ` [PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: subhra mazumdar @ 2019-06-27  1:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
is amount of cpus in a core while upper limit is twice that. This ensures
for any architecture we will usually search beyond a core. The upper limit
also helps in keeping the search cost low and constant.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b58f08f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
+	int cpu, limit, floor, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	if (sched_feat(SIS_PROP)) {
 		u64 span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
+		floor = cpumask_weight(topology_sibling_cpumask(target));
+		if (floor < 2)
+			floor = 2;
+		limit = floor << 1;
+		if (span_avg > floor*avg_cost) {
 			nr = div_u64(span_avg, avg_cost);
-		else
-			nr = 4;
+			if (nr > limit)
+				nr = limit;
+		} else {
+			nr = floor;
+		}
 	}
 
 	time = local_clock();
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit
  2019-06-27  1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar
  2019-06-27  1:29 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar
@ 2019-06-27  1:29 ` subhra mazumdar
  2019-06-27  1:29 ` [PATCH v3 3/7] sched: rotate the cpu search window for better spread subhra mazumdar
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: subhra mazumdar @ 2019-06-27  1:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..80657fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,6 +24,7 @@
 #include <trace/events/sched.h>
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
 /*
@@ -5966,6 +5967,7 @@ void __init sched_init(void)
 	for_each_possible_cpu(i) {
 		struct rq *rq;
 
+		per_cpu(next_cpu, i) = -1;
 		rq = cpu_rq(i);
 		raw_spin_lock_init(&rq->lock);
 		rq->nr_running = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..4cecfa2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -994,6 +994,7 @@ static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		this_cpu_ptr(&runqueues)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 3/7] sched: rotate the cpu search window for better spread
  2019-06-27  1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar
  2019-06-27  1:29 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar
  2019-06-27  1:29 ` [PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
@ 2019-06-27  1:29 ` subhra mazumdar
  2019-06-28 11:54   ` Srikar Dronamraju
  2019-06-28 18:36   ` Parth Shah
  2019-06-27  1:29 ` [PATCH v3 4/7] sched: add sched feature to disable idle core search subhra mazumdar
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 33+ messages in thread
From: subhra mazumdar @ 2019-06-27  1:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

Rotate the cpu search window for better spread of threads. This will ensure
an idle cpu will quickly be found if one exists.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b58f08f..c1ca88e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, limit, floor, nr = INT_MAX;
+	int cpu, limit, floor, target_tmp, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 		}
 	}
 
+	if (per_cpu(next_cpu, target) != -1)
+		target_tmp = per_cpu(next_cpu, target);
+	else
+		target_tmp = target;
+
 	time = local_clock();
 
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+		per_cpu(next_cpu, target) = cpu;
 		if (!--nr)
 			return -1;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 4/7] sched: add sched feature to disable idle core search
  2019-06-27  1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar
                   ` (2 preceding siblings ...)
  2019-06-27  1:29 ` [PATCH v3 3/7] sched: rotate the cpu search window for better spread subhra mazumdar
@ 2019-06-27  1:29 ` subhra mazumdar
  2019-06-27  1:29 ` [PATCH v3 5/7] sched: SIS_CORE " subhra mazumdar
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 33+ messages in thread
From: subhra mazumdar @ 2019-06-27  1:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

Add a new sched feature SIS_CORE to have an option to disable idle core
search (select_idle_core).

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/features.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 858589b..de4d506 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_CORE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
  2019-06-27  1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar
                   ` (3 preceding siblings ...)
  2019-06-27  1:29 ` [PATCH v3 4/7] sched: add sched feature to disable idle core search subhra mazumdar
@ 2019-06-27  1:29 ` subhra mazumdar
  2019-06-28 19:01   ` Parth Shah
  2019-06-27  1:29 ` [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings subhra mazumdar
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: subhra mazumdar @ 2019-06-27  1:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

Use SIS_CORE to disable idle core search. For some workloads
select_idle_core becomes a scalability bottleneck, removing it improves
throughput. Also there are workloads where disabling it can hurt latency,
so need to have an option.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1ca88e..6a74808 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (!sd)
 		return target;
 
-	i = select_idle_core(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
+	if (sched_feat(SIS_CORE)) {
+		i = select_idle_core(p, sd, target);
+		if ((unsigned)i < nr_cpumask_bits)
+			return i;
+	}
 
 	i = select_idle_cpu(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
  2019-06-27  1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar
                   ` (4 preceding siblings ...)
  2019-06-27  1:29 ` [PATCH v3 5/7] sched: SIS_CORE " subhra mazumdar
@ 2019-06-27  1:29 ` subhra mazumdar
  2019-06-27  6:51   ` Thomas Gleixner
  2019-06-27  1:29 ` [PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling subhra mazumdar
  2019-07-01  9:02 ` [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path Peter Zijlstra
  7 siblings, 1 reply; 33+ messages in thread
From: subhra mazumdar @ 2019-06-27  1:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
This will be used for quick lookup in select_idle_cpu to determine the
limits of search. This patch does it only for x86.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 arch/x86/include/asm/smp.h      |  1 +
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/smpboot.c       | 17 ++++++++++++++++-
 include/linux/topology.h        |  4 ++++
 4 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index da545df..1e90cbd 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -22,6 +22,7 @@ extern int smp_num_siblings;
 extern unsigned int num_processors;
 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
+DECLARE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 453cf38..dd19c71 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -111,6 +111,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
 #ifdef CONFIG_SMP
 #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
+#define topology_sibling_weight(cpu)	(per_cpu(cpumask_weight_sibling, cpu))
 
 extern unsigned int __max_logical_packages;
 #define topology_max_packages()			(__max_logical_packages)
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 362dd89..20bf676 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -85,6 +85,10 @@
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 
+/* representing number of HT siblings of each CPU */
+DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
+EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling);
+
 /* representing HT and core siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
@@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu)
 
 	if (!has_mp) {
 		cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
+		per_cpu(cpumask_weight_sibling, cpu) =
+		    cpumask_weight(topology_sibling_cpumask(cpu));
 		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
 		cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
 		c->booted_cores = 1;
@@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu)
 	for_each_cpu(i, cpu_sibling_setup_mask) {
 		o = &cpu_data(i);
 
-		if ((i == cpu) || (has_smt && match_smt(c, o)))
+		if ((i == cpu) || (has_smt && match_smt(c, o))) {
 			link_mask(topology_sibling_cpumask, cpu, i);
+			threads = cpumask_weight(topology_sibling_cpumask(cpu));
+			per_cpu(cpumask_weight_sibling, cpu) = threads;
+			per_cpu(cpumask_weight_sibling, i) = threads;
+		}
 
 		if ((i == cpu) || (has_mp && match_llc(c, o)))
 			link_mask(cpu_llc_shared_mask, cpu, i);
@@ -1173,6 +1183,8 @@ static __init void disable_smp(void)
 	else
 		physid_set_mask_of_physid(0, &phys_cpu_present_map);
 	cpumask_set_cpu(0, topology_sibling_cpumask(0));
+	per_cpu(cpumask_weight_sibling, 0) =
+	    cpumask_weight(topology_sibling_cpumask(0));
 	cpumask_set_cpu(0, topology_core_cpumask(0));
 }
 
@@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
 
 	for_each_cpu(sibling, topology_core_cpumask(cpu)) {
 		cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
+		per_cpu(cpumask_weight_sibling, sibling) =
+		    cpumask_weight(topology_sibling_cpumask(sibling));
 		/*/
 		 * last thread sibling in this cpu core going down
 		 */
@@ -1495,6 +1509,7 @@ static void remove_siblinginfo(int cpu)
 		cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
 	cpumask_clear(cpu_llc_shared_mask(cpu));
 	cpumask_clear(topology_sibling_cpumask(cpu));
+	per_cpu(cpumask_weight_sibling, cpu) = 0;
 	cpumask_clear(topology_core_cpumask(cpu));
 	c->cpu_core_id = 0;
 	c->booted_cores = 0;
diff --git a/include/linux/topology.h b/include/linux/topology.h
index cb0775e..a85aea1 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -190,6 +190,10 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_sibling_cpumask
 #define topology_sibling_cpumask(cpu)		cpumask_of(cpu)
 #endif
+#ifndef topology_sibling_weight
+#define topology_sibling_weight(cpu)	\
+		cpumask_weight(topology_sibling_cpumask(cpu))
+#endif
 #ifndef topology_core_cpumask
 #define topology_core_cpumask(cpu)		cpumask_of(cpu)
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling
  2019-06-27  1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar
                   ` (5 preceding siblings ...)
  2019-06-27  1:29 ` [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings subhra mazumdar
@ 2019-06-27  1:29 ` subhra mazumdar
  2019-07-01  9:02 ` [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path Peter Zijlstra
  7 siblings, 0 replies; 33+ messages in thread
From: subhra mazumdar @ 2019-06-27  1:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

Use per-cpu var cpumask_weight_sibling for quick lookup in select_idle_cpu.
This is the fast path of scheduler and every cycle is worth saving. Usage
of cpumask_weight can result in iterations.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a74808..878f11c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6206,7 +6206,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	if (sched_feat(SIS_PROP)) {
 		u64 span_avg = sd->span_weight * avg_idle;
-		floor = cpumask_weight(topology_sibling_cpumask(target));
+		floor = topology_sibling_weight(target);
 		if (floor < 2)
 			floor = 2;
 		limit = floor << 1;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
  2019-06-27  1:29 ` [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings subhra mazumdar
@ 2019-06-27  6:51   ` Thomas Gleixner
  2019-06-27  6:54     ` Thomas Gleixner
  2019-06-28  1:02     ` Subhra Mazumdar
  0 siblings, 2 replies; 33+ messages in thread
From: Thomas Gleixner @ 2019-06-27  6:51 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, peterz, mingo, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

On Wed, 26 Jun 2019, subhra mazumdar wrote:

> Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
> This will be used for quick lookup in select_idle_cpu to determine the
> limits of search.

Why? The number of siblings is constant at least today unless you play
silly cpu hotplug games. A bit more justification for adding yet another
random storage would be appreciated.

> This patch does it only for x86.

# grep 'This patch' Documentation/process/submitting-patches.rst

IOW, we all know already that this is a patch and from the subject prefix
and the diffstat it's pretty obvious that this is x86 only.

So instead of documenting the obvious, please add proper context to justify
the change.
 
> +/* representing number of HT siblings of each CPU */
> +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
> +EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling);

Why does this need an export? No module has any reason to access this.

>  /* representing HT and core siblings of each logical CPU */
>  DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
> @@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu)
>  
>  	if (!has_mp) {
>  		cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
> +		per_cpu(cpumask_weight_sibling, cpu) =
> +		    cpumask_weight(topology_sibling_cpumask(cpu));
>  		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
>  		cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
>  		c->booted_cores = 1;
> @@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu)
>  	for_each_cpu(i, cpu_sibling_setup_mask) {
>  		o = &cpu_data(i);
>  
> -		if ((i == cpu) || (has_smt && match_smt(c, o)))
> +		if ((i == cpu) || (has_smt && match_smt(c, o))) {
>  			link_mask(topology_sibling_cpumask, cpu, i);
> +			threads = cpumask_weight(topology_sibling_cpumask(cpu));
> +			per_cpu(cpumask_weight_sibling, cpu) = threads;
> +			per_cpu(cpumask_weight_sibling, i) = threads;

This only works for SMT=2, but fails to update the rest for SMT=4.

> @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
>  
>  	for_each_cpu(sibling, topology_core_cpumask(cpu)) {
>  		cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
> +		per_cpu(cpumask_weight_sibling, sibling) =
> +		    cpumask_weight(topology_sibling_cpumask(sibling));

While remove does the right thing.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
  2019-06-27  6:51   ` Thomas Gleixner
@ 2019-06-27  6:54     ` Thomas Gleixner
  2019-06-28  1:06       ` Subhra Mazumdar
  2019-06-28  1:02     ` Subhra Mazumdar
  1 sibling, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2019-06-27  6:54 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, peterz, mingo, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

On Thu, 27 Jun 2019, Thomas Gleixner wrote:

> On Wed, 26 Jun 2019, subhra mazumdar wrote:
> 
> > Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
> > This will be used for quick lookup in select_idle_cpu to determine the
> > limits of search.
> 
> Why? The number of siblings is constant at least today unless you play
> silly cpu hotplug games. A bit more justification for adding yet another
> random storage would be appreciated.
> 
> > This patch does it only for x86.
> 
> # grep 'This patch' Documentation/process/submitting-patches.rst
> 
> IOW, we all know already that this is a patch and from the subject prefix
> and the diffstat it's pretty obvious that this is x86 only.
> 
> So instead of documenting the obvious, please add proper context to justify
> the change.

Aside of that the right ordering is to introduce the default fallback in a
separate patch, which explains the reasoning and then in the next one add
the x86 optimized version.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
  2019-06-27  6:51   ` Thomas Gleixner
  2019-06-27  6:54     ` Thomas Gleixner
@ 2019-06-28  1:02     ` Subhra Mazumdar
  1 sibling, 0 replies; 33+ messages in thread
From: Subhra Mazumdar @ 2019-06-28  1:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, mingo, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman


On 6/26/19 11:51 PM, Thomas Gleixner wrote:
> On Wed, 26 Jun 2019, subhra mazumdar wrote:
>
>> Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
>> This will be used for quick lookup in select_idle_cpu to determine the
>> limits of search.
> Why? The number of siblings is constant at least today unless you play
> silly cpu hotplug games. A bit more justification for adding yet another
> random storage would be appreciated.
Using cpumask_weight every time in select_idle_cpu to compute the no. of
SMT siblings can be costly as cpumask_weight may not be O(1) for systems
with large no. of CPUs (e.g 8 socket, each socket having lots of cores).
Over 512 CPUs the bitmask will span multiple cache lines and touching
multiple cache lines in the fast path of scheduler can cost more than we
save from this optimization. Even in single cache line it loops in longs.
We want to touch O(1) cache lines and do O(1) operations, hence
pre-compute it in per-CPU variable.
>
>> This patch does it only for x86.
> # grep 'This patch' Documentation/process/submitting-patches.rst
>
> IOW, we all know already that this is a patch and from the subject prefix
> and the diffstat it's pretty obvious that this is x86 only.
>
> So instead of documenting the obvious, please add proper context to justify
> the change.
Ok. The extra per-CPU optimization was done only for x86 as we cared about
it the most and make it future proof. I will add for other architectures.
>   
>> +/* representing number of HT siblings of each CPU */
>> +DEFINE_PER_CPU_READ_MOSTLY(unsigned int, cpumask_weight_sibling);
>> +EXPORT_PER_CPU_SYMBOL(cpumask_weight_sibling);
> Why does this need an export? No module has any reason to access this.
I will remove it
>
>>   /* representing HT and core siblings of each logical CPU */
>>   DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
>>   EXPORT_PER_CPU_SYMBOL(cpu_core_map);
>> @@ -520,6 +524,8 @@ void set_cpu_sibling_map(int cpu)
>>   
>>   	if (!has_mp) {
>>   		cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
>> +		per_cpu(cpumask_weight_sibling, cpu) =
>> +		    cpumask_weight(topology_sibling_cpumask(cpu));
>>   		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
>>   		cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
>>   		c->booted_cores = 1;
>> @@ -529,8 +535,12 @@ void set_cpu_sibling_map(int cpu)
>>   	for_each_cpu(i, cpu_sibling_setup_mask) {
>>   		o = &cpu_data(i);
>>   
>> -		if ((i == cpu) || (has_smt && match_smt(c, o)))
>> +		if ((i == cpu) || (has_smt && match_smt(c, o))) {
>>   			link_mask(topology_sibling_cpumask, cpu, i);
>> +			threads = cpumask_weight(topology_sibling_cpumask(cpu));
>> +			per_cpu(cpumask_weight_sibling, cpu) = threads;
>> +			per_cpu(cpumask_weight_sibling, i) = threads;
> This only works for SMT=2, but fails to update the rest for SMT=4.

I guess I assumed that x86 will always be SMT2, will fix this.

Thanks,
Subhra

>
>> @@ -1482,6 +1494,8 @@ static void remove_siblinginfo(int cpu)
>>   
>>   	for_each_cpu(sibling, topology_core_cpumask(cpu)) {
>>   		cpumask_clear_cpu(cpu, topology_core_cpumask(sibling));
>> +		per_cpu(cpumask_weight_sibling, sibling) =
>> +		    cpumask_weight(topology_sibling_cpumask(sibling));
> While remove does the right thing.
>
> Thanks,
>
> 	tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings
  2019-06-27  6:54     ` Thomas Gleixner
@ 2019-06-28  1:06       ` Subhra Mazumdar
  0 siblings, 0 replies; 33+ messages in thread
From: Subhra Mazumdar @ 2019-06-28  1:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, peterz, mingo, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman


On 6/26/19 11:54 PM, Thomas Gleixner wrote:
> On Thu, 27 Jun 2019, Thomas Gleixner wrote:
>
>> On Wed, 26 Jun 2019, subhra mazumdar wrote:
>>
>>> Introduce a per-cpu variable to keep the number of HT siblings of a cpu.
>>> This will be used for quick lookup in select_idle_cpu to determine the
>>> limits of search.
>> Why? The number of siblings is constant at least today unless you play
>> silly cpu hotplug games. A bit more justification for adding yet another
>> random storage would be appreciated.
>>
>>> This patch does it only for x86.
>> # grep 'This patch' Documentation/process/submitting-patches.rst
>>
>> IOW, we all know already that this is a patch and from the subject prefix
>> and the diffstat it's pretty obvious that this is x86 only.
>>
>> So instead of documenting the obvious, please add proper context to justify
>> the change.
> Aside of that the right ordering is to introduce the default fallback in a
> separate patch, which explains the reasoning and then in the next one add
> the x86 optimized version.
OK. I will also add the extra optimization for other architectures.

Thanks,
Subhra
>
> Thanks,
>
> 	tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread
  2019-06-27  1:29 ` [PATCH v3 3/7] sched: rotate the cpu search window for better spread subhra mazumdar
@ 2019-06-28 11:54   ` Srikar Dronamraju
  2019-06-28 22:34     ` Subhra Mazumdar
  2019-06-28 18:36   ` Parth Shah
  1 sibling, 1 reply; 33+ messages in thread
From: Srikar Dronamraju @ 2019-06-28 11:54 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

* subhra mazumdar <subhra.mazumdar@oracle.com> [2019-06-26 18:29:15]:

> Rotate the cpu search window for better spread of threads. This will ensure
> an idle cpu will quickly be found if one exists.

While rotating the cpu search window is good, not sure if this can find a
idle cpu quickly. The probability of finding an idle cpu still should remain
the same. No?

> 
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
>  kernel/sched/fair.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  		}
>  	}
>  
> +	if (per_cpu(next_cpu, target) != -1)
> +		target_tmp = per_cpu(next_cpu, target);
> +	else
> +		target_tmp = target;
> +
>  	time = local_clock();
>  
> -	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
> +	for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
> +		per_cpu(next_cpu, target) = cpu;

Shouldn't this assignment be outside the for loop.
With the current code,
1. We keep reassigning multiple times.
2. The last assignment happes for idle_cpu and sometimes the
assignment is for non-idle cpu.

>  		if (!--nr)
>  			return -1;
>  		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> -- 
> 2.9.3
> 

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread
  2019-06-27  1:29 ` [PATCH v3 3/7] sched: rotate the cpu search window for better spread subhra mazumdar
  2019-06-28 11:54   ` Srikar Dronamraju
@ 2019-06-28 18:36   ` Parth Shah
  2019-06-28 22:14     ` Subhra Mazumdar
  1 sibling, 1 reply; 33+ messages in thread
From: Parth Shah @ 2019-06-28 18:36 UTC (permalink / raw)
  To: subhra mazumdar, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman

Hi Subhra,

I ran your patch series on IBM POWER systems and this is what I have observed.

On 6/27/19 6:59 AM, subhra mazumdar wrote:
> Rotate the cpu search window for better spread of threads. This will ensure
> an idle cpu will quickly be found if one exists.
> 
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
>  kernel/sched/fair.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b58f08f..c1ca88e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  	u64 avg_cost, avg_idle;
>  	u64 time, cost;
>  	s64 delta;
> -	int cpu, limit, floor, nr = INT_MAX;
> +	int cpu, limit, floor, target_tmp, nr = INT_MAX;
> 
>  	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>  	if (!this_sd)
> @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  		}
>  	}
> 
> +	if (per_cpu(next_cpu, target) != -1)
> +		target_tmp = per_cpu(next_cpu, target);
> +	else
> +		target_tmp = target;
> +
>  	time = local_clock();
> 
> -	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
> +	for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
> +		per_cpu(next_cpu, target) = cpu;

This leads to a problem of cache hotness.
AFAIU, in most cases, `target = prev_cpu` of the task being woken up and
selecting an idle CPU nearest to the prev_cpu is favorable.
But since this doesn't keep track of last idle cpu per task, it fails to find the
nearest possible idle CPU in cases when the task is being woken up after other
scheduled task.

Consider below scenario:
=======================
- System: 44 cores, 88 CPUs
- 44 CPU intensive tasks pinned to any CPU in each core. This makes 'select_idle_core' return -1;
- Consider below shown timeline:
- Task T1 runs for time 0-5 on CPU0
- Then task T2 runs for time 6-10 on CPU0
- T1 wakes at time 7, with target=0, and setting
  per_cpu(next_cpu,0)= 4 (let's say cpu 0-3 are busy at the time)
- So T1 runs for time 7-12 on CPU4.
- Meanwhile, T2 wakes at time 11, with target=0, but per_cpu(next_cpu, 0) is 4. So starts
  searching from CPU4 and ends up at CPU 8 or so even though CPU0 is free at that time.
- This goes on further far away from the prev_cpu on each such iteration unless it wraps around after 44 CPUs.


              ^T1           T1$   ^T2          T2$
CPU 0          |             |     |            |      
       -----------------------------------------------------------------------------
               0             5     6            10                       time----->

                                       ^T1                T1$
CPU 4                                   |                  |
       -----------------------------------------------------------------------------
                                        7                  12            time----->

                                                      ^T2          T2$
CPU 8                                                  |            |
       -----------------------------------------------------------------------------
                                                       11                time------>

Symbols: ^Tn: Task Tn wake-up,  Tn$: task Tn sleeps

Above example indicates the both the task T1 and T2 suffers from cache hotness in further iterations.


>  		if (!--nr)
>  			return -1;
>  		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> 


Best
Parth


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu
  2019-06-27  1:29 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar
@ 2019-06-28 18:47   ` Parth Shah
  2019-06-28 22:21     ` Subhra Mazumdar
  0 siblings, 1 reply; 33+ messages in thread
From: Parth Shah @ 2019-06-28 18:47 UTC (permalink / raw)
  To: subhra mazumdar, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman



On 6/27/19 6:59 AM, subhra mazumdar wrote:
> Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
> is amount of cpus in a core while upper limit is twice that. This ensures
> for any architecture we will usually search beyond a core. The upper limit
> also helps in keeping the search cost low and constant.
> 
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
>  kernel/sched/fair.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..b58f08f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  	u64 avg_cost, avg_idle;
>  	u64 time, cost;
>  	s64 delta;
> -	int cpu, nr = INT_MAX;
> +	int cpu, limit, floor, nr = INT_MAX;
> 
>  	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>  	if (!this_sd)
> @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> 
>  	if (sched_feat(SIS_PROP)) {
>  		u64 span_avg = sd->span_weight * avg_idle;
> -		if (span_avg > 4*avg_cost)
> +		floor = cpumask_weight(topology_sibling_cpumask(target));
> +		if (floor < 2)
> +			floor = 2;
> +		limit = floor << 1;

Is upper limit an experimental value only or it has any arch specific significance?
Because, AFAIU, systems like POWER9 might have benefit for searching for 4-cores
due to its different cache model. So it can be tuned for arch specific builds then.

Also variable names can be changed for better readability.
floor -> weight_clamp_min
limit -> weight_clamp_max
or something similar


> +		if (span_avg > floor*avg_cost) {
>  			nr = div_u64(span_avg, avg_cost);
> -		else
> -			nr = 4;
> +			if (nr > limit)
> +				nr = limit;
> +		} else {
> +			nr = floor;
> +		}
>  	}
> 
>  	time = local_clock();
> 


Best,
Parth


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
  2019-06-27  1:29 ` [PATCH v3 5/7] sched: SIS_CORE " subhra mazumdar
@ 2019-06-28 19:01   ` Parth Shah
  2019-06-28 22:29     ` Subhra Mazumdar
  0 siblings, 1 reply; 33+ messages in thread
From: Parth Shah @ 2019-06-28 19:01 UTC (permalink / raw)
  To: subhra mazumdar, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman



On 6/27/19 6:59 AM, subhra mazumdar wrote:
> Use SIS_CORE to disable idle core search. For some workloads
> select_idle_core becomes a scalability bottleneck, removing it improves
> throughput. Also there are workloads where disabling it can hurt latency,
> so need to have an option.
> 
> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
> ---
>  kernel/sched/fair.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c1ca88e..6a74808 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	if (!sd)
>  		return target;
> 
> -	i = select_idle_core(p, sd, target);
> -	if ((unsigned)i < nr_cpumask_bits)
> -		return i;
> +	if (sched_feat(SIS_CORE)) {
> +		i = select_idle_core(p, sd, target);
> +		if ((unsigned)i < nr_cpumask_bits)
> +			return i;
> +	}

This can have significant performance loss if disabled. The select_idle_core spreads
workloads quickly across the cores, hence disabling this leaves much of the work to
be offloaded to load balancer to move task across the cores. Latency sensitive
and long running multi-threaded workload should see the regression under this conditions.

Also, systems like POWER9 has sd_llc as a pair of core only. So it
won't benefit from the limits and hence also hiding your code in select_idle_cpu
behind static keys will be much preferred.

> 
>  	i = select_idle_cpu(p, sd, target);
>  	if ((unsigned)i < nr_cpumask_bits)
> 


Best,
Parth


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread
  2019-06-28 18:36   ` Parth Shah
@ 2019-06-28 22:14     ` Subhra Mazumdar
  0 siblings, 0 replies; 33+ messages in thread
From: Subhra Mazumdar @ 2019-06-28 22:14 UTC (permalink / raw)
  To: Parth Shah, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman


On 6/28/19 11:36 AM, Parth Shah wrote:
> Hi Subhra,
>
> I ran your patch series on IBM POWER systems and this is what I have observed.
>
> On 6/27/19 6:59 AM, subhra mazumdar wrote:
>> Rotate the cpu search window for better spread of threads. This will ensure
>> an idle cpu will quickly be found if one exists.
>>
>> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
>> ---
>>   kernel/sched/fair.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b58f08f..c1ca88e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>   	u64 avg_cost, avg_idle;
>>   	u64 time, cost;
>>   	s64 delta;
>> -	int cpu, limit, floor, nr = INT_MAX;
>> +	int cpu, limit, floor, target_tmp, nr = INT_MAX;
>>
>>   	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>   	if (!this_sd)
>> @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>   		}
>>   	}
>>
>> +	if (per_cpu(next_cpu, target) != -1)
>> +		target_tmp = per_cpu(next_cpu, target);
>> +	else
>> +		target_tmp = target;
>> +
>>   	time = local_clock();
>>
>> -	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
>> +	for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
>> +		per_cpu(next_cpu, target) = cpu;
> This leads to a problem of cache hotness.
> AFAIU, in most cases, `target = prev_cpu` of the task being woken up and
> selecting an idle CPU nearest to the prev_cpu is favorable.
> But since this doesn't keep track of last idle cpu per task, it fails to find the
> nearest possible idle CPU in cases when the task is being woken up after other
> scheduled task.
>
I had tested hackbench on SPARC SMT8 (see numbers in cover letter) and
showed improvement with this. Firstly it's a tradeoff between cache effects
vs time spent in searching idle CPU, and both x86 and SPARC numbers showed
tradeoff is worth it. Secondly there is a lot of cache affinity logic
in the beginning of select_idle_sibling. If select_idle_cpu is still called
that means we are past that and want any idle cpu. I don't think waking up
close to the prev cpu is the intention for starting search from there,
rather it is to spread threads across all cpus so that no cpu gets
victimized as there is no atomicity. Prev cpu just acts a good seed to do
the spreading.

Thanks,
Subhra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu
  2019-06-28 18:47   ` Parth Shah
@ 2019-06-28 22:21     ` Subhra Mazumdar
  0 siblings, 0 replies; 33+ messages in thread
From: Subhra Mazumdar @ 2019-06-28 22:21 UTC (permalink / raw)
  To: Parth Shah, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman


On 6/28/19 11:47 AM, Parth Shah wrote:
>
> On 6/27/19 6:59 AM, subhra mazumdar wrote:
>> Put upper and lower limit on cpu search of select_idle_cpu. The lower limit
>> is amount of cpus in a core while upper limit is twice that. This ensures
>> for any architecture we will usually search beyond a core. The upper limit
>> also helps in keeping the search cost low and constant.
>>
>> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
>> ---
>>   kernel/sched/fair.c | 15 +++++++++++----
>>   1 file changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f35930f..b58f08f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6188,7 +6188,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>   	u64 avg_cost, avg_idle;
>>   	u64 time, cost;
>>   	s64 delta;
>> -	int cpu, nr = INT_MAX;
>> +	int cpu, limit, floor, nr = INT_MAX;
>>
>>   	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>>   	if (!this_sd)
>> @@ -6206,10 +6206,17 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>
>>   	if (sched_feat(SIS_PROP)) {
>>   		u64 span_avg = sd->span_weight * avg_idle;
>> -		if (span_avg > 4*avg_cost)
>> +		floor = cpumask_weight(topology_sibling_cpumask(target));
>> +		if (floor < 2)
>> +			floor = 2;
>> +		limit = floor << 1;
> Is upper limit an experimental value only or it has any arch specific significance?
> Because, AFAIU, systems like POWER9 might have benefit for searching for 4-cores
> due to its different cache model. So it can be tuned for arch specific builds then.
The lower bound and upper bound were 1 core and 2 core respectively. That
is done as to search beyond one core and at the same time not to search
too much. It is heuristic that seemed to work well on all archs coupled
with the moving window mechanism. Does 4 vs 2 make any difference on your
POWER9? AFAIR it didn't on SPARC SMT8.
>
> Also variable names can be changed for better readability.
> floor -> weight_clamp_min
> limit -> weight_clamp_max
> or something similar
OK.

Thanks,
Subhra
>
>
>> +		if (span_avg > floor*avg_cost) {
>>   			nr = div_u64(span_avg, avg_cost);
>> -		else
>> -			nr = 4;
>> +			if (nr > limit)
>> +				nr = limit;
>> +		} else {
>> +			nr = floor;
>> +		}
>>   	}
>>
>>   	time = local_clock();
>>
>
> Best,
> Parth
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
  2019-06-28 19:01   ` Parth Shah
@ 2019-06-28 22:29     ` Subhra Mazumdar
  2019-07-01  9:57       ` Parth Shah
  0 siblings, 1 reply; 33+ messages in thread
From: Subhra Mazumdar @ 2019-06-28 22:29 UTC (permalink / raw)
  To: Parth Shah, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman


On 6/28/19 12:01 PM, Parth Shah wrote:
>
> On 6/27/19 6:59 AM, subhra mazumdar wrote:
>> Use SIS_CORE to disable idle core search. For some workloads
>> select_idle_core becomes a scalability bottleneck, removing it improves
>> throughput. Also there are workloads where disabling it can hurt latency,
>> so need to have an option.
>>
>> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
>> ---
>>   kernel/sched/fair.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c1ca88e..6a74808 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>   	if (!sd)
>>   		return target;
>>
>> -	i = select_idle_core(p, sd, target);
>> -	if ((unsigned)i < nr_cpumask_bits)
>> -		return i;
>> +	if (sched_feat(SIS_CORE)) {
>> +		i = select_idle_core(p, sd, target);
>> +		if ((unsigned)i < nr_cpumask_bits)
>> +			return i;
>> +	}
> This can have significant performance loss if disabled. The select_idle_core spreads
> workloads quickly across the cores, hence disabling this leaves much of the work to
> be offloaded to load balancer to move task across the cores. Latency sensitive
> and long running multi-threaded workload should see the regression under this conditions.
Yes in case of SPARC SMT8 I did notice that (see cover letter). That's why
it is a feature that is ON by default, but can be turned OFF for specific
workloads on x86 SMT2 that can benefit from it.
> Also, systems like POWER9 has sd_llc as a pair of core only. So it
> won't benefit from the limits and hence also hiding your code in select_idle_cpu
> behind static keys will be much preferred.
If it doesn't hurt then I don't see the point.

Thanks,
Subhra


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 3/7] sched: rotate the cpu search window for better spread
  2019-06-28 11:54   ` Srikar Dronamraju
@ 2019-06-28 22:34     ` Subhra Mazumdar
  0 siblings, 0 replies; 33+ messages in thread
From: Subhra Mazumdar @ 2019-06-28 22:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: linux-kernel, peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman


On 6/28/19 4:54 AM, Srikar Dronamraju wrote:
> * subhra mazumdar <subhra.mazumdar@oracle.com> [2019-06-26 18:29:15]:
>
>> Rotate the cpu search window for better spread of threads. This will ensure
>> an idle cpu will quickly be found if one exists.
> While rotating the cpu search window is good, not sure if this can find a
> idle cpu quickly. The probability of finding an idle cpu still should remain
> the same. No?
>
>> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
>> ---
>>   kernel/sched/fair.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> @@ -6219,9 +6219,15 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>   		}
>>   	}
>>   
>> +	if (per_cpu(next_cpu, target) != -1)
>> +		target_tmp = per_cpu(next_cpu, target);
>> +	else
>> +		target_tmp = target;
>> +
>>   	time = local_clock();
>>   
>> -	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
>> +	for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
>> +		per_cpu(next_cpu, target) = cpu;
> Shouldn't this assignment be outside the for loop.
> With the current code,
> 1. We keep reassigning multiple times.
> 2. The last assignment happes for idle_cpu and sometimes the
> assignment is for non-idle cpu.
We want the last assignment irrespective of it was an idle cpu or not since
in both cases we want to track the boundary of search.

Thanks,
Subhra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
  2019-06-27  1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar
                   ` (6 preceding siblings ...)
  2019-06-27  1:29 ` [PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling subhra mazumdar
@ 2019-07-01  9:02 ` Peter Zijlstra
  2019-07-01 13:55   ` Patrick Bellasi
  7 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-01  9:02 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, Paul Turner, riel, morten.rasmussen

On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:
> Hi,
> 
> Resending this patchset, will be good to get some feedback. Any suggestions
> that will make it more acceptable are welcome. We have been shipping this
> with Unbreakable Enterprise Kernel in Oracle Linux.
> 
> Current select_idle_sibling first tries to find a fully idle core using
> select_idle_core which can potentially search all cores and if it fails it
> finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
> search all cpus in the llc domain. This doesn't scale for large llc domains
> and will only get worse with more cores in future.
> 
> This patch solves the scalability problem by:
>  - Setting an upper and lower limit of idle cpu search in select_idle_cpu
>    to keep search time low and constant
>  - Adding a new sched feature SIS_CORE to disable select_idle_core
> 
> Additionally it also introduces a new per-cpu variable next_cpu to track
> the limit of search so that every time search starts from where it ended.
> This rotating search window over cpus in LLC domain ensures that idle
> cpus are eventually found in case of high load.

Right, so we had a wee conversation about this patch series at OSPM, and
I don't see any of that reflected here :-(

Specifically, given that some people _really_ want the whole L3 mask
scanned to reduce tail latency over raw throughput, while you guys
prefer the other way around, it was proposed to extend the task model.

Specifically something like a latency-nice was mentioned (IIRC) where a
task can give a bias but not specify specific behaviour. This is very
important since we don't want to be ABI tied to specific behaviour.

Some of the things we could tie to this would be:

  - select_idle_siblings; -nice would scan more than +nice,

  - wakeup preemption; when the wakee has a relative smaller
    latency-nice value than the current running task, it might preempt
    sooner and the other way around of course.

  - pack-vs-spread; +nice would pack more with like tasks (since we
    already spread by default [0] I don't think -nice would affect much
    here).


Hmmm?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
  2019-06-28 22:29     ` Subhra Mazumdar
@ 2019-07-01  9:57       ` Parth Shah
  2019-07-01 20:37         ` Subhra Mazumdar
  0 siblings, 1 reply; 33+ messages in thread
From: Parth Shah @ 2019-07-01  9:57 UTC (permalink / raw)
  To: Subhra Mazumdar, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman



On 6/29/19 3:59 AM, Subhra Mazumdar wrote:
> 
> On 6/28/19 12:01 PM, Parth Shah wrote:
>>
>> On 6/27/19 6:59 AM, subhra mazumdar wrote:
>>> Use SIS_CORE to disable idle core search. For some workloads
>>> select_idle_core becomes a scalability bottleneck, removing it improves
>>> throughput. Also there are workloads where disabling it can hurt latency,
>>> so need to have an option.
>>>
>>> Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
>>> ---
>>>   kernel/sched/fair.c | 8 +++++---
>>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index c1ca88e..6a74808 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6280,9 +6280,11 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>>       if (!sd)
>>>           return target;
>>>
>>> -    i = select_idle_core(p, sd, target);
>>> -    if ((unsigned)i < nr_cpumask_bits)
>>> -        return i;
>>> +    if (sched_feat(SIS_CORE)) {
>>> +        i = select_idle_core(p, sd, target);
>>> +        if ((unsigned)i < nr_cpumask_bits)
>>> +            return i;
>>> +    }
>> This can have significant performance loss if disabled. The select_idle_core spreads
>> workloads quickly across the cores, hence disabling this leaves much of the work to
>> be offloaded to load balancer to move task across the cores. Latency sensitive
>> and long running multi-threaded workload should see the regression under this conditions.
> Yes in case of SPARC SMT8 I did notice that (see cover letter). That's why
> it is a feature that is ON by default, but can be turned OFF for specific
> workloads on x86 SMT2 that can benefit from it.
>> Also, systems like POWER9 has sd_llc as a pair of core only. So it
>> won't benefit from the limits and hence also hiding your code in select_idle_cpu
>> behind static keys will be much preferred.
> If it doesn't hurt then I don't see the point.
> 

So these is the result from POWER9 system with your patches:
System configuration: 2 Socket, 44 cores, 176 CPUs

Experiment setup: 
===========
=> Setup 1:
- 44 tasks doing just while(1), this is to make select_idle_core return -1 most times
- perf bench sched messaging -g 1 -l 1000000
+-----------+--------+--------------+--------+
| Baseline  | stddev |    Patch     | stddev |
+-----------+--------+--------------+--------+
|       135 |   3.21 | 158(-17.03%) |   4.69 |
+-----------+--------+--------------+--------+

=> Setup 2:
- schbench -m44 -t 1
+=======+==========+=========+=========+==========+                                                                                                                                                                                                                             
| %ile  | Baseline | stddev  |  patch  |  stddev  |                                                                                                                                                                                                                             
+=======+==========+=========+=========+==========+                                                                                                                                                                                                                             
|    50 |       10 |    3.49 |      10 |     2.29 |                                                                                                                                                                                                                             
+-------+----------+---------+---------+----------+                                                                                                                                                                                                                             
|    95 |      467 |    4.47 |     469 |     0.81 |                                                                                                                                                                                                                             
+-------+----------+---------+---------+----------+                                                                                                                                                                                                                             
|    99 |      571 |   21.32 |     584 |    18.69 |                                                                                                                                                                                                                             
+-------+----------+---------+---------+----------+                                                                                                                                                                                                                             
|  99.5 |      629 |   30.05 |     641 |    20.95 |                                                                                                                                                                                                                             
+-------+----------+---------+---------+----------+                                                                                                                                                                                                                             
|  99.9 |      780 |   40.38 |     773 |     44.2 |                                                                                                                                                                                                                             
+-------+----------+---------+---------+----------+

I guess it doesn't make much difference in schbench results but hackbench (perf bench)
seems to have an observable regression.


Best,
Parth


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
  2019-07-01  9:02 ` [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path Peter Zijlstra
@ 2019-07-01 13:55   ` Patrick Bellasi
  2019-07-01 14:04     ` Peter Zijlstra
                       ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Patrick Bellasi @ 2019-07-01 13:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: subhra mazumdar, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, Paul Turner, riel, morten.rasmussen

On 01-Jul 11:02, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:
> > Hi,
> > 
> > Resending this patchset, will be good to get some feedback. Any suggestions
> > that will make it more acceptable are welcome. We have been shipping this
> > with Unbreakable Enterprise Kernel in Oracle Linux.
> > 
> > Current select_idle_sibling first tries to find a fully idle core using
> > select_idle_core which can potentially search all cores and if it fails it
> > finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
> > search all cpus in the llc domain. This doesn't scale for large llc domains
> > and will only get worse with more cores in future.
> > 
> > This patch solves the scalability problem by:
> >  - Setting an upper and lower limit of idle cpu search in select_idle_cpu
> >    to keep search time low and constant
> >  - Adding a new sched feature SIS_CORE to disable select_idle_core
> > 
> > Additionally it also introduces a new per-cpu variable next_cpu to track
> > the limit of search so that every time search starts from where it ended.
> > This rotating search window over cpus in LLC domain ensures that idle
> > cpus are eventually found in case of high load.
> 
> Right, so we had a wee conversation about this patch series at OSPM, and
> I don't see any of that reflected here :-(
> 
> Specifically, given that some people _really_ want the whole L3 mask
> scanned to reduce tail latency over raw throughput, while you guys
> prefer the other way around, it was proposed to extend the task model.
> 
> Specifically something like a latency-nice was mentioned (IIRC) where a

Right, AFAIR PaulT suggested to add support for the concept of a task
being "latency tolerant": meaning we can spend more time to search for
a CPU and/or avoid preempting the current task.

> task can give a bias but not specify specific behaviour. This is very
> important since we don't want to be ABI tied to specific behaviour.

I like the idea of biasing, especially considering we are still in the
domain of the FAIR scheduler. If something more mandatory should be
required there are other classes which are likely more appropriate.

> Some of the things we could tie to this would be:
> 
>   - select_idle_siblings; -nice would scan more than +nice,

Just to be sure, you are not proposing to use the nice value we
already have, i.e.
  p->{static,normal}_prio
but instead a new similar concept, right?

Otherwise, pros would be we don't touch userspace, but as a cons we
would have side effects, i.e. bandwidth allocation.
While I think we don't want to mix "specific behaviors" with "biases".

>   - wakeup preemption; when the wakee has a relative smaller
>     latency-nice value than the current running task, it might preempt
>     sooner and the other way around of course.

I think we currently have a single system-wide parameter for that now:

   sched_wakeup_granularity_ns ==> sysctl_sched_min_granularity

which is used in:

   wakeup_gran()                for the wakeup path
   check_preempt_tick()         for the periodic tick

that's where it should be possible to extend the heuristics with some
biasing based on the latency-nice attribute of a task, right?

>   - pack-vs-spread; +nice would pack more with like tasks (since we
>     already spread by default [0] I don't think -nice would affect much
>     here).

That will be very useful for the Android case too.
In Android we used to call it "prefer_idle", but that's probably not
the best name, conceptually similar however.

In Android we would use a latency-nice concept to go for either the
fast (select_idle_siblings) or the slow (energy aware) path.

> Hmmm?

Just one more requirement I think it's worth to consider since the
beginning: CGroups support

That would be very welcome interface. Just because is so much more
convenient (and safe) to set these bias on a group of tasks depending
on their role in the system.

Do you have any idea on how we can expose such a "lantency-nice"
property via CGroups? It's very similar to cpu.shares but it does not
represent a resource which can be partitioned.

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
  2019-07-01 13:55   ` Patrick Bellasi
@ 2019-07-01 14:04     ` Peter Zijlstra
  2019-07-08 22:32       ` Tim Chen
  2019-07-01 14:06     ` Peter Zijlstra
  2019-07-02  0:01     ` Subhra Mazumdar
  2 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-01 14:04 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: subhra mazumdar, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, Paul Turner, riel, morten.rasmussen

On Mon, Jul 01, 2019 at 02:55:52PM +0100, Patrick Bellasi wrote:
> On 01-Jul 11:02, Peter Zijlstra wrote:

> > Some of the things we could tie to this would be:
> > 
> >   - select_idle_siblings; -nice would scan more than +nice,
> 
> Just to be sure, you are not proposing to use the nice value we
> already have, i.e.
>   p->{static,normal}_prio
> but instead a new similar concept, right?

Correct; a new sched_attr::sched_latency_nice value, which is like
sched_nice, but controls a different dimmension of behaviour.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
  2019-07-01 13:55   ` Patrick Bellasi
  2019-07-01 14:04     ` Peter Zijlstra
@ 2019-07-01 14:06     ` Peter Zijlstra
  2019-07-02  0:01     ` Subhra Mazumdar
  2 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2019-07-01 14:06 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: subhra mazumdar, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, Paul Turner, riel, morten.rasmussen

On Mon, Jul 01, 2019 at 02:55:52PM +0100, Patrick Bellasi wrote:
> On 01-Jul 11:02, Peter Zijlstra wrote:

> > Hmmm?
> 
> Just one more requirement I think it's worth to consider since the
> beginning: CGroups support
> 
> That would be very welcome interface. Just because is so much more
> convenient (and safe) to set these bias on a group of tasks depending
> on their role in the system.
> 
> Do you have any idea on how we can expose such a "lantency-nice"
> property via CGroups? It's very similar to cpu.shares but it does not
> represent a resource which can be partitioned.

If the latency_nice idea lives; exactly like the normal nice? That is;
IIRC cgroupv2 has a nice value interface (see cpu_weight_nice_*()).

But yes, this isn't a resource per se; just a shared attribute like
thing.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
  2019-07-01  9:57       ` Parth Shah
@ 2019-07-01 20:37         ` Subhra Mazumdar
  2019-07-04 12:34           ` Parth Shah
  0 siblings, 1 reply; 33+ messages in thread
From: Subhra Mazumdar @ 2019-07-01 20:37 UTC (permalink / raw)
  To: Parth Shah, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman


>>> Also, systems like POWER9 has sd_llc as a pair of core only. So it
>>> won't benefit from the limits and hence also hiding your code in select_idle_cpu
>>> behind static keys will be much preferred.
>> If it doesn't hurt then I don't see the point.
>>
> So these is the result from POWER9 system with your patches:
> System configuration: 2 Socket, 44 cores, 176 CPUs
>
> Experiment setup:
> ===========
> => Setup 1:
> - 44 tasks doing just while(1), this is to make select_idle_core return -1 most times
> - perf bench sched messaging -g 1 -l 1000000
> +-----------+--------+--------------+--------+
> | Baseline  | stddev |    Patch     | stddev |
> +-----------+--------+--------------+--------+
> |       135 |   3.21 | 158(-17.03%) |   4.69 |
> +-----------+--------+--------------+--------+
>
> => Setup 2:
> - schbench -m44 -t 1
> +=======+==========+=========+=========+==========+
> | %ile  | Baseline | stddev  |  patch  |  stddev  |
> +=======+==========+=========+=========+==========+
> |    50 |       10 |    3.49 |      10 |     2.29 |
> +-------+----------+---------+---------+----------+
> |    95 |      467 |    4.47 |     469 |     0.81 |
> +-------+----------+---------+---------+----------+
> |    99 |      571 |   21.32 |     584 |    18.69 |
> +-------+----------+---------+---------+----------+
> |  99.5 |      629 |   30.05 |     641 |    20.95 |
> +-------+----------+---------+---------+----------+
> |  99.9 |      780 |   40.38 |     773 |     44.2 |
> +-------+----------+---------+---------+----------+
>
> I guess it doesn't make much difference in schbench results but hackbench (perf bench)
> seems to have an observable regression.
>
>
> Best,
> Parth
>
If POWER9 sd_llc has only 2 cores, the behavior shouldn't change much with
the select_idle_cpu changes as the limits are 1 and 2 core. Previously the
lower bound was 4 cpus and upper bound calculated by the prop. Now it is
1 core (4 cpus on SMT4) and upper bound 2 cores. Could it be the extra
computation of cpumask_weight causing the regression rather than the
sliding window itself (one way to check this would be hardcode 4 in place
of topology_sibling_weight)? Or is it the L1 cache coherency? I am a bit
suprised because SPARC SMT8 which has more cores in sd_llc and L1 cache per
core showed improvement with Hackbench.

Thanks,
Subhra

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
  2019-07-01 13:55   ` Patrick Bellasi
  2019-07-01 14:04     ` Peter Zijlstra
  2019-07-01 14:06     ` Peter Zijlstra
@ 2019-07-02  0:01     ` Subhra Mazumdar
  2019-07-02  8:54       ` Patrick Bellasi
  2 siblings, 1 reply; 33+ messages in thread
From: Subhra Mazumdar @ 2019-07-02  0:01 UTC (permalink / raw)
  To: Patrick Bellasi, Peter Zijlstra
  Cc: linux-kernel, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman, Paul Turner, riel, morten.rasmussen


On 7/1/19 6:55 AM, Patrick Bellasi wrote:
> On 01-Jul 11:02, Peter Zijlstra wrote:
>> On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:
>>> Hi,
>>>
>>> Resending this patchset, will be good to get some feedback. Any suggestions
>>> that will make it more acceptable are welcome. We have been shipping this
>>> with Unbreakable Enterprise Kernel in Oracle Linux.
>>>
>>> Current select_idle_sibling first tries to find a fully idle core using
>>> select_idle_core which can potentially search all cores and if it fails it
>>> finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
>>> search all cpus in the llc domain. This doesn't scale for large llc domains
>>> and will only get worse with more cores in future.
>>>
>>> This patch solves the scalability problem by:
>>>   - Setting an upper and lower limit of idle cpu search in select_idle_cpu
>>>     to keep search time low and constant
>>>   - Adding a new sched feature SIS_CORE to disable select_idle_core
>>>
>>> Additionally it also introduces a new per-cpu variable next_cpu to track
>>> the limit of search so that every time search starts from where it ended.
>>> This rotating search window over cpus in LLC domain ensures that idle
>>> cpus are eventually found in case of high load.
>> Right, so we had a wee conversation about this patch series at OSPM, and
>> I don't see any of that reflected here :-(
>>
>> Specifically, given that some people _really_ want the whole L3 mask
>> scanned to reduce tail latency over raw throughput, while you guys
>> prefer the other way around, it was proposed to extend the task model.
>>
>> Specifically something like a latency-nice was mentioned (IIRC) where a
> Right, AFAIR PaulT suggested to add support for the concept of a task
> being "latency tolerant": meaning we can spend more time to search for
> a CPU and/or avoid preempting the current task.
>
Wondering if searching and preempting needs will ever be conflicting?
Otherwise sounds like a good direction to me. For the searching aspect, can
we map latency nice values to the % of cores we search in select_idle_cpu?
Thus the search cost can be controlled by latency nice value. But the issue
is if more latency tolerant workloads set to less search, we still need
some mechanism to achieve good spread of threads. Can we keep the sliding
window mechanism in that case? Also will latency nice do anything for
select_idle_core and select_idle_smt?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
  2019-07-02  0:01     ` Subhra Mazumdar
@ 2019-07-02  8:54       ` Patrick Bellasi
  2019-07-03  3:52         ` Subhra Mazumdar
  0 siblings, 1 reply; 33+ messages in thread
From: Patrick Bellasi @ 2019-07-02  8:54 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: Peter Zijlstra, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, Paul Turner, riel, morten.rasmussen

On 01-Jul 17:01, Subhra Mazumdar wrote:
> 
> On 7/1/19 6:55 AM, Patrick Bellasi wrote:
> > On 01-Jul 11:02, Peter Zijlstra wrote:
> > > On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:
> > > > Hi,
> > > > 
> > > > Resending this patchset, will be good to get some feedback. Any suggestions
> > > > that will make it more acceptable are welcome. We have been shipping this
> > > > with Unbreakable Enterprise Kernel in Oracle Linux.
> > > > 
> > > > Current select_idle_sibling first tries to find a fully idle core using
> > > > select_idle_core which can potentially search all cores and if it fails it
> > > > finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
> > > > search all cpus in the llc domain. This doesn't scale for large llc domains
> > > > and will only get worse with more cores in future.
> > > > 
> > > > This patch solves the scalability problem by:
> > > >   - Setting an upper and lower limit of idle cpu search in select_idle_cpu
> > > >     to keep search time low and constant
> > > >   - Adding a new sched feature SIS_CORE to disable select_idle_core
> > > > 
> > > > Additionally it also introduces a new per-cpu variable next_cpu to track
> > > > the limit of search so that every time search starts from where it ended.
> > > > This rotating search window over cpus in LLC domain ensures that idle
> > > > cpus are eventually found in case of high load.
> > > Right, so we had a wee conversation about this patch series at OSPM, and
> > > I don't see any of that reflected here :-(
> > > 
> > > Specifically, given that some people _really_ want the whole L3 mask
> > > scanned to reduce tail latency over raw throughput, while you guys
> > > prefer the other way around, it was proposed to extend the task model.
> > > 
> > > Specifically something like a latency-nice was mentioned (IIRC) where a
> > Right, AFAIR PaulT suggested to add support for the concept of a task
> > being "latency tolerant": meaning we can spend more time to search for
> > a CPU and/or avoid preempting the current task.
> > 
> Wondering if searching and preempting needs will ever be conflicting?

I guess the winning point is that we don't commit behaviors to
userspace, but just abstract concepts which are turned into biases.

I don't see conflicts right now: if you are latency tolerant that
means you can spend more time to try finding a better CPU (e.g. we can
use the energy model to compare multiple CPUs) _and/or_ give the
current task a better chance to complete by delaying its preemption.

> Otherwise sounds like a good direction to me. For the searching aspect, can
> we map latency nice values to the % of cores we search in select_idle_cpu?
> Thus the search cost can be controlled by latency nice value.

I guess that's worth a try, only caveat I see is that it's turning the
bias into something very platform specific. Meaning, the same
latency-nice value on different machines can have very different
results.

Would not be better to try finding a more platform independent mapping?

Maybe something time bounded, e.g. the higher the latency-nice the more
time we can spend looking for CPUs?

> But the issue is if more latency tolerant workloads set to less
> search, we still need some mechanism to achieve good spread of
> threads.

I don't get this example: why more latency tolerant workloads should
require less search?

> Can we keep the sliding window mechanism in that case?

Which one? Sorry did not went through the patches, can you briefly
resume the idea?

> Also will latency nice do anything for select_idle_core and
> select_idle_smt?

I guess principle the same bias can be used at different levels, maybe
with different mappings.

In the mobile world use-case we will likely use it only to switch from
select_idle_sibling to the energy aware slow path. And perhaps to see
if we can bias the wakeup preemption granularity.

Best,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
  2019-07-02  8:54       ` Patrick Bellasi
@ 2019-07-03  3:52         ` Subhra Mazumdar
  2019-07-04 11:35           ` Parth Shah
  0 siblings, 1 reply; 33+ messages in thread
From: Subhra Mazumdar @ 2019-07-03  3:52 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Peter Zijlstra, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, Paul Turner, riel, morten.rasmussen


On 7/2/19 1:54 AM, Patrick Bellasi wrote:
> Wondering if searching and preempting needs will ever be conflicting?
> I guess the winning point is that we don't commit behaviors to
> userspace, but just abstract concepts which are turned into biases.
>
> I don't see conflicts right now: if you are latency tolerant that
> means you can spend more time to try finding a better CPU (e.g. we can
> use the energy model to compare multiple CPUs) _and/or_ give the
> current task a better chance to complete by delaying its preemption.
OK
>
>> Otherwise sounds like a good direction to me. For the searching aspect, can
>> we map latency nice values to the % of cores we search in select_idle_cpu?
>> Thus the search cost can be controlled by latency nice value.
> I guess that's worth a try, only caveat I see is that it's turning the
> bias into something very platform specific. Meaning, the same
> latency-nice value on different machines can have very different
> results.
>
> Would not be better to try finding a more platform independent mapping?
>
> Maybe something time bounded, e.g. the higher the latency-nice the more
> time we can spend looking for CPUs?
The issue I see is suppose we have a range of latency-nice values, then it
should cover the entire range of search (one core to all cores). As Peter
said some workloads will want to search the LLC fully. If we have absolute
time, the map of latency-nice values range to them will be arbitrary. If
you have something in mind let me know, may be I am thinking differently.
>
>> But the issue is if more latency tolerant workloads set to less
>> search, we still need some mechanism to achieve good spread of
>> threads.
> I don't get this example: why more latency tolerant workloads should
> require less search?
I guess I got the definition of "latency tolerant" backwards.
>
>> Can we keep the sliding window mechanism in that case?
> Which one? Sorry did not went through the patches, can you briefly
> resume the idea?
If a workload has set it to low latency tolerant, then the search will be
less. That can lead to localization of threads on a few CPUs as we are not
searching the entire LLC even if there are idle CPUs available. For this
I had introduced a per-CPU variable (for the target CPU) to track the
boundary of search so that every time it will start from the boundary, thus
sliding the window. So even if we are searching very little the search
window keeps shifting and gives us a good spread. This is orthogonal to the
latency-nice thing.
>
>> Also will latency nice do anything for select_idle_core and
>> select_idle_smt?
> I guess principle the same bias can be used at different levels, maybe
> with different mappings.
Doing it for select_idle_core will have the issue that the dynamic flag
(whether an idle core is present or not) can only be updated by threads
which are doing the full search.

Thanks,
Subhra

> In the mobile world use-case we will likely use it only to switch from
> select_idle_sibling to the energy aware slow path. And perhaps to see
> if we can bias the wakeup preemption granularity.
>
> Best,
> Patrick
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
  2019-07-03  3:52         ` Subhra Mazumdar
@ 2019-07-04 11:35           ` Parth Shah
  0 siblings, 0 replies; 33+ messages in thread
From: Parth Shah @ 2019-07-04 11:35 UTC (permalink / raw)
  To: Subhra Mazumdar, Patrick Bellasi
  Cc: Peter Zijlstra, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	tim.c.chen, mgorman, Paul Turner, riel, morten.rasmussen

Hi,

On 7/3/19 9:22 AM, Subhra Mazumdar wrote:
> 
> On 7/2/19 1:54 AM, Patrick Bellasi wrote:
>> Wondering if searching and preempting needs will ever be conflicting?
>> I guess the winning point is that we don't commit behaviors to
>> userspace, but just abstract concepts which are turned into biases.
>>
>> I don't see conflicts right now: if you are latency tolerant that
>> means you can spend more time to try finding a better CPU (e.g. we can
>> use the energy model to compare multiple CPUs) _and/or_ give the
>> current task a better chance to complete by delaying its preemption.
> OK
>>
>>> Otherwise sounds like a good direction to me. For the searching aspect, can
>>> we map latency nice values to the % of cores we search in select_idle_cpu?
>>> Thus the search cost can be controlled by latency nice value.
>> I guess that's worth a try, only caveat I see is that it's turning the
>> bias into something very platform specific. Meaning, the same
>> latency-nice value on different machines can have very different
>> results.
>>
>> Would not be better to try finding a more platform independent mapping?
>>
>> Maybe something time bounded, e.g. the higher the latency-nice the more
>> time we can spend looking for CPUs?
> The issue I see is suppose we have a range of latency-nice values, then it
> should cover the entire range of search (one core to all cores). As Peter
> said some workloads will want to search the LLC fully. If we have absolute
> time, the map of latency-nice values range to them will be arbitrary. If
> you have something in mind let me know, may be I am thinking differently.
>>
>>> But the issue is if more latency tolerant workloads set to less
>>> search, we still need some mechanism to achieve good spread of
>>> threads.
>> I don't get this example: why more latency tolerant workloads should
>> require less search?
> I guess I got the definition of "latency tolerant" backwards.
>>
>>> Can we keep the sliding window mechanism in that case?
>> Which one? Sorry did not went through the patches, can you briefly
>> resume the idea?
> If a workload has set it to low latency tolerant, then the search will be
> less. That can lead to localization of threads on a few CPUs as we are not
> searching the entire LLC even if there are idle CPUs available. For this
> I had introduced a per-CPU variable (for the target CPU) to track the
> boundary of search so that every time it will start from the boundary, thus
> sliding the window. So even if we are searching very little the search
> window keeps shifting and gives us a good spread. This is orthogonal to the
> latency-nice thing.

Can it be done something like turning off searching for an idle core if the wakee
task is latency tolerant(more latency-nice)? We search for idle core to get faster
resource allocation, thus such tasks don't need to find idle core and can
directly jump to finding idle CPUs.
This can include sliding windows mechanism along, but as I commented previously, it
imposes task ping-pong problem as sliding window gets away from target_cpu. So maybe
we can first search the core with target_cpu and if no idle CPUs found then bail
to this sliding window mechanism.
Just an thought.

Best,
Parth


>>
>>> Also will latency nice do anything for select_idle_core and
>>> select_idle_smt?
>> I guess principle the same bias can be used at different levels, maybe
>> with different mappings.
> Doing it for select_idle_core will have the issue that the dynamic flag
> (whether an idle core is present or not) can only be updated by threads
> which are doing the full search.
> 
> Thanks,
> Subhra
> 
>> In the mobile world use-case we will likely use it only to switch from
>> select_idle_sibling to the energy aware slow path. And perhaps to see
>> if we can bias the wakeup preemption granularity.
>>
>> Best,
>> Patrick
>>
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
  2019-07-01 20:37         ` Subhra Mazumdar
@ 2019-07-04 12:34           ` Parth Shah
  2019-07-14  1:16             ` Subhra Mazumdar
  0 siblings, 1 reply; 33+ messages in thread
From: Parth Shah @ 2019-07-04 12:34 UTC (permalink / raw)
  To: Subhra Mazumdar, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman



On 7/2/19 2:07 AM, Subhra Mazumdar wrote:
> 
>>>> Also, systems like POWER9 has sd_llc as a pair of core only. So it
>>>> won't benefit from the limits and hence also hiding your code in select_idle_cpu
>>>> behind static keys will be much preferred.
>>> If it doesn't hurt then I don't see the point.
>>>
>> So these is the result from POWER9 system with your patches:
>> System configuration: 2 Socket, 44 cores, 176 CPUs
>>
>> Experiment setup:
>> ===========
>> => Setup 1:
>> - 44 tasks doing just while(1), this is to make select_idle_core return -1 most times
>> - perf bench sched messaging -g 1 -l 1000000
>> +-----------+--------+--------------+--------+
>> | Baseline  | stddev |    Patch     | stddev |
>> +-----------+--------+--------------+--------+
>> |       135 |   3.21 | 158(-17.03%) |   4.69 |
>> +-----------+--------+--------------+--------+
>>
>> => Setup 2:
>> - schbench -m44 -t 1
>> +=======+==========+=========+=========+==========+
>> | %ile  | Baseline | stddev  |  patch  |  stddev  |
>> +=======+==========+=========+=========+==========+
>> |    50 |       10 |    3.49 |      10 |     2.29 |
>> +-------+----------+---------+---------+----------+
>> |    95 |      467 |    4.47 |     469 |     0.81 |
>> +-------+----------+---------+---------+----------+
>> |    99 |      571 |   21.32 |     584 |    18.69 |
>> +-------+----------+---------+---------+----------+
>> |  99.5 |      629 |   30.05 |     641 |    20.95 |
>> +-------+----------+---------+---------+----------+
>> |  99.9 |      780 |   40.38 |     773 |     44.2 |
>> +-------+----------+---------+---------+----------+
>>
>> I guess it doesn't make much difference in schbench results but hackbench (perf bench)
>> seems to have an observable regression.
>>
>>
>> Best,
>> Parth
>>
> If POWER9 sd_llc has only 2 cores, the behavior shouldn't change much with
> the select_idle_cpu changes as the limits are 1 and 2 core. Previously the
> lower bound was 4 cpus and upper bound calculated by the prop. Now it is
> 1 core (4 cpus on SMT4) and upper bound 2 cores. Could it be the extra
> computation of cpumask_weight causing the regression rather than the
> sliding window itself (one way to check this would be hardcode 4 in place
> of topology_sibling_weight)? Or is it the L1 cache coherency? I am a bit
> suprised because SPARC SMT8 which has more cores in sd_llc and L1 cache per
> core showed improvement with Hackbench.
> 

Same experiment with hackbench and with perf analysis shows increase in L1
cache miss rate with these patches
(Lower is better)
                          Baseline(%)   Patch(%)   
 ----------------------- ------------- ----------- 
  Total Cache miss rate         17.01   19(-11%)   
  L1 icache miss rate            5.45   6.7(-22%) 



So is is possible for idle_cpu search to try checking target_cpu first and
then goto sliding window if not found.
Below diff works as expected in IBM POWER9 system and resolves the problem
of far wakeup upto large extent.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff2e9b5c3ac5..fae035ce1162 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6161,6 +6161,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
        u64 time, cost;
        s64 delta;
        int cpu, limit, floor, target_tmp, nr = INT_MAX;
+       struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
 
        this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
        if (!this_sd)
@@ -6198,16 +6199,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
        time = local_clock();
 
-       for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+       cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
+       for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) {
+               __cpumask_clear_cpu(cpu, cpus);
+               if (available_idle_cpu(cpu))
+                       goto idle_cpu_exit;
+       }
+
+       for_each_cpu_wrap(cpu, cpus, target_tmp) {
                per_cpu(next_cpu, target) = cpu;
                if (!--nr)
                        return -1;
-               if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-                       continue;
                if (available_idle_cpu(cpu))
                        break;
        }
 
+idle_cpu_exit:
        time = local_clock() - time;
        cost = this_sd->avg_scan_cost;
        delta = (s64)(time - cost) / 8;



Best,
Parth


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path
  2019-07-01 14:04     ` Peter Zijlstra
@ 2019-07-08 22:32       ` Tim Chen
  0 siblings, 0 replies; 33+ messages in thread
From: Tim Chen @ 2019-07-08 22:32 UTC (permalink / raw)
  To: Peter Zijlstra, Patrick Bellasi
  Cc: subhra mazumdar, linux-kernel, mingo, tglx, steven.sistare,
	dhaval.giani, daniel.lezcano, vincent.guittot, viresh.kumar,
	mgorman, Paul Turner, riel, morten.rasmussen, Aubrey Li

On 7/1/19 7:04 AM, Peter Zijlstra wrote:
> On Mon, Jul 01, 2019 at 02:55:52PM +0100, Patrick Bellasi wrote:
>> On 01-Jul 11:02, Peter Zijlstra wrote:
> 
>>> Some of the things we could tie to this would be:
>>>
>>>   - select_idle_siblings; -nice would scan more than +nice,
>>
>> Just to be sure, you are not proposing to use the nice value we
>> already have, i.e.
>>   p->{static,normal}_prio
>> but instead a new similar concept, right?
> 
> Correct; a new sched_attr::sched_latency_nice value, which is like
> sched_nice, but controls a different dimmension of behaviour.
> 

I think the sched_latency_nice value could be also useful for AVX512 for x86.
Running an AVX512 task could drop frequency of the cpu, including the sibling
hardware thread.  So scheduling task that don't mind latency on the
sibling while an AVX512 task runs will make sense.

Tim

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v3 5/7] sched: SIS_CORE to disable idle core search
  2019-07-04 12:34           ` Parth Shah
@ 2019-07-14  1:16             ` Subhra Mazumdar
  0 siblings, 0 replies; 33+ messages in thread
From: Subhra Mazumdar @ 2019-07-14  1:16 UTC (permalink / raw)
  To: Parth Shah, linux-kernel
  Cc: peterz, mingo, tglx, steven.sistare, dhaval.giani,
	daniel.lezcano, vincent.guittot, viresh.kumar, tim.c.chen,
	mgorman


On 7/4/19 6:04 PM, Parth Shah wrote:
> Same experiment with hackbench and with perf analysis shows increase in L1
> cache miss rate with these patches
> (Lower is better)
>                            Baseline(%)   Patch(%)
>   ----------------------- ------------- -----------
>    Total Cache miss rate         17.01   19(-11%)
>    L1 icache miss rate            5.45   6.7(-22%)
>
>
>
> So is is possible for idle_cpu search to try checking target_cpu first and
> then goto sliding window if not found.
> Below diff works as expected in IBM POWER9 system and resolves the problem
> of far wakeup upto large extent.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff2e9b5c3ac5..fae035ce1162 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6161,6 +6161,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>          u64 time, cost;
>          s64 delta;
>          int cpu, limit, floor, target_tmp, nr = INT_MAX;
> +       struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>   
>          this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>          if (!this_sd)
> @@ -6198,16 +6199,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>   
>          time = local_clock();
>   
> -       for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
> +       cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
> +       for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) {
> +               __cpumask_clear_cpu(cpu, cpus);
> +               if (available_idle_cpu(cpu))
> +                       goto idle_cpu_exit;
> +       }
> +
> +       for_each_cpu_wrap(cpu, cpus, target_tmp) {
>                  per_cpu(next_cpu, target) = cpu;
>                  if (!--nr)
>                          return -1;
> -               if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> -                       continue;
>                  if (available_idle_cpu(cpu))
>                          break;
>          }
>   
> +idle_cpu_exit:
>          time = local_clock() - time;
>          cost = this_sd->avg_scan_cost;
>          delta = (s64)(time - cost) / 8;
>
>
>
> Best,
> Parth
How about calling select_idle_smt before select_idle_cpu from
select_idle_sibling? That should have the same effect.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-07-14  1:18 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-27  1:29 [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path subhra mazumdar
2019-06-27  1:29 ` [PATCH v3 1/7] sched: limit cpu search in select_idle_cpu subhra mazumdar
2019-06-28 18:47   ` Parth Shah
2019-06-28 22:21     ` Subhra Mazumdar
2019-06-27  1:29 ` [PATCH v3 2/7] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
2019-06-27  1:29 ` [PATCH v3 3/7] sched: rotate the cpu search window for better spread subhra mazumdar
2019-06-28 11:54   ` Srikar Dronamraju
2019-06-28 22:34     ` Subhra Mazumdar
2019-06-28 18:36   ` Parth Shah
2019-06-28 22:14     ` Subhra Mazumdar
2019-06-27  1:29 ` [PATCH v3 4/7] sched: add sched feature to disable idle core search subhra mazumdar
2019-06-27  1:29 ` [PATCH v3 5/7] sched: SIS_CORE " subhra mazumdar
2019-06-28 19:01   ` Parth Shah
2019-06-28 22:29     ` Subhra Mazumdar
2019-07-01  9:57       ` Parth Shah
2019-07-01 20:37         ` Subhra Mazumdar
2019-07-04 12:34           ` Parth Shah
2019-07-14  1:16             ` Subhra Mazumdar
2019-06-27  1:29 ` [PATCH v3 6/7] x86/smpboot: introduce per-cpu variable for HT siblings subhra mazumdar
2019-06-27  6:51   ` Thomas Gleixner
2019-06-27  6:54     ` Thomas Gleixner
2019-06-28  1:06       ` Subhra Mazumdar
2019-06-28  1:02     ` Subhra Mazumdar
2019-06-27  1:29 ` [PATCH v3 7/7] sched: use per-cpu variable cpumask_weight_sibling subhra mazumdar
2019-07-01  9:02 ` [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path Peter Zijlstra
2019-07-01 13:55   ` Patrick Bellasi
2019-07-01 14:04     ` Peter Zijlstra
2019-07-08 22:32       ` Tim Chen
2019-07-01 14:06     ` Peter Zijlstra
2019-07-02  0:01     ` Subhra Mazumdar
2019-07-02  8:54       ` Patrick Bellasi
2019-07-03  3:52         ` Subhra Mazumdar
2019-07-04 11:35           ` Parth Shah

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).