All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/RFT PATCH 0/3] Improve scheduler scalability for fast path
@ 2018-04-24  0:41 subhra mazumdar
  2018-04-24  0:41 ` [PATCH 1/3] sched: remove select_idle_core() for scalability subhra mazumdar
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: subhra mazumdar @ 2018-04-24  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, daniel.lezcano, steven.sistare, dhaval.giani,
	rohit.k.jain, subhra.mazumdar

Current select_idle_sibling first tries to find a fully idle core using
select_idle_core which can potentially search all cores and if it fails it
finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
search all cpus in the llc domain. This doesn't scale for large llc domains
and will only get worse with more cores in future.

This patch solves the scalability problem by:
-Removing select_idle_core() as it can potentially scan the full LLC domain
 even if there is only one idle core which doesn't scale
-Lowering the lower limit of nr variable in select_idle_cpu() and also
 setting an upper limit to restrict search time

Additionally it also introduces a new per-cpu variable next_cpu to track
the limit of search so that every time search starts from where it ended.
This rotating search window over cpus in LLC domain ensures that idle
cpus are eventually found in case of high load.

Following are the performance numbers with various benchmarks.

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline           %stdev  patch      	   %stdev
1      	0.5742		   21.13   0.5334 (7.10%)  5.2 
2       0.5776 		   7.87    0.5393 (6.63%)  6.39
4       0.9578             1.12    0.9537 (0.43%)  1.08
8       1.7018             1.35    1.682 (1.16%)   1.33
16      2.9955 		   1.36    2.9849 (0.35%)  0.96
32      5.4354 		   0.59    5.3308 (1.92%)  0.60

Sysbench MySQL on 1 socket, 6 core and 12 threads Intel x86 machine
(higher is better):
threads baseline    	  patch
2       49.53             49.83 (0.61%)
4       89.07		  90 (1.05%)
8	149		  154 (3.31%) 
16	240		  246 (2.56%)
32	357		  351 (-1.69%)
64	428		  428 (-0.03%)
128	473		  469 (-0.92%)

Sysbench PostgresSQL on 1 socket, 6 core and 12 threads Intel x86 machine
(higher is better):
threads baseline	  patch
2	68.35 		  70.07 (2.51%)
4	93.53		  92.54 (-1.05%)
8	125		  127 (1.16%)
16	145		  146 (0.92%)
32	158		  156 (-1.24%)
64	160		  160 (0.47%)

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline	%stdev  patch   	 %stdev
20	1		1.35	1.0075 (0.75%)	 0.71
40	1		0.42	0.9971 (-0.29%)	 0.26
60	1		1.54	0.9955 (-0.45%)	 0.83
80	1		0.58	1.0059 (0.59%)	 0.59
100	1		0.77	1.0201 (2.01%)	 0.39
120	1		0.35	1.0145 (1.45%)	 1.41
140	1		0.19	1.0325 (3.25%)	 0.77
160	1		0.09	1.0277 (2.77%)	 0.57
180	1		0.99	1.0249 (2.49%)	 0.79
200	1		1.03	1.0133 (1.33%)	 0.77
220	1		1.69	1.0317 (3.17%)	 1.41

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline	%stdev  patch   	 %stdev
8	49.47		0.35	50.96 (3.02%)	 0.12
16	95.28		0.77	99.01 (3.92%)	 0.14
32	156.77		1.17	180.64 (15.23%)	 1.05
48	193.24		0.22	214.7 (11.1%)	 1
64	216.21		9.33	252.81 (16.93%)	 1.68
128	379.62		10.29	397.47 (4.75)	 0.41

Dbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
clients baseline	patch
1	627.62		629.14 (0.24%)
2	1153.45		1179.9 (2.29%)
4	2060.29		2051.62 (-0.42%)
8	2724.41		2609.4 (-4.22%)
16	2987.56		2891.54 (-3.21%)
32	2375.82		2345.29 (-1.29%)
64	1963.31		1903.61 (-3.04%)
128	1546.01		1513.17 (-2.12%)

Tbench on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):                                  
clients baseline	patch
1	279.33		285.154 (2.08%)
2	545.961		572.538 (4.87%)
4	1081.06		1126.51 (4.2%)
8	2158.47		2234.78 (3.53%)
16	4223.78		4358.11 (3.18%)
32	7117.08		8022.19 (12.72%)
64	8947.28		10719.7 (19.81%)
128	15976.7		17531.2 (9.73%)

Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 256 (higher is better):
clients  baseline 	 %stdev  patch		%stdev
1	 2699		 4.86	 2697 (-0.1%)	3.74
10	 18832		 0 	 18830 (0%)	0.01
100	 18830           0.05    18827 (0%)     0.08

Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 1K (higher is better):
clients	 baseline	 %stdev  patch 	        %stdev
1	 9414		 0.02	 9414 (0%)	0.01
10	 18832		 0	 18832 (0%)	0
100	 18830		 0.05	 18829 (0%)	0.04

Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 4K (higher is better):
clients  baseline	 %stdev  patch    	 %stdev
1	 9414		 0.01	 9414 (0%)	 0
10	 18832		 0	 18832 (0%)	 0
100	 18829		 0.04	 18833 (0%)	 0

Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 64K (higher is better):
clients  baseline	 %stdev  patch  	 %stdev
1	 9415		 0.01	 9415 (0%)	 0
10	 18832		 0	 18832 (0%)	 0
100	 18830		 0.04	 18833 (0%)	 0

Iperf on 2 socket, 24 core and 48 threads Intel x86 machine with message
size = 1M (higher is better):
clients  baseline  	 %stdev  patch 		 %stdev
1	 9415		 0.01	 9415 (0%)	 0.01
10	 18832		 0	 18832 (0%)	 0
100	 18830		 0.04	 18819 (-0.1%)	 0.13

JBB on 2 socket, 28 core and 56 threads Intel x86 machine
(higher is better):
		baseline	%stdev	 patch		%stdev
jops		60049		0.65	 60191 (0.2%)	0.99
critical jops	29689		0.76	 29044 (-2.2%)	1.46

Schbench on 2 socket, 24 core and 48 threads Intel x86 machine with 24
tasks (lower is better):
percentile	baseline        %stdev   patch          %stdev
50		5007		0.16	 5003 (0.1%)	0.12
75		10000		0	 10000 (0%)	0
90		16992		0	 16998 (0%)	0.12
95		21984		0	 22043 (-0.3%)	0.83
99		34229		1.2	 34069 (0.5%)	0.87
99.5		39147		1.1	 38741 (1%)	1.1
99.9		49568		1.59	 49579 (0%)	1.78

Ebizzy on 2 socket, 44 core and 88 threads Intel x86 machine
(higher is better):
threads		baseline        %stdev   patch          %stdev
1		26477		2.66	 26646 (0.6%)	2.81
2		52303		1.72	 52987 (1.3%)	1.59
4		100854		2.48	 101824 (1%)	2.42
8		188059		6.91	 189149 (0.6%)	1.75
16		328055		3.42	 333963 (1.8%)	2.03
32		504419		2.23	 492650 (-2.3%)	1.76
88		534999		5.35	 569326 (6.4%)	3.07
156		541703		2.42	 544463 (0.5%)	2.17

NAS: A whole suite of NAS benchmarks were run on 2 socket, 36 core and 72
threads Intel x86 machine with no statistically significant regressions
while giving improvements in some cases. I am not listing the results due
to too many data points.

subhra mazumdar (3):
  sched: remove select_idle_core() for scalability
  sched: introduce per-cpu var next_cpu to track search limit
  sched: limit cpu search and rotate search window for scalability

 include/linux/sched/topology.h |   1 -
 kernel/sched/core.c            |   2 +
 kernel/sched/fair.c            | 116 +++++------------------------------------
 kernel/sched/idle.c            |   1 -
 kernel/sched/sched.h           |  11 +---
 5 files changed, 17 insertions(+), 114 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-04-24  0:41 [RFC/RFT PATCH 0/3] Improve scheduler scalability for fast path subhra mazumdar
@ 2018-04-24  0:41 ` subhra mazumdar
  2018-04-24 12:46   ` Peter Zijlstra
  2018-04-24  0:41 ` [PATCH 2/3] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
  2018-04-24  0:41 ` [PATCH 3/3] sched: limit cpu search and rotate search window for scalability subhra mazumdar
  2 siblings, 1 reply; 24+ messages in thread
From: subhra mazumdar @ 2018-04-24  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, daniel.lezcano, steven.sistare, dhaval.giani,
	rohit.k.jain, subhra.mazumdar

select_idle_core() can potentially search all cpus to find the fully idle
core even if there is one such core. Removing this is necessary to achieve
scalability in the fast path.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 include/linux/sched/topology.h |  1 -
 kernel/sched/fair.c            | 97 ------------------------------------------
 kernel/sched/idle.c            |  1 -
 kernel/sched/sched.h           | 10 -----
 4 files changed, 109 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 2634774..ac7944d 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -71,7 +71,6 @@ struct sched_group;
 struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
-	int		has_idle_cores;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54dc31e..d1d4769 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6239,94 +6239,6 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 
 #ifdef CONFIG_SCHED_SMT
 
-static inline void set_idle_cores(int cpu, int val)
-{
-	struct sched_domain_shared *sds;
-
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		WRITE_ONCE(sds->has_idle_cores, val);
-}
-
-static inline bool test_idle_cores(int cpu, bool def)
-{
-	struct sched_domain_shared *sds;
-
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		return READ_ONCE(sds->has_idle_cores);
-
-	return def;
-}
-
-/*
- * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
- *
- * Since SMT siblings share all cache levels, inspecting this limited remote
- * state should be fairly cheap.
- */
-void __update_idle_core(struct rq *rq)
-{
-	int core = cpu_of(rq);
-	int cpu;
-
-	rcu_read_lock();
-	if (test_idle_cores(core, true))
-		goto unlock;
-
-	for_each_cpu(cpu, cpu_smt_mask(core)) {
-		if (cpu == core)
-			continue;
-
-		if (!idle_cpu(cpu))
-			goto unlock;
-	}
-
-	set_idle_cores(core, 1);
-unlock:
-	rcu_read_unlock();
-}
-
-/*
- * Scan the entire LLC domain for idle cores; this dynamically switches off if
- * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
- */
-static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int core, cpu;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -1;
-
-	if (!test_idle_cores(target, false))
-		return -1;
-
-	cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
-
-	for_each_cpu_wrap(core, cpus, target) {
-		bool idle = true;
-
-		for_each_cpu(cpu, cpu_smt_mask(core)) {
-			cpumask_clear_cpu(cpu, cpus);
-			if (!idle_cpu(cpu))
-				idle = false;
-		}
-
-		if (idle)
-			return core;
-	}
-
-	/*
-	 * Failed to find an idle core; stop looking for one.
-	 */
-	set_idle_cores(target, 0);
-
-	return -1;
-}
-
 /*
  * Scan the local SMT mask for idle CPUs.
  */
@@ -6349,11 +6261,6 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 
 #else /* CONFIG_SCHED_SMT */
 
-static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	return -1;
-}
-
 static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	return -1;
@@ -6451,10 +6358,6 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (!sd)
 		return target;
 
-	i = select_idle_core(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
-
 	i = select_idle_cpu(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1a3e9bd..7ca8e18 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -392,7 +392,6 @@ static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	put_prev_task(rq, prev);
-	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
 
 	return rq->idle;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c2..3f1874c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -899,16 +899,6 @@ static inline int cpu_of(struct rq *rq)
 
 extern struct static_key_false sched_smt_present;
 
-extern void __update_idle_core(struct rq *rq);
-
-static inline void update_idle_core(struct rq *rq)
-{
-	if (static_branch_unlikely(&sched_smt_present))
-		__update_idle_core(rq);
-}
-
-#else
-static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-- 
2.9.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/3] sched: introduce per-cpu var next_cpu to track search limit
  2018-04-24  0:41 [RFC/RFT PATCH 0/3] Improve scheduler scalability for fast path subhra mazumdar
  2018-04-24  0:41 ` [PATCH 1/3] sched: remove select_idle_core() for scalability subhra mazumdar
@ 2018-04-24  0:41 ` subhra mazumdar
  2018-04-24 12:47   ` Peter Zijlstra
  2018-04-24  0:41 ` [PATCH 3/3] sched: limit cpu search and rotate search window for scalability subhra mazumdar
  2 siblings, 1 reply; 24+ messages in thread
From: subhra mazumdar @ 2018-04-24  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, daniel.lezcano, steven.sistare, dhaval.giani,
	rohit.k.jain, subhra.mazumdar

Introduce a per-cpu variable to track the limit upto which idle cpu search
was done in select_idle_cpu(). This will help to start the search next time
from there. This is necessary for rotating the search window over entire
LLC domain.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/core.c  | 2 ++
 kernel/sched/sched.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aae..cd5c08d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -17,6 +17,7 @@
 #include <trace/events/sched.h>
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
 /*
@@ -6018,6 +6019,7 @@ void __init sched_init(void)
 		struct rq *rq;
 
 		rq = cpu_rq(i);
+		per_cpu(next_cpu, i) = -1;
 		raw_spin_lock_init(&rq->lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3f1874c..a2db041 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -902,6 +902,7 @@ extern struct static_key_false sched_smt_present;
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DECLARE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
 
 #define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
 #define this_rq()		this_cpu_ptr(&runqueues)
-- 
2.9.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
  2018-04-24  0:41 [RFC/RFT PATCH 0/3] Improve scheduler scalability for fast path subhra mazumdar
  2018-04-24  0:41 ` [PATCH 1/3] sched: remove select_idle_core() for scalability subhra mazumdar
  2018-04-24  0:41 ` [PATCH 2/3] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
@ 2018-04-24  0:41 ` subhra mazumdar
  2018-04-24 12:48   ` Peter Zijlstra
                     ` (2 more replies)
  2 siblings, 3 replies; 24+ messages in thread
From: subhra mazumdar @ 2018-04-24  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, daniel.lezcano, steven.sistare, dhaval.giani,
	rohit.k.jain, subhra.mazumdar

Lower the lower limit of idle cpu search in select_idle_cpu() and also put
an upper limit. This helps in scalability of the search by restricting the
search window. Also rotating the search window with help of next_cpu
ensures any idle cpu is eventually found in case of high load.

Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
---
 kernel/sched/fair.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1d4769..62d585b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6279,7 +6279,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
+	int cpu, target_tmp, nr = INT_MAX;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6297,15 +6297,24 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	if (sched_feat(SIS_PROP)) {
 		u64 span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
+		if (span_avg > 2*avg_cost) {
 			nr = div_u64(span_avg, avg_cost);
-		else
-			nr = 4;
+			if (nr > 4)
+				nr = 4;
+		} else {
+			nr = 2;
+		}
 	}
 
+	if (per_cpu(next_cpu, target) != -1)
+		target_tmp = per_cpu(next_cpu, target);
+	else
+		target_tmp = target;
+
 	time = local_clock();
 
-	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
+	for_each_cpu_wrap(cpu, sched_domain_span(sd), target_tmp) {
+		per_cpu(next_cpu, target) = cpu;
 		if (!--nr)
 			return -1;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-- 
2.9.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-04-24  0:41 ` [PATCH 1/3] sched: remove select_idle_core() for scalability subhra mazumdar
@ 2018-04-24 12:46   ` Peter Zijlstra
  2018-04-24 21:45     ` Subhra Mazumdar
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-04-24 12:46 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain

On Mon, Apr 23, 2018 at 05:41:14PM -0700, subhra mazumdar wrote:
> select_idle_core() can potentially search all cpus to find the fully idle
> core even if there is one such core. Removing this is necessary to achieve
> scalability in the fast path.

So this removes the whole core awareness from the wakeup path; this
needs far more justification.

In general running on pure cores is much faster than running on threads.
If you plot performance numbers there's almost always a fairly
significant drop in slope at the moment when we run out of cores and
start using threads.

Also, depending on cpu enumeration, your next patch might not even leave
the core scanning for idle CPUs.

Now, typically on Intel systems, we first enumerate cores and then
siblings, but I've seen Intel systems that don't do this and enumerate
all threads together. Also other architectures are known to iterate full
cores together, both s390 and Power for example do this.

So by only doing a linear scan on CPU number you will actually fill
cores instead of equally spreading across cores. Worse still, by
limiting the scan to _4_ you only barely even get onto a next core for
SMT4 hardware, never mind SMT8.

So while I'm not adverse to limiting the empty core search; I do feel it
is important to have. Overloading cores when you don't have to is not
good.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] sched: introduce per-cpu var next_cpu to track search limit
  2018-04-24  0:41 ` [PATCH 2/3] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
@ 2018-04-24 12:47   ` Peter Zijlstra
  2018-04-24 22:39     ` Subhra Mazumdar
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-04-24 12:47 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain

On Mon, Apr 23, 2018 at 05:41:15PM -0700, subhra mazumdar wrote:
> @@ -17,6 +17,7 @@
>  #include <trace/events/sched.h>
>  
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
>  
>  #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
>  /*
> @@ -6018,6 +6019,7 @@ void __init sched_init(void)
>  		struct rq *rq;
>  
>  		rq = cpu_rq(i);
> +		per_cpu(next_cpu, i) = -1;

If you leave it uninitialized it'll be 0, and we can avoid that extra
branch in the next patch, no?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
  2018-04-24  0:41 ` [PATCH 3/3] sched: limit cpu search and rotate search window for scalability subhra mazumdar
@ 2018-04-24 12:48   ` Peter Zijlstra
  2018-04-24 22:43     ` Subhra Mazumdar
  2018-04-24 12:48   ` Peter Zijlstra
  2018-04-24 12:53   ` Peter Zijlstra
  2 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-04-24 12:48 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain

On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
> +	if (per_cpu(next_cpu, target) != -1)
> +		target_tmp = per_cpu(next_cpu, target);
> +	else
> +		target_tmp = target;
> +

This one; what's the point here?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
  2018-04-24  0:41 ` [PATCH 3/3] sched: limit cpu search and rotate search window for scalability subhra mazumdar
  2018-04-24 12:48   ` Peter Zijlstra
@ 2018-04-24 12:48   ` Peter Zijlstra
  2018-04-24 22:48     ` Subhra Mazumdar
  2018-04-24 12:53   ` Peter Zijlstra
  2 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-04-24 12:48 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain

On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
> Lower the lower limit of idle cpu search in select_idle_cpu() and also put
> an upper limit. This helps in scalability of the search by restricting the
> search window. Also rotating the search window with help of next_cpu
> ensures any idle cpu is eventually found in case of high load.

So this patch does 2 (possibly 3) things, that's not good.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
  2018-04-24  0:41 ` [PATCH 3/3] sched: limit cpu search and rotate search window for scalability subhra mazumdar
  2018-04-24 12:48   ` Peter Zijlstra
  2018-04-24 12:48   ` Peter Zijlstra
@ 2018-04-24 12:53   ` Peter Zijlstra
  2018-04-25  0:10     ` Subhra Mazumdar
  2 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-04-24 12:53 UTC (permalink / raw)
  To: subhra mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain

On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
> Lower the lower limit of idle cpu search in select_idle_cpu() and also put
> an upper limit. This helps in scalability of the search by restricting the
> search window. 

> @@ -6297,15 +6297,24 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>  
>  	if (sched_feat(SIS_PROP)) {
>  		u64 span_avg = sd->span_weight * avg_idle;
> -		if (span_avg > 4*avg_cost)
> +		if (span_avg > 2*avg_cost) {
>  			nr = div_u64(span_avg, avg_cost);
> -		else
> -			nr = 4;
> +			if (nr > 4)
> +				nr = 4;
> +		} else {
> +			nr = 2;
> +		}
>  	}

Why do you need to put a max on? Why isn't the proportional thing
working as is? (is the average no good because of big variance or what)

Again, why do you need to lower the min; what's wrong with 4?

The reason I picked 4 is that many laptops have 4 CPUs and desktops
really want to avoid queueing if at all possible.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-04-24 12:46   ` Peter Zijlstra
@ 2018-04-24 21:45     ` Subhra Mazumdar
  2018-04-25 17:49       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Subhra Mazumdar @ 2018-04-24 21:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain



On 04/24/2018 05:46 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 05:41:14PM -0700, subhra mazumdar wrote:
>> select_idle_core() can potentially search all cpus to find the fully idle
>> core even if there is one such core. Removing this is necessary to achieve
>> scalability in the fast path.
> So this removes the whole core awareness from the wakeup path; this
> needs far more justification.
>
> In general running on pure cores is much faster than running on threads.
> If you plot performance numbers there's almost always a fairly
> significant drop in slope at the moment when we run out of cores and
> start using threads.
The only justification I have is the benchmarks I ran all most all
improved, most importantly our internal Oracle DB tests which we care about
a lot. So what you said makes sense in theory but is not borne out by real
world results. This indicates that threads of these benchmarks care more
about running immediately on any idle cpu rather than spending time to find
fully idle core to run on.
> Also, depending on cpu enumeration, your next patch might not even leave
> the core scanning for idle CPUs.
>
> Now, typically on Intel systems, we first enumerate cores and then
> siblings, but I've seen Intel systems that don't do this and enumerate
> all threads together. Also other architectures are known to iterate full
> cores together, both s390 and Power for example do this.
>
> So by only doing a linear scan on CPU number you will actually fill
> cores instead of equally spreading across cores. Worse still, by
> limiting the scan to _4_ you only barely even get onto a next core for
> SMT4 hardware, never mind SMT8.
Again this doesn't matter for the benchmarks I ran. Most are happy to make
the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that
the scan window is rotated over all cpus, so idle cpus will be found soon.
There is also stealing by idle cpus. Also this was an RFT so I request this
to be tested on other architectrues like SMT4/SMT8.
>
> So while I'm not adverse to limiting the empty core search; I do feel it
> is important to have. Overloading cores when you don't have to is not
> good.
Can we have a config or a way for enabling/disabling select_idle_core?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] sched: introduce per-cpu var next_cpu to track search limit
  2018-04-24 12:47   ` Peter Zijlstra
@ 2018-04-24 22:39     ` Subhra Mazumdar
  0 siblings, 0 replies; 24+ messages in thread
From: Subhra Mazumdar @ 2018-04-24 22:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain



On 04/24/2018 05:47 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 05:41:15PM -0700, subhra mazumdar wrote:
>> @@ -17,6 +17,7 @@
>>   #include <trace/events/sched.h>
>>   
>>   DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>> +DEFINE_PER_CPU_SHARED_ALIGNED(int, next_cpu);
>>   
>>   #if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
>>   /*
>> @@ -6018,6 +6019,7 @@ void __init sched_init(void)
>>   		struct rq *rq;
>>   
>>   		rq = cpu_rq(i);
>> +		per_cpu(next_cpu, i) = -1;
> If you leave it uninitialized it'll be 0, and we can avoid that extra
> branch in the next patch, no?
0 can be a valid cpu id. I wanted to distinguish the first time. The branch
predictor will be fully trained so will not have any cost.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
  2018-04-24 12:48   ` Peter Zijlstra
@ 2018-04-24 22:43     ` Subhra Mazumdar
  0 siblings, 0 replies; 24+ messages in thread
From: Subhra Mazumdar @ 2018-04-24 22:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain



On 04/24/2018 05:48 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
>> +	if (per_cpu(next_cpu, target) != -1)
>> +		target_tmp = per_cpu(next_cpu, target);
>> +	else
>> +		target_tmp = target;
>> +
> This one; what's the point here?
Want to start search from target first time and from the next_cpu next time
onwards. If this doesn't many any difference in performance I can change
it. Will require re-running all the tests.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
  2018-04-24 12:48   ` Peter Zijlstra
@ 2018-04-24 22:48     ` Subhra Mazumdar
  0 siblings, 0 replies; 24+ messages in thread
From: Subhra Mazumdar @ 2018-04-24 22:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain



On 04/24/2018 05:48 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
>> Lower the lower limit of idle cpu search in select_idle_cpu() and also put
>> an upper limit. This helps in scalability of the search by restricting the
>> search window. Also rotating the search window with help of next_cpu
>> ensures any idle cpu is eventually found in case of high load.
> So this patch does 2 (possibly 3) things, that's not good.
During testing I did try with first only restricting the search window.
That alone wasn't enough to give the full benefit, rotating search window
was essential to get best results. I will break this up in next version.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
  2018-04-24 12:53   ` Peter Zijlstra
@ 2018-04-25  0:10     ` Subhra Mazumdar
  2018-04-25 15:36       ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Subhra Mazumdar @ 2018-04-25  0:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain



On 04/24/2018 05:53 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 05:41:16PM -0700, subhra mazumdar wrote:
>> Lower the lower limit of idle cpu search in select_idle_cpu() and also put
>> an upper limit. This helps in scalability of the search by restricting the
>> search window.
>> @@ -6297,15 +6297,24 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>>   
>>   	if (sched_feat(SIS_PROP)) {
>>   		u64 span_avg = sd->span_weight * avg_idle;
>> -		if (span_avg > 4*avg_cost)
>> +		if (span_avg > 2*avg_cost) {
>>   			nr = div_u64(span_avg, avg_cost);
>> -		else
>> -			nr = 4;
>> +			if (nr > 4)
>> +				nr = 4;
>> +		} else {
>> +			nr = 2;
>> +		}
>>   	}
> Why do you need to put a max on? Why isn't the proportional thing
> working as is? (is the average no good because of big variance or what)
Firstly the choosing of 512 seems arbitrary. Secondly the logic here is
that the enqueuing cpu should search up to time it can get work itself.
Why is that the optimal amount to search?
>
> Again, why do you need to lower the min; what's wrong with 4?
>
> The reason I picked 4 is that many laptops have 4 CPUs and desktops
> really want to avoid queueing if at all possible.
To find the optimum upper and lower limit I varied them over many
combinations. 4 and 2 gave the best results across most benchmarks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
  2018-04-25  0:10     ` Subhra Mazumdar
@ 2018-04-25 15:36       ` Peter Zijlstra
  2018-04-25 18:01         ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-04-25 15:36 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain

On Tue, Apr 24, 2018 at 05:10:34PM -0700, Subhra Mazumdar wrote:
> On 04/24/2018 05:53 AM, Peter Zijlstra wrote:

> > Why do you need to put a max on? Why isn't the proportional thing
> > working as is? (is the average no good because of big variance or what)

> Firstly the choosing of 512 seems arbitrary.

It is; it is a crud attempt to deal with big variance. The comment says
as much.

> Secondly the logic here is that the enqueuing cpu should search up to
> time it can get work itself.  Why is that the optimal amount to
> search?

1/512-th of the time in fact, per the above random number, but yes.
Because searching for longer than we're expecting to be idle for is
clearly bad, at that point we're inhibiting doing useful work.

But while thinking about all this, I think I've spotted a few more
issues, aside from the variance:

Firstly, while avg_idle estimates the average duration for _when_ we go
idle, it doesn't give a good measure when we do not in fact go idle. So
when we get wakeups while fully busy, avg_idle is a poor measure.

Secondly, the number of wakeups performed is also important. If we have
a lot of wakeups, we need to look at aggregate wakeup time over a
period. Not just single wakeup time.

And thirdly, we're sharing the idle duration with newidle balance.

And I think the 512 is a result of me not having recognised these
additional issues when looking at the traces, I saw variance and left it
there.


This leaves me thinking we need a better estimator for wakeups. Because
if there really is significant idle time, not looking for idle CPUs to
run on is bad. Placing that upper limit, especially such a low one, is
just an indication of failure.


I'll see if I can come up with something.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-04-24 21:45     ` Subhra Mazumdar
@ 2018-04-25 17:49       ` Peter Zijlstra
  2018-04-30 23:38         ` Subhra Mazumdar
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-04-25 17:49 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain

On Tue, Apr 24, 2018 at 02:45:50PM -0700, Subhra Mazumdar wrote:
> So what you said makes sense in theory but is not borne out by real
> world results. This indicates that threads of these benchmarks care more
> about running immediately on any idle cpu rather than spending time to find
> fully idle core to run on.

But you only ran on Intel which emunerates siblings far apart in the
cpuid space. Which is not something we should rely on.

> > So by only doing a linear scan on CPU number you will actually fill
> > cores instead of equally spreading across cores. Worse still, by
> > limiting the scan to _4_ you only barely even get onto a next core for
> > SMT4 hardware, never mind SMT8.

> Again this doesn't matter for the benchmarks I ran. Most are happy to make
> the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that
> the scan window is rotated over all cpus, so idle cpus will be found soon.

You've not been reading well. The Intel machine you tested this on most
likely doesn't suffer that problem because of the way it happens to
iterate SMT threads.

How does Sparc iterate its SMT siblings in cpuid space?

Also, your benchmarks chose an unfortunate nr of threads vs topology.
The 2^n thing chosen never hits the 100% core case (6,22 resp.).

> > So while I'm not adverse to limiting the empty core search; I do feel it
> > is important to have. Overloading cores when you don't have to is not
> > good.

> Can we have a config or a way for enabling/disabling select_idle_core?

I like Rohit's suggestion of folding select_idle_core and
select_idle_cpu much better, then it stays SMT aware.

Something like the completely untested patch below.

---
 include/linux/sched/topology.h |   1 -
 kernel/sched/fair.c            | 148 +++++++++++++----------------------------
 kernel/sched/idle.c            |   1 -
 kernel/sched/sched.h           |  10 ---
 4 files changed, 47 insertions(+), 113 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..ac7944dd8bc6 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -71,7 +71,6 @@ struct sched_group;
 struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
-	int		has_idle_cores;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54dc31e7ab9b..95fed8dcea7a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6239,124 +6239,72 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 
 #ifdef CONFIG_SCHED_SMT
 
-static inline void set_idle_cores(int cpu, int val)
-{
-	struct sched_domain_shared *sds;
-
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		WRITE_ONCE(sds->has_idle_cores, val);
-}
-
-static inline bool test_idle_cores(int cpu, bool def)
-{
-	struct sched_domain_shared *sds;
-
-	sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-	if (sds)
-		return READ_ONCE(sds->has_idle_cores);
-
-	return def;
-}
-
-/*
- * Scans the local SMT mask to see if the entire core is idle, and records this
- * information in sd_llc_shared->has_idle_cores.
- *
- * Since SMT siblings share all cache levels, inspecting this limited remote
- * state should be fairly cheap.
- */
-void __update_idle_core(struct rq *rq)
-{
-	int core = cpu_of(rq);
-	int cpu;
-
-	rcu_read_lock();
-	if (test_idle_cores(core, true))
-		goto unlock;
-
-	for_each_cpu(cpu, cpu_smt_mask(core)) {
-		if (cpu == core)
-			continue;
-
-		if (!idle_cpu(cpu))
-			goto unlock;
-	}
-
-	set_idle_cores(core, 1);
-unlock:
-	rcu_read_unlock();
-}
-
 /*
- * Scan the entire LLC domain for idle cores; this dynamically switches off if
- * there are no idle cores left in the system; tracked through
- * sd_llc->shared->has_idle_cores and enabled through update_idle_core() above.
+ * Scan the LLC domain for idlest cores; this is dynamically regulated by
+ * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
+ * average idle time for this rq (as found in rq->avg_idle).
  */
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+	int best_busy = UINT_MAX, best_cpu = -1;
+	struct sched_domain *this_sd;
+	u64 avg_cost, avg_idle;
+	int nr = INT_MAX;
+	u64 time, cost;
 	int core, cpu;
+	s64 delta;
 
-	if (!static_branch_likely(&sched_smt_present))
+	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+	if (!this_sd)
 		return -1;
 
-	if (!test_idle_cores(target, false))
-		return -1;
+	avg_idle = this_rq()->avg_idle / 512;
+	avg_cost = this_sd->avg_scan_cost + 1;
+
+	if (sched_feat(SIS_PROP)) {
+		u64 span_avg = sd->span_weight * avg_idle;
+		if (span_avg > 2*avg_cost)
+			nr = div_u64(span_avg, avg_cost);
+		else
+			nr = 2;
+	}
+
+	time = local_clock();
 
 	cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed);
 
 	for_each_cpu_wrap(core, cpus, target) {
-		bool idle = true;
+		int first_idle = -1;
+		int busy = 0;
 
 		for_each_cpu(cpu, cpu_smt_mask(core)) {
 			cpumask_clear_cpu(cpu, cpus);
 			if (!idle_cpu(cpu))
-				idle = false;
+				busy++;
+			else if (first_idle < 0)
+				first_idle = cpu;
 		}
 
-		if (idle)
+		if (!busy)
 			return core;
-	}
-
-	/*
-	 * Failed to find an idle core; stop looking for one.
-	 */
-	set_idle_cores(target, 0);
-
-	return -1;
-}
 
-/*
- * Scan the local SMT mask for idle CPUs.
- */
-static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	int cpu;
-
-	if (!static_branch_likely(&sched_smt_present))
-		return -1;
+		if (busy < best_busy) {
+			best_busy = busy;
+			best_cpu = cpu;
+		}
 
-	for_each_cpu(cpu, cpu_smt_mask(target)) {
-		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
-			continue;
-		if (idle_cpu(cpu))
-			return cpu;
+		if (!--nr)
+			break;
 	}
 
-	return -1;
-}
-
-#else /* CONFIG_SCHED_SMT */
-
-static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	return -1;
-}
+	time = local_clock() - time;
+	cost = this_sd->avg_scan_cost;
+	// XXX we should normalize @time on @nr
+	delta = (s64)(time - cost) / 8;
+	this_sd->avg_scan_cost += delta;
 
-static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
-{
-	return -1;
+	return best_cpu;
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6451,15 +6399,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (!sd)
 		return target;
 
-	i = select_idle_core(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
-
-	i = select_idle_cpu(p, sd, target);
-	if ((unsigned)i < nr_cpumask_bits)
-		return i;
+#ifdef CONFIG_SCHED_SMT
+	if (static_branch_likely(&sched_smt_present))
+		i = select_idle_core(p, sd, target);
+	else
+#endif
+		i = select_idle_cpu(p, sd, target);
 
-	i = select_idle_smt(p, sd, target);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 1a3e9bddd17b..7ca8e18b0018 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -392,7 +392,6 @@ static struct task_struct *
 pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	put_prev_task(rq, prev);
-	update_idle_core(rq);
 	schedstat_inc(rq->sched_goidle);
 
 	return rq->idle;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c222ca2..3f1874c345b1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -899,16 +899,6 @@ static inline int cpu_of(struct rq *rq)
 
 extern struct static_key_false sched_smt_present;
 
-extern void __update_idle_core(struct rq *rq);
-
-static inline void update_idle_core(struct rq *rq)
-{
-	if (static_branch_unlikely(&sched_smt_present))
-		__update_idle_core(rq);
-}
-
-#else
-static inline void update_idle_core(struct rq *rq) { }
 #endif
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability
  2018-04-25 15:36       ` Peter Zijlstra
@ 2018-04-25 18:01         ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2018-04-25 18:01 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain

On Wed, Apr 25, 2018 at 05:36:00PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 24, 2018 at 05:10:34PM -0700, Subhra Mazumdar wrote:
> > On 04/24/2018 05:53 AM, Peter Zijlstra wrote:
> 
> > > Why do you need to put a max on? Why isn't the proportional thing
> > > working as is? (is the average no good because of big variance or what)
> 
> > Firstly the choosing of 512 seems arbitrary.
> 
> It is; it is a crud attempt to deal with big variance. The comment says
> as much.
> 
> > Secondly the logic here is that the enqueuing cpu should search up to
> > time it can get work itself.  Why is that the optimal amount to
> > search?
> 
> 1/512-th of the time in fact, per the above random number, but yes.
> Because searching for longer than we're expecting to be idle for is
> clearly bad, at that point we're inhibiting doing useful work.
> 
> But while thinking about all this, I think I've spotted a few more
> issues, aside from the variance:
> 
> Firstly, while avg_idle estimates the average duration for _when_ we go
> idle, it doesn't give a good measure when we do not in fact go idle. So
> when we get wakeups while fully busy, avg_idle is a poor measure.
> 
> Secondly, the number of wakeups performed is also important. If we have
> a lot of wakeups, we need to look at aggregate wakeup time over a
> period. Not just single wakeup time.
> 
> And thirdly, we're sharing the idle duration with newidle balance.
> 
> And I think the 512 is a result of me not having recognised these
> additional issues when looking at the traces, I saw variance and left it
> there.
> 
> 
> This leaves me thinking we need a better estimator for wakeups. Because
> if there really is significant idle time, not looking for idle CPUs to
> run on is bad. Placing that upper limit, especially such a low one, is
> just an indication of failure.
> 
> 
> I'll see if I can come up with something.

Something like so _could_ work. Again, completely untested. We give idle
time to wake_avg, we subtract select_idle_sibling 'runtime' from
wake_avg, such that when there's lots of wakeups we don't use more time
than there was reported idle time. And we age wake_avg, such that if
there hasn't been idle for a number of ticks (we've been real busy) we
also stop scanning wide.

But it could eat your granny and set your cat on fire :-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e10aaeebfcc..bc910e5776cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1671,6 +1671,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
 		if (rq->avg_idle > max)
 			rq->avg_idle = max;
 
+		rq->wake_stamp = jiffies;
+		rq->wake_avg = rq->avg_idle / 2;
+
 		rq->idle_stamp = 0;
 	}
 #endif
@@ -6072,6 +6075,8 @@ void __init sched_init(void)
 		rq->online = 0;
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
+		rq->wake_stamp = jiffies;
+		rq->wake_avg = rq->avg_idle;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54dc31e7ab9b..fee31dbe15ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6369,7 +6369,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct sched_domain *this_sd;
+	unsigned long now = jiffies;
 	u64 avg_cost, avg_idle;
+	struct rq *this_rq;
 	u64 time, cost;
 	s64 delta;
 	int cpu, nr = INT_MAX;
@@ -6378,11 +6380,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	if (!this_sd)
 		return -1;
 
-	/*
-	 * Due to large variance we need a large fuzz factor; hackbench in
-	 * particularly is sensitive here.
-	 */
-	avg_idle = this_rq()->avg_idle / 512;
+	this_rq = this_rq();
+	if (sched_feat(SIS_NEW)) {
+		/* age the remaining idle time */
+		if (unlikely(this_rq->wake_stamp < now)) {
+			while (this_rq->wake_stamp < now && this_rq->wake_avg)
+				this_rq->wake_avg >>= 1;
+		}
+		avg_idle = this_rq->wake_avg;
+	} else {
+		avg_idle = this_rq->avg_idle / 512;
+	}
+
 	avg_cost = this_sd->avg_scan_cost + 1;
 
 	if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
@@ -6412,6 +6421,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	delta = (s64)(time - cost) / 8;
 	this_sd->avg_scan_cost += delta;
 
+	/* you can only spend the time once */
+	if (this_rq->wake_avg > time)
+		this_rq->wake_avg -= time;
+	else
+		this_rq->wake_avg = 0;
+
 	return cpu;
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae8488039c..f5f86a59aac4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
+SCHED_FEAT(SIS_NEW, false)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15750c222ca2..c4d6ddf907b5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -831,6 +831,9 @@ struct rq {
 	u64			idle_stamp;
 	u64			avg_idle;
 
+	unsigned long		wake_stamp;
+	u64			wake_avg;
+
 	/* This is used to determine avg_idle's max value */
 	u64			max_idle_balance_cost;
 #endif

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-04-25 17:49       ` Peter Zijlstra
@ 2018-04-30 23:38         ` Subhra Mazumdar
  2018-05-01 18:03           ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Subhra Mazumdar @ 2018-04-30 23:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain



On 04/25/2018 10:49 AM, Peter Zijlstra wrote:
> On Tue, Apr 24, 2018 at 02:45:50PM -0700, Subhra Mazumdar wrote:
>> So what you said makes sense in theory but is not borne out by real
>> world results. This indicates that threads of these benchmarks care more
>> about running immediately on any idle cpu rather than spending time to find
>> fully idle core to run on.
> But you only ran on Intel which emunerates siblings far apart in the
> cpuid space. Which is not something we should rely on.
>
>>> So by only doing a linear scan on CPU number you will actually fill
>>> cores instead of equally spreading across cores. Worse still, by
>>> limiting the scan to _4_ you only barely even get onto a next core for
>>> SMT4 hardware, never mind SMT8.
>> Again this doesn't matter for the benchmarks I ran. Most are happy to make
>> the tradeoff on x86 (SMT2). Limiting the scan is mitigated by the fact that
>> the scan window is rotated over all cpus, so idle cpus will be found soon.
> You've not been reading well. The Intel machine you tested this on most
> likely doesn't suffer that problem because of the way it happens to
> iterate SMT threads.
>
> How does Sparc iterate its SMT siblings in cpuid space?
SPARC does sequential enumeration of siblings first, although this needs to
be confirmed if non-sequential enumeration on x86 is the reason of the
improvements through tests. I don't have a SPARC test system handy now.
>
> Also, your benchmarks chose an unfortunate nr of threads vs topology.
> The 2^n thing chosen never hits the 100% core case (6,22 resp.).
>
>>> So while I'm not adverse to limiting the empty core search; I do feel it
>>> is important to have. Overloading cores when you don't have to is not
>>> good.
>> Can we have a config or a way for enabling/disabling select_idle_core?
> I like Rohit's suggestion of folding select_idle_core and
> select_idle_cpu much better, then it stays SMT aware.
>
> Something like the completely untested patch below.
I tried both the patches you suggested, the first with merging of
select_idle_core and select_idle_cpu and second with the new way of
calculating avg_idle and finally both combined. I ran the following
benchmarks for each, the merge only patch seems to giving similar
improvements as my original patch for Uperf and Oracle DB tests, but it
regresses for hackbench. If we can fix this I am OK with it. I can do a run
of other benchamrks after that.

I also noticed a possible bug later in the merge code. Shouldn't it be:

if (busy < best_busy) {
         best_busy = busy;
         best_cpu = first_idle;
}

Unfortunately I noticed it after all runs.

merge:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline       %stdev  patch %stdev
1       0.5742         21.13   0.5099 (11.2%) 2.24
2       0.5776         7.87    0.5385 (6.77%) 3.38
4       0.9578         1.12    1.0626 (-10.94%) 1.35
8       1.7018         1.35    1.8615 (-9.38%) 0.73
16      2.9955         1.36    3.2424 (-8.24%) 0.66
32      5.4354         0.59    5.749  (-5.77%) 0.55

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline        %stdev  patch %stdev
8       49.47           0.35    49.98 (1.03%) 1.36
16      95.28           0.77    97.46 (2.29%) 0.11
32      156.77          1.17    167.03 (6.54%) 1.98
48      193.24          0.22    230.96 (19.52%) 2.44
64      216.21          9.33    299.55 (38.54%) 4
128     379.62          10.29   357.87 (-5.73%) 0.85

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline        %stdev  patch %stdev
20      1               1.35    0.9919 (-0.81%) 0.14
40      1               0.42    0.9959 (-0.41%) 0.72
60      1               1.54    0.9872 (-1.28%) 1.27
80      1               0.58    0.9925 (-0.75%) 0.5
100     1               0.77    1.0145 (1.45%) 1.29
120     1               0.35    1.0136 (1.36%) 1.15
140     1               0.19    1.0404 (4.04%) 0.91
160     1               0.09    1.0317 (3.17%) 1.41
180     1               0.99    1.0322 (3.22%) 0.51
200     1               1.03    1.0245 (2.45%) 0.95
220     1               1.69    1.0296 (2.96%) 2.83

new avg_idle:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline       %stdev  patch %stdev
1       0.5742         21.13   0.5241 (8.73%) 8.26
2       0.5776         7.87    0.5436 (5.89%) 8.53
4       0.9578         1.12    0.989 (-3.26%) 1.9
8       1.7018         1.35    1.7568 (-3.23%) 1.22
16      2.9955         1.36    3.1119 (-3.89%) 0.92
32      5.4354         0.59    5.5889 (-2.82%) 0.64

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline        %stdev  patch %stdev
8       49.47           0.35    48.11 (-2.75%) 0.29
16      95.28           0.77    93.67 (-1.68%) 0.68
32      156.77          1.17    158.28 (0.96%) 0.29
48      193.24          0.22    190.04 (-1.66%) 0.34
64      216.21          9.33    189.45 (-12.38%) 2.05
128     379.62          10.29   326.59 (-13.97%) 13.07

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline        %stdev  patch %stdev
20      1               1.35    1.0026 (0.26%) 0.25
40      1               0.42    0.9857 (-1.43%) 1.47
60      1               1.54    0.9903 (-0.97%) 0.99
80      1               0.58    0.9968 (-0.32%) 1.19
100     1               0.77    0.9933 (-0.67%) 0.53
120     1               0.35    0.9919 (-0.81%) 0.9
140     1               0.19    0.9915 (-0.85%) 0.36
160     1               0.09    0.9811 (-1.89%) 1.21
180     1               0.99    1.0002 (0.02%) 0.87
200     1               1.03    1.0037 (0.37%) 2.5
220     1               1.69    0.998 (-0.2%) 0.8

merge + new avg_idle:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline       %stdev  patch %stdev
1       0.5742         21.13   0.6522 (-13.58%) 12.53
2       0.5776         7.87    0.7593 (-31.46%) 2.7
4       0.9578         1.12    1.0952 (-14.35%) 1.08
8       1.7018         1.35    1.8722 (-10.01%) 0.68
16      2.9955         1.36    3.2987 (-10.12%) 0.58
32      5.4354         0.59    5.7751 (-6.25%) 0.46

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline        %stdev  patch %stdev
8       49.47           0.35    51.29 (3.69%) 0.86
16      95.28           0.77    98.95 (3.85%) 0.41
32      156.77          1.17    165.76 (5.74%) 0.26
48      193.24          0.22    234.25 (21.22%) 0.63
64      216.21          9.33    306.87 (41.93%) 2.11
128     379.62          10.29   355.93 (-6.24%) 8.28

Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline        %stdev  patch %stdev
20      1               1.35    1.0085 (0.85%) 0.72
40      1               0.42    1.0017 (0.17%) 0.3
60      1               1.54    0.9974 (-0.26%) 1.18
80      1               0.58    1.0115 (1.15%) 0.93
100     1               0.77    0.9959 (-0.41%) 1.21
120     1               0.35    1.0034 (0.34%) 0.72
140     1               0.19    1.0123 (1.23%) 0.93
160     1               0.09    1.0057 (0.57%) 0.65
180     1               0.99    1.0195 (1.95%) 0.99
200     1               1.03    1.0474 (4.74%) 0.55
220     1               1.69    1.0392 (3.92%) 0.36

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-04-30 23:38         ` Subhra Mazumdar
@ 2018-05-01 18:03           ` Peter Zijlstra
  2018-05-02 21:58             ` Subhra Mazumdar
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-05-01 18:03 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain

On Mon, Apr 30, 2018 at 04:38:42PM -0700, Subhra Mazumdar wrote:
> I also noticed a possible bug later in the merge code. Shouldn't it be:
> 
> if (busy < best_busy) {
>         best_busy = busy;
>         best_cpu = first_idle;
> }

Uhh, quite. I did say it was completely untested, but yes.. /me dons the
brown paper bag.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-05-01 18:03           ` Peter Zijlstra
@ 2018-05-02 21:58             ` Subhra Mazumdar
  2018-05-04 18:51               ` Subhra Mazumdar
  2018-05-29 21:36               ` Peter Zijlstra
  0 siblings, 2 replies; 24+ messages in thread
From: Subhra Mazumdar @ 2018-05-02 21:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain



On 05/01/2018 11:03 AM, Peter Zijlstra wrote:
> On Mon, Apr 30, 2018 at 04:38:42PM -0700, Subhra Mazumdar wrote:
>> I also noticed a possible bug later in the merge code. Shouldn't it be:
>>
>> if (busy < best_busy) {
>>          best_busy = busy;
>>          best_cpu = first_idle;
>> }
> Uhh, quite. I did say it was completely untested, but yes.. /me dons the
> brown paper bag.
I re-ran the test after fixing that bug but still get similar regressions
for hackbench, while similar improvements on Uperf. I didn't re-run the
Oracle DB tests but my guess is it will show similar improvement.

merge:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline       %stdev  patch %stdev
1       0.5742         21.13   0.5131 (10.64%) 4.11
2       0.5776         7.87    0.5387 (6.73%) 2.39
4       0.9578         1.12    1.0549 (-10.14%) 0.85
8       1.7018         1.35    1.8516 (-8.8%) 1.56
16      2.9955         1.36    3.2466 (-8.38%) 0.42
32      5.4354         0.59    5.7738 (-6.23%) 0.38

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline        %stdev  patch %stdev
8       49.47           0.35    51.1 (3.29%) 0.13
16      95.28           0.77    98.45 (3.33%) 0.61
32      156.77          1.17    170.97 (9.06%) 5.62
48      193.24          0.22    245.89 (27.25%) 7.26
64      216.21          9.33    316.43 (46.35%) 0.37
128     379.62          10.29   337.85 (-11%) 3.68

I tried using the next_cpu technique with the merge but didn't help. I am
open to suggestions.

merge + next_cpu:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline       %stdev  patch %stdev
1       0.5742         21.13   0.5107 (11.06%) 6.35
2       0.5776         7.87    0.5917 (-2.44%) 11.16
4       0.9578         1.12    1.0761 (-12.35%) 1.1
8       1.7018         1.35    1.8748 (-10.17%) 0.8
16      2.9955         1.36    3.2419 (-8.23%) 0.43
32      5.4354         0.59    5.6958 (-4.79%) 0.58

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline        %stdev  patch %stdev
8       49.47           0.35    51.65 (4.41%) 0.26
16      95.28           0.77    99.8 (4.75%) 1.1
32      156.77          1.17    168.37 (7.4%) 0.6
48      193.24          0.22    228.8 (18.4%) 1.75
64      216.21          9.33    287.11 (32.79%) 10.82
128     379.62          10.29   346.22 (-8.8%) 4.7

Finally there was earlier suggestion by Peter in select_task_rq_fair to
transpose the cpu offset that I had tried earlier but also regressed on
hackbench. Just wanted to mention that so we have closure on that.

transpose cpu offset in select_task_rq_fair:

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline       %stdev  patch %stdev
1       0.5742         21.13   0.5251 (8.55%) 2.57
2       0.5776         7.87    0.5471 (5.28%) 11
4       0.9578         1.12    1.0148 (-5.95%) 1.97
8       1.7018         1.35    1.798 (-5.65%) 0.97
16      2.9955         1.36    3.088 (-3.09%) 2.7
32      5.4354         0.59    5.2815 (2.8%) 1.26

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-05-02 21:58             ` Subhra Mazumdar
@ 2018-05-04 18:51               ` Subhra Mazumdar
  2018-05-29 21:36               ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Subhra Mazumdar @ 2018-05-04 18:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain



On 05/02/2018 02:58 PM, Subhra Mazumdar wrote:
>
>
> On 05/01/2018 11:03 AM, Peter Zijlstra wrote:
>> On Mon, Apr 30, 2018 at 04:38:42PM -0700, Subhra Mazumdar wrote:
>>> I also noticed a possible bug later in the merge code. Shouldn't it be:
>>>
>>> if (busy < best_busy) {
>>>          best_busy = busy;
>>>          best_cpu = first_idle;
>>> }
>> Uhh, quite. I did say it was completely untested, but yes.. /me dons the
>> brown paper bag.
> I re-ran the test after fixing that bug but still get similar regressions
> for hackbench, while similar improvements on Uperf. I didn't re-run the
> Oracle DB tests but my guess is it will show similar improvement.
>
> merge:
>
> Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
> (lower is better):
> groups  baseline       %stdev  patch %stdev
> 1       0.5742         21.13   0.5131 (10.64%) 4.11
> 2       0.5776         7.87    0.5387 (6.73%) 2.39
> 4       0.9578         1.12    1.0549 (-10.14%) 0.85
> 8       1.7018         1.35    1.8516 (-8.8%) 1.56
> 16      2.9955         1.36    3.2466 (-8.38%) 0.42
> 32      5.4354         0.59    5.7738 (-6.23%) 0.38
>
> Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
> message size = 8k (higher is better):
> threads baseline        %stdev  patch %stdev
> 8       49.47           0.35    51.1 (3.29%) 0.13
> 16      95.28           0.77    98.45 (3.33%) 0.61
> 32      156.77          1.17    170.97 (9.06%) 5.62
> 48      193.24          0.22    245.89 (27.25%) 7.26
> 64      216.21          9.33    316.43 (46.35%) 0.37
> 128     379.62          10.29   337.85 (-11%) 3.68
>
> I tried using the next_cpu technique with the merge but didn't help. I am
> open to suggestions.
>
> merge + next_cpu:
>
> Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
> (lower is better):
> groups  baseline       %stdev  patch %stdev
> 1       0.5742         21.13   0.5107 (11.06%) 6.35
> 2       0.5776         7.87    0.5917 (-2.44%) 11.16
> 4       0.9578         1.12    1.0761 (-12.35%) 1.1
> 8       1.7018         1.35    1.8748 (-10.17%) 0.8
> 16      2.9955         1.36    3.2419 (-8.23%) 0.43
> 32      5.4354         0.59    5.6958 (-4.79%) 0.58
>
> Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
> message size = 8k (higher is better):
> threads baseline        %stdev  patch %stdev
> 8       49.47           0.35    51.65 (4.41%) 0.26
> 16      95.28           0.77    99.8 (4.75%) 1.1
> 32      156.77          1.17    168.37 (7.4%) 0.6
> 48      193.24          0.22    228.8 (18.4%) 1.75
> 64      216.21          9.33    287.11 (32.79%) 10.82
> 128     379.62          10.29   346.22 (-8.8%) 4.7
>
> Finally there was earlier suggestion by Peter in select_task_rq_fair to
> transpose the cpu offset that I had tried earlier but also regressed on
> hackbench. Just wanted to mention that so we have closure on that.
>
> transpose cpu offset in select_task_rq_fair:
>
> Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
> (lower is better):
> groups  baseline       %stdev  patch %stdev
> 1       0.5742         21.13   0.5251 (8.55%) 2.57
> 2       0.5776         7.87    0.5471 (5.28%) 11
> 4       0.9578         1.12    1.0148 (-5.95%) 1.97
> 8       1.7018         1.35    1.798 (-5.65%) 0.97
> 16      2.9955         1.36    3.088 (-3.09%) 2.7
> 32      5.4354         0.59    5.2815 (2.8%) 1.26
I tried a few other combinations including setting nr=2 exactly with the
folding of select_idle_cpu and select_idle_core but still get regressions
with hackbench. Also tried adding select_idle_smt (just for the sake of it
since my patch retained it) but still see regressions with hackbench. In
all these tests Uperf and Oracle DB tests gave similar improvements as my
orignal patch. This kind of indicates that sequential cpu ids hopping cores
(x86) being important for hackbench. In that case can we consciously hop
core for all archs and search limited nr cpus? We can get the diff of
cpu id of target cpu and first cpu in the smt core and apply the diff to
the cpu id of each smt core to get the cpu we want to check. But we need a
O(1) way of zeroing out all the cpus of smt core from the parent mask.
This will work in both kind of enumeration, whether contiguous or
interleaved. Thoughts?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-05-02 21:58             ` Subhra Mazumdar
  2018-05-04 18:51               ` Subhra Mazumdar
@ 2018-05-29 21:36               ` Peter Zijlstra
  2018-05-30 22:08                 ` Subhra Mazumdar
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2018-05-29 21:36 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain, Mike Galbraith, Matt Fleming

On Wed, May 02, 2018 at 02:58:42PM -0700, Subhra Mazumdar wrote:
> I re-ran the test after fixing that bug but still get similar regressions
> for hackbench

> Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
> (lower is better):
> groups  baseline       %stdev  patch %stdev
> 1       0.5742         21.13   0.5131 (10.64%) 4.11
> 2       0.5776         7.87    0.5387 (6.73%) 2.39
> 4       0.9578         1.12    1.0549 (-10.14%) 0.85
> 8       1.7018         1.35    1.8516 (-8.8%) 1.56
> 16      2.9955         1.36    3.2466 (-8.38%) 0.42
> 32      5.4354         0.59    5.7738 (-6.23%) 0.38

On my IVB-EP (2 socket, 10 core/socket, 2 threads/core):

bench:

  perf stat --null --repeat 10 -- perf bench sched messaging -g $i -t -l 10000 2>&1 | grep "seconds time elapsed"

config + results:

ORIG (SIS_PROP, shift=9)

1:        0.557325175 seconds time elapsed                                          ( +-  0.83% )
2:        0.620646551 seconds time elapsed                                          ( +-  1.46% )
5:        2.313514786 seconds time elapsed                                          ( +-  2.11% )
10:        3.796233615 seconds time elapsed                                          ( +-  1.57% )
20:        6.319403172 seconds time elapsed                                          ( +-  1.61% )
40:        9.313219134 seconds time elapsed                                          ( +-  1.03% )

PROP+AGE+ONCE shift=0

1:        0.559497993 seconds time elapsed                                          ( +-  0.55% )
2:        0.631549599 seconds time elapsed                                          ( +-  1.73% )
5:        2.195464815 seconds time elapsed                                          ( +-  1.77% )
10:        3.703455811 seconds time elapsed                                          ( +-  1.30% )
20:        6.440869566 seconds time elapsed                                          ( +-  1.23% )
40:        9.537849253 seconds time elapsed                                          ( +-  2.00% )

FOLD+AGE+ONCE+PONIES shift=0

1:        0.558893325 seconds time elapsed                                          ( +-  0.98% )
2:        0.617426276 seconds time elapsed                                          ( +-  1.07% )
5:        2.342727231 seconds time elapsed                                          ( +-  1.34% )
10:        3.850449091 seconds time elapsed                                          ( +-  1.07% )
20:        6.622412262 seconds time elapsed                                          ( +-  0.85% )
40:        9.487138039 seconds time elapsed                                          ( +-  2.88% )

FOLD+AGE+ONCE+PONIES+PONIES2 shift=0

10:       3.695294317 seconds time elapsed                                          ( +-  1.21% )


Which seems to not hurt anymore.. can you confirm?

Also, I didn't run anything other than hackbench on it so far.

(sorry, the code is a right mess, it's what I ended up with after a day
of poking with no cleanups)


---
 include/linux/sched/topology.h |   7 ++
 kernel/sched/core.c            |   8 ++
 kernel/sched/fair.c            | 201 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h        |   9 ++
 kernel/sched/sched.h           |   3 +
 kernel/sched/topology.c        |  13 ++-
 6 files changed, 228 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..1a53a805547e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -72,6 +72,8 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+
+	unsigned long core_mask[0];
 };
 
 struct sched_domain {
@@ -162,6 +164,11 @@ static inline struct cpumask *sched_domain_span(struct sched_domain *sd)
 	return to_cpumask(sd->span);
 }
 
+static inline struct cpumask *sched_domain_cores(struct sched_domain *sd)
+{
+	return to_cpumask(sd->shared->core_mask);
+}
+
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 				    struct sched_domain_attr *dattr_new);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8d59b259af4a..2e2ee0df9e4d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1674,6 +1674,12 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
 		if (rq->avg_idle > max)
 			rq->avg_idle = max;
 
+		rq->wake_stamp = jiffies;
+		rq->wake_avg = rq->avg_idle / 2;
+
+		if (sched_feat(SIS_TRACE))
+			trace_printk("delta: %Lu %Lu %Lu\n", delta, rq->avg_idle, rq->wake_avg);
+
 		rq->idle_stamp = 0;
 	}
 #endif
@@ -6051,6 +6057,8 @@ void __init sched_init(void)
 		rq->online = 0;
 		rq->idle_stamp = 0;
 		rq->avg_idle = 2*sysctl_sched_migration_cost;
+		rq->wake_stamp = jiffies;
+		rq->wake_avg = rq->avg_idle;
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05aab7f..8fe1c2404092 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6361,6 +6361,16 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
 
 #endif /* CONFIG_SCHED_SMT */
 
+static DEFINE_PER_CPU(int, sis_rotor);
+
+unsigned long sched_sis_shift = 9;
+unsigned long sched_sis_min = 2;
+unsigned long sched_sis_max = INT_MAX;
+
+module_param(sched_sis_shift, ulong, 0644);
+module_param(sched_sis_min, ulong, 0644);
+module_param(sched_sis_max, ulong, 0644);
+
 /*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
@@ -6372,17 +6382,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 avg_cost, avg_idle;
 	u64 time, cost;
 	s64 delta;
-	int cpu, nr = INT_MAX;
+	int cpu, best_cpu = -1, loops = 0, nr = sched_sis_max;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
 		return -1;
 
-	/*
-	 * Due to large variance we need a large fuzz factor; hackbench in
-	 * particularly is sensitive here.
-	 */
-	avg_idle = this_rq()->avg_idle / 512;
+	if (sched_feat(SIS_AGE)) {
+		/* age the remaining idle time */
+		unsigned long now = jiffies;
+		struct rq *this_rq = this_rq();
+
+		if (unlikely(this_rq->wake_stamp < now)) {
+			while (this_rq->wake_stamp < now && this_rq->wake_avg) {
+				this_rq->wake_stamp++;
+				this_rq->wake_avg >>= 1;
+			}
+		}
+
+		avg_idle = this_rq->wake_avg;
+	} else {
+		avg_idle = this_rq()->avg_idle;
+	}
+
+	avg_idle >>= sched_sis_shift;
 	avg_cost = this_sd->avg_scan_cost + 1;
 
 	if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost)
@@ -6390,29 +6413,170 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 
 	if (sched_feat(SIS_PROP)) {
 		u64 span_avg = sd->span_weight * avg_idle;
-		if (span_avg > 4*avg_cost)
+		if (span_avg > 2*sched_sis_min*avg_cost)
 			nr = div_u64(span_avg, avg_cost);
 		else
-			nr = 4;
+			nr = 2*sched_sis_min;
 	}
 
 	time = local_clock();
 
+	if (sched_feat(SIS_ROTOR))
+		target = this_cpu_read(sis_rotor);
+
 	for_each_cpu_wrap(cpu, sched_domain_span(sd), target) {
-		if (!--nr)
-			return -1;
+		if (loops++ >= nr)
+			break;
+		this_cpu_write(sis_rotor, cpu);
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (available_idle_cpu(cpu))
+		if (available_idle_cpu(cpu)) {
+			best_cpu = cpu;
+			break;
+		}
+	}
+
+	time = local_clock() - time;
+	time = div_u64(time, loops);
+	cost = this_sd->avg_scan_cost;
+	delta = (s64)(time - cost) / 8;
+	this_sd->avg_scan_cost += delta;
+
+	if (sched_feat(SIS_TRACE))
+	trace_printk("cpu: wake_avg: %Lu avg_idle: %Lu avg_idle: %Lu avg_cost: %Lu nr: %d loops: %d best_cpu: %d time: %Lu\n", 
+			this_rq()->wake_avg, this_rq()->avg_idle,
+			avg_idle, avg_cost, nr, loops, cpu,
+			time);
+
+	if (sched_feat(SIS_ONCE)) {
+		struct rq *this_rq = this_rq();
+
+		if (this_rq->wake_avg > time)
+			this_rq->wake_avg -= time;
+		else
+			this_rq->wake_avg = 0;
+	}
+
+	return best_cpu;
+}
+
+
+/*
+ * Scan the LLC domain for idlest cores; this is dynamically regulated by
+ * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
+ * average idle time for this rq (as found in rq->avg_idle).
+ */
+static int select_idle_core_fold(struct task_struct *p, struct sched_domain *sd, int target)
+{
+	int best_busy = UINT_MAX, best_cpu = -1;
+	struct sched_domain *this_sd;
+	u64 avg_cost, avg_idle;
+	int nr = sched_sis_max, loops = 0;
+	u64 time, cost;
+	int core, cpu;
+	s64 delta;
+	bool has_idle_cores = true;
+
+	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+	if (!this_sd)
+		return -1;
+
+	if (sched_feat(SIS_ROTOR))
+		target = this_cpu_read(sis_rotor);
+
+	if (sched_feat(SIS_PONIES))
+		has_idle_cores = test_idle_cores(target, false);
+
+	if (sched_feat(SIS_AGE)) {
+		/* age the remaining idle time */
+		unsigned long now = jiffies;
+		struct rq *this_rq = this_rq();
+
+		if (unlikely(this_rq->wake_stamp < now)) {
+			while (this_rq->wake_stamp < now && this_rq->wake_avg) {
+				this_rq->wake_stamp++;
+				this_rq->wake_avg >>= 1;
+			}
+		}
+
+		avg_idle = this_rq->wake_avg;
+	} else {
+		avg_idle = this_rq()->avg_idle;
+	}
+
+	avg_idle >>= sched_sis_shift;
+	avg_cost = this_sd->avg_scan_cost + 1;
+
+	if (sched_feat(SIS_PROP)) {
+		u64 span_avg = sd->span_weight * avg_idle;
+		if (span_avg > sched_sis_min*avg_cost)
+			nr = div_u64(span_avg, avg_cost);
+		else
+			nr = sched_sis_min;
+	}
+
+	time = local_clock();
+
+	for_each_cpu_wrap(core, sched_domain_cores(sd), target) {
+		int first_idle = -1;
+		int busy = 0;
+
+		if (loops++ >= nr)
 			break;
+
+		this_cpu_write(sis_rotor, core);
+
+		for (cpu = core; cpu < nr_cpumask_bits; cpu = cpumask_next(cpu, cpu_smt_mask(core))) {
+			if (!idle_cpu(cpu))
+				busy++;
+			else if (first_idle < 0 && cpumask_test_cpu(cpu, &p->cpus_allowed)) {
+				if (!has_idle_cores) {
+					best_cpu = cpu;
+					goto out;
+				}
+				first_idle = cpu;
+			}
+		}
+
+		if (first_idle < 0)
+			continue;
+
+		if (!busy) {
+			best_cpu = first_idle;
+			goto out;
+		}
+
+		if (busy < best_busy) {
+			best_busy = busy;
+			best_cpu = first_idle;
+		}
 	}
 
+	set_idle_cores(target, 0);
+
+out:
 	time = local_clock() - time;
+	time = div_u64(time, loops);
 	cost = this_sd->avg_scan_cost;
 	delta = (s64)(time - cost) / 8;
 	this_sd->avg_scan_cost += delta;
 
-	return cpu;
+	if (sched_feat(SIS_TRACE))
+	trace_printk("fold: wake_avg: %Lu avg_idle: %Lu avg_idle: %Lu avg_cost: %Lu nr: %d loops: %d best_cpu: %d time: %Lu\n", 
+			this_rq()->wake_avg, this_rq()->avg_idle,
+			avg_idle, avg_cost, nr, loops, best_cpu,
+			time);
+
+	if (sched_feat(SIS_ONCE)) {
+		struct rq *this_rq = this_rq();
+
+		if (this_rq->wake_avg > time)
+			this_rq->wake_avg -= time;
+		else
+			this_rq->wake_avg = 0;
+	}
+
+	return best_cpu;
 }
 
 /*
@@ -6451,7 +6615,17 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (!sd)
 		return target;
 
+	if (sched_feat(SIS_FOLD)) {
+		if (sched_feat(SIS_PONIES2) && !test_idle_cores(target, false))
+			i = select_idle_cpu(p, sd, target);
+		else
+			i = select_idle_core_fold(p, sd, target);
+		goto out;
+	}
+
 	i = select_idle_core(p, sd, target);
+	if (sched_feat(SIS_TRACE))
+		trace_printk("select_idle_core: %d\n", i);
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
@@ -6460,6 +6634,9 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		return i;
 
 	i = select_idle_smt(p, sd, target);
+	if (sched_feat(SIS_TRACE))
+		trace_printk("select_idle_smt: %d\n", i);
+out:
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae8488039c..bb572f949e5f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -58,6 +58,15 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(SIS_AVG_CPU, false)
 SCHED_FEAT(SIS_PROP, true)
 
+SCHED_FEAT(SIS_FOLD, true)
+SCHED_FEAT(SIS_AGE, true)
+SCHED_FEAT(SIS_ONCE, true)
+SCHED_FEAT(SIS_ROTOR, false)
+SCHED_FEAT(SIS_PONIES, false)
+SCHED_FEAT(SIS_PONIES2, true)
+
+SCHED_FEAT(SIS_TRACE, false)
+
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4d9ac7..240c35a4c6e0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -831,6 +831,9 @@ struct rq {
 	u64			idle_stamp;
 	u64			avg_idle;
 
+	unsigned long		wake_stamp;
+	u64			wake_avg;
+
 	/* This is used to determine avg_idle's max value */
 	u64			max_idle_balance_cost;
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 61a1125c1ae4..a47a6fc9796e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1189,9 +1189,20 @@ sd_init(struct sched_domain_topology_level *tl,
 	 * instance.
 	 */
 	if (sd->flags & SD_SHARE_PKG_RESOURCES) {
+		int core, smt;
+
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+
+		for_each_cpu(core, sched_domain_span(sd)) {
+			for_each_cpu(smt, cpu_smt_mask(core)) {
+				if (cpumask_test_cpu(smt, sched_domain_span(sd))) {
+					__cpumask_set_cpu(smt, sched_domain_cores(sd));
+					break;
+				}
+			}
+		}
 	}
 
 	sd->private = sdd;
@@ -1537,7 +1548,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
-			sds = kzalloc_node(sizeof(struct sched_domain_shared),
+			sds = kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-05-29 21:36               ` Peter Zijlstra
@ 2018-05-30 22:08                 ` Subhra Mazumdar
  2018-05-31  9:26                   ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Subhra Mazumdar @ 2018-05-30 22:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain, Mike Galbraith, Matt Fleming



On 05/29/2018 02:36 PM, Peter Zijlstra wrote:
> On Wed, May 02, 2018 at 02:58:42PM -0700, Subhra Mazumdar wrote:
>> I re-ran the test after fixing that bug but still get similar regressions
>> for hackbench
>> Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
>> (lower is better):
>> groups  baseline       %stdev  patch %stdev
>> 1       0.5742         21.13   0.5131 (10.64%) 4.11
>> 2       0.5776         7.87    0.5387 (6.73%) 2.39
>> 4       0.9578         1.12    1.0549 (-10.14%) 0.85
>> 8       1.7018         1.35    1.8516 (-8.8%) 1.56
>> 16      2.9955         1.36    3.2466 (-8.38%) 0.42
>> 32      5.4354         0.59    5.7738 (-6.23%) 0.38
> On my IVB-EP (2 socket, 10 core/socket, 2 threads/core):
>
> bench:
>
>    perf stat --null --repeat 10 -- perf bench sched messaging -g $i -t -l 10000 2>&1 | grep "seconds time elapsed"
>
> config + results:
>
> ORIG (SIS_PROP, shift=9)
>
> 1:        0.557325175 seconds time elapsed                                          ( +-  0.83% )
> 2:        0.620646551 seconds time elapsed                                          ( +-  1.46% )
> 5:        2.313514786 seconds time elapsed                                          ( +-  2.11% )
> 10:        3.796233615 seconds time elapsed                                          ( +-  1.57% )
> 20:        6.319403172 seconds time elapsed                                          ( +-  1.61% )
> 40:        9.313219134 seconds time elapsed                                          ( +-  1.03% )
>
> PROP+AGE+ONCE shift=0
>
> 1:        0.559497993 seconds time elapsed                                          ( +-  0.55% )
> 2:        0.631549599 seconds time elapsed                                          ( +-  1.73% )
> 5:        2.195464815 seconds time elapsed                                          ( +-  1.77% )
> 10:        3.703455811 seconds time elapsed                                          ( +-  1.30% )
> 20:        6.440869566 seconds time elapsed                                          ( +-  1.23% )
> 40:        9.537849253 seconds time elapsed                                          ( +-  2.00% )
>
> FOLD+AGE+ONCE+PONIES shift=0
>
> 1:        0.558893325 seconds time elapsed                                          ( +-  0.98% )
> 2:        0.617426276 seconds time elapsed                                          ( +-  1.07% )
> 5:        2.342727231 seconds time elapsed                                          ( +-  1.34% )
> 10:        3.850449091 seconds time elapsed                                          ( +-  1.07% )
> 20:        6.622412262 seconds time elapsed                                          ( +-  0.85% )
> 40:        9.487138039 seconds time elapsed                                          ( +-  2.88% )
>
> FOLD+AGE+ONCE+PONIES+PONIES2 shift=0
>
> 10:       3.695294317 seconds time elapsed                                          ( +-  1.21% )
>
>
> Which seems to not hurt anymore.. can you confirm?
>
> Also, I didn't run anything other than hackbench on it so far.
>
> (sorry, the code is a right mess, it's what I ended up with after a day
> of poking with no cleanups)
>
I tested with FOLD+AGE+ONCE+PONIES+PONIES2 shift=0 vs baseline but see some
regression for hackbench and uperf:

hackbench       BL      stdev%  test    stdev% %gain
1(40 tasks)     0.5816  8.94    0.5607  2.89 3.593535
2(80 tasks)     0.6428  10.64   0.5984  3.38 6.907280
4(160 tasks)    1.0152  1.99    1.0036  2.03 1.142631
8(320 tasks)    1.8128  1.40    1.7931  0.97 1.086716
16(640 tasks)   3.1666  0.80    3.2332  0.48 -2.103207
32(1280 tasks)  5.6084  0.83    5.8489  0.56 -4.288210

Uperf            BL      stdev%  test    stdev% %gain
8 threads       45.36   0.43    45.16   0.49 -0.433536
16 threads      87.81   0.82    88.6    0.38 0.899669
32 threads      151.18  0.01    149.98  0.04 -0.795925
48 threads      190.19  0.21    184.77  0.23 -2.849681
64 threads      190.42  0.35    183.78  0.08 -3.485217
128 threads     323.85  0.27    266.32  0.68 -17.766089

sysbench        BL              stdev%  test     stdev% %gain
8 threads       2095.44         1.82    2102.63  0.29 0.343006
16 threads      4218.44         0.06    4179.82  0.49 -0.915413
32 threads      7531.36         0.48    7744.72  0.13 2.832912
48 threads      10206.42        0.20    10144.65 0.19 -0.605163
64 threads      12053.72        0.09    11784.38 0.32 -2.234547
128 threads     14810.33        0.04    14741.78 0.16 -0.462867

I have a patch which is much smaller but seems to work well so far for both
x86 and SPARC across benchmarks I have run so far. It keeps the idle cpu
search between 1 core and 2 core amount of cpus and also puts a new
sched feature of doing idle core search or not. It can be on by default but
for workloads (like Oracle DB on x86) we can turn it off. I plan to send
that after some more testing.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] sched: remove select_idle_core() for scalability
  2018-05-30 22:08                 ` Subhra Mazumdar
@ 2018-05-31  9:26                   ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2018-05-31  9:26 UTC (permalink / raw)
  To: Subhra Mazumdar
  Cc: linux-kernel, mingo, daniel.lezcano, steven.sistare,
	dhaval.giani, rohit.k.jain, Mike Galbraith, Matt Fleming

On Wed, May 30, 2018 at 03:08:21PM -0700, Subhra Mazumdar wrote:
> I tested with FOLD+AGE+ONCE+PONIES+PONIES2 shift=0 vs baseline but see some
> regression for hackbench and uperf:

I'm not seeing a hackbench regression myself, but I let it run a whole
lot of stuff over-night and I do indeed see some pain points, including
sysbench-mysql and some schbench results.

> I have a patch which is much smaller but seems to work well so far for both
> x86 and SPARC across benchmarks I have run so far. It keeps the idle cpu
> search between 1 core and 2 core amount of cpus and also puts a new
> sched feature of doing idle core search or not. It can be on by default but
> for workloads (like Oracle DB on x86) we can turn it off. I plan to send
> that after some more testing.

Sure thing..

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-05-31  9:26 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-24  0:41 [RFC/RFT PATCH 0/3] Improve scheduler scalability for fast path subhra mazumdar
2018-04-24  0:41 ` [PATCH 1/3] sched: remove select_idle_core() for scalability subhra mazumdar
2018-04-24 12:46   ` Peter Zijlstra
2018-04-24 21:45     ` Subhra Mazumdar
2018-04-25 17:49       ` Peter Zijlstra
2018-04-30 23:38         ` Subhra Mazumdar
2018-05-01 18:03           ` Peter Zijlstra
2018-05-02 21:58             ` Subhra Mazumdar
2018-05-04 18:51               ` Subhra Mazumdar
2018-05-29 21:36               ` Peter Zijlstra
2018-05-30 22:08                 ` Subhra Mazumdar
2018-05-31  9:26                   ` Peter Zijlstra
2018-04-24  0:41 ` [PATCH 2/3] sched: introduce per-cpu var next_cpu to track search limit subhra mazumdar
2018-04-24 12:47   ` Peter Zijlstra
2018-04-24 22:39     ` Subhra Mazumdar
2018-04-24  0:41 ` [PATCH 3/3] sched: limit cpu search and rotate search window for scalability subhra mazumdar
2018-04-24 12:48   ` Peter Zijlstra
2018-04-24 22:43     ` Subhra Mazumdar
2018-04-24 12:48   ` Peter Zijlstra
2018-04-24 22:48     ` Subhra Mazumdar
2018-04-24 12:53   ` Peter Zijlstra
2018-04-25  0:10     ` Subhra Mazumdar
2018-04-25 15:36       ` Peter Zijlstra
2018-04-25 18:01         ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.