All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] sched: Make idle_balance smarter about topology
@ 2018-02-08 22:19 Rohit Jain
  2018-02-08 22:19 ` [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance Rohit Jain
  2018-02-08 22:19 ` [RFC 2/2] Introduce sysctl(s) for the migration costs Rohit Jain
  0 siblings, 2 replies; 18+ messages in thread
From: Rohit Jain @ 2018-02-08 22:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, joelaf, jbacik, riel, juri.lelli,
	dhaval.giani, efault

Current idle_balance does a check against migration cost (fixed value)
with the average idle time of a CPU. There is a huge difference in
migration costs between CPUs of the same core, different cores and
different sockets. Since sched_domain already captures the architectural
dependencies, this patch tries to encapsulate the migration cost based
on the topology of the machine.

Test Results:

* Wins:

1) hackbench results on 44 core (22 core per socket), 2 socket Intel x86
machine (lower is better)

+-------+----+-------+-------------------+--------------------------+
|       |    |       | Without patch     |With patch                |
+-------+----+-------+---------+---------+----------------+---------+
|Loops  |FD  |Groups | Average |%Std Dev |Average         |%Std Dev |
+-------+----+-------+---------+---------+----------------+---------+
|100000 |40  |4      | 9.701   |0.78     |9.623  (+0.81%) |3.67     |
|100000 |40  |8      | 17.186  |0.77     |17.068 (+0.68%) |1.89     |
|100000 |40  |16     | 30.378  |0.55     |30.072 (+1.52%) |0.46     |
|100000 |40  |32     | 54.712  |0.54     |53.588 (+2.28%) |0.21     |
+-------+----+-------+---------+---------+----------------+---------+

Note: I start with 4 groups because the Standard Deviation for groups 1
and 2 was very high.

2) sysbench MySQL results on 2 socket, 44 core and 88 threads Intel x86
machine (higher is better):

+-----------+--------------------------------+-------------------------------+
|           | Without Patch                  | With Patch                    |
+-----------+--------+------------+----------+--------------------+----------+
|Approx.    | Num    |  Average   |          |  Average           |          |
|Utilization| Threads|  througput | %Std Dev |  througput         | %Std Dev |
+-----------+--------+------------+----------+--------------------+----------+
|10.00%     |    8   |  133658.2  |  0.66    |  135071.6 (+1.06%) | 1.39     |
|20.00%     |    16  |  266540    |  0.48    |  268417.4 (+0.70%) | 0.88     |
|40.00%     |    32  |  466315.6  |  0.15    |  468289.0 (+0.42%) | 0.45     |
|75.00%     |    64  |  720039.4  |  0.23    |  726244.2 (+0.86%) | 0.03     |
|82.00%     |    72  |  757284.4  |  0.25    |  769020.6 (+1.55%) | 0.18     |
|90.00%     |    80  |  807955.6  |  0.22    |  818989.4 (+1.37%) | 0.22     |
|98.00%     |    88  |  863173.8  |  0.25    |  876121.8 (+1.50%) | 0.28     |
|100.00%    |    96  |  882950.8  |  0.32    |  890678.8 (+0.88%) | 0.51     |
|100.00%    |    128 |  895112.6  |  0.13    |  899149.6 (+0.47%) | 0.44     |
+-----------+--------+------------+----------+--------------------+----------+

* No change:

3) tbench sample results on 2 socket, 44 core and 88 threads Intel x86
machine:

With Patch:

Throughput 555.834 MB/sec  2 clients    2 procs  max_latency=0.330 ms
Throughput 1388.19 MB/sec  5 clients    5 procs  max_latency=3.666 ms
Throughput 2737.96 MB/sec  10 clients  10 procs  max_latency=1.646 ms
Throughput 5220.17 MB/sec  20 clients  20 procs  max_latency=3.666 ms
Throughput 8324.46 MB/sec  40 clients  40 procs  max_latency=0.732 ms

Without patch:

Throughput 557.142 MB/sec  2 clients    2 procs  max_latency=0.264 ms
Throughput 1381.59 MB/sec  5 clients    5 procs  max_latency=0.335 ms
Throughput 2726.84 MB/sec  10 clients  10 procs  max_latency=0.352 ms
Throughput 5230.12 MB/sec  20 clients  20 procs  max_latency=1.632 ms
Throughput 8474.5 MB/sec  40 clients   40 procs  max_latency=7.756 ms

Note: High variation observed in max_latency in different runs

Rohit Jain (2):
  sched: reduce migration cost between faster caches for idle_balance
  Introduce sysctl(s) for the migration costs

 include/linux/sched/sysctl.h   |  2 ++
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 10 ++++++----
 kernel/sched/topology.c        |  5 +++++
 kernel/sysctl.c                | 14 ++++++++++++++
 5 files changed, 28 insertions(+), 4 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
  2018-02-08 22:19 [RFC 0/2] sched: Make idle_balance smarter about topology Rohit Jain
@ 2018-02-08 22:19 ` Rohit Jain
  2018-02-09  3:42   ` Mike Galbraith
  2018-02-08 22:19 ` [RFC 2/2] Introduce sysctl(s) for the migration costs Rohit Jain
  1 sibling, 1 reply; 18+ messages in thread
From: Rohit Jain @ 2018-02-08 22:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, joelaf, jbacik, riel, juri.lelli,
	dhaval.giani, efault

This patch makes idle_balance more dynamic as the sched_migration_cost
is now accounted on a sched_domain level. This in turn is done in
sd_init when we know what the topology relationships are.

For introduction sakes cost of migration within the same core is set as
0, across cores is 50 usec and across sockets is 500 usec. sysctl for
these variables are introduced in patch 2.

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
---
 include/linux/sched/topology.h | 1 +
 kernel/sched/fair.c            | 6 +++---
 kernel/sched/topology.c        | 5 +++++
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index cf257c2..bcb4db2 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -104,6 +104,7 @@ struct sched_domain {
 	u64 max_newidle_lb_cost;
 	unsigned long next_decay_max_lb_cost;
 
+	u64 sched_migration_cost;
 	u64 avg_scan_cost;		/* select_idle_sibling */
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2fe3aa8..61d3508 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8782,8 +8782,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	 */
 	rq_unpin_lock(this_rq, rf);
 
-	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !this_rq->rd->overload) {
+	if (!this_rq->rd->overload) {
 		rcu_read_lock();
 		sd = rcu_dereference_check_sched_domain(this_rq->sd);
 		if (sd)
@@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
 
-		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
+		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost +
+		    sd->sched_migration_cost) {
 			update_next_balance(sd, &next_balance);
 			break;
 		}
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 034cbed..bcd8c64 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1148,12 +1148,14 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
+		sd->sched_migration_cost = 0;
 
 	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
 		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 117;
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
+		sd->sched_migration_cost = 500000UL;
 
 #ifdef CONFIG_NUMA
 	} else if (sd->flags & SD_NUMA) {
@@ -1162,6 +1164,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->idle_idx = 2;
 
 		sd->flags |= SD_SERIALIZE;
+		sd->sched_migration_cost = 5000000UL;
 		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
 				       SD_BALANCE_FORK |
@@ -1174,6 +1177,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
 		sd->idle_idx = 1;
+		sd->sched_migration_cost = 5000000UL;
 	}
 
 	/*
@@ -1622,6 +1626,7 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve
 		}
 
 	}
+
 	set_domain_attribute(sd, attr);
 
 	return sd;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC 2/2] Introduce sysctl(s) for the migration costs
  2018-02-08 22:19 [RFC 0/2] sched: Make idle_balance smarter about topology Rohit Jain
  2018-02-08 22:19 ` [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance Rohit Jain
@ 2018-02-08 22:19 ` Rohit Jain
  2018-02-09  3:54   ` Mike Galbraith
  2018-02-12 15:28   ` Peter Zijlstra
  1 sibling, 2 replies; 18+ messages in thread
From: Rohit Jain @ 2018-02-08 22:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: peterz, mingo, steven.sistare, joelaf, jbacik, riel, juri.lelli,
	dhaval.giani, efault

This patch introduces the sysctl for sched_domain based migration costs.
These in turn can be used for performance tuning of workloads.

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
---
 include/linux/sched/sysctl.h |  2 ++
 kernel/sched/fair.c          |  4 +++-
 kernel/sched/topology.c      |  8 ++++----
 kernel/sysctl.c              | 14 ++++++++++++++
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 1c1a151..d597f6c 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -39,6 +39,8 @@ extern unsigned int sysctl_numa_balancing_scan_size;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern __read_mostly unsigned int sysctl_sched_migration_cost;
+extern __read_mostly unsigned int sysctl_sched_core_migration_cost;
+extern __read_mostly unsigned int sysctl_sched_thread_migration_cost;
 extern __read_mostly unsigned int sysctl_sched_nr_migrate;
 extern __read_mostly unsigned int sysctl_sched_time_avg;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61d3508..f395adc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -99,7 +99,9 @@ unsigned int sysctl_sched_child_runs_first __read_mostly;
 unsigned int sysctl_sched_wakeup_granularity		= 1000000UL;
 unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;
 
-const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
+const_debug unsigned int sysctl_sched_migration_cost		= 500000UL;
+const_debug unsigned int sysctl_sched_core_migration_cost	= 500000UL;
+const_debug unsigned int sysctl_sched_thread_migration_cost	= 0UL;
 
 #ifdef CONFIG_SMP
 /*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bcd8c64..fc147db 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1148,14 +1148,14 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
-		sd->sched_migration_cost = 0;
+		sd->sched_migration_cost = sysctl_sched_thread_migration_cost;
 
 	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
 		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 117;
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
-		sd->sched_migration_cost = 500000UL;
+		sd->sched_migration_cost = sysctl_sched_core_migration_cost;
 
 #ifdef CONFIG_NUMA
 	} else if (sd->flags & SD_NUMA) {
@@ -1164,7 +1164,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->idle_idx = 2;
 
 		sd->flags |= SD_SERIALIZE;
-		sd->sched_migration_cost = 5000000UL;
+		sd->sched_migration_cost = sysctl_sched_migration_cost;
 		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
 				       SD_BALANCE_FORK |
@@ -1177,7 +1177,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
 		sd->idle_idx = 1;
-		sd->sched_migration_cost = 5000000UL;
+		sd->sched_migration_cost = sysctl_sched_migration_cost;
 	}
 
 	/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 557d467..0920795 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -356,6 +356,20 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "sched_core_migration_cost_ns",
+		.data		= &sysctl_sched_core_migration_cost,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "sched_thread_migration_cost_ns",
+		.data		= &sysctl_sched_thread_migration_cost,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "sched_nr_migrate",
 		.data		= &sysctl_sched_nr_migrate,
 		.maxlen		= sizeof(unsigned int),
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
  2018-02-08 22:19 ` [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance Rohit Jain
@ 2018-02-09  3:42   ` Mike Galbraith
  2018-02-09 16:08     ` Steven Sistare
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2018-02-09  3:42 UTC (permalink / raw)
  To: Rohit Jain, linux-kernel
  Cc: peterz, mingo, steven.sistare, joelaf, jbacik, riel, juri.lelli,
	dhaval.giani

On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
> This patch makes idle_balance more dynamic as the sched_migration_cost
> is now accounted on a sched_domain level. This in turn is done in
> sd_init when we know what the topology relationships are.
> 
> For introduction sakes cost of migration within the same core is set as
> 0, across cores is 50 usec and across sockets is 500 usec. sysctl for
> these variables are introduced in patch 2.
> 
> Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
> ---
>  include/linux/sched/topology.h | 1 +
>  kernel/sched/fair.c            | 6 +++---
>  kernel/sched/topology.c        | 5 +++++
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index cf257c2..bcb4db2 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -104,6 +104,7 @@ struct sched_domain {
>  	u64 max_newidle_lb_cost;
>  	unsigned long next_decay_max_lb_cost;
>  
> +	u64 sched_migration_cost;
>  	u64 avg_scan_cost;		/* select_idle_sibling */
>  
>  #ifdef CONFIG_SCHEDSTATS
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2fe3aa8..61d3508 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8782,8 +8782,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	 */
>  	rq_unpin_lock(this_rq, rf);
>  
> -	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> -	    !this_rq->rd->overload) {
> +	if (!this_rq->rd->overload) {
>  		rcu_read_lock();
>  		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  		if (sd)

Unexplained/unrelated change.

> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>  		if (!(sd->flags & SD_LOAD_BALANCE))
>  			continue;
>  
> -		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> +		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost +
> +		    sd->sched_migration_cost) {
>  			update_next_balance(sd, &next_balance);
>  			break;
>  		}

Ditto.

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 034cbed..bcd8c64 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1148,12 +1148,14 @@ sd_init(struct sched_domain_topology_level *tl,
>  		sd->flags |= SD_PREFER_SIBLING;
>  		sd->imbalance_pct = 110;
>  		sd->smt_gain = 1178; /* ~15% */
> +		sd->sched_migration_cost = 0;
>  
>  	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
>  		sd->flags |= SD_PREFER_SIBLING;
>  		sd->imbalance_pct = 117;
>  		sd->cache_nice_tries = 1;
>  		sd->busy_idx = 2;
> +		sd->sched_migration_cost = 500000UL;
>  
>  #ifdef CONFIG_NUMA
>  	} else if (sd->flags & SD_NUMA) {
> @@ -1162,6 +1164,7 @@ sd_init(struct sched_domain_topology_level *tl,
>  		sd->idle_idx = 2;
>  
>  		sd->flags |= SD_SERIALIZE;
> +		sd->sched_migration_cost = 5000000UL;

That's not 500us.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
  2018-02-08 22:19 ` [RFC 2/2] Introduce sysctl(s) for the migration costs Rohit Jain
@ 2018-02-09  3:54   ` Mike Galbraith
  2018-02-09 16:10     ` Steven Sistare
  2018-02-12 15:28   ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2018-02-09  3:54 UTC (permalink / raw)
  To: Rohit Jain, linux-kernel
  Cc: peterz, mingo, steven.sistare, joelaf, jbacik, riel, juri.lelli,
	dhaval.giani

On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
> This patch introduces the sysctl for sched_domain based migration costs.
> These in turn can be used for performance tuning of workloads.

With this patch, we trade 1 completely bogus constant (cost is really
highly variable) for 3, twiddling of which has zero effect unless you
trigger a domain rebuild afterward, which is neither mentioned in the
changelog, nor documented.

bogo-numbers++ is kinda hard to love.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
  2018-02-09  3:42   ` Mike Galbraith
@ 2018-02-09 16:08     ` Steven Sistare
  2018-02-10  6:37       ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Sistare @ 2018-02-09 16:08 UTC (permalink / raw)
  To: Mike Galbraith, Rohit Jain, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On 2/8/2018 10:42 PM, Mike Galbraith wrote:
> On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
>> This patch makes idle_balance more dynamic as the sched_migration_cost
>> is now accounted on a sched_domain level. This in turn is done in
>> sd_init when we know what the topology relationships are.
>>
>> For introduction sakes cost of migration within the same core is set as
>> 0, across cores is 50 usec and across sockets is 500 usec. sysctl for
>> these variables are introduced in patch 2.
>>
>> Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
>> ---
>>  include/linux/sched/topology.h | 1 +
>>  kernel/sched/fair.c            | 6 +++---
>>  kernel/sched/topology.c        | 5 +++++
>>  3 files changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index cf257c2..bcb4db2 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -104,6 +104,7 @@ struct sched_domain {
>>  	u64 max_newidle_lb_cost;
>>  	unsigned long next_decay_max_lb_cost;
>>  
>> +	u64 sched_migration_cost;
>>  	u64 avg_scan_cost;		/* select_idle_sibling */
>>  
>>  #ifdef CONFIG_SCHEDSTATS
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2fe3aa8..61d3508 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8782,8 +8782,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  	 */
>>  	rq_unpin_lock(this_rq, rf);
>>  
>> -	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>> -	    !this_rq->rd->overload) {
>> +	if (!this_rq->rd->overload) {
>>  		rcu_read_lock();
>>  		sd = rcu_dereference_check_sched_domain(this_rq->sd);
>>  		if (sd)
> 
> Unexplained/unrelated change.

This could be explained better in the cover letter, but it is related; this and the
change below are the meat of the patch.  The deleted test "this_rq->avg_idle <
sysctl_sched_migration_cost" formerly bailed based on a single global notion of 
migration cost, independent of sd.  Now the cost is per-sd, evaluated in the sd loop 
below.  The other condition to bail early, "!this_rq->rd->overload" is still relevant 
and remains.
 
>> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>>  		if (!(sd->flags & SD_LOAD_BALANCE))
>>  			continue;
>>  
>> -		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>> +		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost +
>> +		    sd->sched_migration_cost) {
>>  			update_next_balance(sd, &next_balance);
>>  			break;
>>  		}
> 
> Ditto.

The old code did not migrate if the expected costs exceeded the expected idle
time.  The new code just adds the sd-specific penalty (essentially loss of cache 
footprint) to the costs.  The for_each_domain loop visit smallest to largest
sd's, hence visiting smallest to largest migration costs (though the tunables do 
not enforce an ordering), and bails at the first sd where the total cost is a lose.

>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 034cbed..bcd8c64 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1148,12 +1148,14 @@ sd_init(struct sched_domain_topology_level *tl,
>>  		sd->flags |= SD_PREFER_SIBLING;
>>  		sd->imbalance_pct = 110;
>>  		sd->smt_gain = 1178; /* ~15% */
>> +		sd->sched_migration_cost = 0;
>>  
>>  	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
>>  		sd->flags |= SD_PREFER_SIBLING;
>>  		sd->imbalance_pct = 117;
>>  		sd->cache_nice_tries = 1;
>>  		sd->busy_idx = 2;
>> +		sd->sched_migration_cost = 500000UL;
>>  
>>  #ifdef CONFIG_NUMA
>>  	} else if (sd->flags & SD_NUMA) {
>> @@ -1162,6 +1164,7 @@ sd_init(struct sched_domain_topology_level *tl,
>>  		sd->idle_idx = 2;
>>  
>>  		sd->flags |= SD_SERIALIZE;
>> +		sd->sched_migration_cost = 5000000UL;
> 
> That's not 500us.

Good catch, thanks.  It's 5000us but should be 500. The latest version of Rohit's patch 
lost a little performance vs the previous version, and this might explain why. 
Re-testing may bring good news.

- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
  2018-02-09  3:54   ` Mike Galbraith
@ 2018-02-09 16:10     ` Steven Sistare
  2018-02-09 17:08       ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Sistare @ 2018-02-09 16:10 UTC (permalink / raw)
  To: Mike Galbraith, Rohit Jain, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On 2/8/2018 10:54 PM, Mike Galbraith wrote:
> On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
>> This patch introduces the sysctl for sched_domain based migration costs.
>> These in turn can be used for performance tuning of workloads.
> 
> With this patch, we trade 1 completely bogus constant (cost is really
> highly variable) for 3, twiddling of which has zero effect unless you
> trigger a domain rebuild afterward, which is neither mentioned in the
> changelog, nor documented.
> 
> bogo-numbers++ is kinda hard to love.

Yup, the domain rebuild is missing.

I am no fan of tunables, the fewer the better, but one of the several flaws
of the single figure for migration cost is that it ignores the very large
difference in cost when migrating between near vs far levels of the cache hierarchy.
Migration between CPUs of the same core should be free, as they share L1 cache.
Rohit defined a tunable for it, but IMO it could be hard coded to 0. Migration 
between CPUs in different sockets is the most expensive and is represented by
the existing sysctl_sched_migration_cost tunable.  Migration between CPUs in
the same core cluster, or in the same socket, is somewhere in between, as
they share L2 or L3 cache.  We could avoid a separate tunable by setting it to
sysctl_sched_migration_cost / 10.

- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
  2018-02-09 16:10     ` Steven Sistare
@ 2018-02-09 17:08       ` Mike Galbraith
  2018-02-09 17:33         ` Steven Sistare
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2018-02-09 17:08 UTC (permalink / raw)
  To: Steven Sistare, Rohit Jain, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On Fri, 2018-02-09 at 11:10 -0500, Steven Sistare wrote:
> On 2/8/2018 10:54 PM, Mike Galbraith wrote:
> > On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
> >> This patch introduces the sysctl for sched_domain based migration costs.
> >> These in turn can be used for performance tuning of workloads.
> > 
> > With this patch, we trade 1 completely bogus constant (cost is really
> > highly variable) for 3, twiddling of which has zero effect unless you
> > trigger a domain rebuild afterward, which is neither mentioned in the
> > changelog, nor documented.
> > 
> > bogo-numbers++ is kinda hard to love.
> 
> Yup, the domain rebuild is missing.
> 
> I am no fan of tunables, the fewer the better, but one of the several flaws
> of the single figure for migration cost is that it ignores the very large
> difference in cost when migrating between near vs far levels of the cache hierarchy.
> Migration between CPUs of the same core should be free, as they share L1 cache.
> Rohit defined a tunable for it, but IMO it could be hard coded to 0.

That cost is never really 0 in the context of load balancing, as the
load balancing machinery is non-free.  When the idle_balance() throttle
was added, that was done to mitigate the (at that time) quite high cost
to high frequency cross core scheduling ala localhost communication.

>  Migration 
> between CPUs in different sockets is the most expensive and is represented by
> the existing sysctl_sched_migration_cost tunable.  Migration between CPUs in
> the same core cluster, or in the same socket, is somewhere in between, as
> they share L2 or L3 cache.  We could avoid a separate tunable by setting it to
> sysctl_sched_migration_cost / 10.

Shrug.  It's bogus no mater what we do.  Once Upon A Time, a cost
number was generated via measurement, but the end result was just as
bogus as a number pulled out of the ether.  How much bandwidth you have
when blasting data to/from wherever says nothing about misses you avoid
vs those you generate.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
  2018-02-09 17:08       ` Mike Galbraith
@ 2018-02-09 17:33         ` Steven Sistare
  2018-02-09 17:50           ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Sistare @ 2018-02-09 17:33 UTC (permalink / raw)
  To: Mike Galbraith, Rohit Jain, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On 2/9/2018 12:08 PM, Mike Galbraith wrote:
> On Fri, 2018-02-09 at 11:10 -0500, Steven Sistare wrote:
>> On 2/8/2018 10:54 PM, Mike Galbraith wrote:
>>> On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
>>>> This patch introduces the sysctl for sched_domain based migration costs.
>>>> These in turn can be used for performance tuning of workloads.
>>>
>>> With this patch, we trade 1 completely bogus constant (cost is really
>>> highly variable) for 3, twiddling of which has zero effect unless you
>>> trigger a domain rebuild afterward, which is neither mentioned in the
>>> changelog, nor documented.
>>>
>>> bogo-numbers++ is kinda hard to love.
>>
>> Yup, the domain rebuild is missing.
>>
>> I am no fan of tunables, the fewer the better, but one of the several flaws
>> of the single figure for migration cost is that it ignores the very large
>> difference in cost when migrating between near vs far levels of the cache hierarchy.
>> Migration between CPUs of the same core should be free, as they share L1 cache.
>> Rohit defined a tunable for it, but IMO it could be hard coded to 0.
> 
> That cost is never really 0 in the context of load balancing, as the
> load balancing machinery is non-free.  When the idle_balance() throttle
> was added, that was done to mitigate the (at that time) quite high cost
> to high frequency cross core scheduling ala localhost communication.

I was imprecise.  The cache-loss component of cost as represented by 
sched_migration_cost should be 0 in this case.  The cost of the machinery
is non-zero and remains in the code, and can still prevent migration.

>>  Migration 
>> between CPUs in different sockets is the most expensive and is represented by
>> the existing sysctl_sched_migration_cost tunable.  Migration between CPUs in
>> the same core cluster, or in the same socket, is somewhere in between, as
>> they share L2 or L3 cache.  We could avoid a separate tunable by setting it to
>> sysctl_sched_migration_cost / 10.
> 
> Shrug.  It's bogus no mater what we do.  Once Upon A Time, a cost
> number was generated via measurement, but the end result was just as
> bogus as a number pulled out of the ether.  How much bandwidth you have
> when blasting data to/from wherever says nothing about misses you avoid
> vs those you generate.

Yes, yes and yes. I cannot make the original tunable less bogus.  Using a smaller
cost for closer caches still makes logical sense and is supported by the data.

- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
  2018-02-09 17:33         ` Steven Sistare
@ 2018-02-09 17:50           ` Mike Galbraith
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2018-02-09 17:50 UTC (permalink / raw)
  To: Steven Sistare, Rohit Jain, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On Fri, 2018-02-09 at 12:33 -0500, Steven Sistare wrote:
> On 2/9/2018 12:08 PM, Mike Galbraith wrote:
> 
> > Shrug.  It's bogus no mater what we do.  Once Upon A Time, a cost
> > number was generated via measurement, but the end result was just as
> > bogus as a number pulled out of the ether.  How much bandwidth you have
> > when blasting data to/from wherever says nothing about misses you avoid
> > vs those you generate.
> 
> Yes, yes and yes. I cannot make the original tunable less bogus.  Using a smaller
> cost for closer caches still makes logical sense and is supported by the data.

You forgot to write "microscopic" before "data" :)  I'm mostly agnostic
about this, but don't like the yet more knobs that 99.99% won't touch.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
  2018-02-09 16:08     ` Steven Sistare
@ 2018-02-10  6:37       ` Mike Galbraith
  2018-02-15 16:35         ` Steven Sistare
  0 siblings, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2018-02-10  6:37 UTC (permalink / raw)
  To: Steven Sistare, Rohit Jain, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote:
> >> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
> >>  		if (!(sd->flags & SD_LOAD_BALANCE))
> >>  			continue;
> >>  
> >> -		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> >> +		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost +
> >> +		    sd->sched_migration_cost) {
> >>  			update_next_balance(sd, &next_balance);
> >>  			break;
> >>  		}
> > 
> > Ditto.
> 
> The old code did not migrate if the expected costs exceeded the expected idle
> time.  The new code just adds the sd-specific penalty (essentially loss of cache 
> footprint) to the costs.  The for_each_domain loop visit smallest to largest
> sd's, hence visiting smallest to largest migration costs (though the tunables do 
> not enforce an ordering), and bails at the first sd where the total cost is a lose.

Hrm..

You're now adding a hypothetical cost to the measured cost of running
the LB machinery, which implies that the measurement is insufficient,
but you still don't say why it is insufficient.  What happens if you
don't do that?  I ask, because when I removed the...

   this_rq->avg_idle < sysctl_sched_migration_cost

...bits to check removal effect for Peter, the original reason for it
being added did not re-materialize, making me wonder why you need to
make this cutoff more aggressive.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 2/2] Introduce sysctl(s) for the migration costs
  2018-02-08 22:19 ` [RFC 2/2] Introduce sysctl(s) for the migration costs Rohit Jain
  2018-02-09  3:54   ` Mike Galbraith
@ 2018-02-12 15:28   ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2018-02-12 15:28 UTC (permalink / raw)
  To: Rohit Jain
  Cc: linux-kernel, mingo, steven.sistare, joelaf, jbacik, riel,
	juri.lelli, dhaval.giani, efault

On Thu, Feb 08, 2018 at 02:19:55PM -0800, Rohit Jain wrote:
> This patch introduces the sysctl for sched_domain based migration costs.
> These in turn can be used for performance tuning of workloads.

Smells like a bad attempt to (again) revive commit:

  0437e109e184 ("sched: zap the migration init / cache-hot balancing code")

Yes, the migration cost would ideally be per domain, in practise it all
sucks because more tunables is more confusion. And as that commit
states, runtime measurements suck too, they cause run-to-run variation
which causes repeatability issues and degrade boot times.

Static numbers suck worse, because they'll be wrong for everyone.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
  2018-02-10  6:37       ` Mike Galbraith
@ 2018-02-15 16:35         ` Steven Sistare
  2018-02-15 18:07           ` Mike Galbraith
  2018-02-15 18:07           ` Rohit Jain
  0 siblings, 2 replies; 18+ messages in thread
From: Steven Sistare @ 2018-02-15 16:35 UTC (permalink / raw)
  To: Mike Galbraith, Rohit Jain, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On 2/10/2018 1:37 AM, Mike Galbraith wrote:
> On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote:
>>>> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>>>>  		if (!(sd->flags & SD_LOAD_BALANCE))
>>>>  			continue;
>>>>  
>>>> -		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>>>> +		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost +
>>>> +		    sd->sched_migration_cost) {
>>>>  			update_next_balance(sd, &next_balance);
>>>>  			break;
>>>>  		}
>>>
>>> Ditto.
>>
>> The old code did not migrate if the expected costs exceeded the expected idle
>> time.  The new code just adds the sd-specific penalty (essentially loss of cache 
>> footprint) to the costs.  The for_each_domain loop visit smallest to largest
>> sd's, hence visiting smallest to largest migration costs (though the tunables do 
>> not enforce an ordering), and bails at the first sd where the total cost is a lose.
> 
> Hrm..
> 
> You're now adding a hypothetical cost to the measured cost of running
> the LB machinery, which implies that the measurement is insufficient,
> but you still don't say why it is insufficient.  What happens if you
> don't do that?  I ask, because when I removed the...
> 
>    this_rq->avg_idle < sysctl_sched_migration_cost
> 
> ...bits to check removal effect for Peter, the original reason for it
> being added did not re-materialize, making me wonder why you need to
> make this cutoff more aggressive.

The current code with sysctl_sched_migration_cost discourages migration
too much, per our test results.  Deleting it entirely from idle_balance()
may be the right solution, or it may allow too much migration and
cause regressions due to loss of cache warmth on some workloads.
Rohit's patch deletes it and adds the sd->sched_migration_cost term
to allow a migration rate that is somewhere in the middle, and is
logically sound.  It discourages but does not prevent migration between
nodes, and encourages but does not always allow migration between cores.
By contrast, setting relax_domain_level to disable SD_BALANCE_NEWIDLE
at the SD_NUMA level is a big hammer.

I would be perfectly happy if deleting sysctl_sched_migration_cost from
idle_balance does the trick.  Last week in a different thread you mentioned
it did not hurt tbench:

>> Mike, do you remember what comes apart when we take
>> out the sysctl_sched_migration_cost test in idle_balance()?
>
> Used to be anything scheduling cross-core heftily suffered, ie pretty
> much any localhost communication heavy load.  I just tried disabling it
> in 4.13 though (pre pti cliff), tried tbench, and it made zip squat
> difference.  I presume that's due to the meanwhile added
> this_rq->rd->overload and/or curr_cost checks.

Can you provide more details on the sysbench oltp test that motivated you
to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test it?
   1b9508f6 sched: Rate-limit newidle
   Rate limit newidle to migration_cost. It's a win for all stages of
   sysbench oltp tests.

Rohit is running more tests with a patch that deletes
sysctl_sched_migration_cost from idle_balance, and for his patch but
with the 5000 usec mistake corrected back to 500 usec.  So far both
give improvements over the baseline, but for different cases, so we
need to try more workloads before we draw any conclusions.

Rohit, can you share your data so far?

- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
  2018-02-15 16:35         ` Steven Sistare
@ 2018-02-15 18:07           ` Mike Galbraith
  2018-02-15 18:21             ` Steven Sistare
  2018-02-15 18:07           ` Rohit Jain
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Galbraith @ 2018-02-15 18:07 UTC (permalink / raw)
  To: Steven Sistare, Rohit Jain, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On Thu, 2018-02-15 at 11:35 -0500, Steven Sistare wrote:
> On 2/10/2018 1:37 AM, Mike Galbraith wrote:
> > On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote:
> >>>> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
> >>>>  		if (!(sd->flags & SD_LOAD_BALANCE))
> >>>>  			continue;
> >>>>  
> >>>> -		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> >>>> +		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost +
> >>>> +		    sd->sched_migration_cost) {
> >>>>  			update_next_balance(sd, &next_balance);
> >>>>  			break;
> >>>>  		}
> >>>
> >>> Ditto.
> >>
> >> The old code did not migrate if the expected costs exceeded the expected idle
> >> time.  The new code just adds the sd-specific penalty (essentially loss of cache 
> >> footprint) to the costs.  The for_each_domain loop visit smallest to largest
> >> sd's, hence visiting smallest to largest migration costs (though the tunables do 
> >> not enforce an ordering), and bails at the first sd where the total cost is a lose.
> > 
> > Hrm..
> > 
> > You're now adding a hypothetical cost to the measured cost of running
> > the LB machinery, which implies that the measurement is insufficient,
> > but you still don't say why it is insufficient.  What happens if you
> > don't do that?  I ask, because when I removed the...
> > 
> >    this_rq->avg_idle < sysctl_sched_migration_cost
> > 
> > ...bits to check removal effect for Peter, the original reason for it
> > being added did not re-materialize, making me wonder why you need to
> > make this cutoff more aggressive.
> 
> The current code with sysctl_sched_migration_cost discourages migration
> too much, per our test results.

That's why I asked you what happens if you only whack the _apparently_
(but maybe not) obsolete old throttle, it appeared likely that your win
came from allowing a bit more migration than the simple throttle
allowed, which if true, would obviate the need for anything more.

> Can you provide more details on the sysbench oltp test that motivated you
> to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test it?

The problem at that time was the cycle overhead of entering that LB
path at high frequency.  Dirt simple.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
  2018-02-15 16:35         ` Steven Sistare
  2018-02-15 18:07           ` Mike Galbraith
@ 2018-02-15 18:07           ` Rohit Jain
  2018-02-16  4:53             ` Mike Galbraith
  1 sibling, 1 reply; 18+ messages in thread
From: Rohit Jain @ 2018-02-15 18:07 UTC (permalink / raw)
  To: Steven Sistare, Mike Galbraith, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani



On 02/15/2018 08:35 AM, Steven Sistare wrote:
> On 2/10/2018 1:37 AM, Mike Galbraith wrote:
>> On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote:
>>>>> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>>>>>   		if (!(sd->flags & SD_LOAD_BALANCE))
>>>>>   			continue;
>>>>>   
>>>>> -		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>>>>> +		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost +
>>>>> +		    sd->sched_migration_cost) {
>>>>>   			update_next_balance(sd, &next_balance);
>>>>>   			break;
>>>>>   		}
>>>> Ditto.
>>> The old code did not migrate if the expected costs exceeded the expected idle
>>> time.  The new code just adds the sd-specific penalty (essentially loss of cache
>>> footprint) to the costs.  The for_each_domain loop visit smallest to largest
>>> sd's, hence visiting smallest to largest migration costs (though the tunables do
>>> not enforce an ordering), and bails at the first sd where the total cost is a lose.
>> Hrm..
>>
>> You're now adding a hypothetical cost to the measured cost of running
>> the LB machinery, which implies that the measurement is insufficient,
>> but you still don't say why it is insufficient.  What happens if you
>> don't do that?  I ask, because when I removed the...
>>
>>     this_rq->avg_idle < sysctl_sched_migration_cost
>>
>> ...bits to check removal effect for Peter, the original reason for it
>> being added did not re-materialize, making me wonder why you need to
>> make this cutoff more aggressive.
> The current code with sysctl_sched_migration_cost discourages migration
> too much, per our test results.  Deleting it entirely from idle_balance()
> may be the right solution, or it may allow too much migration and
> cause regressions due to loss of cache warmth on some workloads.
> Rohit's patch deletes it and adds the sd->sched_migration_cost term
> to allow a migration rate that is somewhere in the middle, and is
> logically sound.  It discourages but does not prevent migration between
> nodes, and encourages but does not always allow migration between cores.
> By contrast, setting relax_domain_level to disable SD_BALANCE_NEWIDLE
> at the SD_NUMA level is a big hammer.
>
> I would be perfectly happy if deleting sysctl_sched_migration_cost from
> idle_balance does the trick.  Last week in a different thread you mentioned
> it did not hurt tbench:
>
>>> Mike, do you remember what comes apart when we take
>>> out the sysctl_sched_migration_cost test in idle_balance()?
>> Used to be anything scheduling cross-core heftily suffered, ie pretty
>> much any localhost communication heavy load.  I just tried disabling it
>> in 4.13 though (pre pti cliff), tried tbench, and it made zip squat
>> difference.  I presume that's due to the meanwhile added
>> this_rq->rd->overload and/or curr_cost checks.
> Can you provide more details on the sysbench oltp test that motivated you
> to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test it?
>     1b9508f6 sched: Rate-limit newidle
>     Rate limit newidle to migration_cost. It's a win for all stages of
>     sysbench oltp tests.
>
> Rohit is running more tests with a patch that deletes
> sysctl_sched_migration_cost from idle_balance, and for his patch but
> with the 5000 usec mistake corrected back to 500 usec.  So far both
> give improvements over the baseline, but for different cases, so we
> need to try more workloads before we draw any conclusions.
>
> Rohit, can you share your data so far?

Results:

In the following results, "Domain based" approach is as mentioned in the
RFC sent out with the values fixed (As pointed out by Mike). "No check" is
the patch where I just remove the check against sysctl_sched_migration_cost

1) Hackbench results on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):

+--------------+-----------------+--------------------------+-------------------------+
|              | Without Patch   |Domain Based              |No Check                 |
+------+-------+--------+--------+-----------------+--------+----------------+--------+
|Loops | Groups|Average |%Std Dev|Average          |%Std Dev|Average         |%Std Dev|
+------+-------+--------+--------+-----------------+--------+----------------+--------+
|100000| 4     |9.701   |0.78    |7.971  (+17.84%) | 1.34   |8.919  (+8.07%) |1.07    |
|100000| 8     |17.186  |0.77    |16.712 (+2.76%)  | 0.87   |17.043 (+0.83%) |0.83    |
|100000| 16    |30.378  |0.55    |29.780 (+1.97%)  | 0.38   |29.565 (+2.67%) |0.29    |
|100000| 32    |54.712  |0.54    |53.001 (+3.13%)  | 0.19   |52.158 (+4.67%) |0.22    |
+------+-------+--------+--------+-----------------+--------+----------------+--------+

2) Sysbench MySQL results on 2 socket, 44 core and 88 threads Intel x86
machine (higher is better):

+-------+--------------------+----------------------------+----------------------------+
|       | Without Patch      | Domain based               | No check                   |
+-------+-----------+--------+-------------------+--------+-------------------+--------+
|Num    | Average   |        | Average           |        | Average           |        |
|Threads| throughput|%Std Dev| throughput        |%Std Dev| throughput        |%Std Dev|
+-------+-----------+--------+-------------------+--------+-------------------+--------+
|    8  | 133658.2  | 0.66   | 134909.4 (+0.94%) | 0.94   | 134232.2 (+0.43%) | 1.29   |
|   16  | 266540    | 0.48   | 268253.4 (+0.64%) | 0.64   | 268584.6 (+0.77%) | 0.37   |
|   32  | 466315.6  | 0.15   | 465903.6 (-0.09%) | 0.28   | 468594.2 (+0.49%) | 0.23   |
|   64  | 720039.4  | 0.23   | 725663.8 (+0.78%) | 0.42   | 717253.8 (-0.39%) | 0.36   |
|   72  | 757284.4  | 0.25   | 770693.4 (+1.77%) | 0.29   | 764984.0 (+1.02%) | 0.38   |
|   80  | 807955.6  | 0.22   | 818446.0 (+1.30%) | 0.24   | 831372.2 (+2.90%) | 0.10   |
|   88  | 863173.8  | 0.25   | 870520.4 (+0.85%) | 0.23   | 887049.0 (+2.77%) | 0.56   |
|   96  | 882950.8  | 0.32   | 890775.4 (+0.89%) | 0.40   | 892913.8 (+1.13%) | 0.41   |
|  128  | 895112.6  | 0.13   | 898524.2 (+0.38%) | 0.16   | 901195.0 (+0.68%) | 0.28   |
+-------+-----------+--------+-------------------+--------+-------------------+--------+

Thanks,
Rohit

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
  2018-02-15 18:07           ` Mike Galbraith
@ 2018-02-15 18:21             ` Steven Sistare
  2018-02-15 18:39               ` Mike Galbraith
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Sistare @ 2018-02-15 18:21 UTC (permalink / raw)
  To: Mike Galbraith, Rohit Jain, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On 2/15/2018 1:07 PM, Mike Galbraith wrote:
> On Thu, 2018-02-15 at 11:35 -0500, Steven Sistare wrote:
>> On 2/10/2018 1:37 AM, Mike Galbraith wrote:
>>> On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote:
>>>>>> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>>>>>>  		if (!(sd->flags & SD_LOAD_BALANCE))
>>>>>>  			continue;
>>>>>>  
>>>>>> -		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>>>>>> +		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost +
>>>>>> +		    sd->sched_migration_cost) {
>>>>>>  			update_next_balance(sd, &next_balance);
>>>>>>  			break;
>>>>>>  		}
>>>>>
>>>>> Ditto.
>>>>
>>>> The old code did not migrate if the expected costs exceeded the expected idle
>>>> time.  The new code just adds the sd-specific penalty (essentially loss of cache 
>>>> footprint) to the costs.  The for_each_domain loop visit smallest to largest
>>>> sd's, hence visiting smallest to largest migration costs (though the tunables do 
>>>> not enforce an ordering), and bails at the first sd where the total cost is a lose.
>>>
>>> Hrm..
>>>
>>> You're now adding a hypothetical cost to the measured cost of running
>>> the LB machinery, which implies that the measurement is insufficient,
>>> but you still don't say why it is insufficient.  What happens if you
>>> don't do that?  I ask, because when I removed the...
>>>
>>>    this_rq->avg_idle < sysctl_sched_migration_cost
>>>
>>> ...bits to check removal effect for Peter, the original reason for it
>>> being added did not re-materialize, making me wonder why you need to
>>> make this cutoff more aggressive.
>>
>> The current code with sysctl_sched_migration_cost discourages migration
>> too much, per our test results.
> 
> That's why I asked you what happens if you only whack the _apparently_
> (but maybe not) obsolete old throttle, it appeared likely that your win
> came from allowing a bit more migration than the simple throttle
> allowed, which if true, would obviate the need for anything more.
> 
>> Can you provide more details on the sysbench oltp test that motivated you
>> to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test it?
> 
> The problem at that time was the cycle overhead of entering that LB
> path at high frequency.  Dirt simple.

I get that. I meant please provide details on test parameters and config if
you remember them.

- Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
  2018-02-15 18:21             ` Steven Sistare
@ 2018-02-15 18:39               ` Mike Galbraith
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2018-02-15 18:39 UTC (permalink / raw)
  To: Steven Sistare, Rohit Jain, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On Thu, 2018-02-15 at 13:21 -0500, Steven Sistare wrote:
> On 2/15/2018 1:07 PM, Mike Galbraith wrote:
> 
> >> Can you provide more details on the sysbench oltp test that motivated you
> >> to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test it?
> > 
> > The problem at that time was the cycle overhead of entering that LB
> > path at high frequency.  Dirt simple.
> 
> I get that. I meant please provide details on test parameters and config if
> you remember them.

Nope.  I doubt it would be relevant to here/now anyway.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
  2018-02-15 18:07           ` Rohit Jain
@ 2018-02-16  4:53             ` Mike Galbraith
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Galbraith @ 2018-02-16  4:53 UTC (permalink / raw)
  To: Rohit Jain, Steven Sistare, linux-kernel
  Cc: peterz, mingo, joelaf, jbacik, riel, juri.lelli, dhaval.giani

On Thu, 2018-02-15 at 10:07 -0800, Rohit Jain wrote:
> 
> > Rohit is running more tests with a patch that deletes
> > sysctl_sched_migration_cost from idle_balance, and for his patch but
> > with the 5000 usec mistake corrected back to 500 usec.  So far both
> > give improvements over the baseline, but for different cases, so we
> > need to try more workloads before we draw any conclusions.
> >
> > Rohit, can you share your data so far?
> 
> Results:
> 
> In the following results, "Domain based" approach is as mentioned in the
> RFC sent out with the values fixed (As pointed out by Mike). "No check" is
> the patch where I just remove the check against sysctl_sched_migration_cost
> 
> 1) Hackbench results on 2 socket, 44 core and 88 threads Intel x86 machine
> (lower is better):
> 
> +--------------+-----------------+--------------------------+-------------------------+
> |              | Without Patch   |Domain Based              |No Check                 |
> +------+-------+--------+--------+-----------------+--------+----------------+--------+
> |Loops | Groups|Average |%Std Dev|Average          |%Std Dev|Average         |%Std Dev|
> +------+-------+--------+--------+-----------------+--------+----------------+--------+
> |100000| 4     |9.701   |0.78    |7.971  (+17.84%) | 1.34   |8.919  (+8.07%) |1.07    |
> |100000| 8     |17.186  |0.77    |16.712 (+2.76%)  | 0.87   |17.043 (+0.83%) |0.83    |
> |100000| 16    |30.378  |0.55    |29.780 (+1.97%)  | 0.38   |29.565 (+2.67%) |0.29    |
> |100000| 32    |54.712  |0.54    |53.001 (+3.13%)  | 0.19   |52.158 (+4.67%) |0.22    |
> +------+-------+--------+--------+-----------------+--------+----------------+--------+

previous numbers.

+-------+----+-------+-------------------+--------------------------+
|       |    |       | Without patch     |With patch                |
+-------+----+-------+---------+---------+----------------+---------+
|Loops  |FD  |Groups | Average |%Std Dev |Average         |%Std Dev |
+-------+----+-------+---------+---------+----------------+---------+
|100000 |40  |4      | 9.701   |0.78     |9.623  (+0.81%) |3.67     |
|100000 |40  |8      | 17.186  |0.77     |17.068 (+0.68%) |1.89     |
|100000 |40  |16     | 30.378  |0.55     |30.072 (+1.52%) |0.46     |
|100000 |40  |32     | 54.712  |0.54     |53.588 (+2.28%) |0.21     |
+-------+----+-------+---------+---------+----------------+---------+

My take on this (not that you have to sell it to me, you don't) when I
squint at these together is submit the one-liner, and take the rest
back to the drawing board.  You've got nothing but high std dev numbers
in (imo) way too finicky/unrealistic hackbench to sell these not so
pretty patches.

I bet you can easily sell that one-liner, because that removes an old
wart (me stealing migration_cost in the first place), instead of making
wart a whole lot harder to intentionally not notice.

	-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-02-16  4:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-08 22:19 [RFC 0/2] sched: Make idle_balance smarter about topology Rohit Jain
2018-02-08 22:19 ` [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance Rohit Jain
2018-02-09  3:42   ` Mike Galbraith
2018-02-09 16:08     ` Steven Sistare
2018-02-10  6:37       ` Mike Galbraith
2018-02-15 16:35         ` Steven Sistare
2018-02-15 18:07           ` Mike Galbraith
2018-02-15 18:21             ` Steven Sistare
2018-02-15 18:39               ` Mike Galbraith
2018-02-15 18:07           ` Rohit Jain
2018-02-16  4:53             ` Mike Galbraith
2018-02-08 22:19 ` [RFC 2/2] Introduce sysctl(s) for the migration costs Rohit Jain
2018-02-09  3:54   ` Mike Galbraith
2018-02-09 16:10     ` Steven Sistare
2018-02-09 17:08       ` Mike Galbraith
2018-02-09 17:33         ` Steven Sistare
2018-02-09 17:50           ` Mike Galbraith
2018-02-12 15:28   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.