linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] sched/fair: Introduce scaled capacity awareness in enqueue
@ 2017-09-26  0:02 Rohit Jain
  2017-09-26  0:02 ` [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path Rohit Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Rohit Jain @ 2017-09-26  0:02 UTC (permalink / raw)
  To: linux-kernel, eas-dev
  Cc: peterz, mingo, joelaf, atish.patra, vincent.guittot,
	dietmar.eggemann, morten.rasmussen

During OLTP workload runs, threads can end up on CPUs with a lot of
softIRQ activity, thus delaying progress. For more reliable and
faster runs, if the system can spare it, these threads should be
scheduled on CPUs with lower IRQ/RT activity.

Currently, the scheduler takes into account the original capacity of
CPUs when providing 'hints' for select_idle_sibling code path to return
an idle CPU. However, the rest of the select_idle_* code paths remain
capacity agnostic. Further, these code paths are only aware of the
original capacity and not the capacity stolen by IRQ/RT activity.

This patch introduces capacity awarness in scheduler (CAS) which avoids
CPUs which might have their capacities reduced (due to IRQ/RT activity)
when trying to schedule threads (on the push side) in the system. This
awareness has been added into the fair scheduling class.

It does so by, using the following algorithm:
1) As in rt_avg the scaled capacities are already calculated.

2) Any CPU which is running below 80% capacity is considered running low
on capacity.

3) During idle CPU search if a CPU is found running low on capacity, it
is skipped if better CPUs are available.

4) If none of the CPUs are better in terms of idleness and capacity, then
the low-capacity CPU is considered to be the best available CPU.

The performance numbers:
---------------------------------------------------------------------------
CAS shows upto 1.5% improvement on x86 when running 'SELECT' database
workload.

I also used barrier.c (open_mp code) as a micro-benchmark. It does a number
of iterations and barrier sync at the end of each for loop.

I was also running ping on CPU 0 as:
'ping -l 10000 -q -s 10 -f host2'

The results below should be read as:

* 'Baseline without ping' is how the workload would've behaved if there
  was no IRQ activity.

* Compare 'Baseline with ping' and 'Baseline without ping' to see the
  effect of ping

* Compare 'Baseline with ping' and 'CAS with ping' to see the improvement
  CAS can give over baseline

The program (barrier.c) can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html

Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 20 core x86 machine:

+-------+----------------+----------------+------------------+
|Num.   |CAS             |Baseline        |Baseline without  |
|Threads|with ping       |with ping       |ping              |
+-------+-------+--------+-------+--------+-------+----------+
|       |Mean   |Std. Dev|Mean   |Std. Dev|Mean   |Std. Dev  |
+-------+-------+--------+-------+--------+-------+----------+
|1      | 511.7 | 6.9    | 508.3 | 17.3   | 514.6 | 4.7      |
|2      | 486.8 | 16.3   | 463.9 | 17.4   | 510.8 | 3.9      |
|4      | 466.1 | 11.7   | 451.4 | 12.5   | 489.3 | 4.1      |
|8      | 433.6 | 3.7    | 427.5 | 2.2    | 447.6 | 5.0      |
|16     | 391.9 | 7.9    | 385.5 | 16.4   | 396.2 | 0.3      |
|32     | 269.3 | 5.3    | 266.0 | 6.6    | 276.8 | 0.2      |
+-------+-------+--------+-------+--------+-------+----------+

Following are the runtime(s) with hackbench and ping activity as
described above (lower is better), on a 20 core x86 machine:

+---------------+------+--------+--------+
|Num.           |CAS   |Baseline|Baseline|
|Tasks          |with  |with    |without |
|(groups of 40) |ping  |ping    |ping    |
+---------------+------+--------+--------+
|               |Mean  |Mean    |Mean    |
+---------------+------+--------+--------+
|1              | 0.97 | 0.97   | 0.68   |
|2              | 1.36 | 1.36   | 1.30   |
|4              | 2.57 | 2.57   | 1.84   |
|8              | 3.31 | 3.34   | 2.86   |
|16             | 5.63 | 5.71   | 4.61   |
|25             | 7.99 | 8.23   | 6.78   |
+---------------+------+--------+--------+

Changelog:
---------------------------------------------------------------------------
v1->v2:
* Changed the dynamic threshold calculation as the having global state
  can be avoided.

v2->v3:
* Split up the patch for find_idlest_cpu and select_idle_sibling code
  paths.

v3->v4:
* Rebased it to peterz's tree (apologies for wrong tree for v3)

Previous discussion can be found at:
---------------------------------------------------------------------------
https://patchwork.kernel.org/patch/9741351/
https://lists.linaro.org/pipermail/eas-dev/2017-August/000933.html

Rohit Jain (3):
  sched/fair: Introduce scaled capacity awareness in find_idlest_cpu
    code path
  sched/fair: Introduce scaled capacity awareness in select_idle_sibling
    code path
  ignore_this_patch: Fixing compilation error on Peter's tree

 kernel/sched/fair.c      | 81 +++++++++++++++++++++++++++++++++++++++---------
 kernel/time/tick-sched.c |  1 +
 2 files changed, 68 insertions(+), 14 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path
  2017-09-26  0:02 [PATCH v4 0/3] sched/fair: Introduce scaled capacity awareness in enqueue Rohit Jain
@ 2017-09-26  0:02 ` Rohit Jain
  2017-09-26  2:51   ` joelaf
  2017-09-26  0:02 ` [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling " Rohit Jain
  2017-09-26  0:02 ` [PATCH 3/3] ignore_this_patch: Fixing compilation error on Peter's tree Rohit Jain
  2 siblings, 1 reply; 17+ messages in thread
From: Rohit Jain @ 2017-09-26  0:02 UTC (permalink / raw)
  To: linux-kernel, eas-dev
  Cc: peterz, mingo, joelaf, atish.patra, vincent.guittot,
	dietmar.eggemann, morten.rasmussen

While looking for idle CPUs for a waking task, we should also account
for the delays caused due to the bandwidth reduction by RT/IRQ tasks.

This patch does that by trying to find a higher capacity CPU with
minimum wake up latency.


Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
---
 kernel/sched/fair.c | 27 ++++++++++++++++++++++++---
 1 file changed, 24 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eca6a57..afb701f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5590,6 +5590,11 @@ static unsigned long capacity_orig_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity_orig;
 }
 
+static inline bool full_capacity(int cpu)
+{
+	return (capacity_of(cpu) >= (capacity_orig_of(cpu)*819 >> 10));
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -5916,8 +5921,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	unsigned long load, min_load = ULONG_MAX;
 	unsigned int min_exit_latency = UINT_MAX;
 	u64 latest_idle_timestamp = 0;
+	unsigned int backup_cap = 0;
 	int least_loaded_cpu = this_cpu;
 	int shallowest_idle_cpu = -1;
+	int shallowest_idle_cpu_backup = -1;
 	int i;
 
 	/* Check if we have any choice: */
@@ -5937,7 +5944,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 				 */
 				min_exit_latency = idle->exit_latency;
 				latest_idle_timestamp = rq->idle_stamp;
-				shallowest_idle_cpu = i;
+				if (full_capacity(i)) {
+					shallowest_idle_cpu = i;
+				} else if (capacity_of(i) > backup_cap) {
+					shallowest_idle_cpu_backup = i;
+					backup_cap = capacity_of(i);
+				}
 			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
 				   rq->idle_stamp > latest_idle_timestamp) {
 				/*
@@ -5946,7 +5958,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 				 * a warmer cache.
 				 */
 				latest_idle_timestamp = rq->idle_stamp;
-				shallowest_idle_cpu = i;
+				if (full_capacity(i)) {
+					shallowest_idle_cpu = i;
+				} else if (capacity_of(i) > backup_cap) {
+					shallowest_idle_cpu_backup = i;
+					backup_cap = capacity_of(i);
+				}
 			}
 		} else if (shallowest_idle_cpu == -1) {
 			load = weighted_cpuload(cpu_rq(i));
@@ -5957,7 +5974,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 		}
 	}
 
-	return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
+	if (shallowest_idle_cpu != -1)
+		return shallowest_idle_cpu;
+
+	return (shallowest_idle_cpu_backup != -1 ?
+		shallowest_idle_cpu_backup : least_loaded_cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path
  2017-09-26  0:02 [PATCH v4 0/3] sched/fair: Introduce scaled capacity awareness in enqueue Rohit Jain
  2017-09-26  0:02 ` [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path Rohit Jain
@ 2017-09-26  0:02 ` Rohit Jain
  2017-09-26  6:53   ` Joel Fernandes
  2017-09-26  0:02 ` [PATCH 3/3] ignore_this_patch: Fixing compilation error on Peter's tree Rohit Jain
  2 siblings, 1 reply; 17+ messages in thread
From: Rohit Jain @ 2017-09-26  0:02 UTC (permalink / raw)
  To: linux-kernel, eas-dev
  Cc: peterz, mingo, joelaf, atish.patra, vincent.guittot,
	dietmar.eggemann, morten.rasmussen

While looking for CPUs to place running tasks on, the scheduler
completely ignores the capacity stolen away by RT/IRQ tasks.

This patch fixes that.

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
---
 kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 43 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index afb701f..19ff2c3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6040,7 +6040,10 @@ void __update_idle_core(struct rq *rq)
 static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-	int core, cpu;
+	int core, cpu, rcpu, rcpu_backup;
+	unsigned int backup_cap = 0;
+
+	rcpu = rcpu_backup = -1;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -6057,10 +6060,20 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 			cpumask_clear_cpu(cpu, cpus);
 			if (!idle_cpu(cpu))
 				idle = false;
+
+			if (full_capacity(cpu)) {
+				rcpu = cpu;
+			} else if ((rcpu == -1) && (capacity_of(cpu) > backup_cap)) {
+				backup_cap = capacity_of(cpu);
+				rcpu_backup = cpu;
+			}
 		}
 
-		if (idle)
-			return core;
+		if (idle) {
+			if (rcpu == -1)
+				return (rcpu_backup != -1 ? rcpu_backup : core);
+			return rcpu;
+		}
 	}
 
 	/*
@@ -6076,7 +6089,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
  */
 static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
 {
-	int cpu;
+	int cpu, backup_cpu = -1;
+	unsigned int backup_cap = 0;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -6084,11 +6098,17 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu(cpu, cpu_smt_mask(target)) {
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (idle_cpu(cpu))
-			return cpu;
+		if (idle_cpu(cpu)) {
+			if (full_capacity(cpu))
+				return cpu;
+			if (capacity_of(cpu) > backup_cap) {
+				backup_cap = capacity_of(cpu);
+				backup_cpu = cpu;
+			}
+		}
 	}
 
-	return -1;
+	return backup_cpu;
 }
 
 #else /* CONFIG_SCHED_SMT */
@@ -6117,6 +6137,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 time, cost;
 	s64 delta;
 	int cpu, nr = INT_MAX;
+	int backup_cpu = -1;
+	unsigned int backup_cap = 0;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6147,10 +6169,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 			return -1;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (idle_cpu(cpu))
-			break;
+		if (idle_cpu(cpu)) {
+			if (full_capacity(cpu)) {
+				backup_cpu = -1;
+				break;
+			} else if (capacity_of(cpu) > backup_cap) {
+				backup_cap = capacity_of(cpu);
+				backup_cpu = cpu;
+			}
+		}
 	}
 
+	if (backup_cpu >= 0)
+		cpu = backup_cpu;
 	time = local_clock() - time;
 	cost = this_sd->avg_scan_cost;
 	delta = (s64)(time - cost) / 8;
@@ -6167,13 +6198,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i;
 
-	if (idle_cpu(target))
+	if (idle_cpu(target) && full_capacity(target))
 		return target;
 
 	/*
 	 * If the previous cpu is cache affine and idle, don't be stupid.
 	 */
-	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
+	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev)
+	    && full_capacity(prev))
 		return prev;
 
 	sd = rcu_dereference(per_cpu(sd_llc, target));
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/3] ignore_this_patch: Fixing compilation error on Peter's tree
  2017-09-26  0:02 [PATCH v4 0/3] sched/fair: Introduce scaled capacity awareness in enqueue Rohit Jain
  2017-09-26  0:02 ` [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path Rohit Jain
  2017-09-26  0:02 ` [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling " Rohit Jain
@ 2017-09-26  0:02 ` Rohit Jain
  2017-10-01  0:32   ` kbuild test robot
  2 siblings, 1 reply; 17+ messages in thread
From: Rohit Jain @ 2017-09-26  0:02 UTC (permalink / raw)
  To: linux-kernel, eas-dev
  Cc: peterz, mingo, joelaf, atish.patra, vincent.guittot,
	dietmar.eggemann, morten.rasmussen

Addendum, please ignore.

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
---
 kernel/time/tick-sched.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index eb0e975..ede1add 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -19,6 +19,7 @@
 #include <linux/percpu.h>
 #include <linux/nmi.h>
 #include <linux/profile.h>
+#include <linux/vmstat.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/clock.h>
 #include <linux/sched/stat.h>
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path
  2017-09-26  0:02 ` [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path Rohit Jain
@ 2017-09-26  2:51   ` joelaf
  2017-09-26  4:40     ` Rohit Jain
  0 siblings, 1 reply; 17+ messages in thread
From: joelaf @ 2017-09-26  2:51 UTC (permalink / raw)
  To: Rohit Jain, linux-kernel, eas-dev
  Cc: peterz, mingo, atish.patra, vincent.guittot, dietmar.eggemann,
	morten.rasmussen

Hi Rohit,

Just some comments:

On 09/25/2017 05:02 PM, Rohit Jain wrote:
> While looking for idle CPUs for a waking task, we should also account
> for the delays caused due to the bandwidth reduction by RT/IRQ tasks.
> 
> This patch does that by trying to find a higher capacity CPU with
> minimum wake up latency.
> 
> 
> Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
> ---
>  kernel/sched/fair.c | 27 ++++++++++++++++++++++++---
>  1 file changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eca6a57..afb701f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5590,6 +5590,11 @@ static unsigned long capacity_orig_of(int cpu)
>  	return cpu_rq(cpu)->cpu_capacity_orig;
>  }
>  
> +static inline bool full_capacity(int cpu)
> +{
> +	return (capacity_of(cpu) >= (capacity_orig_of(cpu)*819 >> 10));

Wouldn't 768 be better for multiplication? gcc converts the expression to shifts and adds then.

> +}
> +
>  static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> @@ -5916,8 +5921,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  	unsigned long load, min_load = ULONG_MAX;
>  	unsigned int min_exit_latency = UINT_MAX;
>  	u64 latest_idle_timestamp = 0;
> +	unsigned int backup_cap = 0;
>  	int least_loaded_cpu = this_cpu;
>  	int shallowest_idle_cpu = -1;
> +	int shallowest_idle_cpu_backup = -1;
>  	int i;
>  
>  	/* Check if we have any choice: */
> @@ -5937,7 +5944,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  				 */
>  				min_exit_latency = idle->exit_latency;
>  				latest_idle_timestamp = rq->idle_stamp;
> -				shallowest_idle_cpu = i;
> +				if (full_capacity(i)) {
> +					shallowest_idle_cpu = i;
> +				} else if (capacity_of(i) > backup_cap) {
> +					shallowest_idle_cpu_backup = i;
> +					backup_cap = capacity_of(i);
> +				}

I'm a bit skeptical about this - if the CPU is idle, then is it likely that the capacity of the CPU is reduced due to RT pressure? I can see that it can matter, but I am wondering if you have any data for your usecase to show that it does (that is if you didn't consider RT pressure for idle CPUs, are you still seeing a big enough performance improvement to warrant the change?

>  			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
>  				   rq->idle_stamp > latest_idle_timestamp) {
>  				/*
> @@ -5946,7 +5958,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  				 * a warmer cache.
>  				 */
>  				latest_idle_timestamp = rq->idle_stamp;
> -				shallowest_idle_cpu = i;
> +				if (full_capacity(i)) {
> +					shallowest_idle_cpu = i;
> +				} else if (capacity_of(i) > backup_cap) {
> +					shallowest_idle_cpu_backup = i;
> +					backup_cap = capacity_of(i);
> +				}
>  			}
>  		} else if (shallowest_idle_cpu == -1) {
>  			load = weighted_cpuload(cpu_rq(i));
> @@ -5957,7 +5974,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  		}
>  	}
>  
> -	return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
> +	if (shallowest_idle_cpu != -1)
> +		return shallowest_idle_cpu;
> +
> +	return (shallowest_idle_cpu_backup != -1 ?
> +		shallowest_idle_cpu_backup : least_loaded_cpu);
>  }
>  
>  #ifdef CONFIG_SCHED_SMT
> 

I see code duplication here which can be reduced by 7 lines compared to your original patch:

---
 kernel/sched/fair.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e216f6..72fc8d18b251 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5528,6 +5528,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) {
 		if (idle_cpu(i)) {
+			int idle_candidate = -1;
 			struct rq *rq = cpu_rq(i);
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
@@ -5538,7 +5539,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 				 */
 				min_exit_latency = idle->exit_latency;
 				latest_idle_timestamp = rq->idle_stamp;
-				shallowest_idle_cpu = i;
+				idle_candidate = i;
 			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
 				   rq->idle_stamp > latest_idle_timestamp) {
 				/*
@@ -5547,7 +5548,16 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 				 * a warmer cache.
 				 */
 				latest_idle_timestamp = rq->idle_stamp;
-				shallowest_idle_cpu = i;
+				idle_candidate = i;
+			}
+
+			if (idle_candidate != -1) {
+				if (full_capacity(idle_candidate)) {
+					shallowest_idle_cpu = idle_candidate;
+				} else if (capacity_of(idle_candidate) > backup_cap) {
+					shallowest_idle_cpu_backup = idle_candidate;
+					backup_cap = capacity_of(idle_candidate);
+				}
 			}
 		} else if (shallowest_idle_cpu == -1) {
 			load = weighted_cpuload(i);
@@ -5558,7 +5568,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 		}
 	}
 
-	return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
+	if (shallowest_idle_cpu != -1)
+		return shallowest_idle_cpu;
+
+	return (shallowest_idle_cpu_backup != -1 ?
+			shallowest_idle_cpu_backup : least_loaded_cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
-- 
2.14.1.821.g8fa685d3b7-goog

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path
  2017-09-26  2:51   ` joelaf
@ 2017-09-26  4:40     ` Rohit Jain
  2017-09-26  6:59       ` Joel Fernandes
  0 siblings, 1 reply; 17+ messages in thread
From: Rohit Jain @ 2017-09-26  4:40 UTC (permalink / raw)
  To: joelaf, linux-kernel, eas-dev
  Cc: peterz, mingo, atish.patra, vincent.guittot, dietmar.eggemann,
	morten.rasmussen

On 09/25/2017 07:51 PM, joelaf wrote:
> Hi Rohit,
>
> Just some comments:

Hi Joel,

Thanks for the comments

>
> On 09/25/2017 05:02 PM, Rohit Jain wrote:
>> While looking for idle CPUs for a waking task, we should also account
>> for the delays caused due to the bandwidth reduction by RT/IRQ tasks.
>>
>> This patch does that by trying to find a higher capacity CPU with
>> minimum wake up latency.
>>
>>
>> Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
>> ---
>>   kernel/sched/fair.c | 27 ++++++++++++++++++++++++---
>>   1 file changed, 24 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index eca6a57..afb701f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5590,6 +5590,11 @@ static unsigned long capacity_orig_of(int cpu)
>>   	return cpu_rq(cpu)->cpu_capacity_orig;
>>   }
>>   
>> +static inline bool full_capacity(int cpu)
>> +{
>> +	return (capacity_of(cpu) >= (capacity_orig_of(cpu)*819 >> 10));
> Wouldn't 768 be better for multiplication? gcc converts the expression to shifts and adds then.

While 768 is easier to convert to shifts and adds, 819/1024 gets you
very close to 80% which is what I was trying to achieve.

>
>> +}
>> +
>>   static unsigned long cpu_avg_load_per_task(int cpu)
>>   {
>>   	struct rq *rq = cpu_rq(cpu);
>> @@ -5916,8 +5921,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>   	unsigned long load, min_load = ULONG_MAX;
>>   	unsigned int min_exit_latency = UINT_MAX;
>>   	u64 latest_idle_timestamp = 0;
>> +	unsigned int backup_cap = 0;
>>   	int least_loaded_cpu = this_cpu;
>>   	int shallowest_idle_cpu = -1;
>> +	int shallowest_idle_cpu_backup = -1;
>>   	int i;
>>   
>>   	/* Check if we have any choice: */
>> @@ -5937,7 +5944,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>   				 */
>>   				min_exit_latency = idle->exit_latency;
>>   				latest_idle_timestamp = rq->idle_stamp;
>> -				shallowest_idle_cpu = i;
>> +				if (full_capacity(i)) {
>> +					shallowest_idle_cpu = i;
>> +				} else if (capacity_of(i) > backup_cap) {
>> +					shallowest_idle_cpu_backup = i;
>> +					backup_cap = capacity_of(i);
>> +				}
> I'm a bit skeptical about this - if the CPU is idle, then is it likely that the capacity of the CPU is reduced due to RT pressure?

What has idleness got to do with RT pressure?

This is an instantaneous view where the scheduler is looking to place
threads. In this case, if we know historically the capacity of the CPU
is reduced (due to RT/IRQ/Thermal Throttling or whatever it may be) we
should avoid that CPU if we have a choice.

> I can see that it can matter, but I am wondering if you have any data for your usecase to show that it does (that is if you didn't consider RT pressure for idle CPUs, are you still seeing a big enough performance improvement to warrant the change?

I tested this with OLTP which has a mix of both IRQ and RT threads.
Also, I tested with micro-benchmarks which have IRQ and fair threads. I
haven't seen what happens just with RT alone. This would come back to the
question: "Why reduce capacities when RT thread is running?", frankly I
don't know the answer, however from a capacity stand point that is taken
into account.

It makes sense to me however I don't know the reasoning.

>>   			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
>>   				   rq->idle_stamp > latest_idle_timestamp) {
>>   				/*
>> @@ -5946,7 +5958,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>   				 * a warmer cache.
>>   				 */
>>   				latest_idle_timestamp = rq->idle_stamp;
>> -				shallowest_idle_cpu = i;
>> +				if (full_capacity(i)) {
>> +					shallowest_idle_cpu = i;
>> +				} else if (capacity_of(i) > backup_cap) {
>> +					shallowest_idle_cpu_backup = i;
>> +					backup_cap = capacity_of(i);
>> +				}
>>   			}
>>   		} else if (shallowest_idle_cpu == -1) {
>>   			load = weighted_cpuload(cpu_rq(i));
>> @@ -5957,7 +5974,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>   		}
>>   	}
>>   
>> -	return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
>> +	if (shallowest_idle_cpu != -1)
>> +		return shallowest_idle_cpu;
>> +
>> +	return (shallowest_idle_cpu_backup != -1 ?
>> +		shallowest_idle_cpu_backup : least_loaded_cpu);
>>   }
>>   
>>   #ifdef CONFIG_SCHED_SMT
>>
> I see code duplication here which can be reduced by 7 lines compared to your original patch:

This does look better and I will try to incorporate this.

Thanks,
Rohit

>
> ---
>   kernel/sched/fair.c | 20 +++++++++++++++++---
>   1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c95880e216f6..72fc8d18b251 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5528,6 +5528,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>   	/* Traverse only the allowed CPUs */
>   	for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) {
>   		if (idle_cpu(i)) {
> +			int idle_candidate = -1;
>   			struct rq *rq = cpu_rq(i);
>   			struct cpuidle_state *idle = idle_get_state(rq);
>   			if (idle && idle->exit_latency < min_exit_latency) {
> @@ -5538,7 +5539,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>   				 */
>   				min_exit_latency = idle->exit_latency;
>   				latest_idle_timestamp = rq->idle_stamp;
> -				shallowest_idle_cpu = i;
> +				idle_candidate = i;
>   			} else if ((!idle || idle->exit_latency == min_exit_latency) &&
>   				   rq->idle_stamp > latest_idle_timestamp) {
>   				/*
> @@ -5547,7 +5548,16 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>   				 * a warmer cache.
>   				 */
>   				latest_idle_timestamp = rq->idle_stamp;
> -				shallowest_idle_cpu = i;
> +				idle_candidate = i;
> +			}
> +
> +			if (idle_candidate != -1) {
> +				if (full_capacity(idle_candidate)) {
> +					shallowest_idle_cpu = idle_candidate;
> +				} else if (capacity_of(idle_candidate) > backup_cap) {
> +					shallowest_idle_cpu_backup = idle_candidate;
> +					backup_cap = capacity_of(idle_candidate);
> +				}
>   			}
>   		} else if (shallowest_idle_cpu == -1) {
>   			load = weighted_cpuload(i);
> @@ -5558,7 +5568,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>   		}
>   	}
>   
> -	return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
> +	if (shallowest_idle_cpu != -1)
> +		return shallowest_idle_cpu;
> +
> +	return (shallowest_idle_cpu_backup != -1 ?
> +			shallowest_idle_cpu_backup : least_loaded_cpu);
>   }
>   
>   #ifdef CONFIG_SCHED_SMT

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path
  2017-09-26  0:02 ` [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling " Rohit Jain
@ 2017-09-26  6:53   ` Joel Fernandes
  2017-09-26 19:48     ` Rohit Jain
  0 siblings, 1 reply; 17+ messages in thread
From: Joel Fernandes @ 2017-09-26  6:53 UTC (permalink / raw)
  To: Rohit Jain
  Cc: LKML, eas-dev, Peter Zijlstra, Ingo Molnar, Atish Patra,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen

Hi Rohit,

Just some comments:

On Mon, Sep 25, 2017 at 5:02 PM, Rohit Jain <rohit.k.jain@oracle.com> wrote:
> While looking for CPUs to place running tasks on, the scheduler
> completely ignores the capacity stolen away by RT/IRQ tasks.
>
> This patch fixes that.
>
> Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
> ---
>  kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 43 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index afb701f..19ff2c3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6040,7 +6040,10 @@ void __update_idle_core(struct rq *rq)
>  static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
>  {
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> -       int core, cpu;
> +       int core, cpu, rcpu, rcpu_backup;

I would call rcpu_backup as backup_cpu.

> +       unsigned int backup_cap = 0;
> +
> +       rcpu = rcpu_backup = -1;
>
>         if (!static_branch_likely(&sched_smt_present))
>                 return -1;
> @@ -6057,10 +6060,20 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>                         cpumask_clear_cpu(cpu, cpus);
>                         if (!idle_cpu(cpu))
>                                 idle = false;
> +
> +                       if (full_capacity(cpu)) {
> +                               rcpu = cpu;
> +                       } else if ((rcpu == -1) && (capacity_of(cpu) > backup_cap)) {
> +                               backup_cap = capacity_of(cpu);
> +                               rcpu_backup = cpu;
> +                       }

Here you comparing capacity of different SMT threads.

>                 }
>
> -               if (idle)
> -                       return core;
> +               if (idle) {
> +                       if (rcpu == -1)
> +                               return (rcpu_backup != -1 ? rcpu_backup : core);
> +                       return rcpu;
> +               }


This didn't make much sense to me, here you are returning either an
SMT thread or a core. That doesn't make much of a difference because
SMT threads share the same capacity (SD_SHARE_CPUCAPACITY). I think
what you want to do is find out the capacity of a 'core', not an SMT
thread, and compare the capacity of different cores and consider the
one which has least RT/IRQ interference.

>         }
>
>         /*
> @@ -6076,7 +6089,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>   */
>  static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
>  {
> -       int cpu;
> +       int cpu, backup_cpu = -1;
> +       unsigned int backup_cap = 0;
>
>         if (!static_branch_likely(&sched_smt_present))
>                 return -1;
> @@ -6084,11 +6098,17 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>         for_each_cpu(cpu, cpu_smt_mask(target)) {
>                 if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>                         continue;
> -               if (idle_cpu(cpu))
> -                       return cpu;
> +               if (idle_cpu(cpu)) {
> +                       if (full_capacity(cpu))
> +                               return cpu;
> +                       if (capacity_of(cpu) > backup_cap) {
> +                               backup_cap = capacity_of(cpu);
> +                               backup_cpu = cpu;
> +                       }
> +               }

Same thing here, since SMT threads share the same underlying capacity,
is there any point in comparing the capacities of each SMT thread?

thanks,

- Joel

[...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path
  2017-09-26  4:40     ` Rohit Jain
@ 2017-09-26  6:59       ` Joel Fernandes
  0 siblings, 0 replies; 17+ messages in thread
From: Joel Fernandes @ 2017-09-26  6:59 UTC (permalink / raw)
  To: Rohit Jain
  Cc: LKML, eas-dev, Peter Zijlstra, Ingo Molnar, Atish Patra,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen

Hi Rohit,

On Mon, Sep 25, 2017 at 9:40 PM, Rohit Jain <rohit.k.jain@oracle.com> wrote:
> On 09/25/2017 07:51 PM, joelaf wrote:
[...]
>>
>> On 09/25/2017 05:02 PM, Rohit Jain wrote:
>>>
>>> While looking for idle CPUs for a waking task, we should also account
>>> for the delays caused due to the bandwidth reduction by RT/IRQ tasks.
>>>
>>> This patch does that by trying to find a higher capacity CPU with
>>> minimum wake up latency.
>>>
>>>
>>> Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
>>> ---
>>>   kernel/sched/fair.c | 27 ++++++++++++++++++++++++---
>>>   1 file changed, 24 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index eca6a57..afb701f 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5590,6 +5590,11 @@ static unsigned long capacity_orig_of(int cpu)
>>>         return cpu_rq(cpu)->cpu_capacity_orig;
>>>   }
>>>   +static inline bool full_capacity(int cpu)
>>> +{
>>> +       return (capacity_of(cpu) >= (capacity_orig_of(cpu)*819 >> 10));
>>
>> Wouldn't 768 be better for multiplication? gcc converts the expression to
>> shifts and adds then.
>
>
> While 768 is easier to convert to shifts and adds, 819/1024 gets you
> very close to 80% which is what I was trying to achieve.

Yeah I guess if its not too hard, you could check if 768 gets you a
similar result but I would defer to the maintainers on what they are
Ok with.

>>> +}
>>> +
>>>   static unsigned long cpu_avg_load_per_task(int cpu)
>>>   {
>>>         struct rq *rq = cpu_rq(cpu);
>>> @@ -5916,8 +5921,10 @@ find_idlest_cpu(struct sched_group *group, struct
>>> task_struct *p, int this_cpu)
>>>         unsigned long load, min_load = ULONG_MAX;
>>>         unsigned int min_exit_latency = UINT_MAX;
>>>         u64 latest_idle_timestamp = 0;
>>> +       unsigned int backup_cap = 0;
>>>         int least_loaded_cpu = this_cpu;
>>>         int shallowest_idle_cpu = -1;
>>> +       int shallowest_idle_cpu_backup = -1;
>>>         int i;
>>>         /* Check if we have any choice: */
>>> @@ -5937,7 +5944,12 @@ find_idlest_cpu(struct sched_group *group, struct
>>> task_struct *p, int this_cpu)
>>>                                  */
>>>                                 min_exit_latency = idle->exit_latency;
>>>                                 latest_idle_timestamp = rq->idle_stamp;
>>> -                               shallowest_idle_cpu = i;
>>> +                               if (full_capacity(i)) {
>>> +                                       shallowest_idle_cpu = i;
>>> +                               } else if (capacity_of(i) > backup_cap) {
>>> +                                       shallowest_idle_cpu_backup = i;
>>> +                                       backup_cap = capacity_of(i);
>>> +                               }
>>
>> I'm a bit skeptical about this - if the CPU is idle, then is it likely
>> that the capacity of the CPU is reduced due to RT pressure?
>
>
> What has idleness got to do with RT pressure?
>
> This is an instantaneous view where the scheduler is looking to place
> threads. In this case, if we know historically the capacity of the CPU
> is reduced (due to RT/IRQ/Thermal Throttling or whatever it may be) we
> should avoid that CPU if we have a choice.

Yeah Ok, that's a fair point, I don't dispute this fact. I was just
trying to understand your patch.

thanks,

- Joel

[...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path
  2017-09-26  6:53   ` Joel Fernandes
@ 2017-09-26 19:48     ` Rohit Jain
  2017-09-28 10:53       ` joelaf
  0 siblings, 1 reply; 17+ messages in thread
From: Rohit Jain @ 2017-09-26 19:48 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: LKML, eas-dev, Peter Zijlstra, Ingo Molnar, Atish Patra,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen

On 09/25/2017 11:53 PM, Joel Fernandes wrote:
> Hi Rohit,
>
> Just some comments:

Hi Joel,

Thanks for the comments.

> On Mon, Sep 25, 2017 at 5:02 PM, Rohit Jain <rohit.k.jain@oracle.com> wrote:
>> While looking for CPUs to place running tasks on, the scheduler
>> completely ignores the capacity stolen away by RT/IRQ tasks.
>>
>> This patch fixes that.
>>
>> Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
>> ---
>>   kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++-----------
>>   1 file changed, 43 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index afb701f..19ff2c3 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6040,7 +6040,10 @@ void __update_idle_core(struct rq *rq)
>>   static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
>>   {
>>          struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>> -       int core, cpu;
>> +       int core, cpu, rcpu, rcpu_backup;
> I would call rcpu_backup as backup_cpu.

OK

>
>> +       unsigned int backup_cap = 0;
>> +
>> +       rcpu = rcpu_backup = -1;
>>
>>          if (!static_branch_likely(&sched_smt_present))
>>                  return -1;
>> @@ -6057,10 +6060,20 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>>                          cpumask_clear_cpu(cpu, cpus);
>>                          if (!idle_cpu(cpu))
>>                                  idle = false;
>> +
>> +                       if (full_capacity(cpu)) {
>> +                               rcpu = cpu;
>> +                       } else if ((rcpu == -1) && (capacity_of(cpu) > backup_cap)) {
>> +                               backup_cap = capacity_of(cpu);
>> +                               rcpu_backup = cpu;
>> +                       }
> Here you comparing capacity of different SMT threads.
>
>>                  }
>>
>> -               if (idle)
>> -                       return core;
>> +               if (idle) {
>> +                       if (rcpu == -1)
>> +                               return (rcpu_backup != -1 ? rcpu_backup : core);
>> +                       return rcpu;
>> +               }
>
> This didn't make much sense to me, here you are returning either an
> SMT thread or a core. That doesn't make much of a difference because
> SMT threads share the same capacity (SD_SHARE_CPUCAPACITY). I think
> what you want to do is find out the capacity of a 'core', not an SMT
> thread, and compare the capacity of different cores and consider the
> one which has least RT/IRQ interference.

IIUC the capacities of each strand is scaled by IRQ and 'rt_avg' for that
'rq'. Now if the strand is idle now and gets an interrupt in the future,
the 'core' would look like:

    +----+----+
    | I  |    |
    | T  |    |
    +----+----+

(I -> Interrupt, T-> Thread we are trying to schedule).

whereas if the other strand on the core was taking interrupt the core
would look like:

    +----+----+
    | I  | T  |
    |    |    |
    +----+----+

With this case, because we know from the past avg, one of the strands is
running low on capacity, I am trying to return a better strand for the
thread to start on.

>
>>          }
>>
>>          /*
>> @@ -6076,7 +6089,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>>    */
>>   static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
>>   {
>> -       int cpu;
>> +       int cpu, backup_cpu = -1;
>> +       unsigned int backup_cap = 0;
>>
>>          if (!static_branch_likely(&sched_smt_present))
>>                  return -1;
>> @@ -6084,11 +6098,17 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>>          for_each_cpu(cpu, cpu_smt_mask(target)) {
>>                  if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>>                          continue;
>> -               if (idle_cpu(cpu))
>> -                       return cpu;
>> +               if (idle_cpu(cpu)) {
>> +                       if (full_capacity(cpu))
>> +                               return cpu;
>> +                       if (capacity_of(cpu) > backup_cap) {
>> +                               backup_cap = capacity_of(cpu);
>> +                               backup_cpu = cpu;
>> +                       }
>> +               }
> Same thing here, since SMT threads share the same underlying capacity,
> is there any point in comparing the capacities of each SMT thread?

See above

Thanks,
Rohit

>
> thanks,
>
> - Joel
>
> [...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path
  2017-09-26 19:48     ` Rohit Jain
@ 2017-09-28 10:53       ` joelaf
  2017-09-28 15:09         ` Rohit Jain
  0 siblings, 1 reply; 17+ messages in thread
From: joelaf @ 2017-09-28 10:53 UTC (permalink / raw)
  To: Rohit Jain
  Cc: LKML, eas-dev, Peter Zijlstra, Ingo Molnar, Atish Patra,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen

Hi Rohit,

On Tue, Sep 26, 2017 at 12:48 PM, Rohit Jain <rohit.k.jain@oracle.com> wrote:
[...]
>>> +       unsigned int backup_cap = 0;
>>> +
>>> +       rcpu = rcpu_backup = -1;
>>>
>>>          if (!static_branch_likely(&sched_smt_present))
>>>                  return -1;
>>> @@ -6057,10 +6060,20 @@ static int select_idle_core(struct task_struct
>>> *p, struct sched_domain *sd, int
>>>                          cpumask_clear_cpu(cpu, cpus);
>>>                          if (!idle_cpu(cpu))
>>>                                  idle = false;
>>> +
>>> +                       if (full_capacity(cpu)) {
>>> +                               rcpu = cpu;
>>> +                       } else if ((rcpu == -1) && (capacity_of(cpu) >
>>> backup_cap)) {
>>> +                               backup_cap = capacity_of(cpu);
>>> +                               rcpu_backup = cpu;
>>> +                       }
>>
>> Here you comparing capacity of different SMT threads.
>>
>>>                  }
>>>
>>> -               if (idle)
>>> -                       return core;
>>> +               if (idle) {
>>> +                       if (rcpu == -1)
>>> +                               return (rcpu_backup != -1 ? rcpu_backup :
>>> core);
>>> +                       return rcpu;
>>> +               }
>>
>>
>> This didn't make much sense to me, here you are returning either an
>> SMT thread or a core. That doesn't make much of a difference because
>> SMT threads share the same capacity (SD_SHARE_CPUCAPACITY). I think
>> what you want to do is find out the capacity of a 'core', not an SMT
>> thread, and compare the capacity of different cores and consider the
>> one which has least RT/IRQ interference.
>
>
> IIUC the capacities of each strand is scaled by IRQ and 'rt_avg' for that
> 'rq'. Now if the strand is idle now and gets an interrupt in the future,
> the 'core' would look like:
>
>    +----+----+
>    | I  |    |
>    | T  |    |
>    +----+----+
>
> (I -> Interrupt, T-> Thread we are trying to schedule).
>
> whereas if the other strand on the core was taking interrupt the core
> would look like:
>
>    +----+----+
>    | I  | T  |
>    |    |    |
>    +----+----+
>
> With this case, because we know from the past avg, one of the strands is
> running low on capacity, I am trying to return a better strand for the
> thread to start on.
>

I know what you're trying to do but they way you've retrofitted it into the
core looks weird (to me) and makes the code unreadable and ugly IMO.

Why not do something simpler like skip the core if any SMT thread has been
running at lesser capacity? I'm not sure if this works great or if the maintainers
will prefer your or my below approach, but I find the below diff much cleaner
for the select_idle_core bit. It also makes more sense since resources are
shared at SMT level so makes sense to me to skip the core altogether for this:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6ee7242dbe0a..f324a84e29f1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5738,14 +5738,17 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 
 	for_each_cpu_wrap(core, cpus, target) {
 		bool idle = true;
+		bool full_cap = true;
 
 		for_each_cpu(cpu, cpu_smt_mask(core)) {
 			cpumask_clear_cpu(cpu, cpus);
 			if (!idle_cpu(cpu))
 				idle = false;
+			if (!full_capacity(cpu))
+				full_cap = false;
 		}
 
-		if (idle)
+		if (idle && full_cap)
 			return core;
 	}
 


thanks,

- Joel

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path
  2017-09-28 10:53       ` joelaf
@ 2017-09-28 15:09         ` Rohit Jain
  2017-10-03  4:52           ` Joel Fernandes
  0 siblings, 1 reply; 17+ messages in thread
From: Rohit Jain @ 2017-09-28 15:09 UTC (permalink / raw)
  To: joelaf
  Cc: LKML, eas-dev, Peter Zijlstra, Ingo Molnar, Atish Patra,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen

Hi Joel,

On 09/28/2017 05:53 AM, joelaf wrote:
> Hi Rohit,
>
> On Tue, Sep 26, 2017 at 12:48 PM, Rohit Jain <rohit.k.jain@oracle.com> wrote:
> [...]
>
<snip>
>>>>                   }
>>>>
>>>> -               if (idle)
>>>> -                       return core;
>>>> +               if (idle) {
>>>> +                       if (rcpu == -1)
>>>> +                               return (rcpu_backup != -1 ? rcpu_backup :
>>>> core);
>>>> +                       return rcpu;
>>>> +               }
>>>
>>> This didn't make much sense to me, here you are returning either an
>>> SMT thread or a core. That doesn't make much of a difference because
>>> SMT threads share the same capacity (SD_SHARE_CPUCAPACITY). I think
>>> what you want to do is find out the capacity of a 'core', not an SMT
>>> thread, and compare the capacity of different cores and consider the
>>> one which has least RT/IRQ interference.
>>
>> IIUC the capacities of each strand is scaled by IRQ and 'rt_avg' for that
>> 'rq'. Now if the strand is idle now and gets an interrupt in the future,
>> the 'core' would look like:
>>
>>     +----+----+
>>     | I  |    |
>>     | T  |    |
>>     +----+----+
>>
>> (I -> Interrupt, T-> Thread we are trying to schedule).
>>
>> whereas if the other strand on the core was taking interrupt the core
>> would look like:
>>
>>     +----+----+
>>     | I  | T  |
>>     |    |    |
>>     +----+----+
>>
>> With this case, because we know from the past avg, one of the strands is
>> running low on capacity, I am trying to return a better strand for the
>> thread to start on.
>>
> I know what you're trying to do but they way you've retrofitted it into the
> core looks weird (to me) and makes the code unreadable and ugly IMO.
>
> Why not do something simpler like skip the core if any SMT thread has been
> running at lesser capacity? I'm not sure if this works great or if the maintainers
> will prefer your or my below approach, but I find the below diff much cleaner
> for the select_idle_core bit. It also makes more sense since resources are
> shared at SMT level so makes sense to me to skip the core altogether for this:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6ee7242dbe0a..f324a84e29f1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5738,14 +5738,17 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>   
>   	for_each_cpu_wrap(core, cpus, target) {
>   		bool idle = true;
> +		bool full_cap = true;
>   
>   		for_each_cpu(cpu, cpu_smt_mask(core)) {
>   			cpumask_clear_cpu(cpu, cpus);
>   			if (!idle_cpu(cpu))
>   				idle = false;
> +			if (!full_capacity(cpu))
> +				full_cap = false;
>   		}
>   
> -		if (idle)
> +		if (idle && full_cap)
>   			return core;
>   	}
>   


Well, with your changes you will skip over fully idle cores which is not
an ideal thing either. I see that you were advocating for select
idle+lowest capacity core, whereas I was stopping at the first idlecore.

Since the whole philosophy till now in this patch is "Don't spare an
idle CPU", I think the following diff might look better to you. Please
note this is only for discussion sakes, I haven't fully tested it yet.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec15e5f..c2933eb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6040,7 +6040,9 @@ void __update_idle_core(struct rq *rq)
  static int select_idle_core(struct task_struct *p, struct sched_domain 
*sd, int target)
  {
      struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
-    int core, cpu;
+    int core, cpu, rcpu, backup_core;
+
+    rcpu = backup_core = -1;

      if (!static_branch_likely(&sched_smt_present))
          return -1;
@@ -6052,15 +6054,34 @@ static int select_idle_core(struct task_struct 
*p, struct sched_domain *sd, int

      for_each_cpu_wrap(core, cpus, target) {
          bool idle = true;
+        bool full_cap = true;

          for_each_cpu(cpu, cpu_smt_mask(core)) {
              cpumask_clear_cpu(cpu, cpus);
              if (!idle_cpu(cpu))
                  idle = false;
+
+            if (!full_capacity(cpu)) {
+                full_cap = false;
+            }
          }

-        if (idle)
+        if (idle && full_cap)
              return core;
+        else if (idle && backup_core == -1)
+            backup_core = core;
+    }
+
+    if (backup_core != -1) {
+        for_each_cpu(cpu, cpu_smt_mask(backup_core)) {
+            if (full_capacity(cpu))
+                return cpu;
+            else if ((rcpu == -1) ||
+                 (capacity_of(cpu) > capacity_of(rcpu)))
+                rcpu = cpu;
+        }
+
+        return rcpu;
      }


Do let me know what you think.

Thanks,
Rohit

>
> thanks,
>
> - Joel
>

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] ignore_this_patch: Fixing compilation error on Peter's tree
  2017-09-26  0:02 ` [PATCH 3/3] ignore_this_patch: Fixing compilation error on Peter's tree Rohit Jain
@ 2017-10-01  0:32   ` kbuild test robot
  0 siblings, 0 replies; 17+ messages in thread
From: kbuild test robot @ 2017-10-01  0:32 UTC (permalink / raw)
  To: Rohit Jain
  Cc: kbuild-all, linux-kernel, eas-dev, peterz, mingo, joelaf,
	atish.patra, vincent.guittot, dietmar.eggemann, morten.rasmussen

[-- Attachment #1: Type: text/plain, Size: 5956 bytes --]

Hi Rohit,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.14-rc2 next-20170929]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Rohit-Jain/sched-fair-Introduce-scaled-capacity-awareness-in-find_idlest_cpu-code-path/20170929-060222
config: blackfin-BF533-EZKIT_defconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=blackfin 

All errors (new ones prefixed by >>):

   In file included from kernel/time/tick-sched.c:22:0:
   include/linux/vmstat.h: In function '__inc_zone_page_state':
   include/linux/vmstat.h:294:19: error: implicit declaration of function 'page_zone' [-Werror=implicit-function-declaration]
     __inc_zone_state(page_zone(page), item);
                      ^~~~~~~~~
   include/linux/vmstat.h:294:19: warning: passing argument 1 of '__inc_zone_state' makes pointer from integer without a cast [-Wint-conversion]
   include/linux/vmstat.h:267:20: note: expected 'struct zone *' but argument is of type 'int'
    static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
                       ^~~~~~~~~~~~~~~~
   include/linux/vmstat.h: In function '__inc_node_page_state':
   include/linux/vmstat.h:300:19: error: implicit declaration of function 'page_pgdat' [-Werror=implicit-function-declaration]
     __inc_node_state(page_pgdat(page), item);
                      ^~~~~~~~~~
   include/linux/vmstat.h:300:19: warning: passing argument 1 of '__inc_node_state' makes pointer from integer without a cast [-Wint-conversion]
   include/linux/vmstat.h:273:20: note: expected 'struct pglist_data *' but argument is of type 'int'
    static inline void __inc_node_state(struct pglist_data *pgdat, enum node_stat_item item)
                       ^~~~~~~~~~~~~~~~
   include/linux/vmstat.h: In function '__dec_zone_page_state':
   include/linux/vmstat.h:307:19: warning: passing argument 1 of '__dec_zone_state' makes pointer from integer without a cast [-Wint-conversion]
     __dec_zone_state(page_zone(page), item);
                      ^~~~~~~~~
   include/linux/vmstat.h:279:20: note: expected 'struct zone *' but argument is of type 'int'
    static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
                       ^~~~~~~~~~~~~~~~
   include/linux/vmstat.h: In function '__dec_node_page_state':
   include/linux/vmstat.h:313:19: warning: passing argument 1 of '__dec_node_state' makes pointer from integer without a cast [-Wint-conversion]
     __dec_node_state(page_pgdat(page), item);
                      ^~~~~~~~~~
   include/linux/vmstat.h:285:20: note: expected 'struct pglist_data *' but argument is of type 'int'
    static inline void __dec_node_state(struct pglist_data *pgdat, enum node_stat_item item)
                       ^~~~~~~~~~~~~~~~
   In file included from arch/blackfin/include/asm/uaccess.h:15:0,
                    from include/linux/uaccess.h:13,
                    from include/linux/poll.h:11,
                    from include/linux/rtc.h:51,
                    from include/linux/alarmtimer.h:7,
                    from include/linux/posix-timers.h:8,
                    from kernel/time/tick-sched.c:29:
   include/linux/mm.h: At top level:
>> include/linux/mm.h:962:28: error: conflicting types for 'page_zone'
    static inline struct zone *page_zone(const struct page *page)
                               ^~~~~~~~~
   In file included from kernel/time/tick-sched.c:22:0:
   include/linux/vmstat.h:294:19: note: previous implicit declaration of 'page_zone' was here
     __inc_zone_state(page_zone(page), item);
                      ^~~~~~~~~
   In file included from arch/blackfin/include/asm/uaccess.h:15:0,
                    from include/linux/uaccess.h:13,
                    from include/linux/poll.h:11,
                    from include/linux/rtc.h:51,
                    from include/linux/alarmtimer.h:7,
                    from include/linux/posix-timers.h:8,
                    from kernel/time/tick-sched.c:29:
>> include/linux/mm.h:967:26: error: conflicting types for 'page_pgdat'
    static inline pg_data_t *page_pgdat(const struct page *page)
                             ^~~~~~~~~~
   In file included from kernel/time/tick-sched.c:22:0:
   include/linux/vmstat.h:300:19: note: previous implicit declaration of 'page_pgdat' was here
     __inc_node_state(page_pgdat(page), item);
                      ^~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/page_zone +962 include/linux/mm.h

57e0a030 Mel Gorman        2012-11-12  961  
33dd4e0e Ian Campbell      2011-07-25 @962  static inline struct zone *page_zone(const struct page *page)
89689ae7 Christoph Lameter 2006-12-06  963  {
89689ae7 Christoph Lameter 2006-12-06  964  	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
89689ae7 Christoph Lameter 2006-12-06  965  }
89689ae7 Christoph Lameter 2006-12-06  966  
75ef7184 Mel Gorman        2016-07-28 @967  static inline pg_data_t *page_pgdat(const struct page *page)
75ef7184 Mel Gorman        2016-07-28  968  {
75ef7184 Mel Gorman        2016-07-28  969  	return NODE_DATA(page_to_nid(page));
75ef7184 Mel Gorman        2016-07-28  970  }
75ef7184 Mel Gorman        2016-07-28  971  

:::::: The code at line 962 was first introduced by commit
:::::: 33dd4e0ec91138c3d80e790c08a3db47426c81f2 mm: make some struct page's const

:::::: TO: Ian Campbell <ian.campbell@citrix.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 11120 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path
  2017-09-28 15:09         ` Rohit Jain
@ 2017-10-03  4:52           ` Joel Fernandes
  2017-10-04  0:21             ` Rohit Jain
  0 siblings, 1 reply; 17+ messages in thread
From: Joel Fernandes @ 2017-10-03  4:52 UTC (permalink / raw)
  To: Rohit Jain
  Cc: LKML, eas-dev, Peter Zijlstra, Ingo Molnar, Atish Patra,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen

Hi Rohit,

On Thu, Sep 28, 2017 at 8:09 AM, Rohit Jain <rohit.k.jain@oracle.com> wrote:
[..]
>>>
>>> With this case, because we know from the past avg, one of the strands is
>>> running low on capacity, I am trying to return a better strand for the
>>> thread to start on.
>>>
>> I know what you're trying to do but they way you've retrofitted it into
>> the
>> core looks weird (to me) and makes the code unreadable and ugly IMO.
>>
>> Why not do something simpler like skip the core if any SMT thread has been
>> running at lesser capacity? I'm not sure if this works great or if the
>> maintainers
>> will prefer your or my below approach, but I find the below diff much
>> cleaner
>> for the select_idle_core bit. It also makes more sense since resources are
>> shared at SMT level so makes sense to me to skip the core altogether for
>> this:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6ee7242dbe0a..f324a84e29f1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5738,14 +5738,17 @@ static int select_idle_core(struct task_struct *p,
>> struct sched_domain *sd, int
>>         for_each_cpu_wrap(core, cpus, target) {
>>                 bool idle = true;
>> +               bool full_cap = true;
>>                 for_each_cpu(cpu, cpu_smt_mask(core)) {
>>                         cpumask_clear_cpu(cpu, cpus);
>>                         if (!idle_cpu(cpu))
>>                                 idle = false;
>> +                       if (!full_capacity(cpu))
>> +                               full_cap = false;
>>                 }
>>   -             if (idle)
>> +               if (idle && full_cap)
>>                         return core;
>>         }
>>
>
>
>
> Well, with your changes you will skip over fully idle cores which is not
> an ideal thing either. I see that you were advocating for select
> idle+lowest capacity core, whereas I was stopping at the first idlecore.
>
> Since the whole philosophy till now in this patch is "Don't spare an
> idle CPU", I think the following diff might look better to you. Please
> note this is only for discussion sakes, I haven't fully tested it yet.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ec15e5f..c2933eb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6040,7 +6040,9 @@ void __update_idle_core(struct rq *rq)
>  static int select_idle_core(struct task_struct *p, struct sched_domain *sd,
> int target)
>  {
>      struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> -    int core, cpu;
> +    int core, cpu, rcpu, backup_core;
> +
> +    rcpu = backup_core = -1;
>
>      if (!static_branch_likely(&sched_smt_present))
>          return -1;
> @@ -6052,15 +6054,34 @@ static int select_idle_core(struct task_struct *p,
> struct sched_domain *sd, int
>
>      for_each_cpu_wrap(core, cpus, target) {
>          bool idle = true;
> +        bool full_cap = true;
>
>          for_each_cpu(cpu, cpu_smt_mask(core)) {
>              cpumask_clear_cpu(cpu, cpus);
>              if (!idle_cpu(cpu))
>                  idle = false;
> +
> +            if (!full_capacity(cpu)) {
> +                full_cap = false;
> +            }
>          }
>
> -        if (idle)
> +        if (idle && full_cap)
>              return core;
> +        else if (idle && backup_core == -1)
> +            backup_core = core;
> +    }
> +
> +    if (backup_core != -1) {
> +        for_each_cpu(cpu, cpu_smt_mask(backup_core)) {
> +            if (full_capacity(cpu))
> +                return cpu;
> +            else if ((rcpu == -1) ||
> +                 (capacity_of(cpu) > capacity_of(rcpu)))
> +                rcpu = cpu;
> +        }
> +
> +        return rcpu;
>      }
>
>
> Do let me know what you think.

I think that if there isn't a benefit in your tests in doing the above
vs the simpler approach, then I prefer the simpler approach especially
since there's no point/benefit in complicating the code for
select_idle_core.

thanks,

- Joel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path
  2017-10-03  4:52           ` Joel Fernandes
@ 2017-10-04  0:21             ` Rohit Jain
  0 siblings, 0 replies; 17+ messages in thread
From: Rohit Jain @ 2017-10-04  0:21 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: LKML, eas-dev, Peter Zijlstra, Ingo Molnar, Atish Patra,
	Vincent Guittot, Dietmar Eggemann, Morten Rasmussen

Hi Joel,


On 10/02/2017 09:52 PM, Joel Fernandes wrote:
> Hi Rohit,
>
> On Thu, Sep 28, 2017 at 8:09 AM, Rohit Jain <rohit.k.jain@oracle.com> wrote:
> [..]
>>>> With this case, because we know from the past avg, one of the strands is
>>>> running low on capacity, I am trying to return a better strand for the
>>>> thread to start on.
>>>>
>>> I know what you're trying to do but they way you've retrofitted it into
>>> the
>>> core looks weird (to me) and makes the code unreadable and ugly IMO.
>>>
>>> Why not do something simpler like skip the core if any SMT thread has been
>>> running at lesser capacity? I'm not sure if this works great or if the
>>> maintainers
>>> will prefer your or my below approach, but I find the below diff much
>>> cleaner
>>> for the select_idle_core bit. It also makes more sense since resources are
>>> shared at SMT level so makes sense to me to skip the core altogether for
>>> this:
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 6ee7242dbe0a..f324a84e29f1 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5738,14 +5738,17 @@ static int select_idle_core(struct task_struct *p,
>>> struct sched_domain *sd, int
>>>          for_each_cpu_wrap(core, cpus, target) {
>>>                  bool idle = true;
>>> +               bool full_cap = true;
>>>                  for_each_cpu(cpu, cpu_smt_mask(core)) {
>>>                          cpumask_clear_cpu(cpu, cpus);
>>>                          if (!idle_cpu(cpu))
>>>                                  idle = false;
>>> +                       if (!full_capacity(cpu))
>>> +                               full_cap = false;
>>>                  }
>>>    -             if (idle)
>>> +               if (idle && full_cap)
>>>                          return core;
>>>          }
>>>
>>
>>
>> Well, with your changes you will skip over fully idle cores which is not
>> an ideal thing either. I see that you were advocating for select
>> idle+lowest capacity core, whereas I was stopping at the first idlecore.
>>
>> Since the whole philosophy till now in this patch is "Don't spare an
>> idle CPU", I think the following diff might look better to you. Please
>> note this is only for discussion sakes, I haven't fully tested it yet.
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ec15e5f..c2933eb 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6040,7 +6040,9 @@ void __update_idle_core(struct rq *rq)
>>   static int select_idle_core(struct task_struct *p, struct sched_domain *sd,
>> int target)
>>   {
>>       struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
>> -    int core, cpu;
>> +    int core, cpu, rcpu, backup_core;
>> +
>> +    rcpu = backup_core = -1;
>>
>>       if (!static_branch_likely(&sched_smt_present))
>>           return -1;
>> @@ -6052,15 +6054,34 @@ static int select_idle_core(struct task_struct *p,
>> struct sched_domain *sd, int
>>
>>       for_each_cpu_wrap(core, cpus, target) {
>>           bool idle = true;
>> +        bool full_cap = true;
>>
>>           for_each_cpu(cpu, cpu_smt_mask(core)) {
>>               cpumask_clear_cpu(cpu, cpus);
>>               if (!idle_cpu(cpu))
>>                   idle = false;
>> +
>> +            if (!full_capacity(cpu)) {
>> +                full_cap = false;
>> +            }
>>           }
>>
>> -        if (idle)
>> +        if (idle && full_cap)
>>               return core;
>> +        else if (idle && backup_core == -1)
>> +            backup_core = core;
>> +    }
>> +
>> +    if (backup_core != -1) {
>> +        for_each_cpu(cpu, cpu_smt_mask(backup_core)) {
>> +            if (full_capacity(cpu))
>> +                return cpu;
>> +            else if ((rcpu == -1) ||
>> +                 (capacity_of(cpu) > capacity_of(rcpu)))
>> +                rcpu = cpu;
>> +        }
>> +
>> +        return rcpu;
>>       }
>>
>>
>> Do let me know what you think.
> I think that if there isn't a benefit in your tests in doing the above
> vs the simpler approach, then I prefer the simpler approach especially
> since there's no point/benefit in complicating the code for
> select_idle_core.

Fair enough!

If there are no more concerns in this version, then I will go ahead and
try out all that is discussed in this version and send an updated
version. Please let me know if there are any other concerns/feedback.

Thanks,
Rohit

>
> thanks,
>
> - Joel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path
  2017-10-10 15:54   ` Atish Patra
@ 2017-10-10 18:02     ` Rohit Jain
  0 siblings, 0 replies; 17+ messages in thread
From: Rohit Jain @ 2017-10-10 18:02 UTC (permalink / raw)
  To: Atish Patra, linux-kernel, eas-dev
  Cc: peterz, mingo, joelaf, vincent.guittot, dietmar.eggemann,
	morten.rasmussen

Hi Atish,

Thanks for the comments

On 10/10/2017 08:54 AM, Atish Patra wrote:
> <snip>
>>
>> Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
>> ---
>>   kernel/sched/fair.c | 37 ++++++++++++++++++++++++++++---------
>>   1 file changed, 28 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index eaede50..5b1f7b9 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6004,7 +6004,7 @@ static int select_idle_core(struct task_struct 
>> *p, struct sched_domain *sd, int
>>             for_each_cpu(cpu, cpu_smt_mask(core)) {
>>               cpumask_clear_cpu(cpu, cpus);
>> -            if (!idle_cpu(cpu))
>> +            if (!idle_cpu(cpu) || !full_capacity(cpu))
> Do we need to skip the entire core just because 1st cpu in the core 
> doesn't have full capacity ?
> Let's say that is the only idle core available. It will go and try to 
> select_idle_cpu() to find the idlest cpu.
> Is it worth spending extra time to search an idle cpu with full 
> capacity when there are idle cores available ?

This has been previously discussed:
https://lkml.org/lkml/2017/10/3/1001

Returning the best CPU within the idle core did not result in a
statistically significant performance benefit, hence I went with Joel's
suggestion to keep the code simple.

Thanks,
Rohit

<snip>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path
  2017-10-07 23:48 ` [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path Rohit Jain
@ 2017-10-10 15:54   ` Atish Patra
  2017-10-10 18:02     ` Rohit Jain
  0 siblings, 1 reply; 17+ messages in thread
From: Atish Patra @ 2017-10-10 15:54 UTC (permalink / raw)
  To: Rohit Jain, linux-kernel, eas-dev
  Cc: peterz, mingo, joelaf, vincent.guittot, dietmar.eggemann,
	morten.rasmussen


Minor nit: version number missing

On 10/07/2017 06:48 PM, Rohit Jain wrote:
> While looking for CPUs to place running tasks on, the scheduler
> completely ignores the capacity stolen away by RT/IRQ tasks. This patch
> changes that behavior to also take the scaled capacity into account.
>
> Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
> ---
>   kernel/sched/fair.c | 37 ++++++++++++++++++++++++++++---------
>   1 file changed, 28 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eaede50..5b1f7b9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6004,7 +6004,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>   
>   		for_each_cpu(cpu, cpu_smt_mask(core)) {
>   			cpumask_clear_cpu(cpu, cpus);
> -			if (!idle_cpu(cpu))
> +			if (!idle_cpu(cpu) || !full_capacity(cpu))
Do we need to skip the entire core just because 1st cpu in the core 
doesn't have full capacity ?
Let's say that is the only idle core available. It will go and try to 
select_idle_cpu() to find the idlest cpu.
Is it worth spending extra time to search an idle cpu with full capacity 
when there are idle cores available ?

Regards,
Atish
>   				idle = false;
>   		}
>   
> @@ -6025,7 +6025,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
>    */
>   static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
>   {
> -	int cpu;
> +	int cpu, backup_cpu = -1;
> +	unsigned int backup_cap = 0;
>   
>   	if (!static_branch_likely(&sched_smt_present))
>   		return -1;
> @@ -6033,11 +6034,17 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>   	for_each_cpu(cpu, cpu_smt_mask(target)) {
>   		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>   			continue;
> -		if (idle_cpu(cpu))
> -			return cpu;
> +		if (idle_cpu(cpu)) {
> +			if (full_capacity(cpu))
> +				return cpu;
> +			if (capacity_of(cpu) > backup_cap) {
> +				backup_cap = capacity_of(cpu);
> +				backup_cpu = cpu;
> +			}
> +		}
>   	}
>   
> -	return -1;
> +	return backup_cpu;
>   }
>   
>   #else /* CONFIG_SCHED_SMT */
> @@ -6066,6 +6073,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>   	u64 time, cost;
>   	s64 delta;
>   	int cpu, nr = INT_MAX;
> +	int backup_cpu = -1;
> +	unsigned int backup_cap = 0;
>   
>   	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>   	if (!this_sd)
> @@ -6096,10 +6105,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
>   			return -1;
>   		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
>   			continue;
> -		if (idle_cpu(cpu))
> -			break;
> +		if (idle_cpu(cpu)) {
> +			if (full_capacity(cpu)) {
> +				backup_cpu = -1;
> +				break;
> +			} else if (capacity_of(cpu) > backup_cap) {
> +				backup_cap = capacity_of(cpu);
> +				backup_cpu = cpu;
> +			}
> +		}
>   	}
>   
> +	if (backup_cpu >= 0)
> +		cpu = backup_cpu;
>   	time = local_clock() - time;
>   	cost = this_sd->avg_scan_cost;
>   	delta = (s64)(time - cost) / 8;
> @@ -6116,13 +6134,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   	struct sched_domain *sd;
>   	int i;
>   
> -	if (idle_cpu(target))
> +	if (idle_cpu(target) && full_capacity(target))
>   		return target;
>   
>   	/*
>   	 * If the previous cpu is cache affine and idle, don't be stupid.
>   	 */
> -	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
> +	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev)
> +	    && full_capacity(prev))
>   		return prev;
>   
>   	sd = rcu_dereference(per_cpu(sd_llc, target));

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path
  2017-10-07 23:48 [PATCH v5 0/3] sched/fair: Introduce scaled capacity awareness in enqueue Rohit Jain
@ 2017-10-07 23:48 ` Rohit Jain
  2017-10-10 15:54   ` Atish Patra
  0 siblings, 1 reply; 17+ messages in thread
From: Rohit Jain @ 2017-10-07 23:48 UTC (permalink / raw)
  To: linux-kernel, eas-dev
  Cc: peterz, mingo, joelaf, atish.patra, vincent.guittot,
	dietmar.eggemann, morten.rasmussen

While looking for CPUs to place running tasks on, the scheduler
completely ignores the capacity stolen away by RT/IRQ tasks. This patch
changes that behavior to also take the scaled capacity into account.

Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com>
---
 kernel/sched/fair.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eaede50..5b1f7b9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6004,7 +6004,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
 
 		for_each_cpu(cpu, cpu_smt_mask(core)) {
 			cpumask_clear_cpu(cpu, cpus);
-			if (!idle_cpu(cpu))
+			if (!idle_cpu(cpu) || !full_capacity(cpu))
 				idle = false;
 		}
 
@@ -6025,7 +6025,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
  */
 static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
 {
-	int cpu;
+	int cpu, backup_cpu = -1;
+	unsigned int backup_cap = 0;
 
 	if (!static_branch_likely(&sched_smt_present))
 		return -1;
@@ -6033,11 +6034,17 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 	for_each_cpu(cpu, cpu_smt_mask(target)) {
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (idle_cpu(cpu))
-			return cpu;
+		if (idle_cpu(cpu)) {
+			if (full_capacity(cpu))
+				return cpu;
+			if (capacity_of(cpu) > backup_cap) {
+				backup_cap = capacity_of(cpu);
+				backup_cpu = cpu;
+			}
+		}
 	}
 
-	return -1;
+	return backup_cpu;
 }
 
 #else /* CONFIG_SCHED_SMT */
@@ -6066,6 +6073,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 	u64 time, cost;
 	s64 delta;
 	int cpu, nr = INT_MAX;
+	int backup_cpu = -1;
+	unsigned int backup_cap = 0;
 
 	this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
 	if (!this_sd)
@@ -6096,10 +6105,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
 			return -1;
 		if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
 			continue;
-		if (idle_cpu(cpu))
-			break;
+		if (idle_cpu(cpu)) {
+			if (full_capacity(cpu)) {
+				backup_cpu = -1;
+				break;
+			} else if (capacity_of(cpu) > backup_cap) {
+				backup_cap = capacity_of(cpu);
+				backup_cpu = cpu;
+			}
+		}
 	}
 
+	if (backup_cpu >= 0)
+		cpu = backup_cpu;
 	time = local_clock() - time;
 	cost = this_sd->avg_scan_cost;
 	delta = (s64)(time - cost) / 8;
@@ -6116,13 +6134,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	struct sched_domain *sd;
 	int i;
 
-	if (idle_cpu(target))
+	if (idle_cpu(target) && full_capacity(target))
 		return target;
 
 	/*
 	 * If the previous cpu is cache affine and idle, don't be stupid.
 	 */
-	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
+	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev)
+	    && full_capacity(prev))
 		return prev;
 
 	sd = rcu_dereference(per_cpu(sd_llc, target));
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-10-10 17:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-26  0:02 [PATCH v4 0/3] sched/fair: Introduce scaled capacity awareness in enqueue Rohit Jain
2017-09-26  0:02 ` [PATCH 1/3] sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path Rohit Jain
2017-09-26  2:51   ` joelaf
2017-09-26  4:40     ` Rohit Jain
2017-09-26  6:59       ` Joel Fernandes
2017-09-26  0:02 ` [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling " Rohit Jain
2017-09-26  6:53   ` Joel Fernandes
2017-09-26 19:48     ` Rohit Jain
2017-09-28 10:53       ` joelaf
2017-09-28 15:09         ` Rohit Jain
2017-10-03  4:52           ` Joel Fernandes
2017-10-04  0:21             ` Rohit Jain
2017-09-26  0:02 ` [PATCH 3/3] ignore_this_patch: Fixing compilation error on Peter's tree Rohit Jain
2017-10-01  0:32   ` kbuild test robot
2017-10-07 23:48 [PATCH v5 0/3] sched/fair: Introduce scaled capacity awareness in enqueue Rohit Jain
2017-10-07 23:48 ` [PATCH 2/3] sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path Rohit Jain
2017-10-10 15:54   ` Atish Patra
2017-10-10 18:02     ` Rohit Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).