linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2 v3] sched: improve spread of tasks during fork
@ 2016-12-08 16:56 Vincent Guittot
  2016-12-08 16:56 ` [PATCH 1/2 v3] sched: fix find_idlest_group for fork Vincent Guittot
  2016-12-08 16:56 ` [PATCH 2/2 v3] sched: use load_avg for selecting idlest group Vincent Guittot
  0 siblings, 2 replies; 9+ messages in thread
From: Vincent Guittot @ 2016-12-08 16:56 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, matt, Morten.Rasmussen
  Cc: dietmar.eggemann, kernellwp, yuyang.du, umgwanakikbuti, Vincent Guittot

Performance regression has been raised by Matt Fleming for fork intensive
bench like hackbench [1]
The patch 1/2 skips the spare_capacity test for fork task because the
utilization has not beed init yet.
The patch 2/2 takes into account load_avg for selecting CPU when
runnable_load of CPUs are close

Tests done by Matt Fleming show perf improvements with the patchset : [2] [3]

Changes since v2:
- renamed no_spare label to skip_spare
- changed load_avg test condition to prefer local cpu when load are similar
- added explanation for using absoluate margin instead of scale factor when
  comparing runnable_load

[1] https://lkml.org/lkml/2016/10/18/206
[2] https://lkml.org/lkml/2016/12/8/260
[3] https://lkml.org/lkml/2016/12/8/260

Vincent Guittot (2):
  sched: fix find_idlest_group for fork
  sched: use load_avg for selecting idlest group

 kernel/sched/fair.c | 54 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 44 insertions(+), 10 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2 v3] sched: fix find_idlest_group for fork
  2016-12-08 16:56 [PATCH 0/2 v3] sched: improve spread of tasks during fork Vincent Guittot
@ 2016-12-08 16:56 ` Vincent Guittot
  2016-12-09 13:18   ` Matt Fleming
  2016-12-12  6:50   ` [tip:sched/core] sched/core: Fix find_idlest_group() " tip-bot for Vincent Guittot
  2016-12-08 16:56 ` [PATCH 2/2 v3] sched: use load_avg for selecting idlest group Vincent Guittot
  1 sibling, 2 replies; 9+ messages in thread
From: Vincent Guittot @ 2016-12-08 16:56 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, matt, Morten.Rasmussen
  Cc: dietmar.eggemann, kernellwp, yuyang.du, umgwanakikbuti, Vincent Guittot

During fork, the utilization of a task is init once the rq has been
selected because the current utilization level of the rq is used to set
the utilization of the fork task. As the task's utilization is still
null at this step of the fork sequence, it doesn't make sense to look for
some spare capacity that can fit the task's utilization.
Furthermore, I can see perf regressions for the test "hackbench -P -g 1"
because the least loaded policy is always bypassed and tasks are not
spread during fork.

With this patch and the fix below, we are back to same performances as
for v4.8. The fix below is only a temporary one used for the test until a
smarter solution is found because we can't simply remove the test which is
useful for others benchmarks

@@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t

 	avg_cost = this_sd->avg_scan_cost;

-	/*
-	 * Due to large variance we need a large fuzz factor; hackbench in
-	 * particularly is sensitive here.
-	 */
-	if ((avg_idle / 512) < avg_cost)
-		return -1;
-
 	time = local_clock();

 	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 92cb50d..1da846b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5473,13 +5473,19 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	 * utilized systems if we require spare_capacity > task_util(p),
 	 * so we allow for some task stuffing by using
 	 * spare_capacity > task_util(p)/2.
+	 * spare capacity can't be used for fork because the utilization has
+	 * not been set yet as it need to get a rq to init the utilization
 	 */
+	if (sd_flag & SD_BALANCE_FORK)
+		goto skip_spare;
+
 	if (this_spare > task_util(p) / 2 &&
 	    imbalance*this_spare > 100*most_spare)
 		return NULL;
 	else if (most_spare > task_util(p) / 2)
 		return most_spare_sg;
 
+skip_spare:
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
 	return idlest;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2 v3] sched: use load_avg for selecting idlest group
  2016-12-08 16:56 [PATCH 0/2 v3] sched: improve spread of tasks during fork Vincent Guittot
  2016-12-08 16:56 ` [PATCH 1/2 v3] sched: fix find_idlest_group for fork Vincent Guittot
@ 2016-12-08 16:56 ` Vincent Guittot
  2016-12-09 13:22   ` Matt Fleming
                     ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Vincent Guittot @ 2016-12-08 16:56 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, matt, Morten.Rasmussen
  Cc: dietmar.eggemann, kernellwp, yuyang.du, umgwanakikbuti, Vincent Guittot

find_idlest_group() only compares the runnable_load_avg when looking for
the least loaded group. But on fork intensive use case like hackbench
where tasks blocked quickly after the fork, this can lead to selecting the
same CPU instead of other CPUs, which have similar runnable load but a
lower load_avg.

When the runnable_load_avg of 2 CPUs are close, we now take into account
the amount of blocked load as a 2nd selection factor. There is now 3 zones
for the runnable_load of the rq:
-[0 .. (runnable_load - imbalance)] : Select the new rq which has
significantly less runnable_load
-](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The
runnable loads are close so we use load_avg to chose between the 2 rq
-[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which
has significantly less runnable_load

The scale factor that is currently used for comparing runnable_load,
doesn't work well with small value. As an example, the use of a scaling
factor fails as soon as this_runnable_load == 0 because we always select
local rq even if min_runnable_load is only 1, which doesn't really make
sense because they are just the same. So instead of scaling factor, we use
an absolute margin for runnable_load to detect CPUs with similar
runnable_load and we keep using scaling factor for blocked load.

For use case like hackbench, this enable the scheduler to select different
CPUs during the fork sequence and to spread tasks across the system.

Tests have been done on a Hikey board (ARM based octo cores) for several
kernel. The result below gives min, max, avg and stdev values of 18 runs
with each configuration.

The v4.8+patches configuration also includes the changes below which is
part of the proposal made by Peter to ensure that the clock will be up to
date when the fork task will be attached to the rq.

@@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
 	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	rq = __task_rq_lock(p, &rf);
+	update_rq_clock(rq);
 	post_init_entity_util_avg(&p->se);

 	activate_task(rq, p, 0);

hackbench -P -g 1

       ea86cb4b7621  7dc603c9028e  v4.8        v4.8+patches
min    0.049         0.050         0.051       0,048
avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
max    0.066         0.068         0.070       0,063
stdev  +/-9%         +/-9%         +/-8%       +/-9%

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1da846b..0129fbb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5405,16 +5405,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	struct sched_group *most_spare_sg = NULL;
-	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
+	unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
 	unsigned long most_spare = 0, this_spare = 0;
 	int load_idx = sd->forkexec_idx;
-	int imbalance = 100 + (sd->imbalance_pct-100)/2;
+	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
+	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
+				(sd->imbalance_pct-100) / 100;
 
 	if (sd_flag & SD_BALANCE_WAKE)
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load, spare_cap, max_spare_cap;
+		unsigned long load, avg_load, runnable_load;
+		unsigned long spare_cap, max_spare_cap;
 		int local_group;
 		int i;
 
@@ -5431,6 +5435,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		 * the group containing the CPU with most spare capacity.
 		 */
 		avg_load = 0;
+		runnable_load = 0;
 		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
@@ -5440,7 +5445,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 			else
 				load = target_load(i, load_idx);
 
-			avg_load += load;
+			runnable_load += load;
+
+			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
 
 			spare_cap = capacity_spare_wake(i, p);
 
@@ -5449,14 +5456,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 
 		/* Adjust by relative CPU capacity of the group */
-		avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+		avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
+					group->sgc->capacity;
+		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
+					group->sgc->capacity;
 
 		if (local_group) {
-			this_load = avg_load;
+			this_runnable_load = runnable_load;
+			this_avg_load = avg_load;
 			this_spare = max_spare_cap;
 		} else {
-			if (avg_load < min_load) {
-				min_load = avg_load;
+			if (min_runnable_load > (runnable_load + imbalance)) {
+				/*
+				 * The runnable load is significantly smaller
+				 *  so we can pick this new cpu
+				 */
+				min_runnable_load = runnable_load;
+				min_avg_load = avg_load;
+				idlest = group;
+			} else if ((runnable_load < (min_runnable_load + imbalance)) &&
+					(100*min_avg_load > imbalance_scale*avg_load)) {
+				/*
+				 * The runnable loads are close so we take
+				 * into account blocked load through avg_load
+				 *  which is blocked + runnable load
+				 */
+				min_avg_load = avg_load;
 				idlest = group;
 			}
 
@@ -5480,13 +5505,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		goto skip_spare;
 
 	if (this_spare > task_util(p) / 2 &&
-	    imbalance*this_spare > 100*most_spare)
+	    imbalance_scale*this_spare > 100*most_spare)
 		return NULL;
 	else if (most_spare > task_util(p) / 2)
 		return most_spare_sg;
 
 skip_spare:
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest ||
+	    (min_runnable_load > (this_runnable_load + imbalance)) ||
+	    ((this_runnable_load < (min_runnable_load + imbalance)) &&
+			(100*this_avg_load < imbalance_scale*min_avg_load)))
 		return NULL;
 	return idlest;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2 v3] sched: fix find_idlest_group for fork
  2016-12-08 16:56 ` [PATCH 1/2 v3] sched: fix find_idlest_group for fork Vincent Guittot
@ 2016-12-09 13:18   ` Matt Fleming
  2016-12-12  6:50   ` [tip:sched/core] sched/core: Fix find_idlest_group() " tip-bot for Vincent Guittot
  1 sibling, 0 replies; 9+ messages in thread
From: Matt Fleming @ 2016-12-09 13:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, Morten.Rasmussen, dietmar.eggemann,
	kernellwp, yuyang.du, umgwanakikbuti

On Thu, 08 Dec, at 05:56:53PM, Vincent Guittot wrote:
> During fork, the utilization of a task is init once the rq has been
> selected because the current utilization level of the rq is used to set
> the utilization of the fork task. As the task's utilization is still
> null at this step of the fork sequence, it doesn't make sense to look for
> some spare capacity that can fit the task's utilization.
> Furthermore, I can see perf regressions for the test "hackbench -P -g 1"
> because the least loaded policy is always bypassed and tasks are not
> spread during fork.
> 
> With this patch and the fix below, we are back to same performances as
> for v4.8. The fix below is only a temporary one used for the test until a
> smarter solution is found because we can't simply remove the test which is
> useful for others benchmarks
> 
> @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
> 
>  	avg_cost = this_sd->avg_scan_cost;
> 
> -	/*
> -	 * Due to large variance we need a large fuzz factor; hackbench in
> -	 * particularly is sensitive here.
> -	 */
> -	if ((avg_idle / 512) < avg_cost)
> -		return -1;
> -
>  	time = local_clock();
> 
>  	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 6 ++++++
>  1 file changed, 6 insertions(+)

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2 v3] sched: use load_avg for selecting idlest group
  2016-12-08 16:56 ` [PATCH 2/2 v3] sched: use load_avg for selecting idlest group Vincent Guittot
@ 2016-12-09 13:22   ` Matt Fleming
  2016-12-09 15:22   ` Peter Zijlstra
  2016-12-12  6:51   ` [tip:sched/core] sched/core: Use " tip-bot for Vincent Guittot
  2 siblings, 0 replies; 9+ messages in thread
From: Matt Fleming @ 2016-12-09 13:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, Morten.Rasmussen, dietmar.eggemann,
	kernellwp, yuyang.du, umgwanakikbuti

On Thu, 08 Dec, at 05:56:54PM, Vincent Guittot wrote:
> find_idlest_group() only compares the runnable_load_avg when looking for
> the least loaded group. But on fork intensive use case like hackbench
> where tasks blocked quickly after the fork, this can lead to selecting the
> same CPU instead of other CPUs, which have similar runnable load but a
> lower load_avg.
> 
> When the runnable_load_avg of 2 CPUs are close, we now take into account
> the amount of blocked load as a 2nd selection factor. There is now 3 zones
> for the runnable_load of the rq:
> -[0 .. (runnable_load - imbalance)] : Select the new rq which has
> significantly less runnable_load
> -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The
> runnable loads are close so we use load_avg to chose between the 2 rq
> -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which
> has significantly less runnable_load
> 
> The scale factor that is currently used for comparing runnable_load,
> doesn't work well with small value. As an example, the use of a scaling
> factor fails as soon as this_runnable_load == 0 because we always select
> local rq even if min_runnable_load is only 1, which doesn't really make
> sense because they are just the same. So instead of scaling factor, we use
> an absolute margin for runnable_load to detect CPUs with similar
> runnable_load and we keep using scaling factor for blocked load.
> 
> For use case like hackbench, this enable the scheduler to select different
> CPUs during the fork sequence and to spread tasks across the system.
> 
> Tests have been done on a Hikey board (ARM based octo cores) for several
> kernel. The result below gives min, max, avg and stdev values of 18 runs
> with each configuration.
> 
> The v4.8+patches configuration also includes the changes below which is
> part of the proposal made by Peter to ensure that the clock will be up to
> date when the fork task will be attached to the rq.
> 
> @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
>  	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
>  #endif
>  	rq = __task_rq_lock(p, &rf);
> +	update_rq_clock(rq);
>  	post_init_entity_util_avg(&p->se);
> 
>  	activate_task(rq, p, 0);
> 
> hackbench -P -g 1
> 
>        ea86cb4b7621  7dc603c9028e  v4.8        v4.8+patches
> min    0.049         0.050         0.051       0,048
> avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
> max    0.066         0.068         0.070       0,063
> stdev  +/-9%         +/-9%         +/-8%       +/-9%
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 38 insertions(+), 10 deletions(-)

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>

Peter, Ingo, when you pick this up would you also consider adding the
following tag which links to an email describing the problem this
patch solves and the performance test results when it's applied?

Link: https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.uk

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2 v3] sched: use load_avg for selecting idlest group
  2016-12-08 16:56 ` [PATCH 2/2 v3] sched: use load_avg for selecting idlest group Vincent Guittot
  2016-12-09 13:22   ` Matt Fleming
@ 2016-12-09 15:22   ` Peter Zijlstra
  2016-12-09 16:28     ` Vincent Guittot
  2016-12-12  6:51   ` [tip:sched/core] sched/core: Use " tip-bot for Vincent Guittot
  2 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2016-12-09 15:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, matt, Morten.Rasmussen, dietmar.eggemann,
	kernellwp, yuyang.du, umgwanakikbuti

On Thu, Dec 08, 2016 at 05:56:54PM +0100, Vincent Guittot wrote:
> @@ -5449,14 +5456,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  		}
>  
>  		/* Adjust by relative CPU capacity of the group */
> -		avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
> +		avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
> +					group->sgc->capacity;
> +		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
> +					group->sgc->capacity;
>  
>  		if (local_group) {
> -			this_load = avg_load;
> +			this_runnable_load = runnable_load;
> +			this_avg_load = avg_load;
>  			this_spare = max_spare_cap;
>  		} else {
> -			if (avg_load < min_load) {
> -				min_load = avg_load;
> +			if (min_runnable_load > (runnable_load + imbalance)) {
> +				/*
> +				 * The runnable load is significantly smaller
> +				 *  so we can pick this new cpu
> +				 */
> +				min_runnable_load = runnable_load;
> +				min_avg_load = avg_load;
> +				idlest = group;
> +			} else if ((runnable_load < (min_runnable_load + imbalance)) &&
> +					(100*min_avg_load > imbalance_scale*avg_load)) {
> +				/*
> +				 * The runnable loads are close so we take
> +				 * into account blocked load through avg_load
> +				 *  which is blocked + runnable load
> +				 */
> +				min_avg_load = avg_load;
>  				idlest = group;
>  			}
>  
> @@ -5480,13 +5505,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  		goto skip_spare;
>  
>  	if (this_spare > task_util(p) / 2 &&
> -	    imbalance*this_spare > 100*most_spare)
> +	    imbalance_scale*this_spare > 100*most_spare)
>  		return NULL;
>  	else if (most_spare > task_util(p) / 2)
>  		return most_spare_sg;
>  
>  skip_spare:
> -	if (!idlest || 100*this_load < imbalance*min_load)
> +	if (!idlest ||
> +	    (min_runnable_load > (this_runnable_load + imbalance)) ||
> +	    ((this_runnable_load < (min_runnable_load + imbalance)) &&
> +			(100*this_avg_load < imbalance_scale*min_avg_load)))
>  		return NULL;
>  	return idlest;
>  }

I did the below on top for readability.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5469,17 +5469,16 @@ find_idlest_group(struct sched_domain *s
 			if (min_runnable_load > (runnable_load + imbalance)) {
 				/*
 				 * The runnable load is significantly smaller
-				 *  so we can pick this new cpu
+				 * so we can pick this new cpu
 				 */
 				min_runnable_load = runnable_load;
 				min_avg_load = avg_load;
 				idlest = group;
 			} else if ((runnable_load < (min_runnable_load + imbalance)) &&
-					(100*min_avg_load > imbalance_scale*avg_load)) {
+				   (100*min_avg_load > imbalance_scale*avg_load)) {
 				/*
-				 * The runnable loads are close so we take
-				 * into account blocked load through avg_load
-				 *  which is blocked + runnable load
+				 * The runnable loads are close so take the
+				 * blocked load into account through avg_load.
 				 */
 				min_avg_load = avg_load;
 				idlest = group;
@@ -5509,15 +5508,21 @@ find_idlest_group(struct sched_domain *s
 	if (this_spare > task_util(p) / 2 &&
 	    imbalance_scale*this_spare > 100*most_spare)
 		return NULL;
-	else if (most_spare > task_util(p) / 2)
+
+	if (most_spare > task_util(p) / 2)
 		return most_spare_sg;
 
 skip_spare:
-	if (!idlest ||
-	    (min_runnable_load > (this_runnable_load + imbalance)) ||
-	    ((this_runnable_load < (min_runnable_load + imbalance)) &&
-			(100*this_avg_load < imbalance_scale*min_avg_load)))
+	if (!idlest)
+		return NULL;
+
+	if (min_runnable_load > (this_runnable_load + imbalance))
 		return NULL;
+
+	if ((this_runnable_load < (min_runnable_load + imbalance)) &&
+	     (100*this_avg_load < imbalance_scale*min_avg_load))
+		return NULL;
+
 	return idlest;
 }
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2 v3] sched: use load_avg for selecting idlest group
  2016-12-09 15:22   ` Peter Zijlstra
@ 2016-12-09 16:28     ` Vincent Guittot
  0 siblings, 0 replies; 9+ messages in thread
From: Vincent Guittot @ 2016-12-09 16:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Matt Fleming, Morten Rasmussen,
	Dietmar Eggemann, Wanpeng Li, yuyang.du, Mike Galbraith

On 9 December 2016 at 16:22, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Dec 08, 2016 at 05:56:54PM +0100, Vincent Guittot wrote:
>> @@ -5449,14 +5456,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>               }
>>
>>               /* Adjust by relative CPU capacity of the group */
>> -             avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
>> +             avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
>> +                                     group->sgc->capacity;
>> +             runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
>> +                                     group->sgc->capacity;
>>
>>               if (local_group) {
>> -                     this_load = avg_load;
>> +                     this_runnable_load = runnable_load;
>> +                     this_avg_load = avg_load;
>>                       this_spare = max_spare_cap;
>>               } else {
>> -                     if (avg_load < min_load) {
>> -                             min_load = avg_load;
>> +                     if (min_runnable_load > (runnable_load + imbalance)) {
>> +                             /*
>> +                              * The runnable load is significantly smaller
>> +                              *  so we can pick this new cpu
>> +                              */
>> +                             min_runnable_load = runnable_load;
>> +                             min_avg_load = avg_load;
>> +                             idlest = group;
>> +                     } else if ((runnable_load < (min_runnable_load + imbalance)) &&
>> +                                     (100*min_avg_load > imbalance_scale*avg_load)) {
>> +                             /*
>> +                              * The runnable loads are close so we take
>> +                              * into account blocked load through avg_load
>> +                              *  which is blocked + runnable load
>> +                              */
>> +                             min_avg_load = avg_load;
>>                               idlest = group;
>>                       }
>>
>> @@ -5480,13 +5505,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>>               goto skip_spare;
>>
>>       if (this_spare > task_util(p) / 2 &&
>> -         imbalance*this_spare > 100*most_spare)
>> +         imbalance_scale*this_spare > 100*most_spare)
>>               return NULL;
>>       else if (most_spare > task_util(p) / 2)
>>               return most_spare_sg;
>>
>>  skip_spare:
>> -     if (!idlest || 100*this_load < imbalance*min_load)
>> +     if (!idlest ||
>> +         (min_runnable_load > (this_runnable_load + imbalance)) ||
>> +         ((this_runnable_load < (min_runnable_load + imbalance)) &&
>> +                     (100*this_avg_load < imbalance_scale*min_avg_load)))
>>               return NULL;
>>       return idlest;
>>  }
>
> I did the below on top for readability.

Changes looks good to me

>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5469,17 +5469,16 @@ find_idlest_group(struct sched_domain *s
>                         if (min_runnable_load > (runnable_load + imbalance)) {
>                                 /*
>                                  * The runnable load is significantly smaller
> -                                *  so we can pick this new cpu
> +                                * so we can pick this new cpu
>                                  */
>                                 min_runnable_load = runnable_load;
>                                 min_avg_load = avg_load;
>                                 idlest = group;
>                         } else if ((runnable_load < (min_runnable_load + imbalance)) &&
> -                                       (100*min_avg_load > imbalance_scale*avg_load)) {
> +                                  (100*min_avg_load > imbalance_scale*avg_load)) {
>                                 /*
> -                                * The runnable loads are close so we take
> -                                * into account blocked load through avg_load
> -                                *  which is blocked + runnable load
> +                                * The runnable loads are close so take the
> +                                * blocked load into account through avg_load.
>                                  */
>                                 min_avg_load = avg_load;
>                                 idlest = group;
> @@ -5509,15 +5508,21 @@ find_idlest_group(struct sched_domain *s
>         if (this_spare > task_util(p) / 2 &&
>             imbalance_scale*this_spare > 100*most_spare)
>                 return NULL;
> -       else if (most_spare > task_util(p) / 2)
> +
> +       if (most_spare > task_util(p) / 2)
>                 return most_spare_sg;
>
>  skip_spare:
> -       if (!idlest ||
> -           (min_runnable_load > (this_runnable_load + imbalance)) ||
> -           ((this_runnable_load < (min_runnable_load + imbalance)) &&
> -                       (100*this_avg_load < imbalance_scale*min_avg_load)))
> +       if (!idlest)
> +               return NULL;
> +
> +       if (min_runnable_load > (this_runnable_load + imbalance))
>                 return NULL;
> +
> +       if ((this_runnable_load < (min_runnable_load + imbalance)) &&
> +            (100*this_avg_load < imbalance_scale*min_avg_load))
> +               return NULL;
> +
>         return idlest;
>  }
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip:sched/core] sched/core: Fix find_idlest_group() for fork
  2016-12-08 16:56 ` [PATCH 1/2 v3] sched: fix find_idlest_group for fork Vincent Guittot
  2016-12-09 13:18   ` Matt Fleming
@ 2016-12-12  6:50   ` tip-bot for Vincent Guittot
  1 sibling, 0 replies; 9+ messages in thread
From: tip-bot for Vincent Guittot @ 2016-12-12  6:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, morten.rasmussen, linux-kernel, matt, torvalds, hpa,
	mingo, vincent.guittot, tglx

Commit-ID:  f519a3f1c6b7a990e5aed37a8f853c6ecfdee945
Gitweb:     http://git.kernel.org/tip/f519a3f1c6b7a990e5aed37a8f853c6ecfdee945
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 8 Dec 2016 17:56:53 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 11 Dec 2016 13:10:56 +0100

sched/core: Fix find_idlest_group() for fork

During fork, the utilization of a task is init once the rq has been
selected because the current utilization level of the rq is used to
set the utilization of the fork task. As the task's utilization is
still 0 at this step of the fork sequence, it doesn't make sense to
look for some spare capacity that can fit the task's utilization.
Furthermore, I can see perf regressions for the test:

   hackbench -P -g 1

because the least loaded policy is always bypassed and tasks are not
spread during fork.

With this patch and the fix below, we are back to same performances as
for v4.8. The fix below is only a temporary one used for the test
until a smarter solution is found because we can't simply remove the
test which is useful for others benchmarks

| @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
|
|	avg_cost = this_sd->avg_scan_cost;
|
| -	/*
| -	 * Due to large variance we need a large fuzz factor; hackbench in
| -	 * particularly is sensitive here.
| -	 */
| -	if ((avg_idle / 512) < avg_cost)
| -		return -1;
| -
|	time = local_clock();
|
|	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18d9e75..ebb815f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5473,13 +5473,21 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	 * utilized systems if we require spare_capacity > task_util(p),
 	 * so we allow for some task stuffing by using
 	 * spare_capacity > task_util(p)/2.
+	 *
+	 * Spare capacity can't be used for fork because the utilization has
+	 * not been set yet, we must first select a rq to compute the initial
+	 * utilization.
 	 */
+	if (sd_flag & SD_BALANCE_FORK)
+		goto skip_spare;
+
 	if (this_spare > task_util(p) / 2 &&
 	    imbalance*this_spare > 100*most_spare)
 		return NULL;
 	else if (most_spare > task_util(p) / 2)
 		return most_spare_sg;
 
+skip_spare:
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
 	return idlest;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [tip:sched/core] sched/core: Use load_avg for selecting idlest group
  2016-12-08 16:56 ` [PATCH 2/2 v3] sched: use load_avg for selecting idlest group Vincent Guittot
  2016-12-09 13:22   ` Matt Fleming
  2016-12-09 15:22   ` Peter Zijlstra
@ 2016-12-12  6:51   ` tip-bot for Vincent Guittot
  2 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Vincent Guittot @ 2016-12-12  6:51 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-kernel, torvalds, peterz, hpa, vincent.guittot, mingo, matt

Commit-ID:  6b94780e45c17b83e3e75f8aaca5a328db583c74
Gitweb:     http://git.kernel.org/tip/6b94780e45c17b83e3e75f8aaca5a328db583c74
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Thu, 8 Dec 2016 17:56:54 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 11 Dec 2016 13:10:57 +0100

sched/core: Use load_avg for selecting idlest group

find_idlest_group() only compares the runnable_load_avg when looking
for the least loaded group. But on fork intensive use case like
hackbench where tasks blocked quickly after the fork, this can lead to
selecting the same CPU instead of other CPUs, which have similar
runnable load but a lower load_avg.

When the runnable_load_avg of 2 CPUs are close, we now take into
account the amount of blocked load as a 2nd selection factor. There is
now 3 zones for the runnable_load of the rq:

 - [0 .. (runnable_load - imbalance)]:
	Select the new rq which has significantly less runnable_load

 - [(runnable_load - imbalance) .. (runnable_load + imbalance)]:
	The runnable loads are close so we use load_avg to chose
	between the 2 rq

 - [(runnable_load + imbalance) .. ULONG_MAX]:
	Keep the current rq which has significantly less runnable_load

The scale factor that is currently used for comparing runnable_load,
doesn't work well with small value. As an example, the use of a
scaling factor fails as soon as this_runnable_load == 0 because we
always select local rq even if min_runnable_load is only 1, which
doesn't really make sense because they are just the same. So instead
of scaling factor, we use an absolute margin for runnable_load to
detect CPUs with similar runnable_load and we keep using scaling
factor for blocked load.

For use case like hackbench, this enable the scheduler to select
different CPUs during the fork sequence and to spread tasks across the
system.

Tests have been done on a Hikey board (ARM based octo cores) for
several kernel. The result below gives min, max, avg and stdev values
of 18 runs with each configuration.

The patches depend on the "no missing update_rq_clock()" work.

hackbench -P -g 1

         ea86cb4b7621  7dc603c9028e  v4.8        v4.8+patches
  min    0.049         0.050         0.051       0,048
  avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
  max    0.066         0.068         0.070       0,063
  stdev  +/-9%         +/-9%         +/-8%       +/-9%

More performance numbers here:

  https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.uk

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 55 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 44 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebb815f..6559d19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5405,16 +5405,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	struct sched_group *most_spare_sg = NULL;
-	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
+	unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
 	unsigned long most_spare = 0, this_spare = 0;
 	int load_idx = sd->forkexec_idx;
-	int imbalance = 100 + (sd->imbalance_pct-100)/2;
+	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
+	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
+				(sd->imbalance_pct-100) / 100;
 
 	if (sd_flag & SD_BALANCE_WAKE)
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load, spare_cap, max_spare_cap;
+		unsigned long load, avg_load, runnable_load;
+		unsigned long spare_cap, max_spare_cap;
 		int local_group;
 		int i;
 
@@ -5431,6 +5435,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		 * the group containing the CPU with most spare capacity.
 		 */
 		avg_load = 0;
+		runnable_load = 0;
 		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
@@ -5440,7 +5445,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 			else
 				load = target_load(i, load_idx);
 
-			avg_load += load;
+			runnable_load += load;
+
+			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
 
 			spare_cap = capacity_spare_wake(i, p);
 
@@ -5449,14 +5456,31 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 
 		/* Adjust by relative CPU capacity of the group */
-		avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+		avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
+					group->sgc->capacity;
+		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
+					group->sgc->capacity;
 
 		if (local_group) {
-			this_load = avg_load;
+			this_runnable_load = runnable_load;
+			this_avg_load = avg_load;
 			this_spare = max_spare_cap;
 		} else {
-			if (avg_load < min_load) {
-				min_load = avg_load;
+			if (min_runnable_load > (runnable_load + imbalance)) {
+				/*
+				 * The runnable load is significantly smaller
+				 * so we can pick this new cpu
+				 */
+				min_runnable_load = runnable_load;
+				min_avg_load = avg_load;
+				idlest = group;
+			} else if ((runnable_load < (min_runnable_load + imbalance)) &&
+				   (100*min_avg_load > imbalance_scale*avg_load)) {
+				/*
+				 * The runnable loads are close so take the
+				 * blocked load into account through avg_load.
+				 */
+				min_avg_load = avg_load;
 				idlest = group;
 			}
 
@@ -5482,14 +5506,23 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		goto skip_spare;
 
 	if (this_spare > task_util(p) / 2 &&
-	    imbalance*this_spare > 100*most_spare)
+	    imbalance_scale*this_spare > 100*most_spare)
 		return NULL;
-	else if (most_spare > task_util(p) / 2)
+
+	if (most_spare > task_util(p) / 2)
 		return most_spare_sg;
 
 skip_spare:
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest)
+		return NULL;
+
+	if (min_runnable_load > (this_runnable_load + imbalance))
 		return NULL;
+
+	if ((this_runnable_load < (min_runnable_load + imbalance)) &&
+	     (100*this_avg_load < imbalance_scale*min_avg_load))
+		return NULL;
+
 	return idlest;
 }
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-12-12  6:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-08 16:56 [PATCH 0/2 v3] sched: improve spread of tasks during fork Vincent Guittot
2016-12-08 16:56 ` [PATCH 1/2 v3] sched: fix find_idlest_group for fork Vincent Guittot
2016-12-09 13:18   ` Matt Fleming
2016-12-12  6:50   ` [tip:sched/core] sched/core: Fix find_idlest_group() " tip-bot for Vincent Guittot
2016-12-08 16:56 ` [PATCH 2/2 v3] sched: use load_avg for selecting idlest group Vincent Guittot
2016-12-09 13:22   ` Matt Fleming
2016-12-09 15:22   ` Peter Zijlstra
2016-12-09 16:28     ` Vincent Guittot
2016-12-12  6:51   ` [tip:sched/core] sched/core: Use " tip-bot for Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).