linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 0/2] load_balance() fixes for affinity
@ 2017-06-02 22:27 Jeffrey Hugo
  2017-06-02 22:27 ` [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
  2017-06-02 22:27 ` [PATCH V4 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo
  0 siblings, 2 replies; 8+ messages in thread
From: Jeffrey Hugo @ 2017-06-02 22:27 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi, Jeffrey Hugo

We noticed in testing that affined workloads do not provide consistent
performance under certain circumstances. To isolate the issue, we began
testing with a representative CPU workload. In some cases the CPU workload
results would vary by a large percentage from run to run. We used JTAG and
scheduler trace to try and profile the issue and noticed that at least one
core was left idle for the duration of the test in the low score runs.

The specific test methodology used in our investigation involved using an
executable that was compiled to fork a specific number of workers. The
executable was then affined to a number of cores equal to the number of
workers forked, using the userspace taskset utility.  The expectation was
that all cores in the mask of runnable CPUs would run a process until the
work was complete. In practice, for several affinity masks, one to two
cores would not receive work until late in the test or not at all.

Our first observation was that it appeared initial cpu assignment of the
forked threads appeared suboptimal.  We observed that it was highly
probable that many or all of the threads would be initially assigned to the
same CPU. Our limited investigation into this indicated that the forking of
the threads occurs fast enough that load estimates for CPUs have not
adjusted to the work they have been assigned. The function
select_task_rq_fair() does not take number of running process into
consideration and thus can easily assign several tasks spawned close
together to the same CPU within the allowed CPUs for a workload. While this
seems suboptimal, the load_balance() algorithm run on other idle cores in
the affinity mask should resolve imbalances that result from it.

As our test case was leaving cores idle for many seconds at a time, we dug
into load_balance() code and specifically the group_imbalance path. We
added ftrace to the path in question and noticed that two branches in
load_balance() conflict in their assessments of the ability of the
algorithm to migrate load. The group_imbalance path correctly sets the flag
to indicate the group can not be properly balanced due to affinity, but the
redo condition right after this branch incorrectly assumes that there may
be other cores with work to be pulled by considering cores outside of the
scheduling domain in question.

Consider the following circumstance where (T) represents tasks and (*)
represents the allowed CPUs:

   A       B       C
{ 0 1 } { 2 3 } { 4 5 }
    T     T       T T
    T
    *     * *     * *

In this case, core 0 in scheduling group 'A' should run load balance on its
local scheduling domain and try to pull work from core 1. When it fails to
move any work due to affinity, it will mark scheduling group 'A' as
"group_imbalance". However, right after doing so, core 0 load_balance()
evaluates the redo path with core 1 removed from consideration. This should
leave no other cores to consider and result in a jump to "out_all_pinned",
leaving the "group_imbalance" flag to be seen by future runs of the
algorithm on core 3. As the code stands currently, core 0 evaluates all
cores instead of those in the scheduling domain under consideration.
It incorrectly sees other cores in the system and attempts a redo.
With core 1 removed from the mask, the redo identifies no load imbalance
and results in a jump to "out_balanced", clearing the "group_imbalance"
flag.

We propose correcting this logic with patch 1 of this series.

Making this correction revealed an issue in the calculate_imbalance() path.
Patch 2 removes a branch that does not make sense with the current
load_balance() algorithm because it has no scenario where it benifits the
"group_imbalance" case in calculate_imbalance() and which causes problems
in systems with affined workloads and many idle or lightly loaded CPUs.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>

[V4]
-restricted active cpus mask to the domain span
-fixed dst_cpu masking logic to work for the entire local group

[V3]
-fixed botched subject lines

[V2]
-Corrected use of Signed-off-by tags
-Removed temp cpumask variable from stack

Jeffrey Hugo (2):
  sched/fair: Fix load_balance() affinity redo path
  sched/fair: Remove group imbalance from calculate_imbalance()

 kernel/sched/fair.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-06-02 22:27 [PATCH V4 0/2] load_balance() fixes for affinity Jeffrey Hugo
@ 2017-06-02 22:27 ` Jeffrey Hugo
  2017-06-05 17:23   ` Jeffrey Hugo
  2017-06-06 12:30   ` Peter Zijlstra
  2017-06-02 22:27 ` [PATCH V4 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo
  1 sibling, 2 replies; 8+ messages in thread
From: Jeffrey Hugo @ 2017-06-02 22:27 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi, Jeffrey Hugo

If load_balance() fails to migrate any tasks because all tasks were
affined, load_balance() removes the source cpu from consideration and
attempts to redo and balance among the new subset of cpus.

There is a bug in this code path where the algorithm considers all active
cpus in the system (minus the source that was just masked out).  This is
not valid for two reasons: some active cpus may not be in the current
scheduling domain and one of the active cpus is dst_cpu. These cpus should
not be considered, as we cannot pull load from them.

Instead of failing out of load_balance(), we may end up redoing the search
with no valid cpus and incorrectly concluding the domain is balanced.
Additionally, if the group_imbalance flag was just set, it may also be
incorrectly unset, thus the flag will not be seen by other cpus in future
load_balance() runs as that algorithm intends.

Fix the check by removing cpus not in the current domain and the dst_cpu
from considertation, thus limiting the evaluation to valid remaining cpus
from which load might be migrated.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 kernel/sched/fair.c | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d711093..84255ab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6737,10 +6737,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		 * our sched_group. We may want to revisit it if we couldn't
 		 * meet load balance goals by pulling other tasks on src_cpu.
 		 *
-		 * Also avoid computing new_dst_cpu if we have already computed
-		 * one in current iteration.
+		 * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
+		 * already computed one in current iteration.
 		 */
-		if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
+		if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
 			return 0;
 
 		/* Prevent to re-select dst_cpu via env's cpus */
@@ -8091,14 +8091,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
 
-	/*
-	 * For NEWLY_IDLE load_balancing, we don't need to consider
-	 * other cpus in our group
-	 */
-	if (idle == CPU_NEWLY_IDLE)
-		env.dst_grpmask = NULL;
-
-	cpumask_copy(cpus, cpu_active_mask);
+	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
@@ -8220,7 +8213,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {
 			cpumask_clear_cpu(cpu_of(busiest), cpus);
-			if (!cpumask_empty(cpus)) {
+			/*
+			 * Go back to "redo" iff the load-balance cpumask
+			 * contains other potential busiest cpus for the
+			 * current sched domain.
+			 */
+			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = sched_nr_migrate_break;
 				goto redo;
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V4 2/2] sched/fair: Remove group imbalance from calculate_imbalance()
  2017-06-02 22:27 [PATCH V4 0/2] load_balance() fixes for affinity Jeffrey Hugo
  2017-06-02 22:27 ` [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
@ 2017-06-02 22:27 ` Jeffrey Hugo
  1 sibling, 0 replies; 8+ messages in thread
From: Jeffrey Hugo @ 2017-06-02 22:27 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi, Jeffrey Hugo

The group_imbalance path in calculate_imbalance() made sense when it was
added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load
balance across sched domain") because busiest->load_per_task factored into
the amount of imbalance that was calculated. That is not the case today.

The group_imbalance path can only affect the outcome of
calculate_imbalance() when the average load of the domain is less than the
original busiest->load_per_task. In this case, busiest->load_per_task is
overwritten with the scheduling domain load average. Thus
busiest->load_per_task no longer represents actual load that can be moved.

At the final comparison between env->imbalance and busiest->load_per_task,
imbalance may be larger than the new busiest->load_per_task causing the
check to fail under the assumption that there is a task that could be
migrated to satisfy the imbalance. However env->imbalance may still be
smaller than the original busiest->load_per_task, thus it is unlikely that
there is a task that can be migrated to satisfy the imbalance.
Calculate_imbalance() would not choose to run fix_small_imbalance() when we
expect it should. In the worst case, this can result in idle cpus.

Since the group imbalance path in calculate_imbalance() is at best a NOP
but otherwise harmful, remove it.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 kernel/sched/fair.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84255ab..3600713 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7760,15 +7760,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
-	if (busiest->group_type == group_imbalanced) {
-		/*
-		 * In the group_imb case we cannot rely on group-wide averages
-		 * to ensure cpu-load equilibrium, look at wider averages. XXX
-		 */
-		busiest->load_per_task =
-			min(busiest->load_per_task, sds->avg_load);
-	}
-
 	/*
 	 * Avg load of busiest sg can be less and avg load of local sg can
 	 * be greater than avg load across all sgs of sd because avg load
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-06-02 22:27 ` [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
@ 2017-06-05 17:23   ` Jeffrey Hugo
  2017-06-05 18:33     ` Jeffrey Hugo
  2017-06-06 12:30   ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Jeffrey Hugo @ 2017-06-05 17:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi

On 6/2/2017 4:27 PM, Jeffrey Hugo wrote:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d711093..84255ab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6737,10 +6737,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>   		 * our sched_group. We may want to revisit it if we couldn't
>   		 * meet load balance goals by pulling other tasks on src_cpu.
>   		 *
> -		 * Also avoid computing new_dst_cpu if we have already computed
> -		 * one in current iteration.
> +		 * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
> +		 * already computed one in current iteration.
>   		 */
> -		if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
> +		if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
>   			return 0;

Self NACK.  This breaks active_load_balance_cpu_stop().  Looks like 
env->idle == CPU_IDLE, but env->dst_grpmask is uninitialized, so it can 
be NULL, which causes a null pointer dereference a few lines later.

I'm still having a look to see what makes sense to address the issue.


-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-06-05 17:23   ` Jeffrey Hugo
@ 2017-06-05 18:33     ` Jeffrey Hugo
  0 siblings, 0 replies; 8+ messages in thread
From: Jeffrey Hugo @ 2017-06-05 18:33 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi

On 6/5/2017 11:23 AM, Jeffrey Hugo wrote:
> On 6/2/2017 4:27 PM, Jeffrey Hugo wrote:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d711093..84255ab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6737,10 +6737,10 @@ int can_migrate_task(struct task_struct *p, 
>> struct lb_env *env)
>>            * our sched_group. We may want to revisit it if we couldn't
>>            * meet load balance goals by pulling other tasks on src_cpu.
>>            *
>> -         * Also avoid computing new_dst_cpu if we have already computed
>> -         * one in current iteration.
>> +         * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
>> +         * already computed one in current iteration.
>>            */
>> -        if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
>> +        if (env->idle == CPU_NEWLY_IDLE || (env->flags & 
>> LBF_DST_PINNED))
>>               return 0;
> 
> Self NACK.  This breaks active_load_balance_cpu_stop().  Looks like 
> env->idle == CPU_IDLE, but env->dst_grpmask is uninitialized, so it can 
> be NULL, which causes a null pointer dereference a few lines later.
> 
> I'm still having a look to see what makes sense to address the issue.
> 
> 

As far as I can see, there appears to be two options to resolve the issue -

1. Update active_load_balance_cpu_stop() to initialize dst_grpmask to a 
sane value

2. Undo the proposed changes in load_balance() to "ensure" dst_grpmask 
is valid, and calculate the value on demand when checking to see if the 
redo path needs to be done.

The downside to #1 is that dst_grpmask is not needed in the 
active_load_balance_cpu_stop() path, and the loop to calculate a new 
dst_cpu will be used.  Extra code is evaluated, but there appears to be 
no side effects.

The downside to #2 is that dst_grpmask is valid the majority of the time 
in load_balance(), so calculating it on demand is redundant most of the 
time, but again there appears to be no side effects.

It somewhat feels like a choice of which option is less bad.

Peter/Dietmar, any preferences?

-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-06-02 22:27 ` [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
  2017-06-05 17:23   ` Jeffrey Hugo
@ 2017-06-06 12:30   ` Peter Zijlstra
  2017-06-06 15:51     ` Jeffrey Hugo
  2017-06-06 16:33     ` Peter Zijlstra
  1 sibling, 2 replies; 8+ messages in thread
From: Peter Zijlstra @ 2017-06-06 12:30 UTC (permalink / raw)
  To: Jeffrey Hugo
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ,
	Tyler Baicar, Timur Tabi

On Fri, Jun 02, 2017 at 04:27:11PM -0600, Jeffrey Hugo wrote:
> If load_balance() fails to migrate any tasks because all tasks were
> affined, load_balance() removes the source cpu from consideration and
> attempts to redo and balance among the new subset of cpus.
> 
> There is a bug in this code path where the algorithm considers all active
> cpus in the system (minus the source that was just masked out).  This is
> not valid for two reasons: some active cpus may not be in the current
> scheduling domain and one of the active cpus is dst_cpu. These cpus should
> not be considered, as we cannot pull load from them.
> 
> Instead of failing out of load_balance(), we may end up redoing the search
> with no valid cpus and incorrectly concluding the domain is balanced.
> Additionally, if the group_imbalance flag was just set, it may also be
> incorrectly unset, thus the flag will not be seen by other cpus in future
> load_balance() runs as that algorithm intends.
> 
> Fix the check by removing cpus not in the current domain and the dst_cpu
> from considertation, thus limiting the evaluation to valid remaining cpus
> from which load might be migrated.
> 
> Co-authored-by: Austin Christ <austinwc@codeaurora.org>
> Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
> Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
> ---
>  kernel/sched/fair.c | 22 ++++++++++------------
>  1 file changed, 10 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d711093..84255ab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6737,10 +6737,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  		 * our sched_group. We may want to revisit it if we couldn't
>  		 * meet load balance goals by pulling other tasks on src_cpu.
>  		 *
> -		 * Also avoid computing new_dst_cpu if we have already computed
> -		 * one in current iteration.
> +		 * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
> +		 * already computed one in current iteration.
>  		 */
> -		if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
> +		if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
>  			return 0;
>  
>  		/* Prevent to re-select dst_cpu via env's cpus */
> @@ -8091,14 +8091,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		.tasks		= LIST_HEAD_INIT(env.tasks),
>  	};
>  
> -	/*
> -	 * For NEWLY_IDLE load_balancing, we don't need to consider
> -	 * other cpus in our group
> -	 */
> -	if (idle == CPU_NEWLY_IDLE)
> -		env.dst_grpmask = NULL;
> -
> -	cpumask_copy(cpus, cpu_active_mask);
> +	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
>  	schedstat_inc(sd->lb_count[idle]);
>  
> @@ -8220,7 +8213,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		/* All tasks on this runqueue were pinned by CPU affinity */
>  		if (unlikely(env.flags & LBF_ALL_PINNED)) {
>  			cpumask_clear_cpu(cpu_of(busiest), cpus);
> -			if (!cpumask_empty(cpus)) {
> +			/*
> +			 * Go back to "redo" iff the load-balance cpumask
> +			 * contains other potential busiest cpus for the
> +			 * current sched domain.
> +			 */
> +			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>  				env.loop = 0;
>  				env.loop_break = sched_nr_migrate_break;
>  				goto redo;

So I was struggling with that subset condition. You want to ensure there
are CPUs outside of dst_grpmask left, otherwise balancing at this SD
level doesn't make sense anymore, right?

I think you might want to spell that out a little in that comment.
Currently the comment only explains what it does, which is something we
can read from the code. Comments should explain _why_ we do things and
its failing there.


So with that the problem is that active_load_balance_cpu_stop() calls
into can_migrate_task() with ->idle = CPU_IDLE and !dst_grpmask, which
then goes *bang*. Now active_load_balance_cpu_stop() doesn't need to
re-evaluate anything, so ideally it would just skip this entirely,
right?

So why not do #3:

---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 47a0c552c77b..fd639d32fa4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8523,6 +8523,13 @@ static int active_load_balance_cpu_stop(void *data)
 			.src_cpu	= busiest_rq->cpu,
 			.src_rq		= busiest_rq,
 			.idle		= CPU_IDLE,
+			/*
+			 * can_migrate_task() doesn't need to compute new_dst_cpu
+			 * for active balancing. Since we have CPU_IDLE, but no
+			 * @dst_grpmask we need to make that test go away with lying
+			 * about DST_PINNED.
+			 */
+			.flags		= LBF_DST_PINNED,
 		};
 
 		schedstat_inc(sd->alb_count);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-06-06 12:30   ` Peter Zijlstra
@ 2017-06-06 15:51     ` Jeffrey Hugo
  2017-06-06 16:33     ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Jeffrey Hugo @ 2017-06-06 15:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ,
	Tyler Baicar, Timur Tabi

On 6/6/2017 6:30 AM, Peter Zijlstra wrote:
> On Fri, Jun 02, 2017 at 04:27:11PM -0600, Jeffrey Hugo wrote:
>> If load_balance() fails to migrate any tasks because all tasks were
>> affined, load_balance() removes the source cpu from consideration and
>> attempts to redo and balance among the new subset of cpus.
>>
>> There is a bug in this code path where the algorithm considers all active
>> cpus in the system (minus the source that was just masked out).  This is
>> not valid for two reasons: some active cpus may not be in the current
>> scheduling domain and one of the active cpus is dst_cpu. These cpus should
>> not be considered, as we cannot pull load from them.
>>
>> Instead of failing out of load_balance(), we may end up redoing the search
>> with no valid cpus and incorrectly concluding the domain is balanced.
>> Additionally, if the group_imbalance flag was just set, it may also be
>> incorrectly unset, thus the flag will not be seen by other cpus in future
>> load_balance() runs as that algorithm intends.
>>
>> Fix the check by removing cpus not in the current domain and the dst_cpu
>> from considertation, thus limiting the evaluation to valid remaining cpus
>> from which load might be migrated.
>>
>> Co-authored-by: Austin Christ <austinwc@codeaurora.org>
>> Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
>> Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
>> ---
>>   kernel/sched/fair.c | 22 ++++++++++------------
>>   1 file changed, 10 insertions(+), 12 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d711093..84255ab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6737,10 +6737,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>   		 * our sched_group. We may want to revisit it if we couldn't
>>   		 * meet load balance goals by pulling other tasks on src_cpu.
>>   		 *
>> -		 * Also avoid computing new_dst_cpu if we have already computed
>> -		 * one in current iteration.
>> +		 * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
>> +		 * already computed one in current iteration.
>>   		 */
>> -		if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
>> +		if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
>>   			return 0;
>>   
>>   		/* Prevent to re-select dst_cpu via env's cpus */
>> @@ -8091,14 +8091,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>   		.tasks		= LIST_HEAD_INIT(env.tasks),
>>   	};
>>   
>> -	/*
>> -	 * For NEWLY_IDLE load_balancing, we don't need to consider
>> -	 * other cpus in our group
>> -	 */
>> -	if (idle == CPU_NEWLY_IDLE)
>> -		env.dst_grpmask = NULL;
>> -
>> -	cpumask_copy(cpus, cpu_active_mask);
>> +	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>>   
>>   	schedstat_inc(sd->lb_count[idle]);
>>   
>> @@ -8220,7 +8213,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>   		/* All tasks on this runqueue were pinned by CPU affinity */
>>   		if (unlikely(env.flags & LBF_ALL_PINNED)) {
>>   			cpumask_clear_cpu(cpu_of(busiest), cpus);
>> -			if (!cpumask_empty(cpus)) {
>> +			/*
>> +			 * Go back to "redo" iff the load-balance cpumask
>> +			 * contains other potential busiest cpus for the
>> +			 * current sched domain.
>> +			 */
>> +			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>>   				env.loop = 0;
>>   				env.loop_break = sched_nr_migrate_break;
>>   				goto redo;
> 
> So I was struggling with that subset condition. You want to ensure there
> are CPUs outside of dst_grpmask left, otherwise balancing at this SD
> level doesn't make sense anymore, right?

Correct.

> 
> I think you might want to spell that out a little in that comment.
> Currently the comment only explains what it does, which is something we
> can read from the code. Comments should explain _why_ we do things and
> its failing there.

Agreed.  We will attempt to improve the comment in the next version.

> 
> 
> So with that the problem is that active_load_balance_cpu_stop() calls
> into can_migrate_task() with ->idle = CPU_IDLE and !dst_grpmask, which
> then goes *bang*. Now active_load_balance_cpu_stop() doesn't need to
> re-evaluate anything, so ideally it would just skip this entirely,
> right?

Correct.  Your #3 option below would also work based on our 
understanding, and seems cleaner.  We will validate it as a solution, 
and incorporate into the next version.

Thanks

> 
> So why not do #3:
> 
> ---
>   kernel/sched/fair.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 47a0c552c77b..fd639d32fa4c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8523,6 +8523,13 @@ static int active_load_balance_cpu_stop(void *data)
>   			.src_cpu	= busiest_rq->cpu,
>   			.src_rq		= busiest_rq,
>   			.idle		= CPU_IDLE,
> +			/*
> +			 * can_migrate_task() doesn't need to compute new_dst_cpu
> +			 * for active balancing. Since we have CPU_IDLE, but no
> +			 * @dst_grpmask we need to make that test go away with lying
> +			 * about DST_PINNED.
> +			 */
> +			.flags		= LBF_DST_PINNED,
>   		};
>   
>   		schedstat_inc(sd->alb_count);
> 


-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-06-06 12:30   ` Peter Zijlstra
  2017-06-06 15:51     ` Jeffrey Hugo
@ 2017-06-06 16:33     ` Peter Zijlstra
  1 sibling, 0 replies; 8+ messages in thread
From: Peter Zijlstra @ 2017-06-06 16:33 UTC (permalink / raw)
  To: Jeffrey Hugo
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ,
	Tyler Baicar, Timur Tabi

On Tue, Jun 06, 2017 at 02:30:11PM +0200, Peter Zijlstra wrote:

> So with that the problem is that active_load_balance_cpu_stop() calls
> into can_migrate_task() with ->idle = CPU_IDLE and !dst_grpmask, which
> then goes *bang*. Now active_load_balance_cpu_stop() doesn't need to
> re-evaluate anything, so ideally it would just skip this entirely,
> right?
> 
> So why not do #3:
> 
> ---
>  kernel/sched/fair.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 47a0c552c77b..fd639d32fa4c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8523,6 +8523,13 @@ static int active_load_balance_cpu_stop(void *data)
>  			.src_cpu	= busiest_rq->cpu,
>  			.src_rq		= busiest_rq,
>  			.idle		= CPU_IDLE,

So I realized that setting CPU_NEW_IDLE here does the 'same' thing.
Except that would be a 'visible' change in the schedstat output (not
sure anybody really cares about that).

> +			/*
> +			 * can_migrate_task() doesn't need to compute new_dst_cpu
> +			 * for active balancing. Since we have CPU_IDLE, but no
> +			 * @dst_grpmask we need to make that test go away with lying
> +			 * about DST_PINNED.
> +			 */
> +			.flags		= LBF_DST_PINNED,

But given this is trivial and doesn't have any visible side effects,
this seems 'better'.

>  		};
>  
>  		schedstat_inc(sd->alb_count);

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-06-06 16:34 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-02 22:27 [PATCH V4 0/2] load_balance() fixes for affinity Jeffrey Hugo
2017-06-02 22:27 ` [PATCH V4 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
2017-06-05 17:23   ` Jeffrey Hugo
2017-06-05 18:33     ` Jeffrey Hugo
2017-06-06 12:30   ` Peter Zijlstra
2017-06-06 15:51     ` Jeffrey Hugo
2017-06-06 16:33     ` Peter Zijlstra
2017-06-02 22:27 ` [PATCH V4 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).