All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] sched/fair: Tweaks for select_task_rq_fair slowpath
@ 2017-08-21 15:21 Brendan Jackman
  2017-08-21 15:21 ` [PATCH 1/2] sched/fair: Remove unnecessary comparison with -1 Brendan Jackman
  2017-08-21 15:21 ` [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group Brendan Jackman
  0 siblings, 2 replies; 16+ messages in thread
From: Brendan Jackman @ 2017-08-21 15:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes, Andres Oportus, Dietmar Eggemann,
	Vincent Guittot, Josef Bacik, Ingo Molnar, Morten Rasmussen,
	Peter Zijlstra

This patchset optimises away an unused comparison, and fixes some corner cases in
the find_idlest_group path of select_task_rq_fair.

Brendan Jackman (2):
  sched/fair: Remove unnecessary comparison with -1
  sched/fair: Fix use of NULL with find_idlest_group

 kernel/sched/fair.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] sched/fair: Remove unnecessary comparison with -1
  2017-08-21 15:21 [PATCH 0/2] sched/fair: Tweaks for select_task_rq_fair slowpath Brendan Jackman
@ 2017-08-21 15:21 ` Brendan Jackman
  2017-08-21 17:30   ` Josef Bacik
  2017-08-21 15:21 ` [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group Brendan Jackman
  1 sibling, 1 reply; 16+ messages in thread
From: Brendan Jackman @ 2017-08-21 15:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes, Andres Oportus, Dietmar Eggemann,
	Vincent Guittot, Josef Bacik, Ingo Molnar, Morten Rasmussen,
	Peter Zijlstra

Since 83a0a96a5f26 (sched/fair: Leverage the idle state info when
choosing the "idlest" cpu) find_idlest_cpu no longer returns -1.

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e216f6..64618d768546 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5934,7 +5934,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		}
 
 		new_cpu = find_idlest_cpu(group, p, cpu);
-		if (new_cpu == -1 || new_cpu == cpu) {
+		if (new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
 			continue;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-21 15:21 [PATCH 0/2] sched/fair: Tweaks for select_task_rq_fair slowpath Brendan Jackman
  2017-08-21 15:21 ` [PATCH 1/2] sched/fair: Remove unnecessary comparison with -1 Brendan Jackman
@ 2017-08-21 15:21 ` Brendan Jackman
  2017-08-21 17:26   ` Josef Bacik
                     ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Brendan Jackman @ 2017-08-21 15:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes, Andres Oportus, Ingo Molnar, Morten Rasmussen,
	Peter Zijlstra, Dietmar Eggemann, Vincent Guittot

The current use of returning NULL from find_idlest_group is broken in
two cases:

a1) The local group is not allowed.

   In this case, we currently do not change this_runnable_load or
   this_avg_load from its initial value of 0, which means we return
   NULL regardless of the load of the other, allowed groups. This
   results in pointlessly continuing the find_idlest_group search
   within the local group and then returning prev_cpu from
   select_task_rq_fair.

a2) No CPUs in the sched_domain are allowed.

   In this case we also return NULL and again pointlessly continue
   the search.

b) smp_processor_id() is the "idlest" and != prev_cpu.

   find_idlest_group also returns NULL when the local group is
   allowed and is the idlest. The caller then continues the
   find_idlest_group search at a lower level of the current CPU's
   sched_domain hierarchy. However new_cpu is not updated. This means
   the search is pointless and we return prev_cpu from
   select_task_rq_fair.

This is fixed by:

1. Returning NULL from find_idlest_group only when _no_ groups were
   allowed in the current sched_domain. In this case, we now break
   from the while(sd) loop and immediately return prev_cpu. This
   fixes case a2).

2. Initializing this_runnable_load and this_avg_load to ULONG_MAX
   instead of 0. This means in case a1) we now return the idlest
   non-local group.

3. Explicitly updating new_cpu when find_idlest_group returns the
   local group, fixing case b).

This patch also re-words the check for whether the group in
consideration is local, under the assumption that the first group in
the sched domain is always the local one.

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 64618d768546..7cb5ed719cf9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5382,26 +5382,29 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
  * domain.
  */
 static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int sd_flag)
+find_idlest_group(struct sched_domain *sd, struct task_struct *p, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
+	struct sched_group *local_group = sd->groups;
 	struct sched_group *most_spare_sg = NULL;
-	unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
-	unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
+	unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = ULONG_MAX;
+	unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;
 	unsigned long most_spare = 0, this_spare = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
 	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
 				(sd->imbalance_pct-100) / 100;
 
+	if (!cpumask_intersects(sched_domain_span(sd), &p->cpus_allowed))
+		return NULL;
+
 	if (sd_flag & SD_BALANCE_WAKE)
 		load_idx = sd->wake_idx;
 
 	do {
 		unsigned long load, avg_load, runnable_load;
 		unsigned long spare_cap, max_spare_cap;
-		int local_group;
+		bool group_is_local = group == local_group;
 		int i;
 
 		/* Skip over this group if it has no CPUs allowed */
@@ -5409,9 +5412,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 					&p->cpus_allowed))
 			continue;
 
-		local_group = cpumask_test_cpu(this_cpu,
-					       sched_group_span(group));
-
 		/*
 		 * Tally up the load of all CPUs in the group and find
 		 * the group containing the CPU with most spare capacity.
@@ -5422,7 +5422,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		for_each_cpu(i, sched_group_span(group)) {
 			/* Bias balancing toward cpus of our domain */
-			if (local_group)
+			if (group_is_local)
 				load = source_load(i, load_idx);
 			else
 				load = target_load(i, load_idx);
@@ -5443,7 +5443,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
 					group->sgc->capacity;
 
-		if (local_group) {
+		if (group_is_local) {
 			this_runnable_load = runnable_load;
 			this_avg_load = avg_load;
 			this_spare = max_spare_cap;
@@ -5489,21 +5489,21 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 	if (this_spare > task_util(p) / 2 &&
 	    imbalance_scale*this_spare > 100*most_spare)
-		return NULL;
+		return local_group;
 
 	if (most_spare > task_util(p) / 2)
 		return most_spare_sg;
 
 skip_spare:
 	if (!idlest)
-		return NULL;
+		return local_group;
 
 	if (min_runnable_load > (this_runnable_load + imbalance))
-		return NULL;
+		return local_group;
 
 	if ((this_runnable_load < (min_runnable_load + imbalance)) &&
 	     (100*this_avg_load < imbalance_scale*min_avg_load))
-		return NULL;
+		return local_group;
 
 	return idlest;
 }
@@ -5927,8 +5927,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			continue;
 		}
 
-		group = find_idlest_group(sd, p, cpu, sd_flag);
+		group = find_idlest_group(sd, p, sd_flag);
 		if (!group) {
+			break;
+		} else if (group == sd->groups) {
+			new_cpu = cpu;
+			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
 			continue;
 		}
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-21 15:21 ` [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group Brendan Jackman
@ 2017-08-21 17:26   ` Josef Bacik
  2017-08-21 17:59     ` Brendan Jackman
  2017-08-21 21:14   ` Peter Zijlstra
  2017-08-22  7:48   ` Vincent Guittot
  2 siblings, 1 reply; 16+ messages in thread
From: Josef Bacik @ 2017-08-21 17:26 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: linux-kernel, Joel Fernandes, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann,
	Vincent Guittot

On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
> The current use of returning NULL from find_idlest_group is broken in
> two cases:
> 
> a1) The local group is not allowed.
> 
>    In this case, we currently do not change this_runnable_load or
>    this_avg_load from its initial value of 0, which means we return
>    NULL regardless of the load of the other, allowed groups. This
>    results in pointlessly continuing the find_idlest_group search
>    within the local group and then returning prev_cpu from
>    select_task_rq_fair.
> 
> a2) No CPUs in the sched_domain are allowed.
> 
>    In this case we also return NULL and again pointlessly continue
>    the search.
> 
> b) smp_processor_id() is the "idlest" and != prev_cpu.
> 
>    find_idlest_group also returns NULL when the local group is
>    allowed and is the idlest. The caller then continues the
>    find_idlest_group search at a lower level of the current CPU's
>    sched_domain hierarchy. However new_cpu is not updated. This means
>    the search is pointless and we return prev_cpu from
>    select_task_rq_fair.
> 
> This is fixed by:
> 
> 1. Returning NULL from find_idlest_group only when _no_ groups were
>    allowed in the current sched_domain. In this case, we now break
>    from the while(sd) loop and immediately return prev_cpu. This
>    fixes case a2).
> 
> 2. Initializing this_runnable_load and this_avg_load to ULONG_MAX
>    instead of 0. This means in case a1) we now return the idlest
>    non-local group.
> 
> 3. Explicitly updating new_cpu when find_idlest_group returns the
>    local group, fixing case b).
> 
> This patch also re-words the check for whether the group in
> consideration is local, under the assumption that the first group in
> the sched domain is always the local one.
> 
> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 34 +++++++++++++++++++---------------
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 64618d768546..7cb5ed719cf9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5382,26 +5382,29 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>   * domain.
>   */
>  static struct sched_group *
> -find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> -		  int this_cpu, int sd_flag)
> +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int sd_flag)
>  {
>  	struct sched_group *idlest = NULL, *group = sd->groups;
> +	struct sched_group *local_group = sd->groups;
>  	struct sched_group *most_spare_sg = NULL;
> -	unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
> -	unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
> +	unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = ULONG_MAX;
> +	unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;
>  	unsigned long most_spare = 0, this_spare = 0;
>  	int load_idx = sd->forkexec_idx;
>  	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
>  	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
>  				(sd->imbalance_pct-100) / 100;
>  
> +	if (!cpumask_intersects(sched_domain_span(sd), &p->cpus_allowed))
> +		return NULL;
> +
>  	if (sd_flag & SD_BALANCE_WAKE)
>  		load_idx = sd->wake_idx;
>  
>  	do {
>  		unsigned long load, avg_load, runnable_load;
>  		unsigned long spare_cap, max_spare_cap;
> -		int local_group;
> +		bool group_is_local = group == local_group;
>  		int i;
>  
>  		/* Skip over this group if it has no CPUs allowed */
> @@ -5409,9 +5412,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  					&p->cpus_allowed))
>  			continue;
>  
> -		local_group = cpumask_test_cpu(this_cpu,
> -					       sched_group_span(group));
> -

This isn't right is it?  cpu isn't necessarily in the very first group of a sd
right?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] sched/fair: Remove unnecessary comparison with -1
  2017-08-21 15:21 ` [PATCH 1/2] sched/fair: Remove unnecessary comparison with -1 Brendan Jackman
@ 2017-08-21 17:30   ` Josef Bacik
  0 siblings, 0 replies; 16+ messages in thread
From: Josef Bacik @ 2017-08-21 17:30 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: linux-kernel, Joel Fernandes, Andres Oportus, Dietmar Eggemann,
	Vincent Guittot, Josef Bacik, Ingo Molnar, Morten Rasmussen,
	Peter Zijlstra

On Mon, Aug 21, 2017 at 04:21:27PM +0100, Brendan Jackman wrote:
> Since 83a0a96a5f26 (sched/fair: Leverage the idle state info when
> choosing the "idlest" cpu) find_idlest_cpu no longer returns -1.
> 
> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Morten Rasmussen <morten.rasmussen@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>

Reviewed-by: Josef Bacik <jbacik@fb.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-21 17:26   ` Josef Bacik
@ 2017-08-21 17:59     ` Brendan Jackman
  2017-08-21 20:22       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Brendan Jackman @ 2017-08-21 17:59 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-kernel, Joel Fernandes, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann,
	Vincent Guittot

Hi Josef,

Thanks for taking a look.

On Mon, Aug 21 2017 at 17:26, Josef Bacik wrote:
> On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
[...]
>> -		local_group = cpumask_test_cpu(this_cpu,
>> -					       sched_group_span(group));
>> -
>
> This isn't right is it?  cpu isn't necessarily in the very first group of a sd
> right?

I think it is - I haven't grokked the sd/sg setup code in
kernel/sched/topology.c but there is a comment[1] that I interpret as
saying this. I'll take a more careful look tomorrow.

[1] http://elixir.free-electrons.com/linux/v4.13-rc6/source/kernel/sched/topology.c#L786

If I'm wrong, this can be rewritten not to use that assumption - I did
it this way in the caller ("else if (group == sd->groups)") because I
didn't want to use cpumask_test_cpu, and then changed it inside
find_idlest_group so there weren't two ways of doing the same thing in
the same neighbourhood of code.

Cheers,
Brendan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-21 17:59     ` Brendan Jackman
@ 2017-08-21 20:22       ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-21 20:22 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Josef Bacik, linux-kernel, Joel Fernandes, Andres Oportus,
	Ingo Molnar, Morten Rasmussen, Dietmar Eggemann, Vincent Guittot

On Mon, Aug 21, 2017 at 06:59:28PM +0100, Brendan Jackman wrote:
> Hi Josef,
> 
> Thanks for taking a look.
> 
> On Mon, Aug 21 2017 at 17:26, Josef Bacik wrote:
> > On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
> [...]
> >> -		local_group = cpumask_test_cpu(this_cpu,
> >> -					       sched_group_span(group));
> >> -
> >
> > This isn't right is it?  cpu isn't necessarily in the very first group of a sd
> > right?
> 
> I think it is - I haven't grokked the sd/sg setup code in
> kernel/sched/topology.c but there is a comment[1] that I interpret as
> saying this. I'll take a more careful look tomorrow.
> 
> [1] http://elixir.free-electrons.com/linux/v4.13-rc6/source/kernel/sched/topology.c#L786
> 
> If I'm wrong, this can be rewritten not to use that assumption - I did
> it this way in the caller ("else if (group == sd->groups)") because I
> didn't want to use cpumask_test_cpu, and then changed it inside
> find_idlest_group so there weren't two ways of doing the same thing in
> the same neighbourhood of code.

No you are quite correct. The sched_domain of a CPU always includes that
CPU and the first group of a sched_domain is it's child domain and
therefore also includes that CPU.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-21 15:21 ` [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group Brendan Jackman
  2017-08-21 17:26   ` Josef Bacik
@ 2017-08-21 21:14   ` Peter Zijlstra
  2017-08-21 21:23     ` Peter Zijlstra
  2017-08-22  4:34     ` Joel Fernandes
  2017-08-22  7:48   ` Vincent Guittot
  2 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-21 21:14 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: linux-kernel, Joel Fernandes, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Dietmar Eggemann, Vincent Guittot

On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
> The current use of returning NULL from find_idlest_group is broken in
> two cases:
> 
> a1) The local group is not allowed.
> 
>    In this case, we currently do not change this_runnable_load or
>    this_avg_load from its initial value of 0, which means we return
>    NULL regardless of the load of the other, allowed groups. This
>    results in pointlessly continuing the find_idlest_group search
>    within the local group and then returning prev_cpu from
>    select_task_rq_fair.

> b) smp_processor_id() is the "idlest" and != prev_cpu.
> 
>    find_idlest_group also returns NULL when the local group is
>    allowed and is the idlest. The caller then continues the
>    find_idlest_group search at a lower level of the current CPU's
>    sched_domain hierarchy. However new_cpu is not updated. This means
>    the search is pointless and we return prev_cpu from
>    select_task_rq_fair.
> 

I think its much simpler than that.. but its late, so who knows ;-)

Both cases seem predicated on the assumption that we'll return @cpu when
we don't find any idler CPU. Consider, if the local group is the idlest,
we should stick with @cpu and simply proceed with the child domain.

The confusion, and the bugs, seem to have snuck in when we started
considering @prev_cpu, whenever that was. The below is mostly code
movement to put that whole while(sd) loop into its own function.

The effective change is setting @new_cpu = @cpu when we start that loop:

@@ -6023,6 +6023,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 		struct sched_group *group;
 		int weight;
 
+		new_cpu = cpu;
+
 		if (!(sd->flags & sd_flag)) {
 			sd = sd->child;
 			continue;




---
 kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 48 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c77e4b1d51c0..3e77265c480a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5588,10 +5588,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
 }
 
 /*
- * find_idlest_cpu - find the idlest cpu among the cpus in group.
+ * find_idlest_group_cpu - find the idlest cpu among the cpus in group.
  */
 static int
-find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 {
 	unsigned long load, min_load = ULONG_MAX;
 	unsigned int min_exit_latency = UINT_MAX;
@@ -5640,6 +5640,50 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
 	return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
 }
 
+static int
+find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
+{
+	struct sched_domain *tmp;
+	int new_cpu = cpu;
+
+	while (sd) {
+		struct sched_group *group;
+		int weight;
+
+		if (!(sd->flags & sd_flag)) {
+			sd = sd->child;
+			continue;
+		}
+
+		group = find_idlest_group(sd, p, cpu, sd_flag);
+		if (!group) {
+			sd = sd->child;
+			continue;
+		}
+
+		new_cpu = find_idlest_group_cpu(group, p, cpu);
+		if (new_cpu == -1 || new_cpu == cpu) {
+			/* Now try balancing at a lower domain level of cpu */
+			sd = sd->child;
+			continue;
+		}
+
+		/* Now try balancing at a lower domain level of new_cpu */
+		cpu = new_cpu;
+		weight = sd->span_weight;
+		sd = NULL;
+		for_each_domain(cpu, tmp) {
+			if (weight <= tmp->span_weight)
+				break;
+			if (tmp->flags & sd_flag)
+				sd = tmp;
+		}
+		/* while loop will break here if sd == NULL */
+	}
+
+	return new_cpu;
+}
+
 /*
  * Implement a for_each_cpu() variant that starts the scan at a given cpu
  * (@start), and wraps around.
@@ -6019,39 +6063,8 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
 			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
-	} else while (sd) {
-		struct sched_group *group;
-		int weight;
-
-		if (!(sd->flags & sd_flag)) {
-			sd = sd->child;
-			continue;
-		}
-
-		group = find_idlest_group(sd, p, cpu, sd_flag);
-		if (!group) {
-			sd = sd->child;
-			continue;
-		}
-
-		new_cpu = find_idlest_cpu(group, p, cpu);
-		if (new_cpu == -1 || new_cpu == cpu) {
-			/* Now try balancing at a lower domain level of cpu */
-			sd = sd->child;
-			continue;
-		}
-
-		/* Now try balancing at a lower domain level of new_cpu */
-		cpu = new_cpu;
-		weight = sd->span_weight;
-		sd = NULL;
-		for_each_domain(cpu, tmp) {
-			if (weight <= tmp->span_weight)
-				break;
-			if (tmp->flags & sd_flag)
-				sd = tmp;
-		}
-		/* while loop will break here if sd == NULL */
+	} else {
+		new_cpu = find_idlest_cpu(sd, p, cpu, sd_flag);
 	}
 	rcu_read_unlock();
 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-21 21:14   ` Peter Zijlstra
@ 2017-08-21 21:23     ` Peter Zijlstra
  2017-08-22  4:34     ` Joel Fernandes
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-21 21:23 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: linux-kernel, Joel Fernandes, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Dietmar Eggemann, Vincent Guittot

On Mon, Aug 21, 2017 at 11:14:00PM +0200, Peter Zijlstra wrote:
> +static int
> +find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
> +{
> +	struct sched_domain *tmp;
> +	int new_cpu = cpu;
> +
> +	while (sd) {
> +		struct sched_group *group;
> +		int weight;
> +
> +		if (!(sd->flags & sd_flag)) {
> +			sd = sd->child;
> +			continue;
> +		}
> +
> +		group = find_idlest_group(sd, p, cpu, sd_flag);
> +		if (!group) {
> +			sd = sd->child;
> +			continue;
> +		}
> +
> +		new_cpu = find_idlest_group_cpu(group, p, cpu);
> +		if (new_cpu == -1 || new_cpu == cpu) {
> +			/* Now try balancing at a lower domain level of cpu */
> +			sd = sd->child;
> +			continue;
> +		}
> +
> +		/* Now try balancing at a lower domain level of new_cpu */
> +		cpu = new_cpu;
> +		weight = sd->span_weight;
> +		sd = NULL;
> +		for_each_domain(cpu, tmp) {
> +			if (weight <= tmp->span_weight)
> +				break;
> +			if (tmp->flags & sd_flag)
> +				sd = tmp;
> +		}

This find-the-sd-for-another-cpu thing is horrific. And it has always
bugged me that the whole thing is O(n^2) to find a CPU.

I understand why it has this form, but scanning each CPU more than once
is just offensive.

> +		/* while loop will break here if sd == NULL */
> +	}
> +
> +	return new_cpu;
> +}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-21 21:14   ` Peter Zijlstra
  2017-08-21 21:23     ` Peter Zijlstra
@ 2017-08-22  4:34     ` Joel Fernandes
  2017-08-22 10:39       ` Brendan Jackman
  1 sibling, 1 reply; 16+ messages in thread
From: Joel Fernandes @ 2017-08-22  4:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Brendan Jackman, LKML, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Dietmar Eggemann, Vincent Guittot

Hi Peter,

On Mon, Aug 21, 2017 at 2:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
>> The current use of returning NULL from find_idlest_group is broken in
>> two cases:
>>
>> a1) The local group is not allowed.
>>
>>    In this case, we currently do not change this_runnable_load or
>>    this_avg_load from its initial value of 0, which means we return
>>    NULL regardless of the load of the other, allowed groups. This
>>    results in pointlessly continuing the find_idlest_group search
>>    within the local group and then returning prev_cpu from
>>    select_task_rq_fair.
>
>> b) smp_processor_id() is the "idlest" and != prev_cpu.
>>
>>    find_idlest_group also returns NULL when the local group is
>>    allowed and is the idlest. The caller then continues the
>>    find_idlest_group search at a lower level of the current CPU's
>>    sched_domain hierarchy. However new_cpu is not updated. This means
>>    the search is pointless and we return prev_cpu from
>>    select_task_rq_fair.
>>
>
> I think its much simpler than that.. but its late, so who knows ;-)
>
> Both cases seem predicated on the assumption that we'll return @cpu when
> we don't find any idler CPU. Consider, if the local group is the idlest,
> we should stick with @cpu and simply proceed with the child domain.
>
> The confusion, and the bugs, seem to have snuck in when we started
> considering @prev_cpu, whenever that was. The below is mostly code
> movement to put that whole while(sd) loop into its own function.
>
> The effective change is setting @new_cpu = @cpu when we start that loop:
>
<snip>
> ---
>  kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++----------------------
>  1 file changed, 48 insertions(+), 35 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c77e4b1d51c0..3e77265c480a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5588,10 +5588,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>  }
>
>  /*
> - * find_idlest_cpu - find the idlest cpu among the cpus in group.
> + * find_idlest_group_cpu - find the idlest cpu among the cpus in group.
>   */
>  static int
> -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  {
>         unsigned long load, min_load = ULONG_MAX;
>         unsigned int min_exit_latency = UINT_MAX;
> @@ -5640,6 +5640,50 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>         return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
>  }
>
> +static int
> +find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
> +{
> +       struct sched_domain *tmp;
> +       int new_cpu = cpu;
> +
> +       while (sd) {
> +               struct sched_group *group;
> +               int weight;
> +
> +               if (!(sd->flags & sd_flag)) {
> +                       sd = sd->child;
> +                       continue;
> +               }
> +
> +               group = find_idlest_group(sd, p, cpu, sd_flag);
> +               if (!group) {
> +                       sd = sd->child;
> +                       continue;

But this will still have the issue of pointlessly searching in the
local_group when the idlest CPU is in the non-local group? Stemming
from the fact that find_idlest_group is broken if the local group is
not allowed.

I believe this is fixed by Brendan's patch? :

"Initializing this_runnable_load and this_avg_load to ULONG_MAX
   instead of 0. This means in case a1) we now return the idlest
   non-local group."


Hopefully I didn't missing something. Sorry if I did, thanks,

-Joel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-21 15:21 ` [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group Brendan Jackman
  2017-08-21 17:26   ` Josef Bacik
  2017-08-21 21:14   ` Peter Zijlstra
@ 2017-08-22  7:48   ` Vincent Guittot
  2017-08-22 10:41     ` Brendan Jackman
  2 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2017-08-22  7:48 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: linux-kernel, Joel Fernandes, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann

On 21 August 2017 at 17:21, Brendan Jackman <brendan.jackman@arm.com> wrote:
> The current use of returning NULL from find_idlest_group is broken in
[snip]
> ---
>  kernel/sched/fair.c | 34 +++++++++++++++++++---------------
>  1 file changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 64618d768546..7cb5ed719cf9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5382,26 +5382,29 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>   * domain.
>   */
>  static struct sched_group *
> -find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> -                 int this_cpu, int sd_flag)
> +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int sd_flag)
>  {
>         struct sched_group *idlest = NULL, *group = sd->groups;
> +       struct sched_group *local_group = sd->groups;
>         struct sched_group *most_spare_sg = NULL;
> -       unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
> -       unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
> +       unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = ULONG_MAX;
> +       unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;

Good catch for this_runnable_load, indeed there is a problem is
local_group is not allowed

>         unsigned long most_spare = 0, this_spare = 0;
>         int load_idx = sd->forkexec_idx;
>         int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
>         unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
>                                 (sd->imbalance_pct-100) / 100;
>
> +       if (!cpumask_intersects(sched_domain_span(sd), &p->cpus_allowed))
> +               return NULL;

You should put that test above at upper level in the while (sd) loop
or even before the while(sd) loop as there is even no need to start
this while loop in this case. There is no need to call
find_idlest_group if there is no cpu allowed and no group to find.
Then, you can keep the current behavior of find_idlest_group() which
return null if there no best group than the local one

> +
>         if (sd_flag & SD_BALANCE_WAKE)
>                 load_idx = sd->wake_idx;
>
>         do {
>                 unsigned long load, avg_load, runnable_load;
>                 unsigned long spare_cap, max_spare_cap;
> -               int local_group;
> +               bool group_is_local = group == local_group;
>                 int i;
>
>                 /* Skip over this group if it has no CPUs allowed */

[snip]

> @@ -5927,8 +5927,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>                         continue;
>                 }
>
> -               group = find_idlest_group(sd, p, cpu, sd_flag);
> +               group = find_idlest_group(sd, p, sd_flag);
>                 if (!group) {
> +                       break;
> +               } else if (group == sd->groups) {
> +                       new_cpu = cpu;

The " new_cpu = cpu" above should be put before the while(sd) loop
like in Peter's proposal

> +                       /* Now try balancing at a lower domain level of cpu */
>                         sd = sd->child;
>                         continue;
>                 }
> --
> 2.14.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-22  4:34     ` Joel Fernandes
@ 2017-08-22 10:39       ` Brendan Jackman
  2017-08-22 10:45         ` Brendan Jackman
  2017-08-22 11:03         ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: Brendan Jackman @ 2017-08-22 10:39 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, LKML, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Dietmar Eggemann, Vincent Guittot


On Tue, Aug 22 2017 at 04:34, Joel Fernandes wrote:
> Hi Peter,
>
> On Mon, Aug 21, 2017 at 2:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
>>> The current use of returning NULL from find_idlest_group is broken in
>>> two cases:
>>>
>>> a1) The local group is not allowed.
>>>
>>>    In this case, we currently do not change this_runnable_load or
>>>    this_avg_load from its initial value of 0, which means we return
>>>    NULL regardless of the load of the other, allowed groups. This
>>>    results in pointlessly continuing the find_idlest_group search
>>>    within the local group and then returning prev_cpu from
>>>    select_task_rq_fair.
>>
>>> b) smp_processor_id() is the "idlest" and != prev_cpu.
>>>
>>>    find_idlest_group also returns NULL when the local group is
>>>    allowed and is the idlest. The caller then continues the
>>>    find_idlest_group search at a lower level of the current CPU's
>>>    sched_domain hierarchy. However new_cpu is not updated. This means
>>>    the search is pointless and we return prev_cpu from
>>>    select_task_rq_fair.
>>>
>>
>> I think its much simpler than that.. but its late, so who knows ;-)
>>
>> Both cases seem predicated on the assumption that we'll return @cpu when
>> we don't find any idler CPU. Consider, if the local group is the idlest,
>> we should stick with @cpu and simply proceed with the child domain.
>>
>> The confusion, and the bugs, seem to have snuck in when we started
>> considering @prev_cpu, whenever that was. The below is mostly code
>> movement to put that whole while(sd) loop into its own function.
>>
>> The effective change is setting @new_cpu = @cpu when we start that loop:
>>
> <snip>
>> ---
>>  kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++----------------------
>>  1 file changed, 48 insertions(+), 35 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c77e4b1d51c0..3e77265c480a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5588,10 +5588,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>>  }
>>
>>  /*
>> - * find_idlest_cpu - find the idlest cpu among the cpus in group.
>> + * find_idlest_group_cpu - find the idlest cpu among the cpus in group.
>>   */
>>  static int
>> -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>> +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>  {
>>         unsigned long load, min_load = ULONG_MAX;
>>         unsigned int min_exit_latency = UINT_MAX;
>> @@ -5640,6 +5640,50 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>>         return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
>>  }
>>
>> +static int
>> +find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
>> +{
>> +       struct sched_domain *tmp;
>> +       int new_cpu = cpu;
>> +
>> +       while (sd) {
>> +               struct sched_group *group;
>> +               int weight;
>> +
>> +               if (!(sd->flags & sd_flag)) {
>> +                       sd = sd->child;
>> +                       continue;
>> +               }
>> +
>> +               group = find_idlest_group(sd, p, cpu, sd_flag);
>> +               if (!group) {
>> +                       sd = sd->child;
>> +                       continue;
>
> But this will still have the issue of pointlessly searching in the
> local_group when the idlest CPU is in the non-local group? Stemming
> from the fact that find_idlest_group is broken if the local group is
> not allowed.
>
> I believe this is fixed by Brendan's patch? :
>
> "Initializing this_runnable_load and this_avg_load to ULONG_MAX
>    instead of 0. This means in case a1) we now return the idlest
>    non-local group."
>

Yeah, I don't think this is enough, firstly because of affinity and
secondly because I _guess_ we don't want to migrate from prev_cpu to
@cpu if none of the sched_domains have sd_flag set (...right? That would
feel like wake balancing even when there's no SD_BALANCE_WAKE anywhere).

However the code movement helps - I'll combine it with Vincent's
suggestions and post a v2.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-22  7:48   ` Vincent Guittot
@ 2017-08-22 10:41     ` Brendan Jackman
  0 siblings, 0 replies; 16+ messages in thread
From: Brendan Jackman @ 2017-08-22 10:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, Joel Fernandes, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann


On Tue, Aug 22 2017 at 07:48, Vincent Guittot wrote:
> On 21 August 2017 at 17:21, Brendan Jackman <brendan.jackman@arm.com> wrote:
>> The current use of returning NULL from find_idlest_group is broken in
> [snip]
>> ---
>>  kernel/sched/fair.c | 34 +++++++++++++++++++---------------
>>  1 file changed, 19 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 64618d768546..7cb5ed719cf9 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5382,26 +5382,29 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>>   * domain.
>>   */
>>  static struct sched_group *
>> -find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>> -                 int this_cpu, int sd_flag)
>> +find_idlest_group(struct sched_domain *sd, struct task_struct *p, int sd_flag)
>>  {
>>         struct sched_group *idlest = NULL, *group = sd->groups;
>> +       struct sched_group *local_group = sd->groups;
>>         struct sched_group *most_spare_sg = NULL;
>> -       unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
>> -       unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
>> +       unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = ULONG_MAX;
>> +       unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;
>
> Good catch for this_runnable_load, indeed there is a problem is
> local_group is not allowed
>
>>         unsigned long most_spare = 0, this_spare = 0; >>         int load_idx = sd->forkexec_idx;
>>         int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
>>         unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
>>                                 (sd->imbalance_pct-100) / 100;
>>
>> +       if (!cpumask_intersects(sched_domain_span(sd), &p->cpus_allowed))
>> +               return NULL;
>
> You should put that test above at upper level in the while (sd) loop
> or even before the while(sd) loop as there is even no need to start
> this while loop in this case. There is no need to call
> find_idlest_group if there is no cpu allowed and no group to find.
> Then, you can keep the current behavior of find_idlest_group() which
> return null if there no best group than the local one

True. I have to admit the reason I didn't do this is because due to the
current "else while", doing this outside the loop would have meant
another level of indentation. But actually if I break out a new function
as in Peter's proposal I can avoid that (plus I probably shouldn't let
code beauty overrule warm-path performance :| ).

>> +
>>         if (sd_flag & SD_BALANCE_WAKE)
>>                 load_idx = sd->wake_idx;
>>
>>         do {
>>                 unsigned long load, avg_load, runnable_load;
>>                 unsigned long spare_cap, max_spare_cap;
>> -               int local_group;
>> +               bool group_is_local = group == local_group;
>>                 int i;
>>
>>                 /* Skip over this group if it has no CPUs allowed */
>
> [snip]
>
>> @@ -5927,8 +5927,12 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>                         continue;
>>                 }
>>
>> -               group = find_idlest_group(sd, p, cpu, sd_flag);
>> +               group = find_idlest_group(sd, p, sd_flag);
>>                 if (!group) {
>> +                       break;
>> +               } else if (group == sd->groups) {
>> +                       new_cpu = cpu;
>
> The " new_cpu = cpu" above should be put before the while(sd) loop
> like in Peter's proposal

I don't think that would work - I've responded along with Joel in the
other subthread.

>> +                       /* Now try balancing at a lower domain level of cpu */
>>                         sd = sd->child;
>>                         continue;
>>                 }
>> --
>> 2.14.1
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-22 10:39       ` Brendan Jackman
@ 2017-08-22 10:45         ` Brendan Jackman
  2017-08-22 11:03         ` Peter Zijlstra
  1 sibling, 0 replies; 16+ messages in thread
From: Brendan Jackman @ 2017-08-22 10:45 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Peter Zijlstra, LKML, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Dietmar Eggemann, Vincent Guittot


On Tue, Aug 22 2017 at 10:39, Brendan Jackman wrote:
> On Tue, Aug 22 2017 at 04:34, Joel Fernandes wrote:
>> Hi Peter,
>>
>> On Mon, Aug 21, 2017 at 2:14 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Mon, Aug 21, 2017 at 04:21:28PM +0100, Brendan Jackman wrote:
>>>> The current use of returning NULL from find_idlest_group is broken in
>>>> two cases:
>>>>
>>>> a1) The local group is not allowed.
>>>>
>>>>    In this case, we currently do not change this_runnable_load or
>>>>    this_avg_load from its initial value of 0, which means we return
>>>>    NULL regardless of the load of the other, allowed groups. This
>>>>    results in pointlessly continuing the find_idlest_group search
>>>>    within the local group and then returning prev_cpu from
>>>>    select_task_rq_fair.
>>>
>>>> b) smp_processor_id() is the "idlest" and != prev_cpu.
>>>>
>>>>    find_idlest_group also returns NULL when the local group is
>>>>    allowed and is the idlest. The caller then continues the
>>>>    find_idlest_group search at a lower level of the current CPU's
>>>>    sched_domain hierarchy. However new_cpu is not updated. This means
>>>>    the search is pointless and we return prev_cpu from
>>>>    select_task_rq_fair.
>>>>
>>>
>>> I think its much simpler than that.. but its late, so who knows ;-)
>>>
>>> Both cases seem predicated on the assumption that we'll return @cpu when
>>> we don't find any idler CPU. Consider, if the local group is the idlest,
>>> we should stick with @cpu and simply proceed with the child domain.
>>>
>>> The confusion, and the bugs, seem to have snuck in when we started
>>> considering @prev_cpu, whenever that was. The below is mostly code
>>> movement to put that whole while(sd) loop into its own function.
>>>
>>> The effective change is setting @new_cpu = @cpu when we start that loop:
>>>
>> <snip>
>>> ---
>>>  kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++----------------------
>>>  1 file changed, 48 insertions(+), 35 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index c77e4b1d51c0..3e77265c480a 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5588,10 +5588,10 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>>>  }
>>>
>>>  /*
>>> - * find_idlest_cpu - find the idlest cpu among the cpus in group.
>>> + * find_idlest_group_cpu - find the idlest cpu among the cpus in group.
>>>   */
>>>  static int
>>> -find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>> +find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>>>  {
>>>         unsigned long load, min_load = ULONG_MAX;
>>>         unsigned int min_exit_latency = UINT_MAX;
>>> @@ -5640,6 +5640,50 @@ static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
>>>         return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
>>>  }
>>>
>>> +static int
>>> +find_idlest_cpu(struct sched_domain *sd, struct task_struct *p, int cpu, int sd_flag)
>>> +{
>>> +       struct sched_domain *tmp;
>>> +       int new_cpu = cpu;
>>> +
>>> +       while (sd) {
>>> +               struct sched_group *group;
>>> +               int weight;
>>> +
>>> +               if (!(sd->flags & sd_flag)) {
>>> +                       sd = sd->child;
>>> +                       continue;
>>> +               }
>>> +
>>> +               group = find_idlest_group(sd, p, cpu, sd_flag);
>>> +               if (!group) {
>>> +                       sd = sd->child;
>>> +                       continue;
>>
>> But this will still have the issue of pointlessly searching in the
>> local_group when the idlest CPU is in the non-local group? Stemming
>> from the fact that find_idlest_group is broken if the local group is
>> not allowed.
>>
>> I believe this is fixed by Brendan's patch? :
>>
>> "Initializing this_runnable_load and this_avg_load to ULONG_MAX
>>    instead of 0. This means in case a1) we now return the idlest
>>    non-local group."
>>
>
> Yeah, I don't think this is enough, firstly because of affinity and
> secondly because I _guess_ we don't want to migrate from prev_cpu to
> @cpu if none of the sched_domains have sd_flag set (...right? That would
> feel like wake balancing even when there's no SD_BALANCE_WAKE anywhere).

Hmm, actually we always have sd_flag set, because otherwise we wouldn't
have done sd = tmp at the beginning of select_task_rq_fair.

>
> However the code movement helps - I'll combine it with Vincent's
> suggestions and post a v2.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-22 10:39       ` Brendan Jackman
  2017-08-22 10:45         ` Brendan Jackman
@ 2017-08-22 11:03         ` Peter Zijlstra
  2017-08-22 12:46           ` Brendan Jackman
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2017-08-22 11:03 UTC (permalink / raw)
  To: Brendan Jackman
  Cc: Joel Fernandes, LKML, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Dietmar Eggemann, Vincent Guittot

On Tue, Aug 22, 2017 at 11:39:26AM +0100, Brendan Jackman wrote:

> However the code movement helps - I'll combine it with Vincent's
> suggestions and post a v2.

Please also split into multiple patches, as I feel this 2/2 does too
many things. Fixing that this_*_load initialization stuff for instance
could be a separate patch.

And unlike what I did, do the whole code movement as a pure code
movement, without any additional changes.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group
  2017-08-22 11:03         ` Peter Zijlstra
@ 2017-08-22 12:46           ` Brendan Jackman
  0 siblings, 0 replies; 16+ messages in thread
From: Brendan Jackman @ 2017-08-22 12:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joel Fernandes, LKML, Andres Oportus, Ingo Molnar,
	Morten Rasmussen, Dietmar Eggemann, Vincent Guittot


On Tue, Aug 22 2017 at 11:03, Peter Zijlstra wrote:
> On Tue, Aug 22, 2017 at 11:39:26AM +0100, Brendan Jackman wrote:
>
>> However the code movement helps - I'll combine it with Vincent's
>> suggestions and post a v2.
>
> Please also split into multiple patches, as I feel this 2/2 does too
> many things. Fixing that this_*_load initialization stuff for instance
> could be a separate patch.

Sure. It was initially multiple patches but there were confusing
dependencies between them, so I squashed it. I don' think that will be a
problem in light of the new suggestions from you & Vincent.

> And unlike what I did, do the whole code movement as a pure code
> movement, without any additional changes.

Will do.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-08-22 12:46 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-21 15:21 [PATCH 0/2] sched/fair: Tweaks for select_task_rq_fair slowpath Brendan Jackman
2017-08-21 15:21 ` [PATCH 1/2] sched/fair: Remove unnecessary comparison with -1 Brendan Jackman
2017-08-21 17:30   ` Josef Bacik
2017-08-21 15:21 ` [PATCH 2/2] sched/fair: Fix use of NULL with find_idlest_group Brendan Jackman
2017-08-21 17:26   ` Josef Bacik
2017-08-21 17:59     ` Brendan Jackman
2017-08-21 20:22       ` Peter Zijlstra
2017-08-21 21:14   ` Peter Zijlstra
2017-08-21 21:23     ` Peter Zijlstra
2017-08-22  4:34     ` Joel Fernandes
2017-08-22 10:39       ` Brendan Jackman
2017-08-22 10:45         ` Brendan Jackman
2017-08-22 11:03         ` Peter Zijlstra
2017-08-22 12:46           ` Brendan Jackman
2017-08-22  7:48   ` Vincent Guittot
2017-08-22 10:41     ` Brendan Jackman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.