> On Tue, 2016-04-05 at 14:08 -0400, Chris Mason wrote:

> Now, on to the patch.  I pushed some code around and narrowed the
> problem down to select_idle_sibling()   We have cores going into and out
> of idle fast enough that even this cut our latencies in half:

Are you using NO_HZ?  If so, you may want to try the attached.

> static int select_idle_sibling(struct task_struct *p, int target)
>                                 goto next;
>  
>                         for_each_cpu(i, sched_group_cpus(sg)) {
> -                               if (i == target || !idle_cpu(i))
> +                               if (!idle_cpu(i))
>                                         goto next;
>                         }
>  
> IOW, by the time we get down to for_each_cpu(), the idle_cpu() check
> done at the top of the function is no longer valid.

Ok, that's only an optimization, could go if it's causing trouble.

> I tried a few variations on select_idle_sibling() that preserved the
> underlying goal of returning idle cores before idle SMT threads.  They
> were all horrible in different ways, and none of them were fast.
> 
> The patch below just makes select_idle_sibling pick the first idle
> thread it can find.  When I ran it through production workloads here, it
> was faster than the patch we've been carrying around for the last few
> years.
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 56b7d4b..c41baa6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4974,7 +4974,6 @@ find_idlest_cpu(struct sched_group *group,
> struct task_struct *p, int this_cpu)
>  static int select_idle_sibling(struct task_struct *p, int target)
>  {
>  	struct sched_domain *sd;
> -	struct sched_group *sg;
>  	int i = task_cpu(p);
>  
>  	if (idle_cpu(target))
> @@ -4990,24 +4989,14 @@ static int select_idle_sibling(struct
> task_struct *p, int target)
>  	 * Otherwise, iterate the domains and find an elegible idle
> cpu.
>  	 */
>  	sd = rcu_dereference(per_cpu(sd_llc, target));
> -	for_each_lower_domain(sd) {
> -		sg = sd->groups;
> -		do {
> -			if
> (!cpumask_intersects(sched_group_cpus(sg),
> -						tsk_cpus_allowed(p))
> )
> -				goto next;
> -
> -			for_each_cpu(i, sched_group_cpus(sg)) {
> -				if (i == target || !idle_cpu(i))
> -					goto next;
> -			}
> +	if (!sd)
> +		goto done;
>  
> -			target =
> cpumask_first_and(sched_group_cpus(sg),
> -					tsk_cpus_allowed(p));
> +	for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed)
> {
> +		if (cpu_active(i) && idle_cpu(i)) {
> +			target = i;
>  			goto done;
> -next:
> -			sg = sg->next;
> -		} while (sg != sd->groups);
> +		}
>  	}
>  done:
>  	return target;
> 

Ew.  That may improve your latency is everything load, but worst case
package walk will hurt like hell on CPUs with insane number of threads.
 That full search also turns the evil face of two-faced little
select_idle_sibling() into it's only face, the one that bounces tasks
about much more than they appreciate.

Looking for an idle core first delivers the most throughput boost, and
only looking at target's threads if you don't find one keeps the bounce
and traverse pain down to a dull roar, while at least trying to get
that latency win.  To me, your patch looks like it trades harm to many,
for good for a few.

A behavior switch would be better.  It can't get any dumber, but trying
to make it smarter makes it too damn fat.  As it sits, it's aiming in
the general direction of the bullseye.. and occasionally hits the wall.

	-Mike