[PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
@ 2015-05-27 21:22 Josef Bacik
  2015-05-28  3:46 ` Mike Galbraith
                   ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Josef Bacik @ 2015-05-27 21:22 UTC (permalink / raw)
  To: riel, mingo, peterz, linux-kernel

[ sorry if you get this twice, it seems like the first submission got lost ]

At Facebook we have a pretty heavily multi-threaded application that is
sensitive to latency.  We have been pulling forward the old SD_WAKE_IDLE code
because it gives us a pretty significant performance gain (like 20%).  It turns
out this is because there are cases where the scheduler puts our task on a busy
CPU when there are idle CPU's in the system.  We verify this by reading the
cpu_delay_req_avg_us from the scheduler netlink stuff.  With our crappy patch we
get much lower numbers vs baseline.

SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just
looking for an idle sibling, preferring affinity over all else.  This is not
helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not
garuntee affinity.  Fix this by first trying to find an idle sibling, and then
if the cpu is not idle fall through to the logic to find an idle cpu.  With this
patch we get slightly better performance than with our forward port of
SD_WAKE_IDLE.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 241213b..03dafa3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4766,7 +4766,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f

 	if (sd_flag & SD_BALANCE_WAKE) {
 		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		if (idle_cpu(new_cpu))
+			goto unlock;
 	}

 	while (sd) {
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
@ 2015-05-28  3:46 ` Mike Galbraith
  2015-05-28  9:49   ` Morten Rasmussen
  2015-05-28 10:21 ` Peter Zijlstra
  2015-05-28 11:55 ` Srikar Dronamraju
  2 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-05-28  3:46 UTC (permalink / raw)
  To: Josef Bacik; +Cc: riel, mingo, peterz, linux-kernel

On Wed, 2015-05-27 at 17:22 -0400, Josef Bacik wrote:
> [ sorry if you get this twice, it seems like the first submission got lost ]
> 
> At Facebook we have a pretty heavily multi-threaded application that is
> sensitive to latency.  We have been pulling forward the old SD_WAKE_IDLE code
> because it gives us a pretty significant performance gain (like 20%).  It turns
> out this is because there are cases where the scheduler puts our task on a busy
> CPU when there are idle CPU's in the system.  We verify this by reading the
> cpu_delay_req_avg_us from the scheduler netlink stuff.  With our crappy patch we
> get much lower numbers vs baseline.
> 
> SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just
> looking for an idle sibling, preferring affinity over all else.  This is not
> helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not
> garuntee affinity.  Fix this by first trying to find an idle sibling, and then
> if the cpu is not idle fall through to the logic to find an idle cpu.  With this
> patch we get slightly better performance than with our forward port of
> SD_WAKE_IDLE.  Thanks,

The job description isn't really find idle. it's find least loaded.

> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  kernel/sched/fair.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 241213b..03dafa3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4766,7 +4766,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		new_cpu = select_idle_sibling(p, prev_cpu);
> -		goto unlock;
> +		if (idle_cpu(new_cpu))
> +			goto unlock;
>  	}
>  
>  	while (sd) {

Instead of doing what for most will be a redundant idle_cpu() call,
perhaps a couple cycles can be saved if you move the sd assignment above
affine_sd assignment, and say if (!sd || idle_cpu(new_cpu)) ?

You could also stop find_idlest_group() at the first completely idle
group to shave cycles off the not fully committed search.  It ain't
likely to find a negative load.. cool as that would be ;-)

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28  3:46 ` Mike Galbraith
@ 2015-05-28  9:49   ` Morten Rasmussen
  2015-05-28 10:57     ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Morten Rasmussen @ 2015-05-28  9:49 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel

On Thu, May 28, 2015 at 04:46:38AM +0100, Mike Galbraith wrote:
> On Wed, 2015-05-27 at 17:22 -0400, Josef Bacik wrote:
> > [ sorry if you get this twice, it seems like the first submission got lost ]
> > 
> > At Facebook we have a pretty heavily multi-threaded application that is
> > sensitive to latency.  We have been pulling forward the old SD_WAKE_IDLE code
> > because it gives us a pretty significant performance gain (like 20%).  It turns
> > out this is because there are cases where the scheduler puts our task on a busy
> > CPU when there are idle CPU's in the system.  We verify this by reading the
> > cpu_delay_req_avg_us from the scheduler netlink stuff.  With our crappy patch we
> > get much lower numbers vs baseline.
> > 
> > SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just
> > looking for an idle sibling, preferring affinity over all else.  This is not
> > helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not
> > garuntee affinity.  Fix this by first trying to find an idle sibling, and then
> > if the cpu is not idle fall through to the logic to find an idle cpu.  With this
> > patch we get slightly better performance than with our forward port of
> > SD_WAKE_IDLE.  Thanks,
> 
> The job description isn't really find idle. it's find least loaded.

And make sure that the task doesn't migrate away from any data that
might still be in the last-level cache?

IUIC, the goal of SD_BALANCE_WAKE is changed from finding the least
loaded target cpu that shares last-level cache with the previous cpu, to
finding an idle cpu and prefer ones that shares the last-level cache but
extend the search beyond sd_llc if necessary.

> 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  kernel/sched/fair.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 241213b..03dafa3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4766,7 +4766,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >  
> >  	if (sd_flag & SD_BALANCE_WAKE) {
> >  		new_cpu = select_idle_sibling(p, prev_cpu);
> > -		goto unlock;
> > +		if (idle_cpu(new_cpu))
> > +			goto unlock;
> >  	}
> >  
> >  	while (sd) {
> 
> Instead of doing what for most will be a redundant idle_cpu() call,
> perhaps a couple cycles can be saved if you move the sd assignment above
> affine_sd assignment, and say if (!sd || idle_cpu(new_cpu)) ?

Isn't sd == NULL is most cases if you don't move the sd assignment
before the affine_sd assignment? The break after the affine_sd
assignment means that sd isn't assigned under 'normal' circumstances
(sibling waking cpu, no restrictions in tsk_cpus_allowed(p), and
SD_WAKE_AFFINE set) which causes the while (sd) loop to be bypassed and
we end up returning the result of select_idle_sibling() anyway.

I must be missing something?

Morten

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28  9:49   ` Morten Rasmussen
@ 2015-05-28 10:57     ` Mike Galbraith
  2015-05-28 11:48       ` Morten Rasmussen
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-05-28 10:57 UTC (permalink / raw)
  To: Morten Rasmussen; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel

On Thu, 2015-05-28 at 10:49 +0100, Morten Rasmussen wrote:

> Isn't sd == NULL is most cases if you don't move the sd assignment
> before the affine_sd assignment?

sd will usually be NULL regardless of where the assignment is, as
SD_BALANCE_WAKE is usually off in ->flags.  Josef is turning it on.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 10:57     ` Mike Galbraith
@ 2015-05-28 11:48       ` Morten Rasmussen
  2015-05-28 11:49         ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Morten Rasmussen @ 2015-05-28 11:48 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel

On Thu, May 28, 2015 at 11:57:44AM +0100, Mike Galbraith wrote:
> On Thu, 2015-05-28 at 10:49 +0100, Morten Rasmussen wrote:
> 
> > Isn't sd == NULL is most cases if you don't move the sd assignment
> > before the affine_sd assignment?
> 
> sd will usually be NULL regardless of where the assignment is, as
> SD_BALANCE_WAKE is usually off in ->flags.  Josef is turning it on.

Right. SD_BALANCE_WAKE needs to set in the sd flags and the assignment
has to happen before the break for this to work. I just don't see
SD_BALANCE_WAKE being enabled for the sched_domain anywhere in the
patch?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 11:48       ` Morten Rasmussen
@ 2015-05-28 11:49         ` Mike Galbraith
  0 siblings, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-05-28 11:49 UTC (permalink / raw)
  To: Morten Rasmussen; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel

On Thu, 2015-05-28 at 12:48 +0100, Morten Rasmussen wrote:
> On Thu, May 28, 2015 at 11:57:44AM +0100, Mike Galbraith wrote:
> > On Thu, 2015-05-28 at 10:49 +0100, Morten Rasmussen wrote:
> > 
> > > Isn't sd == NULL is most cases if you don't move the sd assignment
> > > before the affine_sd assignment?
> > 
> > sd will usually be NULL regardless of where the assignment is, as
> > SD_BALANCE_WAKE is usually off in ->flags.  Josef is turning it on.
> 
> Right. SD_BALANCE_WAKE needs to set in the sd flags and the assignment
> has to happen before the break for this to work. I just don't see
> SD_BALANCE_WAKE being enabled for the sched_domain anywhere in the
> patch?

He's doing that via proc interface.. well, I presume he is anyway.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
  2015-05-28  3:46 ` Mike Galbraith
@ 2015-05-28 10:21 ` Peter Zijlstra
  2015-05-28 11:05   ` Peter Zijlstra
  2015-05-28 11:16   ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
  2015-05-28 11:55 ` Srikar Dronamraju
  2 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-05-28 10:21 UTC (permalink / raw)
  To: Josef Bacik; +Cc: riel, mingo, linux-kernel

On Wed, May 27, 2015 at 05:22:16PM -0400, Josef Bacik wrote:
> 
> SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however

sd->flags's SD_BALANCE_WAKE, not sd_flags.

And note that SD_BALANCE_WAKE is not set on domains by default.

> it is just looking for an idle sibling, preferring affinity over all
> else.

Not sure what you're saying here, what affinity?

> This is not helpful in all cases, and SD_BALANCE_WAKE's job is
> to find us an idle cpu, not garuntee affinity.

Your argument is going backwards, SD_BALANCE_WAKE is not actually set.

> Fix this by first
> trying to find an idle sibling, and then if the cpu is not idle fall
> through to the logic to find an idle cpu.  With this patch we get
> slightly better performance than with our forward port of
> SD_WAKE_IDLE. 

This is broken. You most certainly do not want to go do that whole load
balance pass on wakeups. It should be controlled by sd->flags. It is far
too expensive to consider turning that on by default.

In fact, select_idle_sibling() is already too expensive on current
server hardware (far too damn many cpus in a LLC domain).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 10:21 ` Peter Zijlstra
@ 2015-05-28 11:05   ` Peter Zijlstra
  2015-05-28 14:27     ` Josef Bacik
                       ` (3 more replies)
  2015-05-28 11:16   ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
  1 sibling, 4 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-05-28 11:05 UTC (permalink / raw)
  To: Josef Bacik; +Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen


So maybe you want something like the below; that cures the thing Morten
raised, and we continue looking for sd, even after we found affine_sd.

It also avoids the pointless idle_cpu() check Mike raised by making
select_idle_sibling() return -1 if it doesn't find anything.

Then it continues doing the full balance IFF sd was set, which is keyed
off of sd->flags.

And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle
CPUs, it looks for the least loaded CPU. And its damn expensive.


Rewriting this entire thing is somewhere on the todo list :/

---
 kernel/sched/fair.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d597aea43030..7330fb4fea9c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1385,8 +1385,13 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * One idle CPU per node is evaluated for a task numa move.
 	 * Call select_idle_sibling to maybe find a better one.
 	 */
-	if (!cur)
-		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+	if (!cur) {
+		int cpu = select_idle_sibling(env->p, env->dst_cpu);
+		if (cpu < 0)
+			cpu = env->dst_cpu;
+
+		env->dst_cpu = cpu;
+	}
 
 assign:
 	task_numa_assign(env, cur, imp);
@@ -4951,16 +4956,15 @@ static int select_idle_sibling(struct task_struct *p, int target)
 					goto next;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
+			return cpumask_first_and(sched_group_cpus(sg),
 					tsk_cpus_allowed(p));
-			goto done;
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
-done:
-	return target;
+	return -1;
 }
+
 /*
  * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
  * tasks. The unit of the return value must be the one of capacity so we can
@@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		 * If both cpu and prev_cpu are part of this domain,
 		 * cpu is a valid SD_WAKE_AFFINE target.
 		 */
-		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
-		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+		if (want_affine && !affine_sd &&
+		    (tmp->flags & SD_WAKE_AFFINE) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
 			affine_sd = tmp;
-			break;
-		}
 
 		if (tmp->flags & sd_flag)
 			sd = tmp;
+		else if (!want_affine || (want_affine && affine_sd))
+			break;
 	}
 
-	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
 		prev_cpu = cpu;
+		sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
+	}
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		int tmp = select_idle_sibling(p, prev_cpu);
+		if (tmp >= 0) {
+			new_cpu = tmp;
+			goto unlock;
+		}
 	}
 
 	while (sd) {

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 11:05   ` Peter Zijlstra
@ 2015-05-28 14:27     ` Josef Bacik
  2015-05-29 21:03     ` Josef Bacik
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-05-28 14:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen

On 05/28/2015 07:05 AM, Peter Zijlstra wrote:
>
> So maybe you want something like the below; that cures the thing Morten
> raised, and we continue looking for sd, even after we found affine_sd.
>
> It also avoids the pointless idle_cpu() check Mike raised by making
> select_idle_sibling() return -1 if it doesn't find anything.
>
> Then it continues doing the full balance IFF sd was set, which is keyed
> off of sd->flags.
>
> And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle
> CPUs, it looks for the least loaded CPU. And its damn expensive.
>

Sorry I was just assuming based on the commit message when WAKE_IDLE was 
removed, this isn't my area.

>
> Rewriting this entire thing is somewhere on the todo list :/

Thanks I'm building and deploying this so I can run our perf test, I'll 
have results in ~3 hours.

Josef

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 11:05   ` Peter Zijlstra
  2015-05-28 14:27     ` Josef Bacik
@ 2015-05-29 21:03     ` Josef Bacik
  2015-05-30  3:55       ` Mike Galbraith
  2015-06-01 19:38       ` Josef Bacik
  2015-06-11 20:33     ` Josef Bacik
  2015-06-12  5:35     ` Mike Galbraith
  3 siblings, 2 replies; 73+ messages in thread
From: Josef Bacik @ 2015-05-29 21:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team

On 05/28/2015 07:05 AM, Peter Zijlstra wrote:
>
> So maybe you want something like the below; that cures the thing Morten
> raised, and we continue looking for sd, even after we found affine_sd.
>
> It also avoids the pointless idle_cpu() check Mike raised by making
> select_idle_sibling() return -1 if it doesn't find anything.
>
> Then it continues doing the full balance IFF sd was set, which is keyed
> off of sd->flags.
>
> And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle
> CPUs, it looks for the least loaded CPU. And its damn expensive.
>
>
> Rewriting this entire thing is somewhere on the todo list :/
>

Summarizing what I've found so far.

-We turn on SD_BALANCE_WAKE on our domains for our 3.10 boxes, but not 
for our 4.0 boxes (due to some weird configuration issue.)
-Running with this patch is better than plain 4.0 but not as good as my 
patch, running with SD_BALANCE_WAKE set and not set makes no difference 
to the runs.
-I took out the sd = NULL; bit from the affine case like you said on IRC 
and I get similar results as before.
-I'm thoroughly confused as to why my patch did anything since we 
weren't turning on SD_BALANCE_WAKE on 4.0 in my previous runs (I assume, 
it isn't set now so I'm pretty sure the problem has always been there) 
so we should have always had sd == NULL which means we would have only 
ever gotten the task cpu I guess.

Now I'm looking at the code in select_idle_sibling and we do this

for_each_lower_domain(sd) {
         sg = sd->groups;
         do {
                 if (!cpumask_intersects(sched_group_cpus(sg),
                                         tsk_cpus_allowed(p)))
                         goto next;

                 for_each_cpu(i, sched_group_cpus(sg)) {
                         if (i == target || !idle_cpu(i))
                                 goto next;
                 }

                 return cpumask_first_and(sched_group_cpus(sg),
                                 tsk_cpus_allowed(p));
next:
                 sg = sg->next
         } while (sg != sd->groups);
}

We get all the schedule groups for the schedule domain and if any of the 
cpu's are not idle or the target then we skip the whole scheduling 
group.  Isn't the scheduling group a group of CPU's?  Why can't we pick 
an idle CPU in the group that has a none idle cpu or the target cpu? 
Thanks,

Josef

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-29 21:03     ` Josef Bacik
@ 2015-05-30  3:55       ` Mike Galbraith
  2015-06-01 19:38       ` Josef Bacik
  1 sibling, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-05-30  3:55 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Fri, 2015-05-29 at 17:03 -0400, Josef Bacik wrote:

> for_each_lower_domain(sd) {
>          sg = sd->groups;
>          do {
>                  if (!cpumask_intersects(sched_group_cpus(sg),
>                                          tsk_cpus_allowed(p)))
>                          goto next;
> 
>                  for_each_cpu(i, sched_group_cpus(sg)) {
>                          if (i == target || !idle_cpu(i))
>                                  goto next;
>                  }
> 
>                  return cpumask_first_and(sched_group_cpus(sg),
>                                  tsk_cpus_allowed(p));
> next:
>                  sg = sg->next
>          } while (sg != sd->groups);
> }
> 
> We get all the schedule groups for the schedule domain and if any of the 
> cpu's are not idle or the target then we skip the whole scheduling 
> group.  Isn't the scheduling group a group of CPU's?  Why can't we pick 
> an idle CPU in the group that has a none idle cpu or the target cpu? 
> Thanks,

We select an idle core if we can get one.  Yes, that leaves a pile of
SMT threads not checked/selected, but if you're gonna do a full search
of a large socket (humongous sparc-thing, shudder), you may as well eat
the BALANCE_WAKE overhead.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-29 21:03     ` Josef Bacik
  2015-05-30  3:55       ` Mike Galbraith
@ 2015-06-01 19:38       ` Josef Bacik
  2015-06-01 20:42         ` Peter Zijlstra
  2015-06-01 22:15         ` Rik van Riel
  1 sibling, 2 replies; 73+ messages in thread
From: Josef Bacik @ 2015-06-01 19:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team

On 05/29/2015 05:03 PM, Josef Bacik wrote:
> On 05/28/2015 07:05 AM, Peter Zijlstra wrote:
>>
>> So maybe you want something like the below; that cures the thing Morten
>> raised, and we continue looking for sd, even after we found affine_sd.
>>
>> It also avoids the pointless idle_cpu() check Mike raised by making
>> select_idle_sibling() return -1 if it doesn't find anything.
>>
>> Then it continues doing the full balance IFF sd was set, which is keyed
>> off of sd->flags.
>>
>> And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle
>> CPUs, it looks for the least loaded CPU. And its damn expensive.
>>
>>
>> Rewriting this entire thing is somewhere on the todo list :/
>>
>

Ok I got this patch to give me the same performance as all our other 
crap, just need to apply this incremental


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b71eb2b..e11cfec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int 
prev_cpu, int sd_flag, int wake_f

  		if (tmp->flags & sd_flag)
  			sd = tmp;
-		else if (!want_affine || (want_affine && affine_sd))
-			break;
  	}

  	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
  		prev_cpu = cpu;
-		sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
  	}

  	if (sd_flag & SD_BALANCE_WAKE) {

And everything works fine.  Does that seem reasonable?  Thanks,

Josef

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-01 19:38       ` Josef Bacik
@ 2015-06-01 20:42         ` Peter Zijlstra
  2015-06-01 21:03           ` Josef Bacik
  2015-06-02 17:12           ` Josef Bacik
  2015-06-01 22:15         ` Rik van Riel
  1 sibling, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-06-01 20:42 UTC (permalink / raw)
  To: Josef Bacik
  Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team

On Mon, 2015-06-01 at 15:38 -0400, Josef Bacik wrote:

> Ok I got this patch to give me the same performance as all our other 
> crap, just need to apply this incremental
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b71eb2b..e11cfec 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int 
> prev_cpu, int sd_flag, int wake_f
> 
>   		if (tmp->flags & sd_flag)
>   			sd = tmp;
> -		else if (!want_affine || (want_affine && affine_sd))
> -			break;
>   	}

That bit worries me a bit, because that causes us to have a weird
definition for what sd is.

Without WAKE_AFFINE, sd is the biggest domain with BALANCE_WAKE (or any
other sd_flag) set.

But with WAKE_AFFINE, its the first domain that satisfies the wake
affine constraint of covering both the previous and waking cpu. It
basically reduces sd to affine_sd.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-01 20:42         ` Peter Zijlstra
@ 2015-06-01 21:03           ` Josef Bacik
  2015-06-02 17:12           ` Josef Bacik
  1 sibling, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-06-01 21:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team

On 06/01/2015 04:42 PM, Peter Zijlstra wrote:
> On Mon, 2015-06-01 at 15:38 -0400, Josef Bacik wrote:
>
>> Ok I got this patch to give me the same performance as all our other
>> crap, just need to apply this incremental
>>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b71eb2b..e11cfec 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int
>> prev_cpu, int sd_flag, int wake_f
>>
>>    		if (tmp->flags & sd_flag)
>>    			sd = tmp;
>> -		else if (!want_affine || (want_affine && affine_sd))
>> -			break;
>>    	}
>
> That bit worries me a bit, because that causes us to have a weird
> definition for what sd is.
>
> Without WAKE_AFFINE, sd is the biggest domain with BALANCE_WAKE (or any
> other sd_flag) set.
>
> But with WAKE_AFFINE, its the first domain that satisfies the wake
> affine constraint of covering both the previous and waking cpu. It
> basically reduces sd to affine_sd.
>

I'm just giving you what made the performance good for us, I'm relying 
on you to make it sane ;).  So with that bit we break out if 
!wake_affine as soon as we don't find an sd that matches sd_flag, that 
seems less than helpful.  I was going to change it to

else if (want_affine && affine_sd)

but the thing is we don't do anything with affine_sd unless the bit 
below matches

if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))

so it seems like we're leaving ourselves without sd set in a few cases 
where we'd actually want it set, so I just nuked the whole thing and 
carried on.  I'm all ears for other ideas, I'm just letting you know 
what my results are since you are the guys who actually understand this 
stuff.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-01 20:42         ` Peter Zijlstra
  2015-06-01 21:03           ` Josef Bacik
@ 2015-06-02 17:12           ` Josef Bacik
  2015-06-03 14:12             ` Rik van Riel
  1 sibling, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-02 17:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team

On 06/01/2015 04:42 PM, Peter Zijlstra wrote:
> On Mon, 2015-06-01 at 15:38 -0400, Josef Bacik wrote:
>
>> Ok I got this patch to give me the same performance as all our other
>> crap, just need to apply this incremental
>>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b71eb2b..e11cfec 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int
>> prev_cpu, int sd_flag, int wake_f
>>
>>    		if (tmp->flags & sd_flag)
>>    			sd = tmp;
>> -		else if (!want_affine || (want_affine && affine_sd))
>> -			break;
>>    	}
>
> That bit worries me a bit, because that causes us to have a weird
> definition for what sd is.
>
> Without WAKE_AFFINE, sd is the biggest domain with BALANCE_WAKE (or any
> other sd_flag) set.
>
> But with WAKE_AFFINE, its the first domain that satisfies the wake
> affine constraint of covering both the previous and waking cpu. It
> basically reduces sd to affine_sd.
>

Ok I took Rik's idea and came out with this


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b71eb2b..75073d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4761,13 +4761,14 @@ select_task_rq_fair(struct task_struct *p, int 
prev_cpu, int sd_flag, int wake_f

  		if (tmp->flags & sd_flag)
  			sd = tmp;
-		else if (!want_affine || (want_affine && affine_sd))
+		else if (want_affine && affine_sd) {
+			sd = affine_sd;
  			break;
+		}
  	}

  	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
  		prev_cpu = cpu;
-		sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
  	}

  	if (sd_flag & SD_BALANCE_WAKE) {

But now that I re-read your response I think this is even more what you 
were worried about than less.

Basically it comes down to if sd isn't set then we get shit performance. 
  I realize that this magic to find an idle cpu when sd is set is pretty 
heavy handed, but it's obviously helpful in our case.

So let me ask this question.  When do we want to do the heavy handed 
search and when do we not?  With WAKE_AFFINE what is our ultimate goal 
vs the other SD's?  If we don't have an sd that matches our sd_flags 
what should we be doing, should we just go with whatever cpu we're on 
and carry on?  Thanks,

Josef

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-02 17:12           ` Josef Bacik
@ 2015-06-03 14:12             ` Rik van Riel
  2015-06-03 14:24               ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Rik van Riel @ 2015-06-03 14:12 UTC (permalink / raw)
  To: Josef Bacik, Peter Zijlstra
  Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team

On 06/02/2015 01:12 PM, Josef Bacik wrote:

> But now that I re-read your response I think this is even more what you
> were worried about than less.
>
> Basically it comes down to if sd isn't set then we get shit performance.
>   I realize that this magic to find an idle cpu when sd is set is pretty
> heavy handed, but it's obviously helpful in our case.

Ingo and Peter appear to be worried that searching
through too many idle CPUs leads to bad performance.

Your test results seem to indicate that finding an
idle CPU really really helps performance.

There is a policy vs mechanism thing here. Ingo and Peter
are worried about the overhead in the mechanism of finding
an idle CPU.  Your measurements show that the policy of
finding an idle CPU is the correct one.

Can we make the policy change now, and optimize the
mechanism later?

> So let me ask this question.  When do we want to do the heavy handed
> search and when do we not?  With WAKE_AFFINE what is our ultimate goal
> vs the other SD's?  If we don't have an sd that matches our sd_flags
> what should we be doing, should we just go with whatever cpu we're on
> and carry on?  Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-03 14:12             ` Rik van Riel
@ 2015-06-03 14:24               ` Peter Zijlstra
  2015-06-03 14:49                 ` Josef Bacik
  2015-06-03 15:30                 ` Mike Galbraith
  0 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-06-03 14:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Josef Bacik, mingo, linux-kernel, umgwanakikbuti,
	morten.rasmussen, kernel-team

On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:

> There is a policy vs mechanism thing here. Ingo and Peter
> are worried about the overhead in the mechanism of finding
> an idle CPU.  Your measurements show that the policy of
> finding an idle CPU is the correct one.

For his workload; I'm sure I can find a workload where it hurts.

In fact, I'm fairly sure Mike knows one from the top of his head, seeing
how he's the one playing about trying to shrink that idle search :-)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-03 14:24               ` Peter Zijlstra
@ 2015-06-03 14:49                 ` Josef Bacik
  2015-06-03 15:30                 ` Mike Galbraith
  1 sibling, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-06-03 14:49 UTC (permalink / raw)
  To: Peter Zijlstra, Rik van Riel
  Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team

On 06/03/2015 10:24 AM, Peter Zijlstra wrote:
> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
>
>> There is a policy vs mechanism thing here. Ingo and Peter
>> are worried about the overhead in the mechanism of finding
>> an idle CPU.  Your measurements show that the policy of
>> finding an idle CPU is the correct one.
>
> For his workload; I'm sure I can find a workload where it hurts.
>
> In fact, I'm fairly sure Mike knows one from the top of his head, seeing
> how he's the one playing about trying to shrink that idle search :-)
>

So the perf bench sched microbenchmarks are a pretty good analog for our 
workload.  I run

perf bench sched messaging -g 100 -l 10000
perf bench sched pipe

5 times and average the results to get an answer, really the messaging 
one is closest one and the one I look at.  I get like 56 seconds of 
runtime on plain 4.0 and 47 seconds patched, it's how I check my little 
experiments before doing the full real workload.

I don't want to tune the scheduler just for our workload, but the 
microbenchmarks we have are also showing the same performance 
improvements.  I would be super interested in workloads where this patch 
doesn't help so we could integrate that workload into perf sched bench 
to make us more confident in making policy changes in the scheduler.  So 
Mike if you have something specific in mind please elaborate and I'm 
happy to do the legwork to get it into perf bench and to test things 
until we're happy.

In the meantime I really want to get this fixed for us, I do not want to 
pull some weird old patch around for the next year until we rebase again 
next year, and then do this whole dance again.  What would be the way 
forward for getting this fixed now?  Do I need to hide it behind a 
sysctl or config option?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-03 14:24               ` Peter Zijlstra
  2015-06-03 14:49                 ` Josef Bacik
@ 2015-06-03 15:30                 ` Mike Galbraith
  2015-06-03 15:57                   ` Josef Bacik
  1 sibling, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-03 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Josef Bacik, mingo, linux-kernel, morten.rasmussen,
	kernel-team

On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote:
> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
> 
> > There is a policy vs mechanism thing here. Ingo and Peter
> > are worried about the overhead in the mechanism of finding
> > an idle CPU.  Your measurements show that the policy of
> > finding an idle CPU is the correct one.
> 
> For his workload; I'm sure I can find a workload where it hurts.
> 
> In fact, I'm fairly sure Mike knows one from the top of his head, seeing
> how he's the one playing about trying to shrink that idle search :-)

Like anything where scheduling latency doesn't heavily dominate.  Even
if searching were free, bounces aren't, even for the very light.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-03 15:30                 ` Mike Galbraith
@ 2015-06-03 15:57                   ` Josef Bacik
  2015-06-03 16:53                     ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-03 15:57 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: Rik van Riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 06/03/2015 11:30 AM, Mike Galbraith wrote:
> On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote:
>> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
>>
>>> There is a policy vs mechanism thing here. Ingo and Peter
>>> are worried about the overhead in the mechanism of finding
>>> an idle CPU.  Your measurements show that the policy of
>>> finding an idle CPU is the correct one.
>>
>> For his workload; I'm sure I can find a workload where it hurts.
>>
>> In fact, I'm fairly sure Mike knows one from the top of his head, seeing
>> how he's the one playing about trying to shrink that idle search :-)
>
> Like anything where scheduling latency doesn't heavily dominate.  Even
> if searching were free, bounces aren't, even for the very light.
>

If scheduling latency doesn't hurt then making the search shouldn't 
matter should it?  I get that migrations aren't free, but it seems like 
they can't hurt that much.  This application is huge, it's our 
webserver, we're doing like 400 requests per second on these things, and 
hands down moving stuff to idle cpus is beating the pants off of staying 
on the same cpu.  Is there a specific workload I could build a test for 
that you think this approach would hurt?  Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-03 15:57                   ` Josef Bacik
@ 2015-06-03 16:53                     ` Mike Galbraith
  2015-06-03 17:16                       ` Josef Bacik
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-03 16:53 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel,
	morten.rasmussen, kernel-team

On Wed, 2015-06-03 at 11:57 -0400, Josef Bacik wrote:
> On 06/03/2015 11:30 AM, Mike Galbraith wrote:
> > On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote:
> >> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
> >>
> >>> There is a policy vs mechanism thing here. Ingo and Peter
> >>> are worried about the overhead in the mechanism of finding
> >>> an idle CPU.  Your measurements show that the policy of
> >>> finding an idle CPU is the correct one.
> >>
> >> For his workload; I'm sure I can find a workload where it hurts.
> >>
> >> In fact, I'm fairly sure Mike knows one from the top of his head, seeing
> >> how he's the one playing about trying to shrink that idle search :-)
> >
> > Like anything where scheduling latency doesn't heavily dominate.  Even
> > if searching were free, bounces aren't, even for the very light.
> >
> 
> If scheduling latency doesn't hurt then making the search shouldn't 
> matter should it?  I get that migrations aren't free, but it seems like 
> they can't hurt that much.

Nah, they don't hurt much :)

commit e0a79f529d5ba2507486d498b25da40911d95cf6
Author: Mike Galbraith <bitbucket@online.de>
Date:   Mon Jan 28 12:19:25 2013 +0100

    sched: Fix select_idle_sibling() bouncing cow syndrome
    
    If the previous CPU is cache affine and idle, select it.
    
    The current implementation simply traverses the sd_llc domain,
    taking the first idle CPU encountered, which walks buddy pairs
    hand in hand over the package, inflicting excruciating pain.
    
    1 tbench pair (worst case) in a 10 core + SMT package:
    
      pre   15.22 MB/sec 1 procs
      post 252.01 MB/sec 1 procs
    

>   This application is huge, it's our 
> webserver, we're doing like 400 requests per second on these things, and 
> hands down moving stuff to idle cpus is beating the pants off of staying 
> on the same cpu.  Is there a specific workload I could build a test for 
> that you think this approach would hurt?  Thanks,

Search cost hurts fast movers, as does dragging even a small footprint
all over the place, as you can see above.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-03 16:53                     ` Mike Galbraith
@ 2015-06-03 17:16                       ` Josef Bacik
  2015-06-03 17:43                         ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-03 17:16 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel,
	morten.rasmussen, kernel-team

On 06/03/2015 12:53 PM, Mike Galbraith wrote:
> On Wed, 2015-06-03 at 11:57 -0400, Josef Bacik wrote:
>> On 06/03/2015 11:30 AM, Mike Galbraith wrote:
>>> On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote:
>>>> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
>>>>
>>>>> There is a policy vs mechanism thing here. Ingo and Peter
>>>>> are worried about the overhead in the mechanism of finding
>>>>> an idle CPU.  Your measurements show that the policy of
>>>>> finding an idle CPU is the correct one.
>>>>
>>>> For his workload; I'm sure I can find a workload where it hurts.
>>>>
>>>> In fact, I'm fairly sure Mike knows one from the top of his head, seeing
>>>> how he's the one playing about trying to shrink that idle search :-)
>>>
>>> Like anything where scheduling latency doesn't heavily dominate.  Even
>>> if searching were free, bounces aren't, even for the very light.
>>>
>>
>> If scheduling latency doesn't hurt then making the search shouldn't
>> matter should it?  I get that migrations aren't free, but it seems like
>> they can't hurt that much.
>
> Nah, they don't hurt much :)
>
> commit e0a79f529d5ba2507486d498b25da40911d95cf6
> Author: Mike Galbraith <bitbucket@online.de>
> Date:   Mon Jan 28 12:19:25 2013 +0100
>
>      sched: Fix select_idle_sibling() bouncing cow syndrome
>
>      If the previous CPU is cache affine and idle, select it.
>
>      The current implementation simply traverses the sd_llc domain,
>      taking the first idle CPU encountered, which walks buddy pairs
>      hand in hand over the package, inflicting excruciating pain.
>
>      1 tbench pair (worst case) in a 10 core + SMT package:
>
>        pre   15.22 MB/sec 1 procs
>        post 252.01 MB/sec 1 procs
>
>
>>    This application is huge, it's our
>> webserver, we're doing like 400 requests per second on these things, and
>> hands down moving stuff to idle cpus is beating the pants off of staying
>> on the same cpu.  Is there a specific workload I could build a test for
>> that you think this approach would hurt?  Thanks,
>
> Search cost hurts fast movers, as does dragging even a small footprint
> all over the place, as you can see above.
>

Eesh ok, do you happen to remember how you ran tbench so I can add it to 
my tests here?  In addition to fixing this problem we're also interested 
in tracking performance of new kernels so we don't have to do this "what 
the hell went wrong in the last 6 releases" dance every year, so I'm 
throwing every performance thing we find useful in our test 
infrastructure.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-03 17:16                       ` Josef Bacik
@ 2015-06-03 17:43                         ` Mike Galbraith
  2015-06-03 20:34                           ` Josef Bacik
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-03 17:43 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel,
	morten.rasmussen, kernel-team

On Wed, 2015-06-03 at 13:16 -0400, Josef Bacik wrote:

> Eesh ok, do you happen to remember how you ran tbench so I can add it to 
> my tests here?  In addition to fixing this problem we're also interested 
> in tracking performance of new kernels so we don't have to do this "what 
> the hell went wrong in the last 6 releases" dance every year, so I'm 
> throwing every performance thing we find useful in our test 
> infrastructure.  Thanks,
> 
> Josef

Start a tbench server, then tbench -t 30 1 localhost.  You're unlikely
to find anything as painful as that bouncing cow bug was, but you won't
have to look hard at all to find bounce pain.

There are also other loads like your server where waking to an idle cpu
dominates all else, pgbench is one of those.  In that case, you've got a
1:N waker/wakee relationship, and what matters above ALL else is when
the mother of all work (the single server thread) wants a CPU, it had
better get it NOW, else the load stalls.  Likewise, 'mom' being
preempted hurts truckloads.  Perhaps your server has a similar thing
going on, keeping wakees the hell away from the waker rules all.

	-Mike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-03 17:43                         ` Mike Galbraith
@ 2015-06-03 20:34                           ` Josef Bacik
  2015-06-04  4:52                             ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-03 20:34 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel,
	morten.rasmussen, kernel-team

On 06/03/2015 01:43 PM, Mike Galbraith wrote:
> On Wed, 2015-06-03 at 13:16 -0400, Josef Bacik wrote:
>
>> Eesh ok, do you happen to remember how you ran tbench so I can add it to
>> my tests here?  In addition to fixing this problem we're also interested
>> in tracking performance of new kernels so we don't have to do this "what
>> the hell went wrong in the last 6 releases" dance every year, so I'm
>> throwing every performance thing we find useful in our test
>> infrastructure.  Thanks,
>>
>> Josef
>
> Start a tbench server, then tbench -t 30 1 localhost.  You're unlikely
> to find anything as painful as that bouncing cow bug was, but you won't
> have to look hard at all to find bounce pain.
>
> There are also other loads like your server where waking to an idle cpu
> dominates all else, pgbench is one of those.  In that case, you've got a
> 1:N waker/wakee relationship, and what matters above ALL else is when
> the mother of all work (the single server thread) wants a CPU, it had
> better get it NOW, else the load stalls.  Likewise, 'mom' being
> preempted hurts truckloads.  Perhaps your server has a similar thing
> going on, keeping wakees the hell away from the waker rules all.
>

Yeah our server has two waker threads (one per numa node) and then the N 
number of wakee threads.  I'll run tbench and pgbench with the new 
patches and see if there's a degredation.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-03 20:34                           ` Josef Bacik
@ 2015-06-04  4:52                             ` Mike Galbraith
  0 siblings, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-06-04  4:52 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel,
	morten.rasmussen, kernel-team

On Wed, 2015-06-03 at 16:34 -0400, Josef Bacik wrote:
> On 06/03/2015 01:43 PM, Mike Galbraith wrote:

> > There are also other loads like your server where waking to an idle cpu
> > dominates all else, pgbench is one of those.  In that case, you've got a
> > 1:N waker/wakee relationship, and what matters above ALL else is when
> > the mother of all work (the single server thread) wants a CPU, it had
> > better get it NOW, else the load stalls.  Likewise, 'mom' being
> > preempted hurts truckloads.  Perhaps your server has a similar thing
> > going on, keeping wakees the hell away from the waker rules all.
> >
> 
> Yeah our server has two waker threads (one per numa node) and then the N 
> number of wakee threads.  I'll run tbench and pgbench with the new 
> patches and see if there's a degredation.  Thanks,

If you look for wake_wide(), it could perhaps be used to select wider
search for only the right flavor load component when BALANCE_WAKE is
set.  That would let the cache lovers in your box continue to perform
while improving the 1:N component.  That wider search still needs to
become cheaper though, low hanging fruit being to stop searching when
you find load = 0.. but you may meet the energy efficient folks, who
iirc want to make it even more expensive.

wake_wide() inadvertently helped another sore spot btw - a gaggle of
pretty light tasks being awakened from an interrupt source tended to
cluster around that source, preventing such loads from being all they
can be in a very similar manner.  Xen (shudder;) showed that nicely in
older kernels, due to the way its weird dom0 gizmo works.

	-Mike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-01 19:38       ` Josef Bacik
  2015-06-01 20:42         ` Peter Zijlstra
@ 2015-06-01 22:15         ` Rik van Riel
  1 sibling, 0 replies; 73+ messages in thread
From: Rik van Riel @ 2015-06-01 22:15 UTC (permalink / raw)
  To: Josef Bacik, Peter Zijlstra
  Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team

On 06/01/2015 03:38 PM, Josef Bacik wrote:

> Ok I got this patch to give me the same performance as all our other
> crap, just need to apply this incremental
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b71eb2b..e11cfec 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int
> prev_cpu, int sd_flag, int wake_f
> 
>          if (tmp->flags & sd_flag)
>              sd = tmp;
> -        else if (!want_affine || (want_affine && affine_sd))
> -            break;
>      }
> 
>      if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
>          prev_cpu = cpu;
> -        sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */

Given Peter's worries about wake_affine and affine_sd,
should the above be sd = affine_sd, in case select_idle_sibling
cannot find an idle sibling?

That way we can attempt to at least find an idle cpu inside
the affine_sd.

Of course, there may be subtleties here I am overlooking...

>      }
> 
>      if (sd_flag & SD_BALANCE_WAKE) {


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 11:05   ` Peter Zijlstra
  2015-05-28 14:27     ` Josef Bacik
  2015-05-29 21:03     ` Josef Bacik
@ 2015-06-11 20:33     ` Josef Bacik
  2015-06-12  3:42       ` Rik van Riel
  2015-06-12  5:35     ` Mike Galbraith
  3 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-11 20:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team

On 05/28/2015 07:05 AM, Peter Zijlstra wrote:
>
> So maybe you want something like the below; that cures the thing Morten
> raised, and we continue looking for sd, even after we found affine_sd.
>
> It also avoids the pointless idle_cpu() check Mike raised by making
> select_idle_sibling() return -1 if it doesn't find anything.
>
> Then it continues doing the full balance IFF sd was set, which is keyed
> off of sd->flags.
>
> And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle
> CPUs, it looks for the least loaded CPU. And its damn expensive.
>
>
> Rewriting this entire thing is somewhere on the todo list :/
>

Ugh I'm sorry, I've been running tests trying to get the numbers to look 
good when I noticed I was getting some inconsistencies in my results. 
Turns out I never actually tested your patch just plain, I had been 
testing it with BALANCE_WAKE, because I was under the assumption that 
was what was best for our workload.  Since then I had fixed all of our 
scripts and such and noticed that it actually super duper sucks for us. 
  So testing with this original patch everything is significantly better 
(this is with the default SD flags set, no changes at all).

So now that I've wasted a good bit of my time and everybody elses, can 
we go about pushing this patch upstream?  If you are happy with it the 
way it is I'll go ahead and pull it into our kernels and just watch to 
make sure it ends upstream at some point.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-11 20:33     ` Josef Bacik
@ 2015-06-12  3:42       ` Rik van Riel
  0 siblings, 0 replies; 73+ messages in thread
From: Rik van Riel @ 2015-06-12  3:42 UTC (permalink / raw)
  To: Josef Bacik, Peter Zijlstra
  Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team

On 06/11/2015 04:33 PM, Josef Bacik wrote:

> Ugh I'm sorry, I've been running tests trying to get the numbers to look
> good when I noticed I was getting some inconsistencies in my results.
> Turns out I never actually tested your patch just plain, I had been
> testing it with BALANCE_WAKE, because I was under the assumption that
> was what was best for our workload.  Since then I had fixed all of our
> scripts and such and noticed that it actually super duper sucks for us.
>  So testing with this original patch everything is significantly better
> (this is with the default SD flags set, no changes at all).
> 
> So now that I've wasted a good bit of my time and everybody elses, can
> we go about pushing this patch upstream?  If you are happy with it the
> way it is I'll go ahead and pull it into our kernels and just watch to
> make sure it ends upstream at some point.  Thanks,

FWIW, Jirka has run some tests with the patch as well,
and seen significant performance improvements on
various tests, on various systems.

Merging the patch makes perfect sense to me.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 11:05   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2015-06-11 20:33     ` Josef Bacik
@ 2015-06-12  5:35     ` Mike Galbraith
  2015-06-17 18:06       ` Josef Bacik
  3 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-12  5:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen

On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:

> @@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  		 * If both cpu and prev_cpu are part of this domain,
>  		 * cpu is a valid SD_WAKE_AFFINE target.
>  		 */
> -		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> -		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> +		if (want_affine && !affine_sd &&
> +		    (tmp->flags & SD_WAKE_AFFINE) &&
> +		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>  			affine_sd = tmp;
> -			break;
> -		}
>  
>  		if (tmp->flags & sd_flag)
>  			sd = tmp;
> +		else if (!want_affine || (want_affine && affine_sd))
> +			break;
>  	}

Hm, new_cpu == cpu.
 
> -	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> +	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
>  		prev_cpu = cpu;
> +		sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
> +	}

If branch above is not taken, new_cpu remains cpu.
 
>  	if (sd_flag & SD_BALANCE_WAKE) {
> -		new_cpu = select_idle_sibling(p, prev_cpu);
> -		goto unlock;
> +		int tmp = select_idle_sibling(p, prev_cpu);
> +		if (tmp >= 0) {
> +			new_cpu = tmp;
> +			goto unlock;
> +		}
>  	}

If select_idle_sibling() returns -1, new_cpu remains cpu.

>  
>  	while (sd) {

If sd == NULL, we fall through and try to pull wakee despite nacked-by
tsk_cpus_allowed() or wake_affine().

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-12  5:35     ` Mike Galbraith
@ 2015-06-17 18:06       ` Josef Bacik
  2015-06-18  0:55         ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-17 18:06 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: riel, mingo, linux-kernel, morten.rasmussen

On 06/11/2015 10:35 PM, Mike Galbraith wrote:
> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
>
>> @@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>   		 * If both cpu and prev_cpu are part of this domain,
>>   		 * cpu is a valid SD_WAKE_AFFINE target.
>>   		 */
>> -		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
>> -		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
>> +		if (want_affine && !affine_sd &&
>> +		    (tmp->flags & SD_WAKE_AFFINE) &&
>> +		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>>   			affine_sd = tmp;
>> -			break;
>> -		}
>>
>>   		if (tmp->flags & sd_flag)
>>   			sd = tmp;
>> +		else if (!want_affine || (want_affine && affine_sd))
>> +			break;
>>   	}
>
> Hm, new_cpu == cpu.
>
>> -	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> +	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
>>   		prev_cpu = cpu;
>> +		sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
>> +	}
>
> If branch above is not taken, new_cpu remains cpu.
>
>>   	if (sd_flag & SD_BALANCE_WAKE) {
>> -		new_cpu = select_idle_sibling(p, prev_cpu);
>> -		goto unlock;
>> +		int tmp = select_idle_sibling(p, prev_cpu);
>> +		if (tmp >= 0) {
>> +			new_cpu = tmp;
>> +			goto unlock;
>> +		}
>>   	}
>
> If select_idle_sibling() returns -1, new_cpu remains cpu.
>
>>
>>   	while (sd) {
>
> If sd == NULL, we fall through and try to pull wakee despite nacked-by
> tsk_cpus_allowed() or wake_affine().
>

So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something 
like this

if (tmp >= 0) {
	new_cpu = tmp;
	goto unlock;
} else if (!want_affine) {
	new_cpu = prev_cpu;
}

so we can make sure we're not being pushed onto a cpu that we aren't 
allowed on?  Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-17 18:06       ` Josef Bacik
@ 2015-06-18  0:55         ` Mike Galbraith
  2015-06-18  3:46           ` Josef Bacik
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-18  0:55 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen

On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
> > On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:

> > If sd == NULL, we fall through and try to pull wakee despite nacked-by
> > tsk_cpus_allowed() or wake_affine().
> >
> 
> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something 
> like this
> 
> if (tmp >= 0) {
> 	new_cpu = tmp;
> 	goto unlock;
> } else if (!want_affine) {
> 	new_cpu = prev_cpu;
> }
> 
> so we can make sure we're not being pushed onto a cpu that we aren't 
> allowed on?  Thanks,

The buglet is a messenger methinks.  You saying the patch helped without
SD_BALANCE_WAKE being set is why I looked.  The buglet would seem to say
that preferring cache is not harming your load after all.  It now sounds
as though wake_wide() may be what you're squabbling with.

Things aren't adding up all that well.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-18  0:55         ` Mike Galbraith
@ 2015-06-18  3:46           ` Josef Bacik
  2015-06-18  4:12             ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-18  3:46 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen

On 06/17/2015 05:55 PM, Mike Galbraith wrote:
> On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
>> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
>>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
>
>>> If sd == NULL, we fall through and try to pull wakee despite nacked-by
>>> tsk_cpus_allowed() or wake_affine().
>>>
>>
>> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
>> like this
>>
>> if (tmp >= 0) {
>> 	new_cpu = tmp;
>> 	goto unlock;
>> } else if (!want_affine) {
>> 	new_cpu = prev_cpu;
>> }
>>
>> so we can make sure we're not being pushed onto a cpu that we aren't
>> allowed on?  Thanks,
>
> The buglet is a messenger methinks.  You saying the patch helped without
> SD_BALANCE_WAKE being set is why I looked.  The buglet would seem to say
> that preferring cache is not harming your load after all.  It now sounds
> as though wake_wide() may be what you're squabbling with.
>
> Things aren't adding up all that well.

Yeah I'm horribly confused.  The other thing is I had to switch clusters 
(I know, I know, I'm changing the parameters of the test).  So these new 
boxes are haswell boxes, but basically the same otherwise, 2 socket 12 
core with HT, just newer/faster CPUs.  I'll re-run everything again and 
give the numbers so we're all on the same page again, but as it stands 
now I think we have this

3.10 with wake_idle forward ported - good
4.0 stock - 20% perf drop
4.0 w/ Peter's patch - good
4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop

I can do all these iterations again to verify, is there any other 
permutation you'd like to see?  Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-18  3:46           ` Josef Bacik
@ 2015-06-18  4:12             ` Mike Galbraith
  2015-07-02 17:44               ` Josef Bacik
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-18  4:12 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen

On Wed, 2015-06-17 at 20:46 -0700, Josef Bacik wrote:
> On 06/17/2015 05:55 PM, Mike Galbraith wrote:
> > On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
> >> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
> >>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
> >
> >>> If sd == NULL, we fall through and try to pull wakee despite nacked-by
> >>> tsk_cpus_allowed() or wake_affine().
> >>>
> >>
> >> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
> >> like this
> >>
> >> if (tmp >= 0) {
> >> 	new_cpu = tmp;
> >> 	goto unlock;
> >> } else if (!want_affine) {
> >> 	new_cpu = prev_cpu;
> >> }
> >>
> >> so we can make sure we're not being pushed onto a cpu that we aren't
> >> allowed on?  Thanks,
> >
> > The buglet is a messenger methinks.  You saying the patch helped without
> > SD_BALANCE_WAKE being set is why I looked.  The buglet would seem to say
> > that preferring cache is not harming your load after all.  It now sounds
> > as though wake_wide() may be what you're squabbling with.
> >
> > Things aren't adding up all that well.
> 
> Yeah I'm horribly confused.  The other thing is I had to switch clusters 
> (I know, I know, I'm changing the parameters of the test).  So these new 
> boxes are haswell boxes, but basically the same otherwise, 2 socket 12 
> core with HT, just newer/faster CPUs.  I'll re-run everything again and 
> give the numbers so we're all on the same page again, but as it stands 
> now I think we have this
> 
> 3.10 with wake_idle forward ported - good
> 4.0 stock - 20% perf drop
> 4.0 w/ Peter's patch - good
> 4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop
> 
> I can do all these iterations again to verify, is there any other 
> permutation you'd like to see?  Thanks,

Yeah, after re-baseline, please apply/poke these buttons individually in
4.0-virgin.

(cat /sys/kernel/debug/sched_features, prepend NO_, echo it back)

---
 kernel/sched/fair.c     |    4 ++--
 kernel/sched/features.h |    2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4506,7 +4506,7 @@ static int wake_affine(struct sched_doma
 	 * If we wake multiple tasks be careful to not bounce
 	 * ourselves around too much.
 	 */
-	if (wake_wide(p))
+	if (sched_feat(WAKE_WIDE) && wake_wide(p))
 		return 0;
 
 	idx	  = sd->wake_idx;
@@ -4682,7 +4682,7 @@ static int select_idle_sibling(struct ta
 	struct sched_group *sg;
 	int i = task_cpu(p);
 
-	if (idle_cpu(target))
+	if (!sched_feat(PREFER_IDLE) || idle_cpu(target))
 		return target;
 
 	/*
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -59,6 +59,8 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(FORCE_SD_OVERLAP, false)
 SCHED_FEAT(RT_RUNTIME_SHARE, true)
 SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(PREFER_IDLE, true)
+SCHED_FEAT(WAKE_WIDE, true)
 
 /*
  * Apply the automatic NUMA scheduling policy. Enabled automatically



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-06-18  4:12             ` Mike Galbraith
@ 2015-07-02 17:44               ` Josef Bacik
  2015-07-03  6:40                 ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-02 17:44 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 06/18/2015 12:12 AM, Mike Galbraith wrote:
> On Wed, 2015-06-17 at 20:46 -0700, Josef Bacik wrote:
>> On 06/17/2015 05:55 PM, Mike Galbraith wrote:
>>> On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
>>>> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
>>>>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
>>>
>>>>> If sd == NULL, we fall through and try to pull wakee despite nacked-by
>>>>> tsk_cpus_allowed() or wake_affine().
>>>>>
>>>>
>>>> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
>>>> like this
>>>>
>>>> if (tmp >= 0) {
>>>> 	new_cpu = tmp;
>>>> 	goto unlock;
>>>> } else if (!want_affine) {
>>>> 	new_cpu = prev_cpu;
>>>> }
>>>>
>>>> so we can make sure we're not being pushed onto a cpu that we aren't
>>>> allowed on?  Thanks,
>>>
>>> The buglet is a messenger methinks.  You saying the patch helped without
>>> SD_BALANCE_WAKE being set is why I looked.  The buglet would seem to say
>>> that preferring cache is not harming your load after all.  It now sounds
>>> as though wake_wide() may be what you're squabbling with.
>>>
>>> Things aren't adding up all that well.
>>
>> Yeah I'm horribly confused.  The other thing is I had to switch clusters
>> (I know, I know, I'm changing the parameters of the test).  So these new
>> boxes are haswell boxes, but basically the same otherwise, 2 socket 12
>> core with HT, just newer/faster CPUs.  I'll re-run everything again and
>> give the numbers so we're all on the same page again, but as it stands
>> now I think we have this
>>
>> 3.10 with wake_idle forward ported - good
>> 4.0 stock - 20% perf drop
>> 4.0 w/ Peter's patch - good
>> 4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop
>>
>> I can do all these iterations again to verify, is there any other
>> permutation you'd like to see?  Thanks,
>
> Yeah, after re-baseline, please apply/poke these buttons individually in
> 4.0-virgin.
>
> (cat /sys/kernel/debug/sched_features, prepend NO_, echo it back)
>

Sorry it took me a while to get these numbers to you, migrating the 
whole fleet to a new setup broke the performance test suite thing so 
I've only just been able to run tests again.  I'll do my best to 
describe what is going on and hopefully that will make the results make 
sense.

This is on our webservers, which is HHVM.  A request comes in for a page 
and this goes onto one of the two hhvm.node.# threads, one thread per 
NUMA node.  From there it is farmed off to one of the worker threads. 
If there are no idle workers the request gets put on what is called the 
"select_queue".  Basically the select_queue should never be larger than 
0 in a perfect world.  If it's more than we've hit latency somewhere and 
that's not good.  The other measurement we care about is how long a 
thread spends on a request before it sends a response (this would be the 
actual work being done).

Our tester slowly increases load to a group of servers until the select 
queue is consistently >= 1.  That means we've loaded the boxes so high 
that they can't process the requests as soon as they've come in.  Then 
it backs down and then ramps up a second time.  It takes all of these 
measurements and puts them into these pretty graphs.  There are 2 graphs 
we care about, the duration of the requests vs the requests per second 
and the probability that our select queue is >= 1 vs requests per second.

Now for 3.10 vs 4.0 our request duration time is the same if not 
slightly better on 4.0, so once the workers are doing their job 
everything is a-ok.

The problem is the probability the select queue >= 1 is way different on 
4.0 vs 3.10.  Normally this graph looks like an S, it's essentially 0 up 
to some RPS (requests per second) threshold and then shoots up to 100% 
after the threshold.  I'll make a table of these graphs that hopefully 
makes sense, the numbers are different from run to run because of 
traffic and such, the test and control are both run at the same time. 
The header is the probability the select queue >=1

		25%	50%	75%
4.0 plain: 	371	388	402
control:	386	394	402
difference:	15	6	0

So with 4.0 its basically a straight line, at lower RPS we are getting a 
higher probability of a select queue >= 1.  We are measuring the cpu 
delay avg ms thing from the scheduler netlink stuff which is how I 
noticed it was scheduler related, our cpu delay is way higher on 4.0 
than it is on 3.10 or 4.0 with the wake idle patch.

So the next test is NO_PREFER_IDLE.  This is slightly better than 4.0 plain
		25%	50%	75%
NO_PREFER_IDLE:	399	401	414
control:	385	408	416
difference:	14	7	2

The numbers don't really show it well, but the graphs are closer 
together, it's slightly more s shaped, but still not great.

Next is NO_WAKE_WIDE, which is horrible

		25%	50%	75%
NO_WAKE_WIDE:	315	344	369
control:	373	380	388
difference:	58	36	19

This isn't even in the same ballpark, it's a way worse regression than 
plain.

The next bit is NO_WAKE_WIDE|NO_PREFER_IDLE, which is just as bad

		25%	50%	75%
EVERYTHING:	327	360	383
control:	381	390	399
difference:	54	30	19

Hopefully that helps.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-02 17:44               ` Josef Bacik
@ 2015-07-03  6:40                 ` Mike Galbraith
  2015-07-03  9:29                   ` Mike Galbraith
  2015-07-04 15:57                   ` Mike Galbraith
  0 siblings, 2 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-03  6:40 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Thu, 2015-07-02 at 13:44 -0400, Josef Bacik wrote:

> Now for 3.10 vs 4.0 our request duration time is the same if not 
> slightly better on 4.0, so once the workers are doing their job 
> everything is a-ok.
> 
> The problem is the probability the select queue >= 1 is way different on 
> 4.0 vs 3.10.  Normally this graph looks like an S, it's essentially 0 up 
> to some RPS (requests per second) threshold and then shoots up to 100% 
> after the threshold.  I'll make a table of these graphs that hopefully 
> makes sense, the numbers are different from run to run because of 
> traffic and such, the test and control are both run at the same time. 
> The header is the probability the select queue >=1
> 
> 		25%	50%	75%
> 4.0 plain: 	371	388	402
> control:	386	394	402
> difference:	15	6	0

So control is 3.10?  Virgin?

> So with 4.0 its basically a straight line, at lower RPS we are getting a 
> higher probability of a select queue >= 1.  We are measuring the cpu 
> delay avg ms thing from the scheduler netlink stuff which is how I 
> noticed it was scheduler related, our cpu delay is way higher on 4.0 
> than it is on 3.10 or 4.0 with the wake idle patch.
> 
> So the next test is NO_PREFER_IDLE.  This is slightly better than 4.0 plain
> 		25%	50%	75%
> NO_PREFER_IDLE:	399	401	414
> control:	385	408	416
> difference:	14	7	2

Hm.  Throttling nohz may make larger delta.  But never mind that.

> The numbers don't really show it well, but the graphs are closer 
> together, it's slightly more s shaped, but still not great.
> 
> Next is NO_WAKE_WIDE, which is horrible
> 
> 		25%	50%	75%
> NO_WAKE_WIDE:	315	344	369
> control:	373	380	388
> difference:	58	36	19
> 
> This isn't even in the same ballpark, it's a way worse regression than 
> plain.

Ok, this jibes perfectly with 1:N waker/wakee thingy.

> The next bit is NO_WAKE_WIDE|NO_PREFER_IDLE, which is just as bad
> 
> 		25%	50%	75%
> EVERYTHING:	327	360	383
> control:	381	390	399
> difference:	54	30	19

Ditto.

Hm.  Seems what this load should like best is if we detect 1:N, skip all
of the routine gyrations, ie move the N (workers) infrequently, expend
search cycles frequently only on the 1 (dispatch).

Ponder..

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-03  6:40                 ` Mike Galbraith
@ 2015-07-03  9:29                   ` Mike Galbraith
  2015-07-04 15:57                   ` Mike Galbraith
  1 sibling, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-03  9:29 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Fri, 2015-07-03 at 08:40 +0200, Mike Galbraith wrote:

> Hm.  Seems what this load should like best is if we detect 1:N, skip all
> of the routine gyrations, ie move the N (workers) infrequently, expend
> search cycles frequently only on the 1 (dispatch).
> 
> Ponder..

While taking a refresher peek at the wake_wide() thing, seems it's not
really paying attention when the waker of many is awakened.  I wonder if
your load would see more benefit if it watched like so.. rashly assuming
I didn't wreck it completely (iow, completely untested).

---
 kernel/sched/fair.c |   36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4586,10 +4586,23 @@ static void record_wakee(struct task_str
 		current->wakee_flips >>= 1;
 		current->wakee_flip_decay_ts = jiffies;
 	}
+	if (time_after(jiffies, p->wakee_flip_decay_ts + HZ)) {
+		p->wakee_flips >>= 1;
+		p->wakee_flip_decay_ts = jiffies;
+	}
 
 	if (current->last_wakee != p) {
 		current->last_wakee = p;
 		current->wakee_flips++;
+		/*
+		 * Flip the buddy as well.  It's the ratio of flips
+		 * with a socket size decayed cutoff that determines
+		 * whether the pair are considered to be part of 1:N
+		 * or M*N loads of a size that we need to spread, so
+		 * ensure flips of both load components.  The waker
+		 * of many will have many more flips than its wakees.
+		 */
+		p->wakee_flips++;
 	}
 }
 
@@ -4732,24 +4745,19 @@ static long effective_load(struct task_g
 
 static int wake_wide(struct task_struct *p)
 {
+	unsigned long max = max(current->wakee_flips, p->wakee_flips);
+	unsigned long min = min(current->wakee_flips, p->wakee_flips);
 	int factor = this_cpu_read(sd_llc_size);
 
 	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
+	 * Yeah, it's a switching-frequency heuristic, and could mean the
+	 * intended many wakees/waker relationship, or rapidly switching
+	 * between a few.  Use factor to try to automatically adjust such
+	 * that the load spreads when it grows beyond what will fit in llc.
 	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (min < factor)
+		return 0;
+	return max > min * factor;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-03  6:40                 ` Mike Galbraith
  2015-07-03  9:29                   ` Mike Galbraith
@ 2015-07-04 15:57                   ` Mike Galbraith
  2015-07-05  7:17                     ` Mike Galbraith
  1 sibling, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-04 15:57 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Fri, 2015-07-03 at 08:40 +0200, Mike Galbraith wrote:

> Hm.  Seems what this load should like best is if we detect 1:N, skip all
> of the routine gyrations, ie move the N (workers) infrequently, expend
> search cycles frequently only on the 1 (dispatch).
> 
> Ponder..

Since it was too hot to do outside chores (any excuse will do;)...

If we're (read /me) on track, the bellow should help.  Per my tracing,
it may want a wee bit of toning down actually, though when I trace
virgin source I expect to see the same, namely Xorg and friends having
"wide-load" tattooed across their hindquarters earlier than they should.
It doesn't seem to hurt anything, but then demolishing a single llc box
is a tad more difficult than demolishing a NUMA box.


sched: beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when waking one of its N minions.

Correct that, and give the user the option to do an expensive balance IFF
select_idle_sibling() doesn't find an idle CPU, and IFF the wakee is the
the 1:N dispatcher of work, thus worth some extra effort.

Not-Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 kernel/sched/fair.c     |   89 +++++++++++++++++++++++++-----------------------
 kernel/sched/features.h |    6 +++
 2 files changed, 54 insertions(+), 41 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -666,7 +666,7 @@ static u64 sched_vslice(struct cfs_rq *c
 }
 
 #ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int cpu, void *clear);
 static unsigned long task_h_load(struct task_struct *p);
 
 static inline void __update_task_entity_contrib(struct sched_entity *se);
@@ -1375,7 +1375,7 @@ static void task_numa_compare(struct tas
 	 * Call select_idle_sibling to maybe find a better one.
 	 */
 	if (!cur)
-		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu, NULL);
 
 assign:
 	task_numa_assign(env, cur, imp);
@@ -4730,26 +4730,30 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
-	int factor = this_cpu_read(sd_llc_size);
-
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
+	unsigned long waker_flips = current->wakee_flips;
+	unsigned long wakee_flips = p->wakee_flips;
+	int factor = this_cpu_read(sd_llc_size), ret = 1;
+
+	if (waker_flips < wakee_flips) {
+		swap(waker_flips, wakee_flips);
+		/* Tell the caller that we're waking a 1:N waker */
+		ret += sched_feat(WAKE_WIDE_BALANCE);
 	}
-
-	return 0;
+	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+		return 0;
+	return ret;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4765,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -4935,20 +4932,22 @@ find_idlest_cpu(struct sched_group *grou
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target, void *clear)
 {
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i = task_cpu(p);
 
 	if (idle_cpu(target))
-		return target;
+		goto done;
 
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
 	 */
-	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
-		return i;
+	if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) {
+		target = i;
+		goto done;
+	}
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
@@ -4973,7 +4972,11 @@ static int select_idle_sibling(struct ta
 			sg = sg->next;
 		} while (sg != sd->groups);
 	}
+	return target;
 done:
+	if (clear)
+		*(void **)clear = 0;
+
 	return target;
 }
 /*
@@ -5021,14 +5024,19 @@ select_task_rq_fair(struct task_struct *
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
-	int want_affine = 0;
+	int new_cpu = prev_cpu;
+	int want_affine = 0, want_balance = 0;
 	int sync = wake_flags & WF_SYNC;
 
-	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
 	rcu_read_lock();
+	if (sd_flag & SD_BALANCE_WAKE) {
+		want_affine = wake_wide(p);
+		want_balance = want_affine > 1;
+		want_affine = !want_affine && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		if (!want_affine && !want_balance)
+			goto select;
+	}
+
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			continue;
@@ -5043,23 +5051,23 @@ select_task_rq_fair(struct task_struct *
 			break;
 		}
 
-		if (tmp->flags & sd_flag)
+		if (tmp->flags & sd_flag || want_balance)
 			sd = tmp;
 	}
 
 	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
+		new_cpu = cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+select:
+		new_cpu = select_idle_sibling(p, new_cpu, &sd);
 	}
 
 	while (sd) {
 		struct sched_group *group;
 		int weight;
 
-		if (!(sd->flags & sd_flag)) {
+		if (!(sd->flags & sd_flag) && !want_balance) {
 			sd = sd->child;
 			continue;
 		}
@@ -5089,7 +5097,6 @@ select_task_rq_fair(struct task_struct *
 		}
 		/* while loop will break here if sd == NULL */
 	}
-unlock:
 	rcu_read_unlock();
 
 	return new_cpu;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
  */
 SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif
+
+/*
+ * Perform expensive full wake balance for 1:N wakers when the
+ * selected cpu is not completely idle.
+ */
+SCHED_FEAT(WAKE_WIDE_BALANCE, false)



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-04 15:57                   ` Mike Galbraith
@ 2015-07-05  7:17                     ` Mike Galbraith
  2015-07-06  5:13                       ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-05  7:17 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Sat, 2015-07-04 at 17:57 +0200, Mike Galbraith wrote:

> If we're (read /me) on track, the bellow should help.  Per my tracing,
> it may want a wee bit of toning down actually, though when I trace
> virgin source I expect to see the same, namely Xorg and friends having
> "wide-load" tattooed across their hindquarters earlier than they should.

That's true, the only difference is that the virgin kernel will still
try to pull the multi-waker when it is being awakened, which I can
easily imagine going either way performance wise, depending on a pile of
variables.. so let's just ask your box. 

V2: drop select_idle_sibling() changes (ick), try to save some cycles.


sched: beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when the 1:N waker is waking one of its minions.

Correct that, and give the user the option to do an expensive balance IFF
select_idle_sibling() doesn't find an idle CPU, and IFF the wakee is the
1:N waker, the dispatcher of work, thus worth some extra effort.  Don't
drill down through all domains though, stop searching at highest, we'll
either have found the desired completely idle CPU, or if heavily loaded,
the least loaded CPU of the least loaded group, which should still add
up to an average scheduling latency improvement (knock wood).


Not-Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 include/linux/sched.h   |    7 ++-
 kernel/sched/fair.c     |   86 ++++++++++++++++++++++++++++--------------------
 kernel/sched/features.h |    6 +++
 3 files changed, 61 insertions(+), 38 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -962,7 +962,8 @@ extern void wake_up_q(struct wake_q_head
 #define SD_BALANCE_EXEC		0x0004	/* Balance on exec */
 #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
-#define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
+#define SD_BALANCE_WAKE_IDLE	0x0020  /* Balance on wakeup, searching for an idle CPU */
+#define SD_WAKE_AFFINE		0x0040	/* Wake task to waking CPU */
 #define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu power */
 #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
@@ -1353,9 +1354,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,30 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
-	int factor = this_cpu_read(sd_llc_size);
-
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
+	unsigned int waker_flips = current->wakee_flips;
+	unsigned int wakee_flips = p->wakee_flips;
+	int factor = this_cpu_read(sd_llc_size), ret = 1;
+
+	if (waker_flips < wakee_flips) {
+		swap(waker_flips, wakee_flips);
+		/* Tell the caller that we're waking a 1:N waker */
+		ret += sched_feat(WAKE_WIDE_IDLE);
 	}
-
-	return 0;
+	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+		return 0;
+	return ret;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4765,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -4865,6 +4862,10 @@ find_idlest_group(struct sched_domain *s
 				load = target_load(i, load_idx);
 
 			avg_load += load;
+
+			if (sd_flag & SD_BALANCE_WAKE_IDLE && idle_cpu(i) &&
+			    cpumask_test_cpu(1, tsk_cpus_allowed(p)))
+				return group;
 		}
 
 		/* Adjust by relative CPU capacity of the group */
@@ -5021,14 +5022,21 @@ select_task_rq_fair(struct task_struct *
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
-	int want_affine = 0;
+	int new_cpu = prev_cpu;
+	int want_affine = 0, want_idle = 0;
 	int sync = wake_flags & WF_SYNC;
 
-	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
 	rcu_read_lock();
+	if (sd_flag & SD_BALANCE_WAKE) {
+		want_idle = wake_wide(p);
+		want_affine = !want_idle && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_idle = want_idle > 1;
+		if (!want_affine && !want_idle)
+			goto select;
+		if (want_idle)
+			sd_flag |= SD_BALANCE_WAKE_IDLE;
+	}
+
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			continue;
@@ -5043,23 +5051,25 @@ select_task_rq_fair(struct task_struct *
 			break;
 		}
 
-		if (tmp->flags & sd_flag)
+		if (tmp->flags & sd_flag || want_idle)
 			sd = tmp;
 	}
 
 	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
+		new_cpu = cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+select:
+		new_cpu = select_idle_sibling(p, new_cpu);
+		if (want_idle && (new_cpu != prev_cpu || idle_cpu(prev_cpu)))
+			sd = NULL;
 	}
 
 	while (sd) {
 		struct sched_group *group;
 		int weight;
 
-		if (!(sd->flags & sd_flag)) {
+		if (!(sd->flags & sd_flag) && !want_idle) {
 			sd = sd->child;
 			continue;
 		}
@@ -5077,6 +5087,13 @@ select_task_rq_fair(struct task_struct *
 			continue;
 		}
 
+		/*
+		 * We've either found an idle CPU, or the least loaded CPU in
+		 * the least load group of the highest domain.  Good enough.
+		 */
+		if (want_idle)
+			break;
+
 		/* Now try balancing at a lower domain level of new_cpu */
 		cpu = new_cpu;
 		weight = sd->span_weight;
@@ -5089,7 +5106,6 @@ select_task_rq_fair(struct task_struct *
 		}
 		/* while loop will break here if sd == NULL */
 	}
-unlock:
 	rcu_read_unlock();
 
 	return new_cpu;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
  */
 SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif
+
+/*
+ * Perform expensive full wake balance for 1:N wakers when the
+ * selected cpu is not completely idle.
+ */
+SCHED_FEAT(WAKE_WIDE_IDLE, false)




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-05  7:17                     ` Mike Galbraith
@ 2015-07-06  5:13                       ` Mike Galbraith
  2015-07-06 14:34                         ` Josef Bacik
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-06  5:13 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

Hm.  Piddling with pgbench, which doesn't seem to collapse into a
quivering heap when load exceeds cores these days, deltas weren't all
that impressive, but it does appreciate the extra effort a bit, and a
bit more when clients receive it as well.

If you test, and have time to piddle, you could try letting wake_wide()
return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
the dispatcher.

Numbers from my little desktop box.

NO_WAKE_WIDE_IDLE
postgres@homer:~> pgbench.sh
clients 8       tps = 116697.697662
clients 12      tps = 115160.230523
clients 16      tps = 115569.804548
clients 20      tps = 117879.230514
clients 24      tps = 118281.753040
clients 28      tps = 116974.796627
clients 32      tps = 119082.163998   avg   117092.239   1.000

WAKE_WIDE_IDLE
postgres@homer:~> pgbench.sh
clients 8       tps = 124351.735754
clients 12      tps = 124419.673135
clients 16      tps = 125050.716498
clients 20      tps = 124813.042352
clients 24      tps = 126047.442307
clients 28      tps = 125373.719401
clients 32      tps = 126711.243383   avg   125252.510   1.069   1.000

WAKE_WIDE_IDLE (clients as well as server)
postgres@homer:~> pgbench.sh
clients 8       tps = 130539.795246
clients 12      tps = 128984.648554
clients 16      tps = 130564.386447
clients 20      tps = 129149.693118
clients 24      tps = 130211.119780
clients 28      tps = 130325.355433
clients 32      tps = 129585.656963   avg   129908.665   1.109   1.037


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-06  5:13                       ` Mike Galbraith
@ 2015-07-06 14:34                         ` Josef Bacik
  2015-07-06 18:36                           ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-06 14:34 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 07/06/2015 01:13 AM, Mike Galbraith wrote:
> Hm.  Piddling with pgbench, which doesn't seem to collapse into a
> quivering heap when load exceeds cores these days, deltas weren't all
> that impressive, but it does appreciate the extra effort a bit, and a
> bit more when clients receive it as well.
>
> If you test, and have time to piddle, you could try letting wake_wide()
> return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
> the dispatcher.
>
> Numbers from my little desktop box.
>
> NO_WAKE_WIDE_IDLE
> postgres@homer:~> pgbench.sh
> clients 8       tps = 116697.697662
> clients 12      tps = 115160.230523
> clients 16      tps = 115569.804548
> clients 20      tps = 117879.230514
> clients 24      tps = 118281.753040
> clients 28      tps = 116974.796627
> clients 32      tps = 119082.163998   avg   117092.239   1.000
>
> WAKE_WIDE_IDLE
> postgres@homer:~> pgbench.sh
> clients 8       tps = 124351.735754
> clients 12      tps = 124419.673135
> clients 16      tps = 125050.716498
> clients 20      tps = 124813.042352
> clients 24      tps = 126047.442307
> clients 28      tps = 125373.719401
> clients 32      tps = 126711.243383   avg   125252.510   1.069   1.000
>
> WAKE_WIDE_IDLE (clients as well as server)
> postgres@homer:~> pgbench.sh
> clients 8       tps = 130539.795246
> clients 12      tps = 128984.648554
> clients 16      tps = 130564.386447
> clients 20      tps = 129149.693118
> clients 24      tps = 130211.119780
> clients 28      tps = 130325.355433
> clients 32      tps = 129585.656963   avg   129908.665   1.109   1.037
>

I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get 
this solved upstream and then I'll rip out the old and put in the new, 
I'm happy to screw around until we're all happy.  I'll throw this in a 
kernel this morning and run stuff today.  Barring any issues with the 
testing infrastructure I should have results today.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-06 14:34                         ` Josef Bacik
@ 2015-07-06 18:36                           ` Mike Galbraith
  2015-07-06 19:41                             ` Josef Bacik
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-06 18:36 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote:
> On 07/06/2015 01:13 AM, Mike Galbraith wrote:
> > Hm.  Piddling with pgbench, which doesn't seem to collapse into a
> > quivering heap when load exceeds cores these days, deltas weren't all
> > that impressive, but it does appreciate the extra effort a bit, and a
> > bit more when clients receive it as well.
> >
> > If you test, and have time to piddle, you could try letting wake_wide()
> > return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
> > the dispatcher.
> >
> > Numbers from my little desktop box.
> >
> > NO_WAKE_WIDE_IDLE
> > postgres@homer:~> pgbench.sh
> > clients 8       tps = 116697.697662
> > clients 12      tps = 115160.230523
> > clients 16      tps = 115569.804548
> > clients 20      tps = 117879.230514
> > clients 24      tps = 118281.753040
> > clients 28      tps = 116974.796627
> > clients 32      tps = 119082.163998   avg   117092.239   1.000
> >
> > WAKE_WIDE_IDLE
> > postgres@homer:~> pgbench.sh
> > clients 8       tps = 124351.735754
> > clients 12      tps = 124419.673135
> > clients 16      tps = 125050.716498
> > clients 20      tps = 124813.042352
> > clients 24      tps = 126047.442307
> > clients 28      tps = 125373.719401
> > clients 32      tps = 126711.243383   avg   125252.510   1.069   1.000
> >
> > WAKE_WIDE_IDLE (clients as well as server)
> > postgres@homer:~> pgbench.sh
> > clients 8       tps = 130539.795246
> > clients 12      tps = 128984.648554
> > clients 16      tps = 130564.386447
> > clients 20      tps = 129149.693118
> > clients 24      tps = 130211.119780
> > clients 28      tps = 130325.355433
> > clients 32      tps = 129585.656963   avg   129908.665   1.109   1.037

I had a typo in my script, so those desktop box numbers were all doing
the same number of clients.  It doesn't invalidate anything, but the
individual deltas are just run to run variance.. not to mention that
single cache box is not all that interesting for this anyway.  That
happens when interconnect becomes a player.

> I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get 
> this solved upstream and then I'll rip out the old and put in the new, 
> I'm happy to screw around until we're all happy.  I'll throw this in a 
> kernel this morning and run stuff today.  Barring any issues with the 
> testing infrastructure I should have results today.  Thanks,

I'll be interested in your results.  Taking pgbench to a little NUMA
box, I'm seeing _nada_ outside of variance with master (crap).  I have a
way to win significantly for _older_ kernels, and that win over master
_may_ provide some useful insight, but I don't trust postgres/pgbench as
far as I can toss the planet, so don't have a warm fuzzy about trying to
use it to approximate your real world load.

BTW, what's your topology look like (numactl --hardware).

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-06 18:36                           ` Mike Galbraith
@ 2015-07-06 19:41                             ` Josef Bacik
  2015-07-07  4:01                               ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-06 19:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 07/06/2015 02:36 PM, Mike Galbraith wrote:
> On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote:
>> On 07/06/2015 01:13 AM, Mike Galbraith wrote:
>>> Hm.  Piddling with pgbench, which doesn't seem to collapse into a
>>> quivering heap when load exceeds cores these days, deltas weren't all
>>> that impressive, but it does appreciate the extra effort a bit, and a
>>> bit more when clients receive it as well.
>>>
>>> If you test, and have time to piddle, you could try letting wake_wide()
>>> return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
>>> the dispatcher.
>>>
>>> Numbers from my little desktop box.
>>>
>>> NO_WAKE_WIDE_IDLE
>>> postgres@homer:~> pgbench.sh
>>> clients 8       tps = 116697.697662
>>> clients 12      tps = 115160.230523
>>> clients 16      tps = 115569.804548
>>> clients 20      tps = 117879.230514
>>> clients 24      tps = 118281.753040
>>> clients 28      tps = 116974.796627
>>> clients 32      tps = 119082.163998   avg   117092.239   1.000
>>>
>>> WAKE_WIDE_IDLE
>>> postgres@homer:~> pgbench.sh
>>> clients 8       tps = 124351.735754
>>> clients 12      tps = 124419.673135
>>> clients 16      tps = 125050.716498
>>> clients 20      tps = 124813.042352
>>> clients 24      tps = 126047.442307
>>> clients 28      tps = 125373.719401
>>> clients 32      tps = 126711.243383   avg   125252.510   1.069   1.000
>>>
>>> WAKE_WIDE_IDLE (clients as well as server)
>>> postgres@homer:~> pgbench.sh
>>> clients 8       tps = 130539.795246
>>> clients 12      tps = 128984.648554
>>> clients 16      tps = 130564.386447
>>> clients 20      tps = 129149.693118
>>> clients 24      tps = 130211.119780
>>> clients 28      tps = 130325.355433
>>> clients 32      tps = 129585.656963   avg   129908.665   1.109   1.037
>
> I had a typo in my script, so those desktop box numbers were all doing
> the same number of clients.  It doesn't invalidate anything, but the
> individual deltas are just run to run variance.. not to mention that
> single cache box is not all that interesting for this anyway.  That
> happens when interconnect becomes a player.
>
>> I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get
>> this solved upstream and then I'll rip out the old and put in the new,
>> I'm happy to screw around until we're all happy.  I'll throw this in a
>> kernel this morning and run stuff today.  Barring any issues with the
>> testing infrastructure I should have results today.  Thanks,
>
> I'll be interested in your results.  Taking pgbench to a little NUMA
> box, I'm seeing _nada_ outside of variance with master (crap).  I have a
> way to win significantly for _older_ kernels, and that win over master
> _may_ provide some useful insight, but I don't trust postgres/pgbench as
> far as I can toss the planet, so don't have a warm fuzzy about trying to
> use it to approximate your real world load.
>
> BTW, what's your topology look like (numactl --hardware).
>

So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the 
baseline with a slight regression at lower RPS and a slight improvement 
at high RPS.  I'm running with WAKE_WIDE_IDLE set now, that should be 
done soonish and then I'll do the 1 + sched_feat(WAKE_WIDE_IDLE) thing 
next and those results should come in the morning.  Here is the numa 
information from one of the boxes in the test cluster

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 15890 MB
node 0 free: 2651 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 16125 MB
node 1 free: 2063 MB
node distances:
node   0   1
   0:  10  20
   1:  20  10

Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-06 19:41                             ` Josef Bacik
@ 2015-07-07  4:01                               ` Mike Galbraith
  2015-07-07  9:43                                 ` [patch] " Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-07  4:01 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:

> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the 
> baseline with a slight regression at lower RPS and a slight improvement 
> at high RPS.

Good.  I can likely drop the rest then (I like dinky, so do CPUs;).  I'm
not real keen on the feature unless your numbers are really good, and
odds are that ain't gonna happen.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-07  4:01                               ` Mike Galbraith
@ 2015-07-07  9:43                                 ` Mike Galbraith
  2015-07-07 13:40                                   ` Josef Bacik
  2015-07-07 17:06                                   ` Josef Bacik
  0 siblings, 2 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-07  9:43 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote:
> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
> 
> > So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the 
> > baseline with a slight regression at lower RPS and a slight improvement 
> > at high RPS.
> 
> Good.  I can likely drop the rest then (I like dinky, so do CPUs;).  I'm
> not real keen on the feature unless your numbers are really good, and
> odds are that ain't gonna happen.

More extensive testing in pedantic-man mode increased my confidence of
that enough to sign off and ship the dirt simple version.  Any further
twiddles should grow their own wings if they want to fly anyway, the
simplest form helps your real world load, as well as the not so real
pgbench, my numbers for that below.

virgin master, 2 socket box
postgres@nessler:~> pgbench.sh
clients 12      tps = 96233.854271     1.000
clients 24      tps = 142234.686166    1.000
clients 36      tps = 148433.534531    1.000
clients 48      tps = 133105.634302    1.000
clients 60      tps = 128903.080371    1.000
clients 72      tps = 128591.821782    1.000
clients 84      tps = 114445.967116    1.000
clients 96      tps = 109803.557524    1.000    avg   125219.017   1.000

V3 (KISS, below)
postgres@nessler:~> pgbench.sh
clients 12      tps = 120793.023637    1.255
clients 24      tps = 144668.961468    1.017
clients 36      tps = 156705.239251    1.055
clients 48      tps = 152004.886893    1.141
clients 60      tps = 138582.113864    1.075
clients 72      tps = 136286.891104    1.059
clients 84      tps = 137420.986043    1.200
clients 96      tps = 135199.060242    1.231   avg   140207.645   1.119   1.000

V2 NO_WAKE_WIDE_IDLE
postgres@nessler:~> pgbench.sh
clients 12      tps = 121821.966162    1.265
clients 24      tps = 146446.388366    1.029
clients 36      tps = 151373.362190    1.019
clients 48      tps = 156806.730746    1.178
clients 60      tps = 133933.491567    1.039
clients 72      tps = 131460.489424    1.022
clients 84      tps = 130859.340261    1.143
clients 96      tps = 130787.476584    1.191   avg   137936.155   1.101   0.983

V2 WAKE_WIDE_IDLE (crawl in a hole feature, you're dead)
postgres@nessler:~> pgbench.sh
clients 12      tps = 121297.791570
clients 24      tps = 145939.488312
clients 36      tps = 155336.090263
clients 48      tps = 149018.245323
clients 60      tps = 136730.079391
clients 72      tps = 134886.116831
clients 84      tps = 130493.283398
clients 96      tps = 126043.336074


sched: beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when the 1:N waker is waking one of its minions.

Correct that, and don't bother doing domain traversal when we know
that all we need to do is check for an idle cpu.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
 include/linux/sched.h |    4 +--
 kernel/sched/fair.c   |   56 ++++++++++++++++++++++++--------------------------
 2 files changed, 29 insertions(+), 31 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -4730,26 +4730,27 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
+	unsigned int waker_flips = current->wakee_flips;
+	unsigned int wakee_flips = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (waker_flips < wakee_flips)
+		swap(waker_flips, wakee_flips);
+	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+		return 0;
+	return 1;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4762,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
+	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
-	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
 	rcu_read_lock();
+	if (sd_flag & SD_BALANCE_WAKE) {
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		if (!want_affine)
+			goto select_idle;
+	}
+
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			continue;
@@ -5048,10 +5045,11 @@ select_task_rq_fair(struct task_struct *
 	}
 
 	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
+		new_cpu = cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
+select_idle:
+		new_cpu = select_idle_sibling(p, new_cpu);
 		goto unlock;
 	}
 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-07  9:43                                 ` [patch] " Mike Galbraith
@ 2015-07-07 13:40                                   ` Josef Bacik
  2015-07-07 15:24                                     ` Mike Galbraith
  2015-07-07 17:06                                   ` Josef Bacik
  1 sibling, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-07 13:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 07/07/2015 05:43 AM, Mike Galbraith wrote:
> On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote:
>> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
>>
>>> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
>>> baseline with a slight regression at lower RPS and a slight improvement
>>> at high RPS.
>>
>> Good.  I can likely drop the rest then (I like dinky, so do CPUs;).  I'm
>> not real keen on the feature unless your numbers are really good, and
>> odds are that ain't gonna happen.
>
> More extensive testing in pedantic-man mode increased my confidence of
> that enough to sign off and ship the dirt simple version.  Any further
> twiddles should grow their own wings if they want to fly anyway, the
> simplest form helps your real world load, as well as the not so real
> pgbench, my numbers for that below.
>

The WAKE_WIDE_IDLE run was basically the same so I'm good with the KISS 
version.  I'll run that through the load tests this morning and let you 
know how it goes.  I'm still seeing a slight regression at lower RPS, 
but it's like 1-2%, compared to ~15%.  Once load ramps up we're good to 
go, not sure why that is but it may also be my sample size (the cluster 
is only 32 boxes, the minimum for decent results).  Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-07 13:40                                   ` Josef Bacik
@ 2015-07-07 15:24                                     ` Mike Galbraith
  0 siblings, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-07 15:24 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Tue, 2015-07-07 at 09:40 -0400, Josef Bacik wrote:

> The WAKE_WIDE_IDLE run was basically the same so I'm good with the KISS 
> version.  I'll run that through the load tests this morning and let you 
> know how it goes.  I'm still seeing a slight regression at lower RPS, 
> but it's like 1-2%, compared to ~15%.

If I'm to believe pgbench, and configs really are as close as they
should be, there may be something between 3.12 and master which could
account for your delta and then some.  I may go hunting.. heck, if I
trusted pgbench I would be hunting.

patched SLE (resembles 3.12 if you squint)
postgres@nessler:~> pgbench.sh
clients 12      tps = 134795.901777
clients 24      tps = 156955.584523
clients 36      tps = 161450.044316
clients 48      tps = 171181.069044
clients 60      tps = 157896.858179
clients 72      tps = 155816.051709
clients 84      tps = 152456.003468
clients 96      tps = 121129.848008

patched master
postgres@nessler:~> pgbench.sh
clients 12      tps = 120793.023637
clients 24      tps = 144668.961468
clients 36      tps = 156705.239251
clients 48      tps = 152004.886893
clients 60      tps = 138582.113864
clients 72      tps = 136286.891104
clients 84      tps = 137420.986043
clients 96      tps = 135199.060242


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-07-07  9:43                                 ` [patch] " Mike Galbraith
  2015-07-07 13:40                                   ` Josef Bacik
@ 2015-07-07 17:06                                   ` Josef Bacik
  2015-07-08  6:13                                     ` [patch] sched: beef up wake_wide() Mike Galbraith
  1 sibling, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-07 17:06 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 07/07/2015 05:43 AM, Mike Galbraith wrote:
> On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote:
>> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
>>
>>> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
>>> baseline with a slight regression at lower RPS and a slight improvement
>>> at high RPS.
>>
>> Good.  I can likely drop the rest then (I like dinky, so do CPUs;).  I'm
>> not real keen on the feature unless your numbers are really good, and
>> odds are that ain't gonna happen.
>
> More extensive testing in pedantic-man mode increased my confidence of
> that enough to sign off and ship the dirt simple version.  Any further
> twiddles should grow their own wings if they want to fly anyway, the
> simplest form helps your real world load, as well as the not so real
> pgbench, my numbers for that below.
>
> virgin master, 2 socket box
> postgres@nessler:~> pgbench.sh
> clients 12      tps = 96233.854271     1.000
> clients 24      tps = 142234.686166    1.000
> clients 36      tps = 148433.534531    1.000
> clients 48      tps = 133105.634302    1.000
> clients 60      tps = 128903.080371    1.000
> clients 72      tps = 128591.821782    1.000
> clients 84      tps = 114445.967116    1.000
> clients 96      tps = 109803.557524    1.000    avg   125219.017   1.000
>
> V3 (KISS, below)
> postgres@nessler:~> pgbench.sh
> clients 12      tps = 120793.023637    1.255
> clients 24      tps = 144668.961468    1.017
> clients 36      tps = 156705.239251    1.055
> clients 48      tps = 152004.886893    1.141
> clients 60      tps = 138582.113864    1.075
> clients 72      tps = 136286.891104    1.059
> clients 84      tps = 137420.986043    1.200
> clients 96      tps = 135199.060242    1.231   avg   140207.645   1.119   1.000
>
> V2 NO_WAKE_WIDE_IDLE
> postgres@nessler:~> pgbench.sh
> clients 12      tps = 121821.966162    1.265
> clients 24      tps = 146446.388366    1.029
> clients 36      tps = 151373.362190    1.019
> clients 48      tps = 156806.730746    1.178
> clients 60      tps = 133933.491567    1.039
> clients 72      tps = 131460.489424    1.022
> clients 84      tps = 130859.340261    1.143
> clients 96      tps = 130787.476584    1.191   avg   137936.155   1.101   0.983
>
> V2 WAKE_WIDE_IDLE (crawl in a hole feature, you're dead)
> postgres@nessler:~> pgbench.sh
> clients 12      tps = 121297.791570
> clients 24      tps = 145939.488312
> clients 36      tps = 155336.090263
> clients 48      tps = 149018.245323
> clients 60      tps = 136730.079391
> clients 72      tps = 134886.116831
> clients 84      tps = 130493.283398
> clients 96      tps = 126043.336074
>
>
> sched: beef up wake_wide()
>
> Josef Bacik reported that Facebook sees better performance with their
> 1:N load (1 dispatch/node, N workers/node) when carrying an old patch
> to try very hard to wake to an idle CPU.  While looking at wake_wide(),
> I noticed that it doesn't pay attention to wakeup of the 1:N waker,
> returning 1 only when the 1:N waker is waking one of its minions.
>
> Correct that, and don't bother doing domain traversal when we know
> that all we need to do is check for an idle cpu.
>
> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>

Ok you can add

Tested-by: Josef Bacik <jbacik@fb.com>

The new patch is the best across the board, there are no regressions and 
about a 5% improvement for the bulk of the run (25 percentile and up). 
Thanks for your help on this!

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [patch] sched: beef up wake_wide()
  2015-07-07 17:06                                   ` Josef Bacik
@ 2015-07-08  6:13                                     ` Mike Galbraith
  2015-07-09 13:26                                       ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-08  6:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team


Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when the 1:N waker is waking one of its minions.

Correct that, and don't bother doing domain traversal when we know
that all we need to do is check for an idle cpu.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
---
 include/linux/sched.h |    4 +--
 kernel/sched/fair.c   |   56 ++++++++++++++++++++++++--------------------------
 2 files changed, 29 insertions(+), 31 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,27 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
+	unsigned int waker_flips = current->wakee_flips;
+	unsigned int wakee_flips = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (waker_flips < wakee_flips)
+		swap(waker_flips, wakee_flips);
+	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+		return 0;
+	return 1;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4762,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
+	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
-	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
 	rcu_read_lock();
+	if (sd_flag & SD_BALANCE_WAKE) {
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		if (!want_affine)
+			goto select_idle;
+	}
+
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			continue;
@@ -5048,10 +5045,11 @@ select_task_rq_fair(struct task_struct *
 	}
 
 	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
+		new_cpu = cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
+select_idle:
+		new_cpu = select_idle_sibling(p, new_cpu);
 		goto unlock;
 	}
 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-08  6:13                                     ` [patch] sched: beef up wake_wide() Mike Galbraith
@ 2015-07-09 13:26                                       ` Peter Zijlstra
  2015-07-09 14:07                                         ` Mike Galbraith
  2015-07-10  5:19                                         ` Mike Galbraith
  0 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-07-09 13:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
> 
> +/*
> + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
> + * A waker of many should wake a different task than the one last awakened
> + * at a frequency roughly N times higher than one of its wakees.  In order
> + * to determine whether we should let the load spread vs consolodating to
> + * shared cache, we look for a minimum 'flip' frequency of llc_size in one
> + * partner, and a factor of lls_size higher frequency in the other.  With
> + * both conditions met, we can be relatively sure that we are seeing a 1:N
> + * relationship, and that load size exceeds socket size.
> + */
>  static int wake_wide(struct task_struct *p)
>  {
> +	unsigned int waker_flips = current->wakee_flips;
> +	unsigned int wakee_flips = p->wakee_flips;
>  	int factor = this_cpu_read(sd_llc_size);
>  
> +	if (waker_flips < wakee_flips)
> +		swap(waker_flips, wakee_flips);

This makes the wakee/waker names useless, the end result is more like
wakee_flips := client_flips, waker_flips := server_flips.

> +	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
> +		return 0;

I don't get the first condition... why would the client ever flip? It
only talks to that one server.

> +	return 1;
>  }


> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
>  {
>  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>  	int cpu = smp_processor_id();
> +	int new_cpu = prev_cpu;
>  	int want_affine = 0;
>  	int sync = wake_flags & WF_SYNC;
>  
>  	rcu_read_lock();
> +	if (sd_flag & SD_BALANCE_WAKE) {
> +		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> +		if (!want_affine)
> +			goto select_idle;
> +	}

So this preserves/makes worse the bug Morten spotted, even without
want_affine we should still attempt SD_BALANCE_WAKE if set.

> +
>  	for_each_domain(cpu, tmp) {
>  		if (!(tmp->flags & SD_LOAD_BALANCE))
>  			continue;

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-09 13:26                                       ` Peter Zijlstra
@ 2015-07-09 14:07                                         ` Mike Galbraith
  2015-07-09 14:46                                           ` Mike Galbraith
  2015-07-10  5:19                                         ` Mike Galbraith
  1 sibling, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-09 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
> > 
> > +/*
> > + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
> > + * A waker of many should wake a different task than the one last awakened
> > + * at a frequency roughly N times higher than one of its wakees.  In order
> > + * to determine whether we should let the load spread vs consolodating to
> > + * shared cache, we look for a minimum 'flip' frequency of llc_size in one
> > + * partner, and a factor of lls_size higher frequency in the other.  With
> > + * both conditions met, we can be relatively sure that we are seeing a 1:N
> > + * relationship, and that load size exceeds socket size.
> > + */
> >  static int wake_wide(struct task_struct *p)
> >  {
> > +	unsigned int waker_flips = current->wakee_flips;
> > +	unsigned int wakee_flips = p->wakee_flips;
> >  	int factor = this_cpu_read(sd_llc_size);
> >  
> > +	if (waker_flips < wakee_flips)
> > +		swap(waker_flips, wakee_flips);
> 
> This makes the wakee/waker names useless, the end result is more like
> wakee_flips := client_flips, waker_flips := server_flips.

True, perhaps a rename is in order.

> > +	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
> > +		return 0;
> 
> I don't get the first condition... why would the client ever flip? It
> only talks to that one server.

So I was thinking too, and I initially cemented the relationship by
flipping both.  However, the thing works in virgin source, ie clients do
in fact flip, so I removed that cementing based on the hard evidence.

> > @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
> >  {
> >  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> >  	int cpu = smp_processor_id();
> > +	int new_cpu = prev_cpu;
> >  	int want_affine = 0;
> >  	int sync = wake_flags & WF_SYNC;
> >  
> >  	rcu_read_lock();
> > +	if (sd_flag & SD_BALANCE_WAKE) {
> > +		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> > +		if (!want_affine)
> > +			goto select_idle;
> > +	}
> 
> So this preserves/makes worse the bug Morten spotted, even without
> want_affine we should still attempt SD_BALANCE_WAKE if set.

Yeah.  I can redo it if you want, but it seems a shame to traverse for
nothing given we know SD_BALANCE_WAKE is so painful that nobody really
really wants to do that.  One has to override the other in any case, no?

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-09 14:07                                         ` Mike Galbraith
@ 2015-07-09 14:46                                           ` Mike Galbraith
  0 siblings, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-09 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Thu, 2015-07-09 at 16:07 +0200, Mike Galbraith wrote:
> On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
> > On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
> > > 
> > > +/*
> > > + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
> > > + * A waker of many should wake a different task than the one last awakened
> > > + * at a frequency roughly N times higher than one of its wakees.  In order
> > > + * to determine whether we should let the load spread vs consolodating to
> > > + * shared cache, we look for a minimum 'flip' frequency of llc_size in one
> > > + * partner, and a factor of lls_size higher frequency in the other.  With
> > > + * both conditions met, we can be relatively sure that we are seeing a 1:N
> > > + * relationship, and that load size exceeds socket size.
> > > + */
> > >  static int wake_wide(struct task_struct *p)
> > >  {
> > > +	unsigned int waker_flips = current->wakee_flips;
> > > +	unsigned int wakee_flips = p->wakee_flips;
> > >  	int factor = this_cpu_read(sd_llc_size);
> > >  
> > > +	if (waker_flips < wakee_flips)
> > > +		swap(waker_flips, wakee_flips);
> > 
> > This makes the wakee/waker names useless, the end result is more like
> > wakee_flips := client_flips, waker_flips := server_flips.
> 
> True, perhaps a rename is in order.

I should perhaps add that waker/wakee_flips does make sense to me in the
sense that the task who's flips end up in the waker_flips bucket is our
waker of many vs being one its many wakees.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-09 13:26                                       ` Peter Zijlstra
  2015-07-09 14:07                                         ` Mike Galbraith
@ 2015-07-10  5:19                                         ` Mike Galbraith
  2015-07-10 13:41                                           ` Josef Bacik
  2015-07-10 20:59                                           ` Josef Bacik
  1 sibling, 2 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-10  5:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
> >  static int wake_wide(struct task_struct *p)
> >  {
> > +	unsigned int waker_flips = current->wakee_flips;
> > +	unsigned int wakee_flips = p->wakee_flips;
> >  	int factor = this_cpu_read(sd_llc_size);
> >  
> > +	if (waker_flips < wakee_flips)
> > +		swap(waker_flips, wakee_flips);
> 
> This makes the wakee/waker names useless, the end result is more like
> wakee_flips := client_flips, waker_flips := server_flips.

I settled on master/slave plus hopefully improved comment block.

> > +	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
> > +		return 0;
> 
> I don't get the first condition... why would the client ever flip? It
> only talks to that one server.

(tightening heuristic up a bit by one means or another would be good,
but "if it ain't broke, don't fix it" applies for this patchlet)

> > @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
> >  {
> >  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> >  	int cpu = smp_processor_id();
> > +	int new_cpu = prev_cpu;
> >  	int want_affine = 0;
> >  	int sync = wake_flags & WF_SYNC;
> >  
> >  	rcu_read_lock();
> > +	if (sd_flag & SD_BALANCE_WAKE) {
> > +		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> > +		if (!want_affine)
> > +			goto select_idle;
> > +	}
> 
> So this preserves/makes worse the bug Morten spotted, even without
> want_affine we should still attempt SD_BALANCE_WAKE if set.

Fixed.  wake_wide() may override want_affine as before, want_affine may
override other ->flags as before, but a surviving domain selection now
results in a full balance instead of a select_idle_sibling() call.

sched: beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
---
 include/linux/sched.h |    4 +--
 kernel/sched/fair.c   |   57 ++++++++++++++++++++++----------------------------
 2 files changed, 28 insertions(+), 33 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,29 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size.  Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
+	unsigned int master = current->wakee_flips;
+	unsigned int slave = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (master < slave)
+		swap(master, slave);
+	if (slave < factor || master < slave * factor)
+		return 0;
+	return 1;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4764,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -5021,12 +5017,12 @@ select_task_rq_fair(struct task_struct *
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
+	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
 	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
@@ -5040,6 +5036,8 @@ select_task_rq_fair(struct task_struct *
 		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
 		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
 			affine_sd = tmp;
+			/* Prefer affinity over any other flags */
+			sd = NULL;
 			break;
 		}
 
@@ -5048,12 +5046,10 @@ select_task_rq_fair(struct task_struct *
 	}
 
 	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
+		new_cpu = cpu;
 
-	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
-	}
+	if ((sd_flag & SD_BALANCE_WAKE) && (!sd || (!(sd->flags & SD_BALANCE_WAKE))))
+		new_cpu = select_idle_sibling(p, new_cpu);
 
 	while (sd) {
 		struct sched_group *group;
@@ -5089,7 +5085,6 @@ select_task_rq_fair(struct task_struct *
 		}
 		/* while loop will break here if sd == NULL */
 	}
-unlock:
 	rcu_read_unlock();
 
 	return new_cpu;



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-10  5:19                                         ` Mike Galbraith
@ 2015-07-10 13:41                                           ` Josef Bacik
  2015-07-10 20:59                                           ` Josef Bacik
  1 sibling, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-07-10 13:41 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 07/10/2015 01:19 AM, Mike Galbraith wrote:
> On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
>> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
>>>   static int wake_wide(struct task_struct *p)
>>>   {
>>> +	unsigned int waker_flips = current->wakee_flips;
>>> +	unsigned int wakee_flips = p->wakee_flips;
>>>   	int factor = this_cpu_read(sd_llc_size);
>>>
>>> +	if (waker_flips < wakee_flips)
>>> +		swap(waker_flips, wakee_flips);
>>
>> This makes the wakee/waker names useless, the end result is more like
>> wakee_flips := client_flips, waker_flips := server_flips.
>
> I settled on master/slave plus hopefully improved comment block.
>
>>> +	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
>>> +		return 0;
>>
>> I don't get the first condition... why would the client ever flip? It
>> only talks to that one server.
>
> (tightening heuristic up a bit by one means or another would be good,
> but "if it ain't broke, don't fix it" applies for this patchlet)
>
>>> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
>>>   {
>>>   	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>>>   	int cpu = smp_processor_id();
>>> +	int new_cpu = prev_cpu;
>>>   	int want_affine = 0;
>>>   	int sync = wake_flags & WF_SYNC;
>>>
>>>   	rcu_read_lock();
>>> +	if (sd_flag & SD_BALANCE_WAKE) {
>>> +		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>> +		if (!want_affine)
>>> +			goto select_idle;
>>> +	}
>>
>> So this preserves/makes worse the bug Morten spotted, even without
>> want_affine we should still attempt SD_BALANCE_WAKE if set.
>
> Fixed.  wake_wide() may override want_affine as before, want_affine may
> override other ->flags as before, but a surviving domain selection now
> results in a full balance instead of a select_idle_sibling() call.
>
> sched: beef up wake_wide()
>
> Josef Bacik reported that Facebook sees better performance with their
> 1:N load (1 dispatch/node, N workers/node) when carrying an old patch
> to try very hard to wake to an idle CPU.  While looking at wake_wide(),
> I noticed that it doesn't pay attention to the wakeup of a many partner
> waker, returning 1 only when waking one of its many partners.
>
> Correct that, letting explicit domain flags override the heuristic.
>
> While at it, adjust task_struct bits, we don't need a 64bit counter.
>
> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Tested-by: Josef Bacik <jbacik@fb.com>


I'll give this new one a whirl and let you know how it goes.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-10  5:19                                         ` Mike Galbraith
  2015-07-10 13:41                                           ` Josef Bacik
@ 2015-07-10 20:59                                           ` Josef Bacik
  2015-07-11  3:11                                             ` Mike Galbraith
  1 sibling, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-10 20:59 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 07/10/2015 01:19 AM, Mike Galbraith wrote:
> On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
>> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
>>>   static int wake_wide(struct task_struct *p)
>>>   {
>>> +	unsigned int waker_flips = current->wakee_flips;
>>> +	unsigned int wakee_flips = p->wakee_flips;
>>>   	int factor = this_cpu_read(sd_llc_size);
>>>
>>> +	if (waker_flips < wakee_flips)
>>> +		swap(waker_flips, wakee_flips);
>>
>> This makes the wakee/waker names useless, the end result is more like
>> wakee_flips := client_flips, waker_flips := server_flips.
>
> I settled on master/slave plus hopefully improved comment block.
>
>>> +	if (wakee_flips < factor || waker_flips < wakee_flips * factor)
>>> +		return 0;
>>
>> I don't get the first condition... why would the client ever flip? It
>> only talks to that one server.
>
> (tightening heuristic up a bit by one means or another would be good,
> but "if it ain't broke, don't fix it" applies for this patchlet)
>
>>> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
>>>   {
>>>   	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>>>   	int cpu = smp_processor_id();
>>> +	int new_cpu = prev_cpu;
>>>   	int want_affine = 0;
>>>   	int sync = wake_flags & WF_SYNC;
>>>
>>>   	rcu_read_lock();
>>> +	if (sd_flag & SD_BALANCE_WAKE) {
>>> +		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>> +		if (!want_affine)
>>> +			goto select_idle;
>>> +	}
>>
>> So this preserves/makes worse the bug Morten spotted, even without
>> want_affine we should still attempt SD_BALANCE_WAKE if set.
>
> Fixed.  wake_wide() may override want_affine as before, want_affine may
> override other ->flags as before, but a surviving domain selection now
> results in a full balance instead of a select_idle_sibling() call.
>
> sched: beef up wake_wide()
>
> Josef Bacik reported that Facebook sees better performance with their
> 1:N load (1 dispatch/node, N workers/node) when carrying an old patch
> to try very hard to wake to an idle CPU.  While looking at wake_wide(),
> I noticed that it doesn't pay attention to the wakeup of a many partner
> waker, returning 1 only when waking one of its many partners.
>
> Correct that, letting explicit domain flags override the heuristic.
>
> While at it, adjust task_struct bits, we don't need a 64bit counter.
>
> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Tested-by: Josef Bacik <jbacik@fb.com>

Not quite as awesome but still better than the baseline so we're good. 
Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-10 20:59                                           ` Josef Bacik
@ 2015-07-11  3:11                                             ` Mike Galbraith
  2015-07-13 13:53                                               ` Josef Bacik
  2015-07-14 11:19                                               ` Peter Zijlstra
  0 siblings, 2 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-11  3:11 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote:

> Not quite as awesome but still better than the baseline so we're good. 
> Thanks,

I personally like the other much better.  We're not doing the user any
favor by making the thing balance when SD_BALANCE_WAKE is set.  Until
such time as it ceases to be horribly painful, we're just paying a few
cycles to help a theoretically existing masochist torture his box.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-11  3:11                                             ` Mike Galbraith
@ 2015-07-13 13:53                                               ` Josef Bacik
  2015-07-14 11:19                                               ` Peter Zijlstra
  1 sibling, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-07-13 13:53 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 07/10/2015 11:11 PM, Mike Galbraith wrote:
> On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote:
>
>> Not quite as awesome but still better than the baseline so we're good.
>> Thanks,
>
> I personally like the other much better.  We're not doing the user any
> favor by making the thing balance when SD_BALANCE_WAKE is set.  Until
> such time as it ceases to be horribly painful, we're just paying a few
> cycles to help a theoretically existing masochist torture his box.
>

Agreed, I'm all for the faster version ;),

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-11  3:11                                             ` Mike Galbraith
  2015-07-13 13:53                                               ` Josef Bacik
@ 2015-07-14 11:19                                               ` Peter Zijlstra
  2015-07-14 13:49                                                 ` Mike Galbraith
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2015-07-14 11:19 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Sat, Jul 11, 2015 at 05:11:51AM +0200, Mike Galbraith wrote:
> On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote:
> 
> > Not quite as awesome but still better than the baseline so we're good. 
> > Thanks,
> 
> I personally like the other much better.  We're not doing the user any
> favor by making the thing balance when SD_BALANCE_WAKE is set.  Until
> such time as it ceases to be horribly painful, we're just paying a few
> cycles to help a theoretically existing masochist torture his box.

OK, how about something like the below; it tightens things up by
applying two rules:

 - We really should not continue looking for a balancing domain once
   SD_LOAD_BALANCE is not set.

 - SD (balance) flags should really be set in a single contiguous range,
   always starting at the bottom.

The latter means what if !want_affine and the (first) sd doesn't have
BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
junk you didn't like..

Hmm?

---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <umgwanakikbuti@gmail.com>
Date: Fri, 10 Jul 2015 07:19:26 +0200

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Cc: mingo@redhat.com
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com
---
 include/linux/sched.h |    4 +-
 kernel/sched/fair.c   |   72 +++++++++++++++++++++++---------------------------
 2 files changed, 36 insertions(+), 40 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size.  Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
+	unsigned int master = current->wakee_flips;
+	unsigned int slave = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (master < slave)
+		swap(master, slave);
+	if (slave < factor || master < slave * factor)
+		return 0;
+	return 1;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu)
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
-	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+	struct sched_domain *tmp, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
+	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
 	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
-			continue;
+			break;
 
 		/*
 		 * If both cpu and prev_cpu are part of this domain,
@@ -5035,20 +5031,14 @@ select_task_rq_fair(struct task_struct *
 		 */
 		if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
 		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
-			affine_sd = tmp;
-			break;
+			/* Prefer affinity over any other flags */
+			goto do_affine;
 		}
 
 		if (tmp->flags & sd_flag)
 			sd = tmp;
-	}
-
-	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
-
-	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		else if (!want_affine)
+			break;
 	}
 
 	while (sd) {
@@ -5087,8 +5077,14 @@ select_task_rq_fair(struct task_struct *
 	}
 unlock:
 	rcu_read_unlock();
-
 	return new_cpu;
+
+do_affine:
+	if (cpu != prev_cpu && wake_affine(tmp, p, sync))
+		new_cpu = cpu;
+
+	new_cpu = select_idle_sibling(p, new_cpu);
+	goto unlock;
 }
 
 /*

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-14 11:19                                               ` Peter Zijlstra
@ 2015-07-14 13:49                                                 ` Mike Galbraith
  2015-07-14 14:07                                                   ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-14 13:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote:

> OK, how about something like the below; it tightens things up by
> applying two rules:
> 
>  - We really should not continue looking for a balancing domain once
>    SD_LOAD_BALANCE is not set.
> 
>  - SD (balance) flags should really be set in a single contiguous range,
>    always starting at the bottom.
> 
> The latter means what if !want_affine and the (first) sd doesn't have
> BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
> junk you didn't like..
> 
> Hmm?

Yeah, that's better.  It's not big hairy deal either way, it just bugged
me to knowingly toss those cycles out the window ;-)

select_idle_sibling() looks kinda funny down there, but otoh when the
day comes (hah) that we can just balance, it's closer to the exit.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-14 13:49                                                 ` Mike Galbraith
@ 2015-07-14 14:07                                                   ` Peter Zijlstra
  2015-07-14 14:17                                                     ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2015-07-14 14:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Tue, Jul 14, 2015 at 03:49:17PM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote:
> 
> > OK, how about something like the below; it tightens things up by
> > applying two rules:
> > 
> >  - We really should not continue looking for a balancing domain once
> >    SD_LOAD_BALANCE is not set.
> > 
> >  - SD (balance) flags should really be set in a single contiguous range,
> >    always starting at the bottom.
> > 
> > The latter means what if !want_affine and the (first) sd doesn't have
> > BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
> > junk you didn't like..
> > 
> > Hmm?
> 
> Yeah, that's better.  It's not big hairy deal either way, it just bugged
> me to knowingly toss those cycles out the window ;-)
> 
> select_idle_sibling() looks kinda funny down there, but otoh when the
> day comes (hah) that we can just balance, it's closer to the exit.

Right, not too pretty, does this look beter? 

---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <umgwanakikbuti@gmail.com>
Date: Fri, 10 Jul 2015 07:19:26 +0200

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Cc: mingo@redhat.com
Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com
---
 include/linux/sched.h |    4 +-
 kernel/sched/fair.c   |   68 +++++++++++++++++++++++---------------------------
 2 files changed, 34 insertions(+), 38 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,9 +1359,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size.  Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
+	unsigned int master = current->wakee_flips;
+	unsigned int slave = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (master < slave)
+		swap(master, slave);
+	if (slave < factor || master < slave * factor)
+		return 0;
+	return 1;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu)
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
-	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+	struct sched_domain *tmp, affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
+	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
 	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
-			continue;
+			break;
 
 		/*
 		 * If both cpu and prev_cpu are part of this domain,
@@ -5041,17 +5037,17 @@ select_task_rq_fair(struct task_struct *
 
 		if (tmp->flags & sd_flag)
 			sd = tmp;
+		else if (!want_affine)
+			break;
 	}
 
-	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
+	if (affine_sd) { /* Prefer affinity over any other flags */
+		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+			new_cpu = cpu;
 
-	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
-	}
+		new_cpu = select_idle_sibling(p, new_cpu);
 
-	while (sd) {
+	} else while (sd) {
 		struct sched_group *group;
 		int weight;
 
@@ -5085,10 +5081,10 @@ select_task_rq_fair(struct task_struct *
 		}
 		/* while loop will break here if sd == NULL */
 	}
-unlock:
-	rcu_read_unlock();
 
+	rcu_read_unlock();
 	return new_cpu;
+
 }
 
 /*

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-14 14:07                                                   ` Peter Zijlstra
@ 2015-07-14 14:17                                                     ` Mike Galbraith
  2015-07-14 15:04                                                       ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-14 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Tue, 2015-07-14 at 16:07 +0200, Peter Zijlstra wrote:
> On Tue, Jul 14, 2015 at 03:49:17PM +0200, Mike Galbraith wrote:
> > On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote:
> > 
> > > OK, how about something like the below; it tightens things up by
> > > applying two rules:
> > > 
> > >  - We really should not continue looking for a balancing domain once
> > >    SD_LOAD_BALANCE is not set.
> > > 
> > >  - SD (balance) flags should really be set in a single contiguous range,
> > >    always starting at the bottom.
> > > 
> > > The latter means what if !want_affine and the (first) sd doesn't have
> > > BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
> > > junk you didn't like..
> > > 
> > > Hmm?
> > 
> > Yeah, that's better.  It's not big hairy deal either way, it just bugged
> > me to knowingly toss those cycles out the window ;-)
> > 
> > select_idle_sibling() looks kinda funny down there, but otoh when the
> > day comes (hah) that we can just balance, it's closer to the exit.
> 
> Right, not too pretty, does this look beter?

There's a buglet, I was just about to mention the inverse in the other.


> @@ -5041,17 +5037,17 @@ select_task_rq_fair(struct task_struct *
>  
>  		if (tmp->flags & sd_flag)
>  			sd = tmp;
> +		else if (!want_affine)
> +			break;
>  	}
>  
> -	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> -		prev_cpu = cpu;
> +	if (affine_sd) { /* Prefer affinity over any other flags */
> +		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> +			new_cpu = cpu;
>  
> -	if (sd_flag & SD_BALANCE_WAKE) {
> -		new_cpu = select_idle_sibling(p, prev_cpu);
> -		goto unlock;
> -	}
> +		new_cpu = select_idle_sibling(p, new_cpu);

We'll not look for a idle cpu when wake_wide() naks want_affine.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-14 14:17                                                     ` Mike Galbraith
@ 2015-07-14 15:04                                                       ` Peter Zijlstra
  2015-07-14 15:39                                                         ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2015-07-14 15:04 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote:
> There's a buglet, 

> We'll not look for a idle cpu when wake_wide() naks want_affine.

*sigh* indeed.. fixing that'll bring us very close to what we started
out wiht..

The one XXX there raises the question on whether we should always so
select_idle_sibling() if we do not have a suitable balance flag, or only
on wakeups.



---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <umgwanakikbuti@gmail.com>
Date: Fri, 10 Jul 2015 07:19:26 +0200

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Cc: mingo@redhat.com
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com
---
 include/linux/sched.h |    4 +-
 kernel/sched/fair.c   |   69 ++++++++++++++++++++++++--------------------------
 2 files changed, 36 insertions(+), 37 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,9 +1359,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size.  Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
+	unsigned int master = current->wakee_flips;
+	unsigned int slave = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (master < slave)
+		swap(master, slave);
+	if (slave < factor || master < slave * factor)
+		return 0;
+	return 1;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu)
 static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
-	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+	struct sched_domain *tmp, affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
+	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
 	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
-			continue;
+			break;
 
 		/*
 		 * If both cpu and prev_cpu are part of this domain,
@@ -5041,17 +5037,21 @@ select_task_rq_fair(struct task_struct *
 
 		if (tmp->flags & sd_flag)
 			sd = tmp;
+		else if (!want_affine)
+			break;
 	}
 
-	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
-
-	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+	if (affine_sd) {
+		sd = NULL; /* Prefer wake_affine over balance flags */
+		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+			new_cpu = cpu;
 	}
 
-	while (sd) {
+	if (!sd) {
+		if (sd_flags & SD_BALANCE_WAKE) /* XXX always ? */
+			new_cpu = select_idle_sibling(p, new_cpu);
+
+	} else while (sd) {
 		struct sched_group *group;
 		int weight;
 
@@ -5085,7 +5085,6 @@ select_task_rq_fair(struct task_struct *
 		}
 		/* while loop will break here if sd == NULL */
 	}
-unlock:
 	rcu_read_unlock();
 
 	return new_cpu;

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-14 15:04                                                       ` Peter Zijlstra
@ 2015-07-14 15:39                                                         ` Mike Galbraith
  2015-07-14 16:01                                                           ` Josef Bacik
  2015-08-03 17:07                                                           ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith
  0 siblings, 2 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-14 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Tue, 2015-07-14 at 17:04 +0200, Peter Zijlstra wrote:
> On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote:
> > There's a buglet, 
> 
> > We'll not look for a idle cpu when wake_wide() naks want_affine.
> 
> *sigh* indeed.. fixing that'll bring us very close to what we started
> out wiht..
> 
> The one XXX there raises the question on whether we should always so
> select_idle_sibling() if we do not have a suitable balance flag, or only
> on wakeups.

That's what I've been sitting here waffling over, finally convinced
myself that should the user turn FORX/EXEC off, he shouldn't find that a
substitute quietly slipped in.. though otoh.. crap, guess I'm not done
waffling after all.  Yeah, this will work just fine ;-)

(typos fixed)

---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <umgwanakikbuti@gmail.com>
Date: Fri, 10 Jul 2015 07:19:26 +0200

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64bit counter.

Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Cc: mingo@redhat.com
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com
---
 include/linux/sched.h |    4 +-
 kernel/sched/fair.c   |   67 ++++++++++++++++++++++++--------------------------
 2 files changed, 35 insertions(+), 36 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,29 @@ static long effective_load(struct task_g
 
 #endif
 
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size.  Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
+	unsigned int master = current->wakee_flips;
+	unsigned int slave = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (master < slave)
+		swap(master, slave);
+	if (slave < factor || master < slave * factor)
+		return 0;
+	return 1;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4764,6 @@ static int wake_affine(struct sched_doma
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -5021,17 +5017,17 @@ select_task_rq_fair(struct task_struct *
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
+	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
 	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
-			continue;
+			break;
 
 		/*
 		 * If both cpu and prev_cpu are part of this domain,
@@ -5045,17 +5041,21 @@ select_task_rq_fair(struct task_struct *
 
 		if (tmp->flags & sd_flag)
 			sd = tmp;
+		else if (!want_affine)
+			break;
 	}
 
-	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
-
-	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+	if (affine_sd) {
+		sd = NULL; /* Prefer wake_affine over balance flags */
+		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+			new_cpu = cpu;
 	}
 
-	while (sd) {
+	if (!sd) {
+		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
+			new_cpu = select_idle_sibling(p, new_cpu);
+
+	} else while (sd) {
 		struct sched_group *group;
 		int weight;
 
@@ -5089,7 +5089,6 @@ select_task_rq_fair(struct task_struct *
 		}
 		/* while loop will break here if sd == NULL */
 	}
-unlock:
 	rcu_read_unlock();
 
 	return new_cpu;



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-14 15:39                                                         ` Mike Galbraith
@ 2015-07-14 16:01                                                           ` Josef Bacik
  2015-07-14 17:59                                                             ` Mike Galbraith
  2015-08-03 17:07                                                           ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith
  1 sibling, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-14 16:01 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra
  Cc: riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 07/14/2015 11:39 AM, Mike Galbraith wrote:
> On Tue, 2015-07-14 at 17:04 +0200, Peter Zijlstra wrote:
>> On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote:
>>> There's a buglet,
>>
>>> We'll not look for a idle cpu when wake_wide() naks want_affine.
>>
>> *sigh* indeed.. fixing that'll bring us very close to what we started
>> out wiht..
>>
>> The one XXX there raises the question on whether we should always so
>> select_idle_sibling() if we do not have a suitable balance flag, or only
>> on wakeups.
>
> That's what I've been sitting here waffling over, finally convinced
> myself that should the user turn FORX/EXEC off, he shouldn't find that a
> substitute quietly slipped in.. though otoh.. crap, guess I'm not done
> waffling after all.  Yeah, this will work just fine ;-)
>
> (typos fixed)
>

We happy with this or should I wait for more patches to fly by before I 
test something ;)?

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-14 16:01                                                           ` Josef Bacik
@ 2015-07-14 17:59                                                             ` Mike Galbraith
  2015-07-15 17:11                                                               ` Josef Bacik
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-14 17:59 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On Tue, 2015-07-14 at 12:01 -0400, Josef Bacik wrote:

> We happy with this or should I wait for more patches to fly by before I 
> test something ;)?

Yeah, it should be The End time.

	-Mike



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [patch] sched: beef up wake_wide()
  2015-07-14 17:59                                                             ` Mike Galbraith
@ 2015-07-15 17:11                                                               ` Josef Bacik
  0 siblings, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-07-15 17:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team

On 07/14/2015 01:59 PM, Mike Galbraith wrote:
> On Tue, 2015-07-14 at 12:01 -0400, Josef Bacik wrote:
>
>> We happy with this or should I wait for more patches to fly by before I
>> test something ;)?
>
> Yeah, it should be The End time.

It's showing the same thing as one of the many earlier patches, it's 
better at high RPS, but at low RPS there's a 3% regression.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [tip:sched/core] sched/fair: Beef up wake_wide()
  2015-07-14 15:39                                                         ` Mike Galbraith
  2015-07-14 16:01                                                           ` Josef Bacik
@ 2015-08-03 17:07                                                           ` tip-bot for Mike Galbraith
  1 sibling, 0 replies; 73+ messages in thread
From: tip-bot for Mike Galbraith @ 2015-08-03 17:07 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-kernel, hpa, Kernel-team, jbacik, mingo, peterz,
	torvalds, umgwanakikbuti, efault

Commit-ID:  63b0e9edceec10fa41ec33393a1515a5ff444277
Gitweb:     http://git.kernel.org/tip/63b0e9edceec10fa41ec33393a1515a5ff444277
Author:     Mike Galbraith <umgwanakikbuti@gmail.com>
AuthorDate: Tue, 14 Jul 2015 17:39:50 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 3 Aug 2015 12:21:23 +0200

sched/fair: Beef up wake_wide()

Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU.  While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.

Correct that, letting explicit domain flags override the heuristic.

While at it, adjust task_struct bits, we don't need a 64-bit counter.

Tested-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
[ Tidy things up. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team<Kernel-team@fb.com>
Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1436888390.7983.49.camel@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  4 +--
 kernel/sched/fair.c   | 67 +++++++++++++++++++++++++--------------------------
 2 files changed, 35 insertions(+), 36 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7412070..65a8a86 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,9 +1359,9 @@ struct task_struct {
 #ifdef CONFIG_SMP
 	struct llist_node wake_entry;
 	int on_cpu;
-	struct task_struct *last_wakee;
-	unsigned long wakee_flips;
+	unsigned int wakee_flips;
 	unsigned long wakee_flip_decay_ts;
+	struct task_struct *last_wakee;
 
 	int wake_cpu;
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b384b8d..ea23f9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 
 #endif
 
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees.  In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other.  With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size.  Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
 static int wake_wide(struct task_struct *p)
 {
+	unsigned int master = current->wakee_flips;
+	unsigned int slave = p->wakee_flips;
 	int factor = this_cpu_read(sd_llc_size);
 
-	/*
-	 * Yeah, it's the switching-frequency, could means many wakee or
-	 * rapidly switch, use factor here will just help to automatically
-	 * adjust the loose-degree, so bigger node will lead to more pull.
-	 */
-	if (p->wakee_flips > factor) {
-		/*
-		 * wakee is somewhat hot, it needs certain amount of cpu
-		 * resource, so if waker is far more hot, prefer to leave
-		 * it alone.
-		 */
-		if (current->wakee_flips > (factor * p->wakee_flips))
-			return 1;
-	}
-
-	return 0;
+	if (master < slave)
+		swap(master, slave);
+	if (slave < factor || master < slave * factor)
+		return 0;
+	return 1;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	unsigned long weight;
 	int balanced;
 
-	/*
-	 * If we wake multiple tasks be careful to not bounce
-	 * ourselves around too much.
-	 */
-	if (wake_wide(p))
-		return 0;
-
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
@@ -5017,17 +5013,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = cpu;
+	int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
 
 	if (sd_flag & SD_BALANCE_WAKE)
-		want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 
 	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
-			continue;
+			break;
 
 		/*
 		 * If both cpu and prev_cpu are part of this domain,
@@ -5041,17 +5037,21 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 		if (tmp->flags & sd_flag)
 			sd = tmp;
+		else if (!want_affine)
+			break;
 	}
 
-	if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
-		prev_cpu = cpu;
-
-	if (sd_flag & SD_BALANCE_WAKE) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+	if (affine_sd) {
+		sd = NULL; /* Prefer wake_affine over balance flags */
+		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+			new_cpu = cpu;
 	}
 
-	while (sd) {
+	if (!sd) {
+		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
+			new_cpu = select_idle_sibling(p, new_cpu);
+
+	} else while (sd) {
 		struct sched_group *group;
 		int weight;
 
@@ -5085,7 +5085,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		}
 		/* while loop will break here if sd == NULL */
 	}
-unlock:
 	rcu_read_unlock();
 
 	return new_cpu;

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 10:21 ` Peter Zijlstra
  2015-05-28 11:05   ` Peter Zijlstra
@ 2015-05-28 11:16   ` Mike Galbraith
  2015-05-28 11:49     ` Ingo Molnar
  1 sibling, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-05-28 11:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Josef Bacik, riel, mingo, linux-kernel

On Thu, 2015-05-28 at 12:21 +0200, Peter Zijlstra wrote:

> In fact, select_idle_sibling() is already too expensive on current
> server hardware (far too damn many cpus in a LLC domain).

Yup.  I've played with rate limiting motion per task because of that.
Packages have gotten way too damn big.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 11:16   ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
@ 2015-05-28 11:49     ` Ingo Molnar
  2015-05-28 12:15       ` Mike Galbraith
  0 siblings, 1 reply; 73+ messages in thread
From: Ingo Molnar @ 2015-05-28 11:49 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, Josef Bacik, riel, mingo, linux-kernel


* Mike Galbraith <mgalbraith@novell.com> wrote:

> On Thu, 2015-05-28 at 12:21 +0200, Peter Zijlstra wrote:
> 
> > In fact, select_idle_sibling() is already too expensive on current
> > server hardware (far too damn many cpus in a LLC domain).
> 
> Yup.  I've played with rate limiting motion per task because of that.
> Packages have gotten way too damn big.

What's the biggest you've seen?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 11:49     ` Ingo Molnar
@ 2015-05-28 12:15       ` Mike Galbraith
  2015-05-28 12:19         ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-05-28 12:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Josef Bacik, riel, mingo, linux-kernel

On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote:
> * Mike Galbraith <mgalbraith@novell.com> wrote:
> 
> > On Thu, 2015-05-28 at 12:21 +0200, Peter Zijlstra wrote:
> > 
> > > In fact, select_idle_sibling() is already too expensive on current
> > > server hardware (far too damn many cpus in a LLC domain).
> > 
> > Yup.  I've played with rate limiting motion per task because of that.
> > Packages have gotten way too damn big.
> 
> What's the biggest you've seen?

15 cores so far.  It'll no doubt grow.

	-Mike


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 12:15       ` Mike Galbraith
@ 2015-05-28 12:19         ` Peter Zijlstra
  2015-05-28 12:29           ` Ingo Molnar
  2015-05-28 15:22           ` David Ahern
  0 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-05-28 12:19 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Josef Bacik, riel, mingo, linux-kernel

> On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote:

> > What's the biggest you've seen?

Wikipedia here: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29

Tell us HSW-E[PX] have 18 cores 36 thread SKUs.

But yes, what Mike says, its bound to only get bigger.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 12:19         ` Peter Zijlstra
@ 2015-05-28 12:29           ` Ingo Molnar
  2015-05-28 15:22           ` David Ahern
  1 sibling, 0 replies; 73+ messages in thread
From: Ingo Molnar @ 2015-05-28 12:29 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Josef Bacik, riel, mingo, linux-kernel

* Peter Zijlstra <peterz@infradead.org> wrote:

> > On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote:
> 
> > > What's the biggest you've seen?
> 
> Wikipedia here: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29
> 
> Tell us HSW-E[PX] have 18 cores 36 thread SKUs.
> 
> But yes, what Mike says, its bound to only get bigger.

So it's starting to get big enough to warrant an optimization of the way we 
account and discover idle CPUs:

So when a CPU goes idle, it has idle cycles it could spend on registering itself 
in either an idle-CPUs bitmap, or in an idle-CPUs queue. The queue (or bitmap) 
would strictly be only shared between CPUs within the same domain, so the cache 
bouncing cost from that is still small and package-local. (We remote access 
overhead in select_idle_sibling() already, due to having to access half of all 
remote rqs on average.)

Such an approach would make select_idle_sibling() independent on the size of the 
cores domain, it would make it essentially O(1).

( There's a bit of a complication with rq->wake_list, but I think it would be good
  enough to just register/unregister from the idle handler, if something is idle 
  only short term it should probably not be considered for SMP balancing. )

But I'd definitely not go towards making our SMP balancing macro idle selection 
decisions poorer, just because our internal implementation is 
O(nr_cores_per_package) ...

Agreed?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-28 12:19         ` Peter Zijlstra
  2015-05-28 12:29           ` Ingo Molnar
@ 2015-05-28 15:22           ` David Ahern
  1 sibling, 0 replies; 73+ messages in thread
From: David Ahern @ 2015-05-28 15:22 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Galbraith
  Cc: Ingo Molnar, Josef Bacik, riel, mingo, linux-kernel

On 5/28/15 6:19 AM, Peter Zijlstra wrote:
>> On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote:
>
>>> What's the biggest you've seen?
>
> Wikipedia here: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29
>
> Tell us HSW-E[PX] have 18 cores 36 thread SKUs.
>
> But yes, what Mike says, its bound to only get bigger.

sparc M7: 8 threads per core, 32 cores per socket = 256 cpus per socket


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
  2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
  2015-05-28  3:46 ` Mike Galbraith
  2015-05-28 10:21 ` Peter Zijlstra
@ 2015-05-28 11:55 ` Srikar Dronamraju
  2 siblings, 0 replies; 73+ messages in thread
From: Srikar Dronamraju @ 2015-05-28 11:55 UTC (permalink / raw)
  To: Josef Bacik; +Cc: riel, mingo, peterz, linux-kernel

> At Facebook we have a pretty heavily multi-threaded application that is
> sensitive to latency.  We have been pulling forward the old SD_WAKE_IDLE code
> because it gives us a pretty significant performance gain (like 20%).  It turns
> out this is because there are cases where the scheduler puts our task on a busy
> CPU when there are idle CPU's in the system.  We verify this by reading the
> cpu_delay_req_avg_us from the scheduler netlink stuff.  With our crappy patch we
> get much lower numbers vs baseline.
>

Was this application run under cpu cgroup. Because we were seeing bursty
workloads exhibiting this behaviour esp when run under cpu cgroups.

http://mid.gmane.org/53A11A89.5000602@linux.vnet.ibm.com

-- 
Thansk and Regards
Srikar


^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2015-08-03 17:08 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
2015-05-28  3:46 ` Mike Galbraith
2015-05-28  9:49   ` Morten Rasmussen
2015-05-28 10:57     ` Mike Galbraith
2015-05-28 11:48       ` Morten Rasmussen
2015-05-28 11:49         ` Mike Galbraith
2015-05-28 10:21 ` Peter Zijlstra
2015-05-28 11:05   ` Peter Zijlstra
2015-05-28 14:27     ` Josef Bacik
2015-05-29 21:03     ` Josef Bacik
2015-05-30  3:55       ` Mike Galbraith
2015-06-01 19:38       ` Josef Bacik
2015-06-01 20:42         ` Peter Zijlstra
2015-06-01 21:03           ` Josef Bacik
2015-06-02 17:12           ` Josef Bacik
2015-06-03 14:12             ` Rik van Riel
2015-06-03 14:24               ` Peter Zijlstra
2015-06-03 14:49                 ` Josef Bacik
2015-06-03 15:30                 ` Mike Galbraith
2015-06-03 15:57                   ` Josef Bacik
2015-06-03 16:53                     ` Mike Galbraith
2015-06-03 17:16                       ` Josef Bacik
2015-06-03 17:43                         ` Mike Galbraith
2015-06-03 20:34                           ` Josef Bacik
2015-06-04  4:52                             ` Mike Galbraith
2015-06-01 22:15         ` Rik van Riel
2015-06-11 20:33     ` Josef Bacik
2015-06-12  3:42       ` Rik van Riel
2015-06-12  5:35     ` Mike Galbraith
2015-06-17 18:06       ` Josef Bacik
2015-06-18  0:55         ` Mike Galbraith
2015-06-18  3:46           ` Josef Bacik
2015-06-18  4:12             ` Mike Galbraith
2015-07-02 17:44               ` Josef Bacik
2015-07-03  6:40                 ` Mike Galbraith
2015-07-03  9:29                   ` Mike Galbraith
2015-07-04 15:57                   ` Mike Galbraith
2015-07-05  7:17                     ` Mike Galbraith
2015-07-06  5:13                       ` Mike Galbraith
2015-07-06 14:34                         ` Josef Bacik
2015-07-06 18:36                           ` Mike Galbraith
2015-07-06 19:41                             ` Josef Bacik
2015-07-07  4:01                               ` Mike Galbraith
2015-07-07  9:43                                 ` [patch] " Mike Galbraith
2015-07-07 13:40                                   ` Josef Bacik
2015-07-07 15:24                                     ` Mike Galbraith
2015-07-07 17:06                                   ` Josef Bacik
2015-07-08  6:13                                     ` [patch] sched: beef up wake_wide() Mike Galbraith
2015-07-09 13:26                                       ` Peter Zijlstra
2015-07-09 14:07                                         ` Mike Galbraith
2015-07-09 14:46                                           ` Mike Galbraith
2015-07-10  5:19                                         ` Mike Galbraith
2015-07-10 13:41                                           ` Josef Bacik
2015-07-10 20:59                                           ` Josef Bacik
2015-07-11  3:11                                             ` Mike Galbraith
2015-07-13 13:53                                               ` Josef Bacik
2015-07-14 11:19                                               ` Peter Zijlstra
2015-07-14 13:49                                                 ` Mike Galbraith
2015-07-14 14:07                                                   ` Peter Zijlstra
2015-07-14 14:17                                                     ` Mike Galbraith
2015-07-14 15:04                                                       ` Peter Zijlstra
2015-07-14 15:39                                                         ` Mike Galbraith
2015-07-14 16:01                                                           ` Josef Bacik
2015-07-14 17:59                                                             ` Mike Galbraith
2015-07-15 17:11                                                               ` Josef Bacik
2015-08-03 17:07                                                           ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith
2015-05-28 11:16   ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
2015-05-28 11:49     ` Ingo Molnar
2015-05-28 12:15       ` Mike Galbraith
2015-05-28 12:19         ` Peter Zijlstra
2015-05-28 12:29           ` Ingo Molnar
2015-05-28 15:22           ` David Ahern
2015-05-28 11:55 ` Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).