* [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
@ 2015-05-27 21:22 Josef Bacik
2015-05-28 3:46 ` Mike Galbraith
` (2 more replies)
0 siblings, 3 replies; 73+ messages in thread
From: Josef Bacik @ 2015-05-27 21:22 UTC (permalink / raw)
To: riel, mingo, peterz, linux-kernel
[ sorry if you get this twice, it seems like the first submission got lost ]
At Facebook we have a pretty heavily multi-threaded application that is
sensitive to latency. We have been pulling forward the old SD_WAKE_IDLE code
because it gives us a pretty significant performance gain (like 20%). It turns
out this is because there are cases where the scheduler puts our task on a busy
CPU when there are idle CPU's in the system. We verify this by reading the
cpu_delay_req_avg_us from the scheduler netlink stuff. With our crappy patch we
get much lower numbers vs baseline.
SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just
looking for an idle sibling, preferring affinity over all else. This is not
helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not
garuntee affinity. Fix this by first trying to find an idle sibling, and then
if the cpu is not idle fall through to the logic to find an idle cpu. With this
patch we get slightly better performance than with our forward port of
SD_WAKE_IDLE. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 241213b..03dafa3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4766,7 +4766,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (sd_flag & SD_BALANCE_WAKE) {
new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ if (idle_cpu(new_cpu))
+ goto unlock;
}
while (sd) {
--
1.8.1
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
@ 2015-05-28 3:46 ` Mike Galbraith
2015-05-28 9:49 ` Morten Rasmussen
2015-05-28 10:21 ` Peter Zijlstra
2015-05-28 11:55 ` Srikar Dronamraju
2 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-05-28 3:46 UTC (permalink / raw)
To: Josef Bacik; +Cc: riel, mingo, peterz, linux-kernel
On Wed, 2015-05-27 at 17:22 -0400, Josef Bacik wrote:
> [ sorry if you get this twice, it seems like the first submission got lost ]
>
> At Facebook we have a pretty heavily multi-threaded application that is
> sensitive to latency. We have been pulling forward the old SD_WAKE_IDLE code
> because it gives us a pretty significant performance gain (like 20%). It turns
> out this is because there are cases where the scheduler puts our task on a busy
> CPU when there are idle CPU's in the system. We verify this by reading the
> cpu_delay_req_avg_us from the scheduler netlink stuff. With our crappy patch we
> get much lower numbers vs baseline.
>
> SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just
> looking for an idle sibling, preferring affinity over all else. This is not
> helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not
> garuntee affinity. Fix this by first trying to find an idle sibling, and then
> if the cpu is not idle fall through to the logic to find an idle cpu. With this
> patch we get slightly better performance than with our forward port of
> SD_WAKE_IDLE. Thanks,
The job description isn't really find idle. it's find least loaded.
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> kernel/sched/fair.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 241213b..03dafa3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4766,7 +4766,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>
> if (sd_flag & SD_BALANCE_WAKE) {
> new_cpu = select_idle_sibling(p, prev_cpu);
> - goto unlock;
> + if (idle_cpu(new_cpu))
> + goto unlock;
> }
>
> while (sd) {
Instead of doing what for most will be a redundant idle_cpu() call,
perhaps a couple cycles can be saved if you move the sd assignment above
affine_sd assignment, and say if (!sd || idle_cpu(new_cpu)) ?
You could also stop find_idlest_group() at the first completely idle
group to shave cycles off the not fully committed search. It ain't
likely to find a negative load.. cool as that would be ;-)
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 3:46 ` Mike Galbraith
@ 2015-05-28 9:49 ` Morten Rasmussen
2015-05-28 10:57 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Morten Rasmussen @ 2015-05-28 9:49 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel
On Thu, May 28, 2015 at 04:46:38AM +0100, Mike Galbraith wrote:
> On Wed, 2015-05-27 at 17:22 -0400, Josef Bacik wrote:
> > [ sorry if you get this twice, it seems like the first submission got lost ]
> >
> > At Facebook we have a pretty heavily multi-threaded application that is
> > sensitive to latency. We have been pulling forward the old SD_WAKE_IDLE code
> > because it gives us a pretty significant performance gain (like 20%). It turns
> > out this is because there are cases where the scheduler puts our task on a busy
> > CPU when there are idle CPU's in the system. We verify this by reading the
> > cpu_delay_req_avg_us from the scheduler netlink stuff. With our crappy patch we
> > get much lower numbers vs baseline.
> >
> > SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just
> > looking for an idle sibling, preferring affinity over all else. This is not
> > helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not
> > garuntee affinity. Fix this by first trying to find an idle sibling, and then
> > if the cpu is not idle fall through to the logic to find an idle cpu. With this
> > patch we get slightly better performance than with our forward port of
> > SD_WAKE_IDLE. Thanks,
>
> The job description isn't really find idle. it's find least loaded.
And make sure that the task doesn't migrate away from any data that
might still be in the last-level cache?
IUIC, the goal of SD_BALANCE_WAKE is changed from finding the least
loaded target cpu that shares last-level cache with the previous cpu, to
finding an idle cpu and prefer ones that shares the last-level cache but
extend the search beyond sd_llc if necessary.
>
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> > kernel/sched/fair.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 241213b..03dafa3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4766,7 +4766,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >
> > if (sd_flag & SD_BALANCE_WAKE) {
> > new_cpu = select_idle_sibling(p, prev_cpu);
> > - goto unlock;
> > + if (idle_cpu(new_cpu))
> > + goto unlock;
> > }
> >
> > while (sd) {
>
> Instead of doing what for most will be a redundant idle_cpu() call,
> perhaps a couple cycles can be saved if you move the sd assignment above
> affine_sd assignment, and say if (!sd || idle_cpu(new_cpu)) ?
Isn't sd == NULL is most cases if you don't move the sd assignment
before the affine_sd assignment? The break after the affine_sd
assignment means that sd isn't assigned under 'normal' circumstances
(sibling waking cpu, no restrictions in tsk_cpus_allowed(p), and
SD_WAKE_AFFINE set) which causes the while (sd) loop to be bypassed and
we end up returning the result of select_idle_sibling() anyway.
I must be missing something?
Morten
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
2015-05-28 3:46 ` Mike Galbraith
@ 2015-05-28 10:21 ` Peter Zijlstra
2015-05-28 11:05 ` Peter Zijlstra
2015-05-28 11:16 ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
2015-05-28 11:55 ` Srikar Dronamraju
2 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-05-28 10:21 UTC (permalink / raw)
To: Josef Bacik; +Cc: riel, mingo, linux-kernel
On Wed, May 27, 2015 at 05:22:16PM -0400, Josef Bacik wrote:
>
> SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however
sd->flags's SD_BALANCE_WAKE, not sd_flags.
And note that SD_BALANCE_WAKE is not set on domains by default.
> it is just looking for an idle sibling, preferring affinity over all
> else.
Not sure what you're saying here, what affinity?
> This is not helpful in all cases, and SD_BALANCE_WAKE's job is
> to find us an idle cpu, not garuntee affinity.
Your argument is going backwards, SD_BALANCE_WAKE is not actually set.
> Fix this by first
> trying to find an idle sibling, and then if the cpu is not idle fall
> through to the logic to find an idle cpu. With this patch we get
> slightly better performance than with our forward port of
> SD_WAKE_IDLE.
This is broken. You most certainly do not want to go do that whole load
balance pass on wakeups. It should be controlled by sd->flags. It is far
too expensive to consider turning that on by default.
In fact, select_idle_sibling() is already too expensive on current
server hardware (far too damn many cpus in a LLC domain).
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 9:49 ` Morten Rasmussen
@ 2015-05-28 10:57 ` Mike Galbraith
2015-05-28 11:48 ` Morten Rasmussen
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-05-28 10:57 UTC (permalink / raw)
To: Morten Rasmussen; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel
On Thu, 2015-05-28 at 10:49 +0100, Morten Rasmussen wrote:
> Isn't sd == NULL is most cases if you don't move the sd assignment
> before the affine_sd assignment?
sd will usually be NULL regardless of where the assignment is, as
SD_BALANCE_WAKE is usually off in ->flags. Josef is turning it on.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 10:21 ` Peter Zijlstra
@ 2015-05-28 11:05 ` Peter Zijlstra
2015-05-28 14:27 ` Josef Bacik
` (3 more replies)
2015-05-28 11:16 ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
1 sibling, 4 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-05-28 11:05 UTC (permalink / raw)
To: Josef Bacik; +Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen
So maybe you want something like the below; that cures the thing Morten
raised, and we continue looking for sd, even after we found affine_sd.
It also avoids the pointless idle_cpu() check Mike raised by making
select_idle_sibling() return -1 if it doesn't find anything.
Then it continues doing the full balance IFF sd was set, which is keyed
off of sd->flags.
And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle
CPUs, it looks for the least loaded CPU. And its damn expensive.
Rewriting this entire thing is somewhere on the todo list :/
---
kernel/sched/fair.c | 36 +++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d597aea43030..7330fb4fea9c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1385,8 +1385,13 @@ static void task_numa_compare(struct task_numa_env *env,
* One idle CPU per node is evaluated for a task numa move.
* Call select_idle_sibling to maybe find a better one.
*/
- if (!cur)
- env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+ if (!cur) {
+ int cpu = select_idle_sibling(env->p, env->dst_cpu);
+ if (cpu < 0)
+ cpu = env->dst_cpu;
+
+ env->dst_cpu = cpu;
+ }
assign:
task_numa_assign(env, cur, imp);
@@ -4951,16 +4956,15 @@ static int select_idle_sibling(struct task_struct *p, int target)
goto next;
}
- target = cpumask_first_and(sched_group_cpus(sg),
+ return cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
- goto done;
next:
sg = sg->next;
} while (sg != sd->groups);
}
-done:
- return target;
+ return -1;
}
+
/*
* get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
* tasks. The unit of the return value must be the one of capacity so we can
@@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
* If both cpu and prev_cpu are part of this domain,
* cpu is a valid SD_WAKE_AFFINE target.
*/
- if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
- cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
+ if (want_affine && !affine_sd &&
+ (tmp->flags & SD_WAKE_AFFINE) &&
+ cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
affine_sd = tmp;
- break;
- }
if (tmp->flags & sd_flag)
sd = tmp;
+ else if (!want_affine || (want_affine && affine_sd))
+ break;
}
- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
prev_cpu = cpu;
+ sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
+ }
if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ int tmp = select_idle_sibling(p, prev_cpu);
+ if (tmp >= 0) {
+ new_cpu = tmp;
+ goto unlock;
+ }
}
while (sd) {
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 10:21 ` Peter Zijlstra
2015-05-28 11:05 ` Peter Zijlstra
@ 2015-05-28 11:16 ` Mike Galbraith
2015-05-28 11:49 ` Ingo Molnar
1 sibling, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-05-28 11:16 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Josef Bacik, riel, mingo, linux-kernel
On Thu, 2015-05-28 at 12:21 +0200, Peter Zijlstra wrote:
> In fact, select_idle_sibling() is already too expensive on current
> server hardware (far too damn many cpus in a LLC domain).
Yup. I've played with rate limiting motion per task because of that.
Packages have gotten way too damn big.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 10:57 ` Mike Galbraith
@ 2015-05-28 11:48 ` Morten Rasmussen
2015-05-28 11:49 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Morten Rasmussen @ 2015-05-28 11:48 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel
On Thu, May 28, 2015 at 11:57:44AM +0100, Mike Galbraith wrote:
> On Thu, 2015-05-28 at 10:49 +0100, Morten Rasmussen wrote:
>
> > Isn't sd == NULL is most cases if you don't move the sd assignment
> > before the affine_sd assignment?
>
> sd will usually be NULL regardless of where the assignment is, as
> SD_BALANCE_WAKE is usually off in ->flags. Josef is turning it on.
Right. SD_BALANCE_WAKE needs to set in the sd flags and the assignment
has to happen before the break for this to work. I just don't see
SD_BALANCE_WAKE being enabled for the sched_domain anywhere in the
patch?
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 11:16 ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
@ 2015-05-28 11:49 ` Ingo Molnar
2015-05-28 12:15 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Ingo Molnar @ 2015-05-28 11:49 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Peter Zijlstra, Josef Bacik, riel, mingo, linux-kernel
* Mike Galbraith <mgalbraith@novell.com> wrote:
> On Thu, 2015-05-28 at 12:21 +0200, Peter Zijlstra wrote:
>
> > In fact, select_idle_sibling() is already too expensive on current
> > server hardware (far too damn many cpus in a LLC domain).
>
> Yup. I've played with rate limiting motion per task because of that.
> Packages have gotten way too damn big.
What's the biggest you've seen?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 11:48 ` Morten Rasmussen
@ 2015-05-28 11:49 ` Mike Galbraith
0 siblings, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-05-28 11:49 UTC (permalink / raw)
To: Morten Rasmussen; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel
On Thu, 2015-05-28 at 12:48 +0100, Morten Rasmussen wrote:
> On Thu, May 28, 2015 at 11:57:44AM +0100, Mike Galbraith wrote:
> > On Thu, 2015-05-28 at 10:49 +0100, Morten Rasmussen wrote:
> >
> > > Isn't sd == NULL is most cases if you don't move the sd assignment
> > > before the affine_sd assignment?
> >
> > sd will usually be NULL regardless of where the assignment is, as
> > SD_BALANCE_WAKE is usually off in ->flags. Josef is turning it on.
>
> Right. SD_BALANCE_WAKE needs to set in the sd flags and the assignment
> has to happen before the break for this to work. I just don't see
> SD_BALANCE_WAKE being enabled for the sched_domain anywhere in the
> patch?
He's doing that via proc interface.. well, I presume he is anyway.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
2015-05-28 3:46 ` Mike Galbraith
2015-05-28 10:21 ` Peter Zijlstra
@ 2015-05-28 11:55 ` Srikar Dronamraju
2 siblings, 0 replies; 73+ messages in thread
From: Srikar Dronamraju @ 2015-05-28 11:55 UTC (permalink / raw)
To: Josef Bacik; +Cc: riel, mingo, peterz, linux-kernel
> At Facebook we have a pretty heavily multi-threaded application that is
> sensitive to latency. We have been pulling forward the old SD_WAKE_IDLE code
> because it gives us a pretty significant performance gain (like 20%). It turns
> out this is because there are cases where the scheduler puts our task on a busy
> CPU when there are idle CPU's in the system. We verify this by reading the
> cpu_delay_req_avg_us from the scheduler netlink stuff. With our crappy patch we
> get much lower numbers vs baseline.
>
Was this application run under cpu cgroup. Because we were seeing bursty
workloads exhibiting this behaviour esp when run under cpu cgroups.
http://mid.gmane.org/53A11A89.5000602@linux.vnet.ibm.com
--
Thansk and Regards
Srikar
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 11:49 ` Ingo Molnar
@ 2015-05-28 12:15 ` Mike Galbraith
2015-05-28 12:19 ` Peter Zijlstra
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-05-28 12:15 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Peter Zijlstra, Josef Bacik, riel, mingo, linux-kernel
On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote:
> * Mike Galbraith <mgalbraith@novell.com> wrote:
>
> > On Thu, 2015-05-28 at 12:21 +0200, Peter Zijlstra wrote:
> >
> > > In fact, select_idle_sibling() is already too expensive on current
> > > server hardware (far too damn many cpus in a LLC domain).
> >
> > Yup. I've played with rate limiting motion per task because of that.
> > Packages have gotten way too damn big.
>
> What's the biggest you've seen?
15 cores so far. It'll no doubt grow.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 12:15 ` Mike Galbraith
@ 2015-05-28 12:19 ` Peter Zijlstra
2015-05-28 12:29 ` Ingo Molnar
2015-05-28 15:22 ` David Ahern
0 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-05-28 12:19 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Ingo Molnar, Josef Bacik, riel, mingo, linux-kernel
> On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote:
> > What's the biggest you've seen?
Wikipedia here: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29
Tell us HSW-E[PX] have 18 cores 36 thread SKUs.
But yes, what Mike says, its bound to only get bigger.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 12:19 ` Peter Zijlstra
@ 2015-05-28 12:29 ` Ingo Molnar
2015-05-28 15:22 ` David Ahern
1 sibling, 0 replies; 73+ messages in thread
From: Ingo Molnar @ 2015-05-28 12:29 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mike Galbraith, Josef Bacik, riel, mingo, linux-kernel
* Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote:
>
> > > What's the biggest you've seen?
>
> Wikipedia here: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29
>
> Tell us HSW-E[PX] have 18 cores 36 thread SKUs.
>
> But yes, what Mike says, its bound to only get bigger.
So it's starting to get big enough to warrant an optimization of the way we
account and discover idle CPUs:
So when a CPU goes idle, it has idle cycles it could spend on registering itself
in either an idle-CPUs bitmap, or in an idle-CPUs queue. The queue (or bitmap)
would strictly be only shared between CPUs within the same domain, so the cache
bouncing cost from that is still small and package-local. (We remote access
overhead in select_idle_sibling() already, due to having to access half of all
remote rqs on average.)
Such an approach would make select_idle_sibling() independent on the size of the
cores domain, it would make it essentially O(1).
( There's a bit of a complication with rq->wake_list, but I think it would be good
enough to just register/unregister from the idle handler, if something is idle
only short term it should probably not be considered for SMP balancing. )
But I'd definitely not go towards making our SMP balancing macro idle selection
decisions poorer, just because our internal implementation is
O(nr_cores_per_package) ...
Agreed?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 11:05 ` Peter Zijlstra
@ 2015-05-28 14:27 ` Josef Bacik
2015-05-29 21:03 ` Josef Bacik
` (2 subsequent siblings)
3 siblings, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-05-28 14:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen
On 05/28/2015 07:05 AM, Peter Zijlstra wrote:
>
> So maybe you want something like the below; that cures the thing Morten
> raised, and we continue looking for sd, even after we found affine_sd.
>
> It also avoids the pointless idle_cpu() check Mike raised by making
> select_idle_sibling() return -1 if it doesn't find anything.
>
> Then it continues doing the full balance IFF sd was set, which is keyed
> off of sd->flags.
>
> And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle
> CPUs, it looks for the least loaded CPU. And its damn expensive.
>
Sorry I was just assuming based on the commit message when WAKE_IDLE was
removed, this isn't my area.
>
> Rewriting this entire thing is somewhere on the todo list :/
Thanks I'm building and deploying this so I can run our perf test, I'll
have results in ~3 hours.
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 12:19 ` Peter Zijlstra
2015-05-28 12:29 ` Ingo Molnar
@ 2015-05-28 15:22 ` David Ahern
1 sibling, 0 replies; 73+ messages in thread
From: David Ahern @ 2015-05-28 15:22 UTC (permalink / raw)
To: Peter Zijlstra, Mike Galbraith
Cc: Ingo Molnar, Josef Bacik, riel, mingo, linux-kernel
On 5/28/15 6:19 AM, Peter Zijlstra wrote:
>> On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote:
>
>>> What's the biggest you've seen?
>
> Wikipedia here: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29
>
> Tell us HSW-E[PX] have 18 cores 36 thread SKUs.
>
> But yes, what Mike says, its bound to only get bigger.
sparc M7: 8 threads per core, 32 cores per socket = 256 cpus per socket
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 11:05 ` Peter Zijlstra
2015-05-28 14:27 ` Josef Bacik
@ 2015-05-29 21:03 ` Josef Bacik
2015-05-30 3:55 ` Mike Galbraith
2015-06-01 19:38 ` Josef Bacik
2015-06-11 20:33 ` Josef Bacik
2015-06-12 5:35 ` Mike Galbraith
3 siblings, 2 replies; 73+ messages in thread
From: Josef Bacik @ 2015-05-29 21:03 UTC (permalink / raw)
To: Peter Zijlstra
Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team
On 05/28/2015 07:05 AM, Peter Zijlstra wrote:
>
> So maybe you want something like the below; that cures the thing Morten
> raised, and we continue looking for sd, even after we found affine_sd.
>
> It also avoids the pointless idle_cpu() check Mike raised by making
> select_idle_sibling() return -1 if it doesn't find anything.
>
> Then it continues doing the full balance IFF sd was set, which is keyed
> off of sd->flags.
>
> And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle
> CPUs, it looks for the least loaded CPU. And its damn expensive.
>
>
> Rewriting this entire thing is somewhere on the todo list :/
>
Summarizing what I've found so far.
-We turn on SD_BALANCE_WAKE on our domains for our 3.10 boxes, but not
for our 4.0 boxes (due to some weird configuration issue.)
-Running with this patch is better than plain 4.0 but not as good as my
patch, running with SD_BALANCE_WAKE set and not set makes no difference
to the runs.
-I took out the sd = NULL; bit from the affine case like you said on IRC
and I get similar results as before.
-I'm thoroughly confused as to why my patch did anything since we
weren't turning on SD_BALANCE_WAKE on 4.0 in my previous runs (I assume,
it isn't set now so I'm pretty sure the problem has always been there)
so we should have always had sd == NULL which means we would have only
ever gotten the task cpu I guess.
Now I'm looking at the code in select_idle_sibling and we do this
for_each_lower_domain(sd) {
sg = sd->groups;
do {
if (!cpumask_intersects(sched_group_cpus(sg),
tsk_cpus_allowed(p)))
goto next;
for_each_cpu(i, sched_group_cpus(sg)) {
if (i == target || !idle_cpu(i))
goto next;
}
return cpumask_first_and(sched_group_cpus(sg),
tsk_cpus_allowed(p));
next:
sg = sg->next
} while (sg != sd->groups);
}
We get all the schedule groups for the schedule domain and if any of the
cpu's are not idle or the target then we skip the whole scheduling
group. Isn't the scheduling group a group of CPU's? Why can't we pick
an idle CPU in the group that has a none idle cpu or the target cpu?
Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-29 21:03 ` Josef Bacik
@ 2015-05-30 3:55 ` Mike Galbraith
2015-06-01 19:38 ` Josef Bacik
1 sibling, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-05-30 3:55 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Fri, 2015-05-29 at 17:03 -0400, Josef Bacik wrote:
> for_each_lower_domain(sd) {
> sg = sd->groups;
> do {
> if (!cpumask_intersects(sched_group_cpus(sg),
> tsk_cpus_allowed(p)))
> goto next;
>
> for_each_cpu(i, sched_group_cpus(sg)) {
> if (i == target || !idle_cpu(i))
> goto next;
> }
>
> return cpumask_first_and(sched_group_cpus(sg),
> tsk_cpus_allowed(p));
> next:
> sg = sg->next
> } while (sg != sd->groups);
> }
>
> We get all the schedule groups for the schedule domain and if any of the
> cpu's are not idle or the target then we skip the whole scheduling
> group. Isn't the scheduling group a group of CPU's? Why can't we pick
> an idle CPU in the group that has a none idle cpu or the target cpu?
> Thanks,
We select an idle core if we can get one. Yes, that leaves a pile of
SMT threads not checked/selected, but if you're gonna do a full search
of a large socket (humongous sparc-thing, shudder), you may as well eat
the BALANCE_WAKE overhead.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-29 21:03 ` Josef Bacik
2015-05-30 3:55 ` Mike Galbraith
@ 2015-06-01 19:38 ` Josef Bacik
2015-06-01 20:42 ` Peter Zijlstra
2015-06-01 22:15 ` Rik van Riel
1 sibling, 2 replies; 73+ messages in thread
From: Josef Bacik @ 2015-06-01 19:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team
On 05/29/2015 05:03 PM, Josef Bacik wrote:
> On 05/28/2015 07:05 AM, Peter Zijlstra wrote:
>>
>> So maybe you want something like the below; that cures the thing Morten
>> raised, and we continue looking for sd, even after we found affine_sd.
>>
>> It also avoids the pointless idle_cpu() check Mike raised by making
>> select_idle_sibling() return -1 if it doesn't find anything.
>>
>> Then it continues doing the full balance IFF sd was set, which is keyed
>> off of sd->flags.
>>
>> And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle
>> CPUs, it looks for the least loaded CPU. And its damn expensive.
>>
>>
>> Rewriting this entire thing is somewhere on the todo list :/
>>
>
Ok I got this patch to give me the same performance as all our other
crap, just need to apply this incremental
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b71eb2b..e11cfec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int
prev_cpu, int sd_flag, int wake_f
if (tmp->flags & sd_flag)
sd = tmp;
- else if (!want_affine || (want_affine && affine_sd))
- break;
}
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
prev_cpu = cpu;
- sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
}
if (sd_flag & SD_BALANCE_WAKE) {
And everything works fine. Does that seem reasonable? Thanks,
Josef
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-01 19:38 ` Josef Bacik
@ 2015-06-01 20:42 ` Peter Zijlstra
2015-06-01 21:03 ` Josef Bacik
2015-06-02 17:12 ` Josef Bacik
2015-06-01 22:15 ` Rik van Riel
1 sibling, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-06-01 20:42 UTC (permalink / raw)
To: Josef Bacik
Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team
On Mon, 2015-06-01 at 15:38 -0400, Josef Bacik wrote:
> Ok I got this patch to give me the same performance as all our other
> crap, just need to apply this incremental
>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b71eb2b..e11cfec 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int
> prev_cpu, int sd_flag, int wake_f
>
> if (tmp->flags & sd_flag)
> sd = tmp;
> - else if (!want_affine || (want_affine && affine_sd))
> - break;
> }
That bit worries me a bit, because that causes us to have a weird
definition for what sd is.
Without WAKE_AFFINE, sd is the biggest domain with BALANCE_WAKE (or any
other sd_flag) set.
But with WAKE_AFFINE, its the first domain that satisfies the wake
affine constraint of covering both the previous and waking cpu. It
basically reduces sd to affine_sd.
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-01 20:42 ` Peter Zijlstra
@ 2015-06-01 21:03 ` Josef Bacik
2015-06-02 17:12 ` Josef Bacik
1 sibling, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-06-01 21:03 UTC (permalink / raw)
To: Peter Zijlstra
Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team
On 06/01/2015 04:42 PM, Peter Zijlstra wrote:
> On Mon, 2015-06-01 at 15:38 -0400, Josef Bacik wrote:
>
>> Ok I got this patch to give me the same performance as all our other
>> crap, just need to apply this incremental
>>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b71eb2b..e11cfec 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int
>> prev_cpu, int sd_flag, int wake_f
>>
>> if (tmp->flags & sd_flag)
>> sd = tmp;
>> - else if (!want_affine || (want_affine && affine_sd))
>> - break;
>> }
>
> That bit worries me a bit, because that causes us to have a weird
> definition for what sd is.
>
> Without WAKE_AFFINE, sd is the biggest domain with BALANCE_WAKE (or any
> other sd_flag) set.
>
> But with WAKE_AFFINE, its the first domain that satisfies the wake
> affine constraint of covering both the previous and waking cpu. It
> basically reduces sd to affine_sd.
>
I'm just giving you what made the performance good for us, I'm relying
on you to make it sane ;). So with that bit we break out if
!wake_affine as soon as we don't find an sd that matches sd_flag, that
seems less than helpful. I was going to change it to
else if (want_affine && affine_sd)
but the thing is we don't do anything with affine_sd unless the bit
below matches
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
so it seems like we're leaving ourselves without sd set in a few cases
where we'd actually want it set, so I just nuked the whole thing and
carried on. I'm all ears for other ideas, I'm just letting you know
what my results are since you are the guys who actually understand this
stuff. Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-01 19:38 ` Josef Bacik
2015-06-01 20:42 ` Peter Zijlstra
@ 2015-06-01 22:15 ` Rik van Riel
1 sibling, 0 replies; 73+ messages in thread
From: Rik van Riel @ 2015-06-01 22:15 UTC (permalink / raw)
To: Josef Bacik, Peter Zijlstra
Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team
On 06/01/2015 03:38 PM, Josef Bacik wrote:
> Ok I got this patch to give me the same performance as all our other
> crap, just need to apply this incremental
>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b71eb2b..e11cfec 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int
> prev_cpu, int sd_flag, int wake_f
>
> if (tmp->flags & sd_flag)
> sd = tmp;
> - else if (!want_affine || (want_affine && affine_sd))
> - break;
> }
>
> if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
> prev_cpu = cpu;
> - sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
Given Peter's worries about wake_affine and affine_sd,
should the above be sd = affine_sd, in case select_idle_sibling
cannot find an idle sibling?
That way we can attempt to at least find an idle cpu inside
the affine_sd.
Of course, there may be subtleties here I am overlooking...
> }
>
> if (sd_flag & SD_BALANCE_WAKE) {
--
All rights reversed
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-01 20:42 ` Peter Zijlstra
2015-06-01 21:03 ` Josef Bacik
@ 2015-06-02 17:12 ` Josef Bacik
2015-06-03 14:12 ` Rik van Riel
1 sibling, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-02 17:12 UTC (permalink / raw)
To: Peter Zijlstra
Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team
On 06/01/2015 04:42 PM, Peter Zijlstra wrote:
> On Mon, 2015-06-01 at 15:38 -0400, Josef Bacik wrote:
>
>> Ok I got this patch to give me the same performance as all our other
>> crap, just need to apply this incremental
>>
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b71eb2b..e11cfec 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int
>> prev_cpu, int sd_flag, int wake_f
>>
>> if (tmp->flags & sd_flag)
>> sd = tmp;
>> - else if (!want_affine || (want_affine && affine_sd))
>> - break;
>> }
>
> That bit worries me a bit, because that causes us to have a weird
> definition for what sd is.
>
> Without WAKE_AFFINE, sd is the biggest domain with BALANCE_WAKE (or any
> other sd_flag) set.
>
> But with WAKE_AFFINE, its the first domain that satisfies the wake
> affine constraint of covering both the previous and waking cpu. It
> basically reduces sd to affine_sd.
>
Ok I took Rik's idea and came out with this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b71eb2b..75073d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4761,13 +4761,14 @@ select_task_rq_fair(struct task_struct *p, int
prev_cpu, int sd_flag, int wake_f
if (tmp->flags & sd_flag)
sd = tmp;
- else if (!want_affine || (want_affine && affine_sd))
+ else if (want_affine && affine_sd) {
+ sd = affine_sd;
break;
+ }
}
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
prev_cpu = cpu;
- sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
}
if (sd_flag & SD_BALANCE_WAKE) {
But now that I re-read your response I think this is even more what you
were worried about than less.
Basically it comes down to if sd isn't set then we get shit performance.
I realize that this magic to find an idle cpu when sd is set is pretty
heavy handed, but it's obviously helpful in our case.
So let me ask this question. When do we want to do the heavy handed
search and when do we not? With WAKE_AFFINE what is our ultimate goal
vs the other SD's? If we don't have an sd that matches our sd_flags
what should we be doing, should we just go with whatever cpu we're on
and carry on? Thanks,
Josef
^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-02 17:12 ` Josef Bacik
@ 2015-06-03 14:12 ` Rik van Riel
2015-06-03 14:24 ` Peter Zijlstra
0 siblings, 1 reply; 73+ messages in thread
From: Rik van Riel @ 2015-06-03 14:12 UTC (permalink / raw)
To: Josef Bacik, Peter Zijlstra
Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team
On 06/02/2015 01:12 PM, Josef Bacik wrote:
> But now that I re-read your response I think this is even more what you
> were worried about than less.
>
> Basically it comes down to if sd isn't set then we get shit performance.
> I realize that this magic to find an idle cpu when sd is set is pretty
> heavy handed, but it's obviously helpful in our case.
Ingo and Peter appear to be worried that searching
through too many idle CPUs leads to bad performance.
Your test results seem to indicate that finding an
idle CPU really really helps performance.
There is a policy vs mechanism thing here. Ingo and Peter
are worried about the overhead in the mechanism of finding
an idle CPU. Your measurements show that the policy of
finding an idle CPU is the correct one.
Can we make the policy change now, and optimize the
mechanism later?
> So let me ask this question. When do we want to do the heavy handed
> search and when do we not? With WAKE_AFFINE what is our ultimate goal
> vs the other SD's? If we don't have an sd that matches our sd_flags
> what should we be doing, should we just go with whatever cpu we're on
> and carry on? Thanks,
>
> Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-03 14:12 ` Rik van Riel
@ 2015-06-03 14:24 ` Peter Zijlstra
2015-06-03 14:49 ` Josef Bacik
2015-06-03 15:30 ` Mike Galbraith
0 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-06-03 14:24 UTC (permalink / raw)
To: Rik van Riel
Cc: Josef Bacik, mingo, linux-kernel, umgwanakikbuti,
morten.rasmussen, kernel-team
On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
> There is a policy vs mechanism thing here. Ingo and Peter
> are worried about the overhead in the mechanism of finding
> an idle CPU. Your measurements show that the policy of
> finding an idle CPU is the correct one.
For his workload; I'm sure I can find a workload where it hurts.
In fact, I'm fairly sure Mike knows one from the top of his head, seeing
how he's the one playing about trying to shrink that idle search :-)
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-03 14:24 ` Peter Zijlstra
@ 2015-06-03 14:49 ` Josef Bacik
2015-06-03 15:30 ` Mike Galbraith
1 sibling, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-06-03 14:49 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team
On 06/03/2015 10:24 AM, Peter Zijlstra wrote:
> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
>
>> There is a policy vs mechanism thing here. Ingo and Peter
>> are worried about the overhead in the mechanism of finding
>> an idle CPU. Your measurements show that the policy of
>> finding an idle CPU is the correct one.
>
> For his workload; I'm sure I can find a workload where it hurts.
>
> In fact, I'm fairly sure Mike knows one from the top of his head, seeing
> how he's the one playing about trying to shrink that idle search :-)
>
So the perf bench sched microbenchmarks are a pretty good analog for our
workload. I run
perf bench sched messaging -g 100 -l 10000
perf bench sched pipe
5 times and average the results to get an answer, really the messaging
one is closest one and the one I look at. I get like 56 seconds of
runtime on plain 4.0 and 47 seconds patched, it's how I check my little
experiments before doing the full real workload.
I don't want to tune the scheduler just for our workload, but the
microbenchmarks we have are also showing the same performance
improvements. I would be super interested in workloads where this patch
doesn't help so we could integrate that workload into perf sched bench
to make us more confident in making policy changes in the scheduler. So
Mike if you have something specific in mind please elaborate and I'm
happy to do the legwork to get it into perf bench and to test things
until we're happy.
In the meantime I really want to get this fixed for us, I do not want to
pull some weird old patch around for the next year until we rebase again
next year, and then do this whole dance again. What would be the way
forward for getting this fixed now? Do I need to hide it behind a
sysctl or config option? Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-03 14:24 ` Peter Zijlstra
2015-06-03 14:49 ` Josef Bacik
@ 2015-06-03 15:30 ` Mike Galbraith
2015-06-03 15:57 ` Josef Bacik
1 sibling, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-03 15:30 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rik van Riel, Josef Bacik, mingo, linux-kernel, morten.rasmussen,
kernel-team
On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote:
> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
>
> > There is a policy vs mechanism thing here. Ingo and Peter
> > are worried about the overhead in the mechanism of finding
> > an idle CPU. Your measurements show that the policy of
> > finding an idle CPU is the correct one.
>
> For his workload; I'm sure I can find a workload where it hurts.
>
> In fact, I'm fairly sure Mike knows one from the top of his head, seeing
> how he's the one playing about trying to shrink that idle search :-)
Like anything where scheduling latency doesn't heavily dominate. Even
if searching were free, bounces aren't, even for the very light.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-03 15:30 ` Mike Galbraith
@ 2015-06-03 15:57 ` Josef Bacik
2015-06-03 16:53 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-03 15:57 UTC (permalink / raw)
To: Mike Galbraith, Peter Zijlstra
Cc: Rik van Riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 06/03/2015 11:30 AM, Mike Galbraith wrote:
> On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote:
>> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
>>
>>> There is a policy vs mechanism thing here. Ingo and Peter
>>> are worried about the overhead in the mechanism of finding
>>> an idle CPU. Your measurements show that the policy of
>>> finding an idle CPU is the correct one.
>>
>> For his workload; I'm sure I can find a workload where it hurts.
>>
>> In fact, I'm fairly sure Mike knows one from the top of his head, seeing
>> how he's the one playing about trying to shrink that idle search :-)
>
> Like anything where scheduling latency doesn't heavily dominate. Even
> if searching were free, bounces aren't, even for the very light.
>
If scheduling latency doesn't hurt then making the search shouldn't
matter should it? I get that migrations aren't free, but it seems like
they can't hurt that much. This application is huge, it's our
webserver, we're doing like 400 requests per second on these things, and
hands down moving stuff to idle cpus is beating the pants off of staying
on the same cpu. Is there a specific workload I could build a test for
that you think this approach would hurt? Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-03 15:57 ` Josef Bacik
@ 2015-06-03 16:53 ` Mike Galbraith
2015-06-03 17:16 ` Josef Bacik
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-03 16:53 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel,
morten.rasmussen, kernel-team
On Wed, 2015-06-03 at 11:57 -0400, Josef Bacik wrote:
> On 06/03/2015 11:30 AM, Mike Galbraith wrote:
> > On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote:
> >> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
> >>
> >>> There is a policy vs mechanism thing here. Ingo and Peter
> >>> are worried about the overhead in the mechanism of finding
> >>> an idle CPU. Your measurements show that the policy of
> >>> finding an idle CPU is the correct one.
> >>
> >> For his workload; I'm sure I can find a workload where it hurts.
> >>
> >> In fact, I'm fairly sure Mike knows one from the top of his head, seeing
> >> how he's the one playing about trying to shrink that idle search :-)
> >
> > Like anything where scheduling latency doesn't heavily dominate. Even
> > if searching were free, bounces aren't, even for the very light.
> >
>
> If scheduling latency doesn't hurt then making the search shouldn't
> matter should it? I get that migrations aren't free, but it seems like
> they can't hurt that much.
Nah, they don't hurt much :)
commit e0a79f529d5ba2507486d498b25da40911d95cf6
Author: Mike Galbraith <bitbucket@online.de>
Date: Mon Jan 28 12:19:25 2013 +0100
sched: Fix select_idle_sibling() bouncing cow syndrome
If the previous CPU is cache affine and idle, select it.
The current implementation simply traverses the sd_llc domain,
taking the first idle CPU encountered, which walks buddy pairs
hand in hand over the package, inflicting excruciating pain.
1 tbench pair (worst case) in a 10 core + SMT package:
pre 15.22 MB/sec 1 procs
post 252.01 MB/sec 1 procs
> This application is huge, it's our
> webserver, we're doing like 400 requests per second on these things, and
> hands down moving stuff to idle cpus is beating the pants off of staying
> on the same cpu. Is there a specific workload I could build a test for
> that you think this approach would hurt? Thanks,
Search cost hurts fast movers, as does dragging even a small footprint
all over the place, as you can see above.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-03 16:53 ` Mike Galbraith
@ 2015-06-03 17:16 ` Josef Bacik
2015-06-03 17:43 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-03 17:16 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel,
morten.rasmussen, kernel-team
On 06/03/2015 12:53 PM, Mike Galbraith wrote:
> On Wed, 2015-06-03 at 11:57 -0400, Josef Bacik wrote:
>> On 06/03/2015 11:30 AM, Mike Galbraith wrote:
>>> On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote:
>>>> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote:
>>>>
>>>>> There is a policy vs mechanism thing here. Ingo and Peter
>>>>> are worried about the overhead in the mechanism of finding
>>>>> an idle CPU. Your measurements show that the policy of
>>>>> finding an idle CPU is the correct one.
>>>>
>>>> For his workload; I'm sure I can find a workload where it hurts.
>>>>
>>>> In fact, I'm fairly sure Mike knows one from the top of his head, seeing
>>>> how he's the one playing about trying to shrink that idle search :-)
>>>
>>> Like anything where scheduling latency doesn't heavily dominate. Even
>>> if searching were free, bounces aren't, even for the very light.
>>>
>>
>> If scheduling latency doesn't hurt then making the search shouldn't
>> matter should it? I get that migrations aren't free, but it seems like
>> they can't hurt that much.
>
> Nah, they don't hurt much :)
>
> commit e0a79f529d5ba2507486d498b25da40911d95cf6
> Author: Mike Galbraith <bitbucket@online.de>
> Date: Mon Jan 28 12:19:25 2013 +0100
>
> sched: Fix select_idle_sibling() bouncing cow syndrome
>
> If the previous CPU is cache affine and idle, select it.
>
> The current implementation simply traverses the sd_llc domain,
> taking the first idle CPU encountered, which walks buddy pairs
> hand in hand over the package, inflicting excruciating pain.
>
> 1 tbench pair (worst case) in a 10 core + SMT package:
>
> pre 15.22 MB/sec 1 procs
> post 252.01 MB/sec 1 procs
>
>
>> This application is huge, it's our
>> webserver, we're doing like 400 requests per second on these things, and
>> hands down moving stuff to idle cpus is beating the pants off of staying
>> on the same cpu. Is there a specific workload I could build a test for
>> that you think this approach would hurt? Thanks,
>
> Search cost hurts fast movers, as does dragging even a small footprint
> all over the place, as you can see above.
>
Eesh ok, do you happen to remember how you ran tbench so I can add it to
my tests here? In addition to fixing this problem we're also interested
in tracking performance of new kernels so we don't have to do this "what
the hell went wrong in the last 6 releases" dance every year, so I'm
throwing every performance thing we find useful in our test
infrastructure. Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-03 17:16 ` Josef Bacik
@ 2015-06-03 17:43 ` Mike Galbraith
2015-06-03 20:34 ` Josef Bacik
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-03 17:43 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel,
morten.rasmussen, kernel-team
On Wed, 2015-06-03 at 13:16 -0400, Josef Bacik wrote:
> Eesh ok, do you happen to remember how you ran tbench so I can add it to
> my tests here? In addition to fixing this problem we're also interested
> in tracking performance of new kernels so we don't have to do this "what
> the hell went wrong in the last 6 releases" dance every year, so I'm
> throwing every performance thing we find useful in our test
> infrastructure. Thanks,
>
> Josef
Start a tbench server, then tbench -t 30 1 localhost. You're unlikely
to find anything as painful as that bouncing cow bug was, but you won't
have to look hard at all to find bounce pain.
There are also other loads like your server where waking to an idle cpu
dominates all else, pgbench is one of those. In that case, you've got a
1:N waker/wakee relationship, and what matters above ALL else is when
the mother of all work (the single server thread) wants a CPU, it had
better get it NOW, else the load stalls. Likewise, 'mom' being
preempted hurts truckloads. Perhaps your server has a similar thing
going on, keeping wakees the hell away from the waker rules all.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-03 17:43 ` Mike Galbraith
@ 2015-06-03 20:34 ` Josef Bacik
2015-06-04 4:52 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-03 20:34 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel,
morten.rasmussen, kernel-team
On 06/03/2015 01:43 PM, Mike Galbraith wrote:
> On Wed, 2015-06-03 at 13:16 -0400, Josef Bacik wrote:
>
>> Eesh ok, do you happen to remember how you ran tbench so I can add it to
>> my tests here? In addition to fixing this problem we're also interested
>> in tracking performance of new kernels so we don't have to do this "what
>> the hell went wrong in the last 6 releases" dance every year, so I'm
>> throwing every performance thing we find useful in our test
>> infrastructure. Thanks,
>>
>> Josef
>
> Start a tbench server, then tbench -t 30 1 localhost. You're unlikely
> to find anything as painful as that bouncing cow bug was, but you won't
> have to look hard at all to find bounce pain.
>
> There are also other loads like your server where waking to an idle cpu
> dominates all else, pgbench is one of those. In that case, you've got a
> 1:N waker/wakee relationship, and what matters above ALL else is when
> the mother of all work (the single server thread) wants a CPU, it had
> better get it NOW, else the load stalls. Likewise, 'mom' being
> preempted hurts truckloads. Perhaps your server has a similar thing
> going on, keeping wakees the hell away from the waker rules all.
>
Yeah our server has two waker threads (one per numa node) and then the N
number of wakee threads. I'll run tbench and pgbench with the new
patches and see if there's a degredation. Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-03 20:34 ` Josef Bacik
@ 2015-06-04 4:52 ` Mike Galbraith
0 siblings, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-06-04 4:52 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel,
morten.rasmussen, kernel-team
On Wed, 2015-06-03 at 16:34 -0400, Josef Bacik wrote:
> On 06/03/2015 01:43 PM, Mike Galbraith wrote:
> > There are also other loads like your server where waking to an idle cpu
> > dominates all else, pgbench is one of those. In that case, you've got a
> > 1:N waker/wakee relationship, and what matters above ALL else is when
> > the mother of all work (the single server thread) wants a CPU, it had
> > better get it NOW, else the load stalls. Likewise, 'mom' being
> > preempted hurts truckloads. Perhaps your server has a similar thing
> > going on, keeping wakees the hell away from the waker rules all.
> >
>
> Yeah our server has two waker threads (one per numa node) and then the N
> number of wakee threads. I'll run tbench and pgbench with the new
> patches and see if there's a degredation. Thanks,
If you look for wake_wide(), it could perhaps be used to select wider
search for only the right flavor load component when BALANCE_WAKE is
set. That would let the cache lovers in your box continue to perform
while improving the 1:N component. That wider search still needs to
become cheaper though, low hanging fruit being to stop searching when
you find load = 0.. but you may meet the energy efficient folks, who
iirc want to make it even more expensive.
wake_wide() inadvertently helped another sore spot btw - a gaggle of
pretty light tasks being awakened from an interrupt source tended to
cluster around that source, preventing such loads from being all they
can be in a very similar manner. Xen (shudder;) showed that nicely in
older kernels, due to the way its weird dom0 gizmo works.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 11:05 ` Peter Zijlstra
2015-05-28 14:27 ` Josef Bacik
2015-05-29 21:03 ` Josef Bacik
@ 2015-06-11 20:33 ` Josef Bacik
2015-06-12 3:42 ` Rik van Riel
2015-06-12 5:35 ` Mike Galbraith
3 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-11 20:33 UTC (permalink / raw)
To: Peter Zijlstra
Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team
On 05/28/2015 07:05 AM, Peter Zijlstra wrote:
>
> So maybe you want something like the below; that cures the thing Morten
> raised, and we continue looking for sd, even after we found affine_sd.
>
> It also avoids the pointless idle_cpu() check Mike raised by making
> select_idle_sibling() return -1 if it doesn't find anything.
>
> Then it continues doing the full balance IFF sd was set, which is keyed
> off of sd->flags.
>
> And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle
> CPUs, it looks for the least loaded CPU. And its damn expensive.
>
>
> Rewriting this entire thing is somewhere on the todo list :/
>
Ugh I'm sorry, I've been running tests trying to get the numbers to look
good when I noticed I was getting some inconsistencies in my results.
Turns out I never actually tested your patch just plain, I had been
testing it with BALANCE_WAKE, because I was under the assumption that
was what was best for our workload. Since then I had fixed all of our
scripts and such and noticed that it actually super duper sucks for us.
So testing with this original patch everything is significantly better
(this is with the default SD flags set, no changes at all).
So now that I've wasted a good bit of my time and everybody elses, can
we go about pushing this patch upstream? If you are happy with it the
way it is I'll go ahead and pull it into our kernels and just watch to
make sure it ends upstream at some point. Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-11 20:33 ` Josef Bacik
@ 2015-06-12 3:42 ` Rik van Riel
0 siblings, 0 replies; 73+ messages in thread
From: Rik van Riel @ 2015-06-12 3:42 UTC (permalink / raw)
To: Josef Bacik, Peter Zijlstra
Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team
On 06/11/2015 04:33 PM, Josef Bacik wrote:
> Ugh I'm sorry, I've been running tests trying to get the numbers to look
> good when I noticed I was getting some inconsistencies in my results.
> Turns out I never actually tested your patch just plain, I had been
> testing it with BALANCE_WAKE, because I was under the assumption that
> was what was best for our workload. Since then I had fixed all of our
> scripts and such and noticed that it actually super duper sucks for us.
> So testing with this original patch everything is significantly better
> (this is with the default SD flags set, no changes at all).
>
> So now that I've wasted a good bit of my time and everybody elses, can
> we go about pushing this patch upstream? If you are happy with it the
> way it is I'll go ahead and pull it into our kernels and just watch to
> make sure it ends upstream at some point. Thanks,
FWIW, Jirka has run some tests with the patch as well,
and seen significant performance improvements on
various tests, on various systems.
Merging the patch makes perfect sense to me.
Acked-by: Rik van Riel <riel@redhat.com>
--
All rights reversed
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-05-28 11:05 ` Peter Zijlstra
` (2 preceding siblings ...)
2015-06-11 20:33 ` Josef Bacik
@ 2015-06-12 5:35 ` Mike Galbraith
2015-06-17 18:06 ` Josef Bacik
3 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-12 5:35 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen
On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
> @@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> * If both cpu and prev_cpu are part of this domain,
> * cpu is a valid SD_WAKE_AFFINE target.
> */
> - if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
> - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
> + if (want_affine && !affine_sd &&
> + (tmp->flags & SD_WAKE_AFFINE) &&
> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> affine_sd = tmp;
> - break;
> - }
>
> if (tmp->flags & sd_flag)
> sd = tmp;
> + else if (!want_affine || (want_affine && affine_sd))
> + break;
> }
Hm, new_cpu == cpu.
> - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
> prev_cpu = cpu;
> + sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
> + }
If branch above is not taken, new_cpu remains cpu.
> if (sd_flag & SD_BALANCE_WAKE) {
> - new_cpu = select_idle_sibling(p, prev_cpu);
> - goto unlock;
> + int tmp = select_idle_sibling(p, prev_cpu);
> + if (tmp >= 0) {
> + new_cpu = tmp;
> + goto unlock;
> + }
> }
If select_idle_sibling() returns -1, new_cpu remains cpu.
>
> while (sd) {
If sd == NULL, we fall through and try to pull wakee despite nacked-by
tsk_cpus_allowed() or wake_affine().
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-12 5:35 ` Mike Galbraith
@ 2015-06-17 18:06 ` Josef Bacik
2015-06-18 0:55 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-17 18:06 UTC (permalink / raw)
To: Mike Galbraith, Peter Zijlstra
Cc: riel, mingo, linux-kernel, morten.rasmussen
On 06/11/2015 10:35 PM, Mike Galbraith wrote:
> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
>
>> @@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> * If both cpu and prev_cpu are part of this domain,
>> * cpu is a valid SD_WAKE_AFFINE target.
>> */
>> - if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
>> - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
>> + if (want_affine && !affine_sd &&
>> + (tmp->flags & SD_WAKE_AFFINE) &&
>> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>> affine_sd = tmp;
>> - break;
>> - }
>>
>> if (tmp->flags & sd_flag)
>> sd = tmp;
>> + else if (!want_affine || (want_affine && affine_sd))
>> + break;
>> }
>
> Hm, new_cpu == cpu.
>
>> - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
>> + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) {
>> prev_cpu = cpu;
>> + sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */
>> + }
>
> If branch above is not taken, new_cpu remains cpu.
>
>> if (sd_flag & SD_BALANCE_WAKE) {
>> - new_cpu = select_idle_sibling(p, prev_cpu);
>> - goto unlock;
>> + int tmp = select_idle_sibling(p, prev_cpu);
>> + if (tmp >= 0) {
>> + new_cpu = tmp;
>> + goto unlock;
>> + }
>> }
>
> If select_idle_sibling() returns -1, new_cpu remains cpu.
>
>>
>> while (sd) {
>
> If sd == NULL, we fall through and try to pull wakee despite nacked-by
> tsk_cpus_allowed() or wake_affine().
>
So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
like this
if (tmp >= 0) {
new_cpu = tmp;
goto unlock;
} else if (!want_affine) {
new_cpu = prev_cpu;
}
so we can make sure we're not being pushed onto a cpu that we aren't
allowed on? Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-17 18:06 ` Josef Bacik
@ 2015-06-18 0:55 ` Mike Galbraith
2015-06-18 3:46 ` Josef Bacik
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-18 0:55 UTC (permalink / raw)
To: Josef Bacik; +Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen
On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
> > On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
> > If sd == NULL, we fall through and try to pull wakee despite nacked-by
> > tsk_cpus_allowed() or wake_affine().
> >
>
> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
> like this
>
> if (tmp >= 0) {
> new_cpu = tmp;
> goto unlock;
> } else if (!want_affine) {
> new_cpu = prev_cpu;
> }
>
> so we can make sure we're not being pushed onto a cpu that we aren't
> allowed on? Thanks,
The buglet is a messenger methinks. You saying the patch helped without
SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say
that preferring cache is not harming your load after all. It now sounds
as though wake_wide() may be what you're squabbling with.
Things aren't adding up all that well.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-18 0:55 ` Mike Galbraith
@ 2015-06-18 3:46 ` Josef Bacik
2015-06-18 4:12 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-06-18 3:46 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen
On 06/17/2015 05:55 PM, Mike Galbraith wrote:
> On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
>> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
>>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
>
>>> If sd == NULL, we fall through and try to pull wakee despite nacked-by
>>> tsk_cpus_allowed() or wake_affine().
>>>
>>
>> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
>> like this
>>
>> if (tmp >= 0) {
>> new_cpu = tmp;
>> goto unlock;
>> } else if (!want_affine) {
>> new_cpu = prev_cpu;
>> }
>>
>> so we can make sure we're not being pushed onto a cpu that we aren't
>> allowed on? Thanks,
>
> The buglet is a messenger methinks. You saying the patch helped without
> SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say
> that preferring cache is not harming your load after all. It now sounds
> as though wake_wide() may be what you're squabbling with.
>
> Things aren't adding up all that well.
Yeah I'm horribly confused. The other thing is I had to switch clusters
(I know, I know, I'm changing the parameters of the test). So these new
boxes are haswell boxes, but basically the same otherwise, 2 socket 12
core with HT, just newer/faster CPUs. I'll re-run everything again and
give the numbers so we're all on the same page again, but as it stands
now I think we have this
3.10 with wake_idle forward ported - good
4.0 stock - 20% perf drop
4.0 w/ Peter's patch - good
4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop
I can do all these iterations again to verify, is there any other
permutation you'd like to see? Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-18 3:46 ` Josef Bacik
@ 2015-06-18 4:12 ` Mike Galbraith
2015-07-02 17:44 ` Josef Bacik
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-06-18 4:12 UTC (permalink / raw)
To: Josef Bacik; +Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen
On Wed, 2015-06-17 at 20:46 -0700, Josef Bacik wrote:
> On 06/17/2015 05:55 PM, Mike Galbraith wrote:
> > On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
> >> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
> >>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
> >
> >>> If sd == NULL, we fall through and try to pull wakee despite nacked-by
> >>> tsk_cpus_allowed() or wake_affine().
> >>>
> >>
> >> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
> >> like this
> >>
> >> if (tmp >= 0) {
> >> new_cpu = tmp;
> >> goto unlock;
> >> } else if (!want_affine) {
> >> new_cpu = prev_cpu;
> >> }
> >>
> >> so we can make sure we're not being pushed onto a cpu that we aren't
> >> allowed on? Thanks,
> >
> > The buglet is a messenger methinks. You saying the patch helped without
> > SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say
> > that preferring cache is not harming your load after all. It now sounds
> > as though wake_wide() may be what you're squabbling with.
> >
> > Things aren't adding up all that well.
>
> Yeah I'm horribly confused. The other thing is I had to switch clusters
> (I know, I know, I'm changing the parameters of the test). So these new
> boxes are haswell boxes, but basically the same otherwise, 2 socket 12
> core with HT, just newer/faster CPUs. I'll re-run everything again and
> give the numbers so we're all on the same page again, but as it stands
> now I think we have this
>
> 3.10 with wake_idle forward ported - good
> 4.0 stock - 20% perf drop
> 4.0 w/ Peter's patch - good
> 4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop
>
> I can do all these iterations again to verify, is there any other
> permutation you'd like to see? Thanks,
Yeah, after re-baseline, please apply/poke these buttons individually in
4.0-virgin.
(cat /sys/kernel/debug/sched_features, prepend NO_, echo it back)
---
kernel/sched/fair.c | 4 ++--
kernel/sched/features.h | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4506,7 +4506,7 @@ static int wake_affine(struct sched_doma
* If we wake multiple tasks be careful to not bounce
* ourselves around too much.
*/
- if (wake_wide(p))
+ if (sched_feat(WAKE_WIDE) && wake_wide(p))
return 0;
idx = sd->wake_idx;
@@ -4682,7 +4682,7 @@ static int select_idle_sibling(struct ta
struct sched_group *sg;
int i = task_cpu(p);
- if (idle_cpu(target))
+ if (!sched_feat(PREFER_IDLE) || idle_cpu(target))
return target;
/*
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -59,6 +59,8 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(PREFER_IDLE, true)
+SCHED_FEAT(WAKE_WIDE, true)
/*
* Apply the automatic NUMA scheduling policy. Enabled automatically
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-06-18 4:12 ` Mike Galbraith
@ 2015-07-02 17:44 ` Josef Bacik
2015-07-03 6:40 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-02 17:44 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 06/18/2015 12:12 AM, Mike Galbraith wrote:
> On Wed, 2015-06-17 at 20:46 -0700, Josef Bacik wrote:
>> On 06/17/2015 05:55 PM, Mike Galbraith wrote:
>>> On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote:
>>>> On 06/11/2015 10:35 PM, Mike Galbraith wrote:
>>>>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote:
>>>
>>>>> If sd == NULL, we fall through and try to pull wakee despite nacked-by
>>>>> tsk_cpus_allowed() or wake_affine().
>>>>>
>>>>
>>>> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something
>>>> like this
>>>>
>>>> if (tmp >= 0) {
>>>> new_cpu = tmp;
>>>> goto unlock;
>>>> } else if (!want_affine) {
>>>> new_cpu = prev_cpu;
>>>> }
>>>>
>>>> so we can make sure we're not being pushed onto a cpu that we aren't
>>>> allowed on? Thanks,
>>>
>>> The buglet is a messenger methinks. You saying the patch helped without
>>> SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say
>>> that preferring cache is not harming your load after all. It now sounds
>>> as though wake_wide() may be what you're squabbling with.
>>>
>>> Things aren't adding up all that well.
>>
>> Yeah I'm horribly confused. The other thing is I had to switch clusters
>> (I know, I know, I'm changing the parameters of the test). So these new
>> boxes are haswell boxes, but basically the same otherwise, 2 socket 12
>> core with HT, just newer/faster CPUs. I'll re-run everything again and
>> give the numbers so we're all on the same page again, but as it stands
>> now I think we have this
>>
>> 3.10 with wake_idle forward ported - good
>> 4.0 stock - 20% perf drop
>> 4.0 w/ Peter's patch - good
>> 4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop
>>
>> I can do all these iterations again to verify, is there any other
>> permutation you'd like to see? Thanks,
>
> Yeah, after re-baseline, please apply/poke these buttons individually in
> 4.0-virgin.
>
> (cat /sys/kernel/debug/sched_features, prepend NO_, echo it back)
>
Sorry it took me a while to get these numbers to you, migrating the
whole fleet to a new setup broke the performance test suite thing so
I've only just been able to run tests again. I'll do my best to
describe what is going on and hopefully that will make the results make
sense.
This is on our webservers, which is HHVM. A request comes in for a page
and this goes onto one of the two hhvm.node.# threads, one thread per
NUMA node. From there it is farmed off to one of the worker threads.
If there are no idle workers the request gets put on what is called the
"select_queue". Basically the select_queue should never be larger than
0 in a perfect world. If it's more than we've hit latency somewhere and
that's not good. The other measurement we care about is how long a
thread spends on a request before it sends a response (this would be the
actual work being done).
Our tester slowly increases load to a group of servers until the select
queue is consistently >= 1. That means we've loaded the boxes so high
that they can't process the requests as soon as they've come in. Then
it backs down and then ramps up a second time. It takes all of these
measurements and puts them into these pretty graphs. There are 2 graphs
we care about, the duration of the requests vs the requests per second
and the probability that our select queue is >= 1 vs requests per second.
Now for 3.10 vs 4.0 our request duration time is the same if not
slightly better on 4.0, so once the workers are doing their job
everything is a-ok.
The problem is the probability the select queue >= 1 is way different on
4.0 vs 3.10. Normally this graph looks like an S, it's essentially 0 up
to some RPS (requests per second) threshold and then shoots up to 100%
after the threshold. I'll make a table of these graphs that hopefully
makes sense, the numbers are different from run to run because of
traffic and such, the test and control are both run at the same time.
The header is the probability the select queue >=1
25% 50% 75%
4.0 plain: 371 388 402
control: 386 394 402
difference: 15 6 0
So with 4.0 its basically a straight line, at lower RPS we are getting a
higher probability of a select queue >= 1. We are measuring the cpu
delay avg ms thing from the scheduler netlink stuff which is how I
noticed it was scheduler related, our cpu delay is way higher on 4.0
than it is on 3.10 or 4.0 with the wake idle patch.
So the next test is NO_PREFER_IDLE. This is slightly better than 4.0 plain
25% 50% 75%
NO_PREFER_IDLE: 399 401 414
control: 385 408 416
difference: 14 7 2
The numbers don't really show it well, but the graphs are closer
together, it's slightly more s shaped, but still not great.
Next is NO_WAKE_WIDE, which is horrible
25% 50% 75%
NO_WAKE_WIDE: 315 344 369
control: 373 380 388
difference: 58 36 19
This isn't even in the same ballpark, it's a way worse regression than
plain.
The next bit is NO_WAKE_WIDE|NO_PREFER_IDLE, which is just as bad
25% 50% 75%
EVERYTHING: 327 360 383
control: 381 390 399
difference: 54 30 19
Hopefully that helps. Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-02 17:44 ` Josef Bacik
@ 2015-07-03 6:40 ` Mike Galbraith
2015-07-03 9:29 ` Mike Galbraith
2015-07-04 15:57 ` Mike Galbraith
0 siblings, 2 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-03 6:40 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Thu, 2015-07-02 at 13:44 -0400, Josef Bacik wrote:
> Now for 3.10 vs 4.0 our request duration time is the same if not
> slightly better on 4.0, so once the workers are doing their job
> everything is a-ok.
>
> The problem is the probability the select queue >= 1 is way different on
> 4.0 vs 3.10. Normally this graph looks like an S, it's essentially 0 up
> to some RPS (requests per second) threshold and then shoots up to 100%
> after the threshold. I'll make a table of these graphs that hopefully
> makes sense, the numbers are different from run to run because of
> traffic and such, the test and control are both run at the same time.
> The header is the probability the select queue >=1
>
> 25% 50% 75%
> 4.0 plain: 371 388 402
> control: 386 394 402
> difference: 15 6 0
So control is 3.10? Virgin?
> So with 4.0 its basically a straight line, at lower RPS we are getting a
> higher probability of a select queue >= 1. We are measuring the cpu
> delay avg ms thing from the scheduler netlink stuff which is how I
> noticed it was scheduler related, our cpu delay is way higher on 4.0
> than it is on 3.10 or 4.0 with the wake idle patch.
>
> So the next test is NO_PREFER_IDLE. This is slightly better than 4.0 plain
> 25% 50% 75%
> NO_PREFER_IDLE: 399 401 414
> control: 385 408 416
> difference: 14 7 2
Hm. Throttling nohz may make larger delta. But never mind that.
> The numbers don't really show it well, but the graphs are closer
> together, it's slightly more s shaped, but still not great.
>
> Next is NO_WAKE_WIDE, which is horrible
>
> 25% 50% 75%
> NO_WAKE_WIDE: 315 344 369
> control: 373 380 388
> difference: 58 36 19
>
> This isn't even in the same ballpark, it's a way worse regression than
> plain.
Ok, this jibes perfectly with 1:N waker/wakee thingy.
> The next bit is NO_WAKE_WIDE|NO_PREFER_IDLE, which is just as bad
>
> 25% 50% 75%
> EVERYTHING: 327 360 383
> control: 381 390 399
> difference: 54 30 19
Ditto.
Hm. Seems what this load should like best is if we detect 1:N, skip all
of the routine gyrations, ie move the N (workers) infrequently, expend
search cycles frequently only on the 1 (dispatch).
Ponder..
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-03 6:40 ` Mike Galbraith
@ 2015-07-03 9:29 ` Mike Galbraith
2015-07-04 15:57 ` Mike Galbraith
1 sibling, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-03 9:29 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Fri, 2015-07-03 at 08:40 +0200, Mike Galbraith wrote:
> Hm. Seems what this load should like best is if we detect 1:N, skip all
> of the routine gyrations, ie move the N (workers) infrequently, expend
> search cycles frequently only on the 1 (dispatch).
>
> Ponder..
While taking a refresher peek at the wake_wide() thing, seems it's not
really paying attention when the waker of many is awakened. I wonder if
your load would see more benefit if it watched like so.. rashly assuming
I didn't wreck it completely (iow, completely untested).
---
kernel/sched/fair.c | 36 ++++++++++++++++++++++--------------
1 file changed, 22 insertions(+), 14 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4586,10 +4586,23 @@ static void record_wakee(struct task_str
current->wakee_flips >>= 1;
current->wakee_flip_decay_ts = jiffies;
}
+ if (time_after(jiffies, p->wakee_flip_decay_ts + HZ)) {
+ p->wakee_flips >>= 1;
+ p->wakee_flip_decay_ts = jiffies;
+ }
if (current->last_wakee != p) {
current->last_wakee = p;
current->wakee_flips++;
+ /*
+ * Flip the buddy as well. It's the ratio of flips
+ * with a socket size decayed cutoff that determines
+ * whether the pair are considered to be part of 1:N
+ * or M*N loads of a size that we need to spread, so
+ * ensure flips of both load components. The waker
+ * of many will have many more flips than its wakees.
+ */
+ p->wakee_flips++;
}
}
@@ -4732,24 +4745,19 @@ static long effective_load(struct task_g
static int wake_wide(struct task_struct *p)
{
+ unsigned long max = max(current->wakee_flips, p->wakee_flips);
+ unsigned long min = min(current->wakee_flips, p->wakee_flips);
int factor = this_cpu_read(sd_llc_size);
/*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
+ * Yeah, it's a switching-frequency heuristic, and could mean the
+ * intended many wakees/waker relationship, or rapidly switching
+ * between a few. Use factor to try to automatically adjust such
+ * that the load spreads when it grows beyond what will fit in llc.
*/
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (min < factor)
+ return 0;
+ return max > min * factor;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-03 6:40 ` Mike Galbraith
2015-07-03 9:29 ` Mike Galbraith
@ 2015-07-04 15:57 ` Mike Galbraith
2015-07-05 7:17 ` Mike Galbraith
1 sibling, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-04 15:57 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Fri, 2015-07-03 at 08:40 +0200, Mike Galbraith wrote:
> Hm. Seems what this load should like best is if we detect 1:N, skip all
> of the routine gyrations, ie move the N (workers) infrequently, expend
> search cycles frequently only on the 1 (dispatch).
>
> Ponder..
Since it was too hot to do outside chores (any excuse will do;)...
If we're (read /me) on track, the bellow should help. Per my tracing,
it may want a wee bit of toning down actually, though when I trace
virgin source I expect to see the same, namely Xorg and friends having
"wide-load" tattooed across their hindquarters earlier than they should.
It doesn't seem to hurt anything, but then demolishing a single llc box
is a tad more difficult than demolishing a NUMA box.
sched: beef up wake_wide()
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when waking one of its N minions.
Correct that, and give the user the option to do an expensive balance IFF
select_idle_sibling() doesn't find an idle CPU, and IFF the wakee is the
the 1:N dispatcher of work, thus worth some extra effort.
Not-Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
kernel/sched/fair.c | 89 +++++++++++++++++++++++++-----------------------
kernel/sched/features.h | 6 +++
2 files changed, 54 insertions(+), 41 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -666,7 +666,7 @@ static u64 sched_vslice(struct cfs_rq *c
}
#ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int cpu, void *clear);
static unsigned long task_h_load(struct task_struct *p);
static inline void __update_task_entity_contrib(struct sched_entity *se);
@@ -1375,7 +1375,7 @@ static void task_numa_compare(struct tas
* Call select_idle_sibling to maybe find a better one.
*/
if (!cur)
- env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+ env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu, NULL);
assign:
task_numa_assign(env, cur, imp);
@@ -4730,26 +4730,30 @@ static long effective_load(struct task_g
#endif
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
- int factor = this_cpu_read(sd_llc_size);
-
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
+ unsigned long waker_flips = current->wakee_flips;
+ unsigned long wakee_flips = p->wakee_flips;
+ int factor = this_cpu_read(sd_llc_size), ret = 1;
+
+ if (waker_flips < wakee_flips) {
+ swap(waker_flips, wakee_flips);
+ /* Tell the caller that we're waking a 1:N waker */
+ ret += sched_feat(WAKE_WIDE_BALANCE);
}
-
- return 0;
+ if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+ return 0;
+ return ret;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4765,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -4935,20 +4932,22 @@ find_idlest_cpu(struct sched_group *grou
/*
* Try and locate an idle CPU in the sched_domain.
*/
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target, void *clear)
{
struct sched_domain *sd;
struct sched_group *sg;
int i = task_cpu(p);
if (idle_cpu(target))
- return target;
+ goto done;
/*
* If the prevous cpu is cache affine and idle, don't be stupid.
*/
- if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
- return i;
+ if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) {
+ target = i;
+ goto done;
+ }
/*
* Otherwise, iterate the domains and find an elegible idle cpu.
@@ -4973,7 +4972,11 @@ static int select_idle_sibling(struct ta
sg = sg->next;
} while (sg != sd->groups);
}
+ return target;
done:
+ if (clear)
+ *(void **)clear = 0;
+
return target;
}
/*
@@ -5021,14 +5024,19 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
- int want_affine = 0;
+ int new_cpu = prev_cpu;
+ int want_affine = 0, want_balance = 0;
int sync = wake_flags & WF_SYNC;
- if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
rcu_read_lock();
+ if (sd_flag & SD_BALANCE_WAKE) {
+ want_affine = wake_wide(p);
+ want_balance = want_affine > 1;
+ want_affine = !want_affine && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ if (!want_affine && !want_balance)
+ goto select;
+ }
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
continue;
@@ -5043,23 +5051,23 @@ select_task_rq_fair(struct task_struct *
break;
}
- if (tmp->flags & sd_flag)
+ if (tmp->flags & sd_flag || want_balance)
sd = tmp;
}
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;
if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+select:
+ new_cpu = select_idle_sibling(p, new_cpu, &sd);
}
while (sd) {
struct sched_group *group;
int weight;
- if (!(sd->flags & sd_flag)) {
+ if (!(sd->flags & sd_flag) && !want_balance) {
sd = sd->child;
continue;
}
@@ -5089,7 +5097,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();
return new_cpu;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Perform expensive full wake balance for 1:N wakers when the
+ * selected cpu is not completely idle.
+ */
+SCHED_FEAT(WAKE_WIDE_BALANCE, false)
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-04 15:57 ` Mike Galbraith
@ 2015-07-05 7:17 ` Mike Galbraith
2015-07-06 5:13 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-05 7:17 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Sat, 2015-07-04 at 17:57 +0200, Mike Galbraith wrote:
> If we're (read /me) on track, the bellow should help. Per my tracing,
> it may want a wee bit of toning down actually, though when I trace
> virgin source I expect to see the same, namely Xorg and friends having
> "wide-load" tattooed across their hindquarters earlier than they should.
That's true, the only difference is that the virgin kernel will still
try to pull the multi-waker when it is being awakened, which I can
easily imagine going either way performance wise, depending on a pile of
variables.. so let's just ask your box.
V2: drop select_idle_sibling() changes (ick), try to save some cycles.
sched: beef up wake_wide()
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when the 1:N waker is waking one of its minions.
Correct that, and give the user the option to do an expensive balance IFF
select_idle_sibling() doesn't find an idle CPU, and IFF the wakee is the
1:N waker, the dispatcher of work, thus worth some extra effort. Don't
drill down through all domains though, stop searching at highest, we'll
either have found the desired completely idle CPU, or if heavily loaded,
the least loaded CPU of the least loaded group, which should still add
up to an average scheduling latency improvement (knock wood).
Not-Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
include/linux/sched.h | 7 ++-
kernel/sched/fair.c | 86 ++++++++++++++++++++++++++++--------------------
kernel/sched/features.h | 6 +++
3 files changed, 61 insertions(+), 38 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -962,7 +962,8 @@ extern void wake_up_q(struct wake_q_head
#define SD_BALANCE_EXEC 0x0004 /* Balance on exec */
#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
-#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
+#define SD_BALANCE_WAKE_IDLE 0x0020 /* Balance on wakeup, searching for an idle CPU */
+#define SD_WAKE_AFFINE 0x0040 /* Wake task to waking CPU */
#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu power */
#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
@@ -1353,9 +1354,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;
int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,30 @@ static long effective_load(struct task_g
#endif
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
- int factor = this_cpu_read(sd_llc_size);
-
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
+ unsigned int waker_flips = current->wakee_flips;
+ unsigned int wakee_flips = p->wakee_flips;
+ int factor = this_cpu_read(sd_llc_size), ret = 1;
+
+ if (waker_flips < wakee_flips) {
+ swap(waker_flips, wakee_flips);
+ /* Tell the caller that we're waking a 1:N waker */
+ ret += sched_feat(WAKE_WIDE_IDLE);
}
-
- return 0;
+ if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+ return 0;
+ return ret;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4765,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -4865,6 +4862,10 @@ find_idlest_group(struct sched_domain *s
load = target_load(i, load_idx);
avg_load += load;
+
+ if (sd_flag & SD_BALANCE_WAKE_IDLE && idle_cpu(i) &&
+ cpumask_test_cpu(1, tsk_cpus_allowed(p)))
+ return group;
}
/* Adjust by relative CPU capacity of the group */
@@ -5021,14 +5022,21 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
- int want_affine = 0;
+ int new_cpu = prev_cpu;
+ int want_affine = 0, want_idle = 0;
int sync = wake_flags & WF_SYNC;
- if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
rcu_read_lock();
+ if (sd_flag & SD_BALANCE_WAKE) {
+ want_idle = wake_wide(p);
+ want_affine = !want_idle && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_idle = want_idle > 1;
+ if (!want_affine && !want_idle)
+ goto select;
+ if (want_idle)
+ sd_flag |= SD_BALANCE_WAKE_IDLE;
+ }
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
continue;
@@ -5043,23 +5051,25 @@ select_task_rq_fair(struct task_struct *
break;
}
- if (tmp->flags & sd_flag)
+ if (tmp->flags & sd_flag || want_idle)
sd = tmp;
}
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;
if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+select:
+ new_cpu = select_idle_sibling(p, new_cpu);
+ if (want_idle && (new_cpu != prev_cpu || idle_cpu(prev_cpu)))
+ sd = NULL;
}
while (sd) {
struct sched_group *group;
int weight;
- if (!(sd->flags & sd_flag)) {
+ if (!(sd->flags & sd_flag) && !want_idle) {
sd = sd->child;
continue;
}
@@ -5077,6 +5087,13 @@ select_task_rq_fair(struct task_struct *
continue;
}
+ /*
+ * We've either found an idle CPU, or the least loaded CPU in
+ * the least load group of the highest domain. Good enough.
+ */
+ if (want_idle)
+ break;
+
/* Now try balancing at a lower domain level of new_cpu */
cpu = new_cpu;
weight = sd->span_weight;
@@ -5089,7 +5106,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();
return new_cpu;
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
*/
SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
+
+/*
+ * Perform expensive full wake balance for 1:N wakers when the
+ * selected cpu is not completely idle.
+ */
+SCHED_FEAT(WAKE_WIDE_IDLE, false)
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-05 7:17 ` Mike Galbraith
@ 2015-07-06 5:13 ` Mike Galbraith
2015-07-06 14:34 ` Josef Bacik
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-06 5:13 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
Hm. Piddling with pgbench, which doesn't seem to collapse into a
quivering heap when load exceeds cores these days, deltas weren't all
that impressive, but it does appreciate the extra effort a bit, and a
bit more when clients receive it as well.
If you test, and have time to piddle, you could try letting wake_wide()
return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
the dispatcher.
Numbers from my little desktop box.
NO_WAKE_WIDE_IDLE
postgres@homer:~> pgbench.sh
clients 8 tps = 116697.697662
clients 12 tps = 115160.230523
clients 16 tps = 115569.804548
clients 20 tps = 117879.230514
clients 24 tps = 118281.753040
clients 28 tps = 116974.796627
clients 32 tps = 119082.163998 avg 117092.239 1.000
WAKE_WIDE_IDLE
postgres@homer:~> pgbench.sh
clients 8 tps = 124351.735754
clients 12 tps = 124419.673135
clients 16 tps = 125050.716498
clients 20 tps = 124813.042352
clients 24 tps = 126047.442307
clients 28 tps = 125373.719401
clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000
WAKE_WIDE_IDLE (clients as well as server)
postgres@homer:~> pgbench.sh
clients 8 tps = 130539.795246
clients 12 tps = 128984.648554
clients 16 tps = 130564.386447
clients 20 tps = 129149.693118
clients 24 tps = 130211.119780
clients 28 tps = 130325.355433
clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-06 5:13 ` Mike Galbraith
@ 2015-07-06 14:34 ` Josef Bacik
2015-07-06 18:36 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-06 14:34 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 07/06/2015 01:13 AM, Mike Galbraith wrote:
> Hm. Piddling with pgbench, which doesn't seem to collapse into a
> quivering heap when load exceeds cores these days, deltas weren't all
> that impressive, but it does appreciate the extra effort a bit, and a
> bit more when clients receive it as well.
>
> If you test, and have time to piddle, you could try letting wake_wide()
> return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
> the dispatcher.
>
> Numbers from my little desktop box.
>
> NO_WAKE_WIDE_IDLE
> postgres@homer:~> pgbench.sh
> clients 8 tps = 116697.697662
> clients 12 tps = 115160.230523
> clients 16 tps = 115569.804548
> clients 20 tps = 117879.230514
> clients 24 tps = 118281.753040
> clients 28 tps = 116974.796627
> clients 32 tps = 119082.163998 avg 117092.239 1.000
>
> WAKE_WIDE_IDLE
> postgres@homer:~> pgbench.sh
> clients 8 tps = 124351.735754
> clients 12 tps = 124419.673135
> clients 16 tps = 125050.716498
> clients 20 tps = 124813.042352
> clients 24 tps = 126047.442307
> clients 28 tps = 125373.719401
> clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000
>
> WAKE_WIDE_IDLE (clients as well as server)
> postgres@homer:~> pgbench.sh
> clients 8 tps = 130539.795246
> clients 12 tps = 128984.648554
> clients 16 tps = 130564.386447
> clients 20 tps = 129149.693118
> clients 24 tps = 130211.119780
> clients 28 tps = 130325.355433
> clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037
>
I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get
this solved upstream and then I'll rip out the old and put in the new,
I'm happy to screw around until we're all happy. I'll throw this in a
kernel this morning and run stuff today. Barring any issues with the
testing infrastructure I should have results today. Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-06 14:34 ` Josef Bacik
@ 2015-07-06 18:36 ` Mike Galbraith
2015-07-06 19:41 ` Josef Bacik
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-06 18:36 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote:
> On 07/06/2015 01:13 AM, Mike Galbraith wrote:
> > Hm. Piddling with pgbench, which doesn't seem to collapse into a
> > quivering heap when load exceeds cores these days, deltas weren't all
> > that impressive, but it does appreciate the extra effort a bit, and a
> > bit more when clients receive it as well.
> >
> > If you test, and have time to piddle, you could try letting wake_wide()
> > return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
> > the dispatcher.
> >
> > Numbers from my little desktop box.
> >
> > NO_WAKE_WIDE_IDLE
> > postgres@homer:~> pgbench.sh
> > clients 8 tps = 116697.697662
> > clients 12 tps = 115160.230523
> > clients 16 tps = 115569.804548
> > clients 20 tps = 117879.230514
> > clients 24 tps = 118281.753040
> > clients 28 tps = 116974.796627
> > clients 32 tps = 119082.163998 avg 117092.239 1.000
> >
> > WAKE_WIDE_IDLE
> > postgres@homer:~> pgbench.sh
> > clients 8 tps = 124351.735754
> > clients 12 tps = 124419.673135
> > clients 16 tps = 125050.716498
> > clients 20 tps = 124813.042352
> > clients 24 tps = 126047.442307
> > clients 28 tps = 125373.719401
> > clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000
> >
> > WAKE_WIDE_IDLE (clients as well as server)
> > postgres@homer:~> pgbench.sh
> > clients 8 tps = 130539.795246
> > clients 12 tps = 128984.648554
> > clients 16 tps = 130564.386447
> > clients 20 tps = 129149.693118
> > clients 24 tps = 130211.119780
> > clients 28 tps = 130325.355433
> > clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037
I had a typo in my script, so those desktop box numbers were all doing
the same number of clients. It doesn't invalidate anything, but the
individual deltas are just run to run variance.. not to mention that
single cache box is not all that interesting for this anyway. That
happens when interconnect becomes a player.
> I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get
> this solved upstream and then I'll rip out the old and put in the new,
> I'm happy to screw around until we're all happy. I'll throw this in a
> kernel this morning and run stuff today. Barring any issues with the
> testing infrastructure I should have results today. Thanks,
I'll be interested in your results. Taking pgbench to a little NUMA
box, I'm seeing _nada_ outside of variance with master (crap). I have a
way to win significantly for _older_ kernels, and that win over master
_may_ provide some useful insight, but I don't trust postgres/pgbench as
far as I can toss the planet, so don't have a warm fuzzy about trying to
use it to approximate your real world load.
BTW, what's your topology look like (numactl --hardware).
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-06 18:36 ` Mike Galbraith
@ 2015-07-06 19:41 ` Josef Bacik
2015-07-07 4:01 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-06 19:41 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 07/06/2015 02:36 PM, Mike Galbraith wrote:
> On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote:
>> On 07/06/2015 01:13 AM, Mike Galbraith wrote:
>>> Hm. Piddling with pgbench, which doesn't seem to collapse into a
>>> quivering heap when load exceeds cores these days, deltas weren't all
>>> that impressive, but it does appreciate the extra effort a bit, and a
>>> bit more when clients receive it as well.
>>>
>>> If you test, and have time to piddle, you could try letting wake_wide()
>>> return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is
>>> the dispatcher.
>>>
>>> Numbers from my little desktop box.
>>>
>>> NO_WAKE_WIDE_IDLE
>>> postgres@homer:~> pgbench.sh
>>> clients 8 tps = 116697.697662
>>> clients 12 tps = 115160.230523
>>> clients 16 tps = 115569.804548
>>> clients 20 tps = 117879.230514
>>> clients 24 tps = 118281.753040
>>> clients 28 tps = 116974.796627
>>> clients 32 tps = 119082.163998 avg 117092.239 1.000
>>>
>>> WAKE_WIDE_IDLE
>>> postgres@homer:~> pgbench.sh
>>> clients 8 tps = 124351.735754
>>> clients 12 tps = 124419.673135
>>> clients 16 tps = 125050.716498
>>> clients 20 tps = 124813.042352
>>> clients 24 tps = 126047.442307
>>> clients 28 tps = 125373.719401
>>> clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000
>>>
>>> WAKE_WIDE_IDLE (clients as well as server)
>>> postgres@homer:~> pgbench.sh
>>> clients 8 tps = 130539.795246
>>> clients 12 tps = 128984.648554
>>> clients 16 tps = 130564.386447
>>> clients 20 tps = 129149.693118
>>> clients 24 tps = 130211.119780
>>> clients 28 tps = 130325.355433
>>> clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037
>
> I had a typo in my script, so those desktop box numbers were all doing
> the same number of clients. It doesn't invalidate anything, but the
> individual deltas are just run to run variance.. not to mention that
> single cache box is not all that interesting for this anyway. That
> happens when interconnect becomes a player.
>
>> I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get
>> this solved upstream and then I'll rip out the old and put in the new,
>> I'm happy to screw around until we're all happy. I'll throw this in a
>> kernel this morning and run stuff today. Barring any issues with the
>> testing infrastructure I should have results today. Thanks,
>
> I'll be interested in your results. Taking pgbench to a little NUMA
> box, I'm seeing _nada_ outside of variance with master (crap). I have a
> way to win significantly for _older_ kernels, and that win over master
> _may_ provide some useful insight, but I don't trust postgres/pgbench as
> far as I can toss the planet, so don't have a warm fuzzy about trying to
> use it to approximate your real world load.
>
> BTW, what's your topology look like (numactl --hardware).
>
So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
baseline with a slight regression at lower RPS and a slight improvement
at high RPS. I'm running with WAKE_WIDE_IDLE set now, that should be
done soonish and then I'll do the 1 + sched_feat(WAKE_WIDE_IDLE) thing
next and those results should come in the morning. Here is the numa
information from one of the boxes in the test cluster
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 15890 MB
node 0 free: 2651 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 16125 MB
node 1 free: 2063 MB
node distances:
node 0 1
0: 10 20
1: 20 10
Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-06 19:41 ` Josef Bacik
@ 2015-07-07 4:01 ` Mike Galbraith
2015-07-07 9:43 ` [patch] " Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-07 4:01 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
> baseline with a slight regression at lower RPS and a slight improvement
> at high RPS.
Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm
not real keen on the feature unless your numbers are really good, and
odds are that ain't gonna happen.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-07 4:01 ` Mike Galbraith
@ 2015-07-07 9:43 ` Mike Galbraith
2015-07-07 13:40 ` Josef Bacik
2015-07-07 17:06 ` Josef Bacik
0 siblings, 2 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-07 9:43 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote:
> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
>
> > So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
> > baseline with a slight regression at lower RPS and a slight improvement
> > at high RPS.
>
> Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm
> not real keen on the feature unless your numbers are really good, and
> odds are that ain't gonna happen.
More extensive testing in pedantic-man mode increased my confidence of
that enough to sign off and ship the dirt simple version. Any further
twiddles should grow their own wings if they want to fly anyway, the
simplest form helps your real world load, as well as the not so real
pgbench, my numbers for that below.
virgin master, 2 socket box
postgres@nessler:~> pgbench.sh
clients 12 tps = 96233.854271 1.000
clients 24 tps = 142234.686166 1.000
clients 36 tps = 148433.534531 1.000
clients 48 tps = 133105.634302 1.000
clients 60 tps = 128903.080371 1.000
clients 72 tps = 128591.821782 1.000
clients 84 tps = 114445.967116 1.000
clients 96 tps = 109803.557524 1.000 avg 125219.017 1.000
V3 (KISS, below)
postgres@nessler:~> pgbench.sh
clients 12 tps = 120793.023637 1.255
clients 24 tps = 144668.961468 1.017
clients 36 tps = 156705.239251 1.055
clients 48 tps = 152004.886893 1.141
clients 60 tps = 138582.113864 1.075
clients 72 tps = 136286.891104 1.059
clients 84 tps = 137420.986043 1.200
clients 96 tps = 135199.060242 1.231 avg 140207.645 1.119 1.000
V2 NO_WAKE_WIDE_IDLE
postgres@nessler:~> pgbench.sh
clients 12 tps = 121821.966162 1.265
clients 24 tps = 146446.388366 1.029
clients 36 tps = 151373.362190 1.019
clients 48 tps = 156806.730746 1.178
clients 60 tps = 133933.491567 1.039
clients 72 tps = 131460.489424 1.022
clients 84 tps = 130859.340261 1.143
clients 96 tps = 130787.476584 1.191 avg 137936.155 1.101 0.983
V2 WAKE_WIDE_IDLE (crawl in a hole feature, you're dead)
postgres@nessler:~> pgbench.sh
clients 12 tps = 121297.791570
clients 24 tps = 145939.488312
clients 36 tps = 155336.090263
clients 48 tps = 149018.245323
clients 60 tps = 136730.079391
clients 72 tps = 134886.116831
clients 84 tps = 130493.283398
clients 96 tps = 126043.336074
sched: beef up wake_wide()
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when the 1:N waker is waking one of its minions.
Correct that, and don't bother doing domain traversal when we know
that all we need to do is check for an idle cpu.
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
---
include/linux/sched.h | 4 +--
kernel/sched/fair.c | 56 ++++++++++++++++++++++++--------------------------
2 files changed, 29 insertions(+), 31 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;
int wake_cpu;
#endif
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -4730,26 +4730,27 @@ static long effective_load(struct task_g
#endif
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int waker_flips = current->wakee_flips;
+ unsigned int wakee_flips = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (waker_flips < wakee_flips)
+ swap(waker_flips, wakee_flips);
+ if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+ return 0;
+ return 1;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4762,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
- if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
rcu_read_lock();
+ if (sd_flag & SD_BALANCE_WAKE) {
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ if (!want_affine)
+ goto select_idle;
+ }
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
continue;
@@ -5048,10 +5045,11 @@ select_task_rq_fair(struct task_struct *
}
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;
if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
+select_idle:
+ new_cpu = select_idle_sibling(p, new_cpu);
goto unlock;
}
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-07 9:43 ` [patch] " Mike Galbraith
@ 2015-07-07 13:40 ` Josef Bacik
2015-07-07 15:24 ` Mike Galbraith
2015-07-07 17:06 ` Josef Bacik
1 sibling, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-07 13:40 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 07/07/2015 05:43 AM, Mike Galbraith wrote:
> On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote:
>> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
>>
>>> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
>>> baseline with a slight regression at lower RPS and a slight improvement
>>> at high RPS.
>>
>> Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm
>> not real keen on the feature unless your numbers are really good, and
>> odds are that ain't gonna happen.
>
> More extensive testing in pedantic-man mode increased my confidence of
> that enough to sign off and ship the dirt simple version. Any further
> twiddles should grow their own wings if they want to fly anyway, the
> simplest form helps your real world load, as well as the not so real
> pgbench, my numbers for that below.
>
The WAKE_WIDE_IDLE run was basically the same so I'm good with the KISS
version. I'll run that through the load tests this morning and let you
know how it goes. I'm still seeing a slight regression at lower RPS,
but it's like 1-2%, compared to ~15%. Once load ramps up we're good to
go, not sure why that is but it may also be my sample size (the cluster
is only 32 boxes, the minimum for decent results). Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-07 13:40 ` Josef Bacik
@ 2015-07-07 15:24 ` Mike Galbraith
0 siblings, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-07 15:24 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Tue, 2015-07-07 at 09:40 -0400, Josef Bacik wrote:
> The WAKE_WIDE_IDLE run was basically the same so I'm good with the KISS
> version. I'll run that through the load tests this morning and let you
> know how it goes. I'm still seeing a slight regression at lower RPS,
> but it's like 1-2%, compared to ~15%.
If I'm to believe pgbench, and configs really are as close as they
should be, there may be something between 3.12 and master which could
account for your delta and then some. I may go hunting.. heck, if I
trusted pgbench I would be hunting.
patched SLE (resembles 3.12 if you squint)
postgres@nessler:~> pgbench.sh
clients 12 tps = 134795.901777
clients 24 tps = 156955.584523
clients 36 tps = 161450.044316
clients 48 tps = 171181.069044
clients 60 tps = 157896.858179
clients 72 tps = 155816.051709
clients 84 tps = 152456.003468
clients 96 tps = 121129.848008
patched master
postgres@nessler:~> pgbench.sh
clients 12 tps = 120793.023637
clients 24 tps = 144668.961468
clients 36 tps = 156705.239251
clients 48 tps = 152004.886893
clients 60 tps = 138582.113864
clients 72 tps = 136286.891104
clients 84 tps = 137420.986043
clients 96 tps = 135199.060242
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE
2015-07-07 9:43 ` [patch] " Mike Galbraith
2015-07-07 13:40 ` Josef Bacik
@ 2015-07-07 17:06 ` Josef Bacik
2015-07-08 6:13 ` [patch] sched: beef up wake_wide() Mike Galbraith
1 sibling, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-07 17:06 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 07/07/2015 05:43 AM, Mike Galbraith wrote:
> On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote:
>> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote:
>>
>>> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the
>>> baseline with a slight regression at lower RPS and a slight improvement
>>> at high RPS.
>>
>> Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm
>> not real keen on the feature unless your numbers are really good, and
>> odds are that ain't gonna happen.
>
> More extensive testing in pedantic-man mode increased my confidence of
> that enough to sign off and ship the dirt simple version. Any further
> twiddles should grow their own wings if they want to fly anyway, the
> simplest form helps your real world load, as well as the not so real
> pgbench, my numbers for that below.
>
> virgin master, 2 socket box
> postgres@nessler:~> pgbench.sh
> clients 12 tps = 96233.854271 1.000
> clients 24 tps = 142234.686166 1.000
> clients 36 tps = 148433.534531 1.000
> clients 48 tps = 133105.634302 1.000
> clients 60 tps = 128903.080371 1.000
> clients 72 tps = 128591.821782 1.000
> clients 84 tps = 114445.967116 1.000
> clients 96 tps = 109803.557524 1.000 avg 125219.017 1.000
>
> V3 (KISS, below)
> postgres@nessler:~> pgbench.sh
> clients 12 tps = 120793.023637 1.255
> clients 24 tps = 144668.961468 1.017
> clients 36 tps = 156705.239251 1.055
> clients 48 tps = 152004.886893 1.141
> clients 60 tps = 138582.113864 1.075
> clients 72 tps = 136286.891104 1.059
> clients 84 tps = 137420.986043 1.200
> clients 96 tps = 135199.060242 1.231 avg 140207.645 1.119 1.000
>
> V2 NO_WAKE_WIDE_IDLE
> postgres@nessler:~> pgbench.sh
> clients 12 tps = 121821.966162 1.265
> clients 24 tps = 146446.388366 1.029
> clients 36 tps = 151373.362190 1.019
> clients 48 tps = 156806.730746 1.178
> clients 60 tps = 133933.491567 1.039
> clients 72 tps = 131460.489424 1.022
> clients 84 tps = 130859.340261 1.143
> clients 96 tps = 130787.476584 1.191 avg 137936.155 1.101 0.983
>
> V2 WAKE_WIDE_IDLE (crawl in a hole feature, you're dead)
> postgres@nessler:~> pgbench.sh
> clients 12 tps = 121297.791570
> clients 24 tps = 145939.488312
> clients 36 tps = 155336.090263
> clients 48 tps = 149018.245323
> clients 60 tps = 136730.079391
> clients 72 tps = 134886.116831
> clients 84 tps = 130493.283398
> clients 96 tps = 126043.336074
>
>
> sched: beef up wake_wide()
>
> Josef Bacik reported that Facebook sees better performance with their
> 1:N load (1 dispatch/node, N workers/node) when carrying an old patch
> to try very hard to wake to an idle CPU. While looking at wake_wide(),
> I noticed that it doesn't pay attention to wakeup of the 1:N waker,
> returning 1 only when the 1:N waker is waking one of its minions.
>
> Correct that, and don't bother doing domain traversal when we know
> that all we need to do is check for an idle cpu.
>
> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Ok you can add
Tested-by: Josef Bacik <jbacik@fb.com>
The new patch is the best across the board, there are no regressions and
about a 5% improvement for the bulk of the run (25 percentile and up).
Thanks for your help on this!
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* [patch] sched: beef up wake_wide()
2015-07-07 17:06 ` Josef Bacik
@ 2015-07-08 6:13 ` Mike Galbraith
2015-07-09 13:26 ` Peter Zijlstra
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-08 6:13 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to wakeup of the 1:N waker,
returning 1 only when the 1:N waker is waking one of its minions.
Correct that, and don't bother doing domain traversal when we know
that all we need to do is check for an idle cpu.
While at it, adjust task_struct bits, we don't need a 64bit counter.
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
---
include/linux/sched.h | 4 +--
kernel/sched/fair.c | 56 ++++++++++++++++++++++++--------------------------
2 files changed, 29 insertions(+), 31 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;
int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,27 @@ static long effective_load(struct task_g
#endif
+/*
+ * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that we are seeing a 1:N
+ * relationship, and that load size exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int waker_flips = current->wakee_flips;
+ unsigned int wakee_flips = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (waker_flips < wakee_flips)
+ swap(waker_flips, wakee_flips);
+ if (wakee_flips < factor || waker_flips < wakee_flips * factor)
+ return 0;
+ return 1;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4762,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
- if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
-
rcu_read_lock();
+ if (sd_flag & SD_BALANCE_WAKE) {
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ if (!want_affine)
+ goto select_idle;
+ }
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
continue;
@@ -5048,10 +5045,11 @@ select_task_rq_fair(struct task_struct *
}
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;
if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
+select_idle:
+ new_cpu = select_idle_sibling(p, new_cpu);
goto unlock;
}
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-08 6:13 ` [patch] sched: beef up wake_wide() Mike Galbraith
@ 2015-07-09 13:26 ` Peter Zijlstra
2015-07-09 14:07 ` Mike Galbraith
2015-07-10 5:19 ` Mike Galbraith
0 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2015-07-09 13:26 UTC (permalink / raw)
To: Mike Galbraith
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
>
> +/*
> + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
> + * A waker of many should wake a different task than the one last awakened
> + * at a frequency roughly N times higher than one of its wakees. In order
> + * to determine whether we should let the load spread vs consolodating to
> + * shared cache, we look for a minimum 'flip' frequency of llc_size in one
> + * partner, and a factor of lls_size higher frequency in the other. With
> + * both conditions met, we can be relatively sure that we are seeing a 1:N
> + * relationship, and that load size exceeds socket size.
> + */
> static int wake_wide(struct task_struct *p)
> {
> + unsigned int waker_flips = current->wakee_flips;
> + unsigned int wakee_flips = p->wakee_flips;
> int factor = this_cpu_read(sd_llc_size);
>
> + if (waker_flips < wakee_flips)
> + swap(waker_flips, wakee_flips);
This makes the wakee/waker names useless, the end result is more like
wakee_flips := client_flips, waker_flips := server_flips.
> + if (wakee_flips < factor || waker_flips < wakee_flips * factor)
> + return 0;
I don't get the first condition... why would the client ever flip? It
only talks to that one server.
> + return 1;
> }
> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
> {
> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> int cpu = smp_processor_id();
> + int new_cpu = prev_cpu;
> int want_affine = 0;
> int sync = wake_flags & WF_SYNC;
>
> rcu_read_lock();
> + if (sd_flag & SD_BALANCE_WAKE) {
> + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> + if (!want_affine)
> + goto select_idle;
> + }
So this preserves/makes worse the bug Morten spotted, even without
want_affine we should still attempt SD_BALANCE_WAKE if set.
> +
> for_each_domain(cpu, tmp) {
> if (!(tmp->flags & SD_LOAD_BALANCE))
> continue;
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-09 13:26 ` Peter Zijlstra
@ 2015-07-09 14:07 ` Mike Galbraith
2015-07-09 14:46 ` Mike Galbraith
2015-07-10 5:19 ` Mike Galbraith
1 sibling, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-09 14:07 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
> >
> > +/*
> > + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
> > + * A waker of many should wake a different task than the one last awakened
> > + * at a frequency roughly N times higher than one of its wakees. In order
> > + * to determine whether we should let the load spread vs consolodating to
> > + * shared cache, we look for a minimum 'flip' frequency of llc_size in one
> > + * partner, and a factor of lls_size higher frequency in the other. With
> > + * both conditions met, we can be relatively sure that we are seeing a 1:N
> > + * relationship, and that load size exceeds socket size.
> > + */
> > static int wake_wide(struct task_struct *p)
> > {
> > + unsigned int waker_flips = current->wakee_flips;
> > + unsigned int wakee_flips = p->wakee_flips;
> > int factor = this_cpu_read(sd_llc_size);
> >
> > + if (waker_flips < wakee_flips)
> > + swap(waker_flips, wakee_flips);
>
> This makes the wakee/waker names useless, the end result is more like
> wakee_flips := client_flips, waker_flips := server_flips.
True, perhaps a rename is in order.
> > + if (wakee_flips < factor || waker_flips < wakee_flips * factor)
> > + return 0;
>
> I don't get the first condition... why would the client ever flip? It
> only talks to that one server.
So I was thinking too, and I initially cemented the relationship by
flipping both. However, the thing works in virgin source, ie clients do
in fact flip, so I removed that cementing based on the hard evidence.
> > @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
> > {
> > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> > int cpu = smp_processor_id();
> > + int new_cpu = prev_cpu;
> > int want_affine = 0;
> > int sync = wake_flags & WF_SYNC;
> >
> > rcu_read_lock();
> > + if (sd_flag & SD_BALANCE_WAKE) {
> > + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> > + if (!want_affine)
> > + goto select_idle;
> > + }
>
> So this preserves/makes worse the bug Morten spotted, even without
> want_affine we should still attempt SD_BALANCE_WAKE if set.
Yeah. I can redo it if you want, but it seems a shame to traverse for
nothing given we know SD_BALANCE_WAKE is so painful that nobody really
really wants to do that. One has to override the other in any case, no?
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-09 14:07 ` Mike Galbraith
@ 2015-07-09 14:46 ` Mike Galbraith
0 siblings, 0 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-09 14:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Thu, 2015-07-09 at 16:07 +0200, Mike Galbraith wrote:
> On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
> > On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
> > >
> > > +/*
> > > + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic.
> > > + * A waker of many should wake a different task than the one last awakened
> > > + * at a frequency roughly N times higher than one of its wakees. In order
> > > + * to determine whether we should let the load spread vs consolodating to
> > > + * shared cache, we look for a minimum 'flip' frequency of llc_size in one
> > > + * partner, and a factor of lls_size higher frequency in the other. With
> > > + * both conditions met, we can be relatively sure that we are seeing a 1:N
> > > + * relationship, and that load size exceeds socket size.
> > > + */
> > > static int wake_wide(struct task_struct *p)
> > > {
> > > + unsigned int waker_flips = current->wakee_flips;
> > > + unsigned int wakee_flips = p->wakee_flips;
> > > int factor = this_cpu_read(sd_llc_size);
> > >
> > > + if (waker_flips < wakee_flips)
> > > + swap(waker_flips, wakee_flips);
> >
> > This makes the wakee/waker names useless, the end result is more like
> > wakee_flips := client_flips, waker_flips := server_flips.
>
> True, perhaps a rename is in order.
I should perhaps add that waker/wakee_flips does make sense to me in the
sense that the task who's flips end up in the waker_flips bucket is our
waker of many vs being one its many wakees.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-09 13:26 ` Peter Zijlstra
2015-07-09 14:07 ` Mike Galbraith
@ 2015-07-10 5:19 ` Mike Galbraith
2015-07-10 13:41 ` Josef Bacik
2015-07-10 20:59 ` Josef Bacik
1 sibling, 2 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-10 5:19 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
> > static int wake_wide(struct task_struct *p)
> > {
> > + unsigned int waker_flips = current->wakee_flips;
> > + unsigned int wakee_flips = p->wakee_flips;
> > int factor = this_cpu_read(sd_llc_size);
> >
> > + if (waker_flips < wakee_flips)
> > + swap(waker_flips, wakee_flips);
>
> This makes the wakee/waker names useless, the end result is more like
> wakee_flips := client_flips, waker_flips := server_flips.
I settled on master/slave plus hopefully improved comment block.
> > + if (wakee_flips < factor || waker_flips < wakee_flips * factor)
> > + return 0;
>
> I don't get the first condition... why would the client ever flip? It
> only talks to that one server.
(tightening heuristic up a bit by one means or another would be good,
but "if it ain't broke, don't fix it" applies for this patchlet)
> > @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
> > {
> > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> > int cpu = smp_processor_id();
> > + int new_cpu = prev_cpu;
> > int want_affine = 0;
> > int sync = wake_flags & WF_SYNC;
> >
> > rcu_read_lock();
> > + if (sd_flag & SD_BALANCE_WAKE) {
> > + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> > + if (!want_affine)
> > + goto select_idle;
> > + }
>
> So this preserves/makes worse the bug Morten spotted, even without
> want_affine we should still attempt SD_BALANCE_WAKE if set.
Fixed. wake_wide() may override want_affine as before, want_affine may
override other ->flags as before, but a surviving domain selection now
results in a full balance instead of a select_idle_sibling() call.
sched: beef up wake_wide()
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.
Correct that, letting explicit domain flags override the heuristic.
While at it, adjust task_struct bits, we don't need a 64bit counter.
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
---
include/linux/sched.h | 4 +--
kernel/sched/fair.c | 57 ++++++++++++++++++++++----------------------------
2 files changed, 28 insertions(+), 33 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;
int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,29 @@ static long effective_load(struct task_g
#endif
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4764,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5021,12 +5017,12 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
rcu_read_lock();
for_each_domain(cpu, tmp) {
@@ -5040,6 +5036,8 @@ select_task_rq_fair(struct task_struct *
if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
affine_sd = tmp;
+ /* Prefer affinity over any other flags */
+ sd = NULL;
break;
}
@@ -5048,12 +5046,10 @@ select_task_rq_fair(struct task_struct *
}
if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ new_cpu = cpu;
- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
- }
+ if ((sd_flag & SD_BALANCE_WAKE) && (!sd || (!(sd->flags & SD_BALANCE_WAKE))))
+ new_cpu = select_idle_sibling(p, new_cpu);
while (sd) {
struct sched_group *group;
@@ -5089,7 +5085,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();
return new_cpu;
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-10 5:19 ` Mike Galbraith
@ 2015-07-10 13:41 ` Josef Bacik
2015-07-10 20:59 ` Josef Bacik
1 sibling, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-07-10 13:41 UTC (permalink / raw)
To: Mike Galbraith, Peter Zijlstra
Cc: riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 07/10/2015 01:19 AM, Mike Galbraith wrote:
> On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
>> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
>>> static int wake_wide(struct task_struct *p)
>>> {
>>> + unsigned int waker_flips = current->wakee_flips;
>>> + unsigned int wakee_flips = p->wakee_flips;
>>> int factor = this_cpu_read(sd_llc_size);
>>>
>>> + if (waker_flips < wakee_flips)
>>> + swap(waker_flips, wakee_flips);
>>
>> This makes the wakee/waker names useless, the end result is more like
>> wakee_flips := client_flips, waker_flips := server_flips.
>
> I settled on master/slave plus hopefully improved comment block.
>
>>> + if (wakee_flips < factor || waker_flips < wakee_flips * factor)
>>> + return 0;
>>
>> I don't get the first condition... why would the client ever flip? It
>> only talks to that one server.
>
> (tightening heuristic up a bit by one means or another would be good,
> but "if it ain't broke, don't fix it" applies for this patchlet)
>
>>> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
>>> {
>>> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>>> int cpu = smp_processor_id();
>>> + int new_cpu = prev_cpu;
>>> int want_affine = 0;
>>> int sync = wake_flags & WF_SYNC;
>>>
>>> rcu_read_lock();
>>> + if (sd_flag & SD_BALANCE_WAKE) {
>>> + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>> + if (!want_affine)
>>> + goto select_idle;
>>> + }
>>
>> So this preserves/makes worse the bug Morten spotted, even without
>> want_affine we should still attempt SD_BALANCE_WAKE if set.
>
> Fixed. wake_wide() may override want_affine as before, want_affine may
> override other ->flags as before, but a surviving domain selection now
> results in a full balance instead of a select_idle_sibling() call.
>
> sched: beef up wake_wide()
>
> Josef Bacik reported that Facebook sees better performance with their
> 1:N load (1 dispatch/node, N workers/node) when carrying an old patch
> to try very hard to wake to an idle CPU. While looking at wake_wide(),
> I noticed that it doesn't pay attention to the wakeup of a many partner
> waker, returning 1 only when waking one of its many partners.
>
> Correct that, letting explicit domain flags override the heuristic.
>
> While at it, adjust task_struct bits, we don't need a 64bit counter.
>
> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Tested-by: Josef Bacik <jbacik@fb.com>
I'll give this new one a whirl and let you know how it goes. Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-10 5:19 ` Mike Galbraith
2015-07-10 13:41 ` Josef Bacik
@ 2015-07-10 20:59 ` Josef Bacik
2015-07-11 3:11 ` Mike Galbraith
1 sibling, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-10 20:59 UTC (permalink / raw)
To: Mike Galbraith, Peter Zijlstra
Cc: riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 07/10/2015 01:19 AM, Mike Galbraith wrote:
> On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote:
>> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote:
>>> static int wake_wide(struct task_struct *p)
>>> {
>>> + unsigned int waker_flips = current->wakee_flips;
>>> + unsigned int wakee_flips = p->wakee_flips;
>>> int factor = this_cpu_read(sd_llc_size);
>>>
>>> + if (waker_flips < wakee_flips)
>>> + swap(waker_flips, wakee_flips);
>>
>> This makes the wakee/waker names useless, the end result is more like
>> wakee_flips := client_flips, waker_flips := server_flips.
>
> I settled on master/slave plus hopefully improved comment block.
>
>>> + if (wakee_flips < factor || waker_flips < wakee_flips * factor)
>>> + return 0;
>>
>> I don't get the first condition... why would the client ever flip? It
>> only talks to that one server.
>
> (tightening heuristic up a bit by one means or another would be good,
> but "if it ain't broke, don't fix it" applies for this patchlet)
>
>>> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct *
>>> {
>>> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>>> int cpu = smp_processor_id();
>>> + int new_cpu = prev_cpu;
>>> int want_affine = 0;
>>> int sync = wake_flags & WF_SYNC;
>>>
>>> rcu_read_lock();
>>> + if (sd_flag & SD_BALANCE_WAKE) {
>>> + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>>> + if (!want_affine)
>>> + goto select_idle;
>>> + }
>>
>> So this preserves/makes worse the bug Morten spotted, even without
>> want_affine we should still attempt SD_BALANCE_WAKE if set.
>
> Fixed. wake_wide() may override want_affine as before, want_affine may
> override other ->flags as before, but a surviving domain selection now
> results in a full balance instead of a select_idle_sibling() call.
>
> sched: beef up wake_wide()
>
> Josef Bacik reported that Facebook sees better performance with their
> 1:N load (1 dispatch/node, N workers/node) when carrying an old patch
> to try very hard to wake to an idle CPU. While looking at wake_wide(),
> I noticed that it doesn't pay attention to the wakeup of a many partner
> waker, returning 1 only when waking one of its many partners.
>
> Correct that, letting explicit domain flags override the heuristic.
>
> While at it, adjust task_struct bits, we don't need a 64bit counter.
>
> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
> Tested-by: Josef Bacik <jbacik@fb.com>
Not quite as awesome but still better than the baseline so we're good.
Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-10 20:59 ` Josef Bacik
@ 2015-07-11 3:11 ` Mike Galbraith
2015-07-13 13:53 ` Josef Bacik
2015-07-14 11:19 ` Peter Zijlstra
0 siblings, 2 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-11 3:11 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote:
> Not quite as awesome but still better than the baseline so we're good.
> Thanks,
I personally like the other much better. We're not doing the user any
favor by making the thing balance when SD_BALANCE_WAKE is set. Until
such time as it ceases to be horribly painful, we're just paying a few
cycles to help a theoretically existing masochist torture his box.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-11 3:11 ` Mike Galbraith
@ 2015-07-13 13:53 ` Josef Bacik
2015-07-14 11:19 ` Peter Zijlstra
1 sibling, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-07-13 13:53 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 07/10/2015 11:11 PM, Mike Galbraith wrote:
> On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote:
>
>> Not quite as awesome but still better than the baseline so we're good.
>> Thanks,
>
> I personally like the other much better. We're not doing the user any
> favor by making the thing balance when SD_BALANCE_WAKE is set. Until
> such time as it ceases to be horribly painful, we're just paying a few
> cycles to help a theoretically existing masochist torture his box.
>
Agreed, I'm all for the faster version ;),
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-11 3:11 ` Mike Galbraith
2015-07-13 13:53 ` Josef Bacik
@ 2015-07-14 11:19 ` Peter Zijlstra
2015-07-14 13:49 ` Mike Galbraith
1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2015-07-14 11:19 UTC (permalink / raw)
To: Mike Galbraith
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Sat, Jul 11, 2015 at 05:11:51AM +0200, Mike Galbraith wrote:
> On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote:
>
> > Not quite as awesome but still better than the baseline so we're good.
> > Thanks,
>
> I personally like the other much better. We're not doing the user any
> favor by making the thing balance when SD_BALANCE_WAKE is set. Until
> such time as it ceases to be horribly painful, we're just paying a few
> cycles to help a theoretically existing masochist torture his box.
OK, how about something like the below; it tightens things up by
applying two rules:
- We really should not continue looking for a balancing domain once
SD_LOAD_BALANCE is not set.
- SD (balance) flags should really be set in a single contiguous range,
always starting at the bottom.
The latter means what if !want_affine and the (first) sd doesn't have
BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
junk you didn't like..
Hmm?
---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <umgwanakikbuti@gmail.com>
Date: Fri, 10 Jul 2015 07:19:26 +0200
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.
Correct that, letting explicit domain flags override the heuristic.
While at it, adjust task_struct bits, we don't need a 64bit counter.
Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Cc: mingo@redhat.com
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com
---
include/linux/sched.h | 4 +-
kernel/sched/fair.c | 72 +++++++++++++++++++++++---------------------------
2 files changed, 36 insertions(+), 40 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;
int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_g
#endif
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu)
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
- struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+ struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
- continue;
+ break;
/*
* If both cpu and prev_cpu are part of this domain,
@@ -5035,20 +5031,14 @@ select_task_rq_fair(struct task_struct *
*/
if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
- affine_sd = tmp;
- break;
+ /* Prefer affinity over any other flags */
+ goto do_affine;
}
if (tmp->flags & sd_flag)
sd = tmp;
- }
-
- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
-
- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ else if (!want_affine)
+ break;
}
while (sd) {
@@ -5087,8 +5077,14 @@ select_task_rq_fair(struct task_struct *
}
unlock:
rcu_read_unlock();
-
return new_cpu;
+
+do_affine:
+ if (cpu != prev_cpu && wake_affine(tmp, p, sync))
+ new_cpu = cpu;
+
+ new_cpu = select_idle_sibling(p, new_cpu);
+ goto unlock;
}
/*
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-14 11:19 ` Peter Zijlstra
@ 2015-07-14 13:49 ` Mike Galbraith
2015-07-14 14:07 ` Peter Zijlstra
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-14 13:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote:
> OK, how about something like the below; it tightens things up by
> applying two rules:
>
> - We really should not continue looking for a balancing domain once
> SD_LOAD_BALANCE is not set.
>
> - SD (balance) flags should really be set in a single contiguous range,
> always starting at the bottom.
>
> The latter means what if !want_affine and the (first) sd doesn't have
> BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
> junk you didn't like..
>
> Hmm?
Yeah, that's better. It's not big hairy deal either way, it just bugged
me to knowingly toss those cycles out the window ;-)
select_idle_sibling() looks kinda funny down there, but otoh when the
day comes (hah) that we can just balance, it's closer to the exit.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-14 13:49 ` Mike Galbraith
@ 2015-07-14 14:07 ` Peter Zijlstra
2015-07-14 14:17 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2015-07-14 14:07 UTC (permalink / raw)
To: Mike Galbraith
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Tue, Jul 14, 2015 at 03:49:17PM +0200, Mike Galbraith wrote:
> On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote:
>
> > OK, how about something like the below; it tightens things up by
> > applying two rules:
> >
> > - We really should not continue looking for a balancing domain once
> > SD_LOAD_BALANCE is not set.
> >
> > - SD (balance) flags should really be set in a single contiguous range,
> > always starting at the bottom.
> >
> > The latter means what if !want_affine and the (first) sd doesn't have
> > BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
> > junk you didn't like..
> >
> > Hmm?
>
> Yeah, that's better. It's not big hairy deal either way, it just bugged
> me to knowingly toss those cycles out the window ;-)
>
> select_idle_sibling() looks kinda funny down there, but otoh when the
> day comes (hah) that we can just balance, it's closer to the exit.
Right, not too pretty, does this look beter?
---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <umgwanakikbuti@gmail.com>
Date: Fri, 10 Jul 2015 07:19:26 +0200
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.
Correct that, letting explicit domain flags override the heuristic.
While at it, adjust task_struct bits, we don't need a 64bit counter.
Cc: mingo@redhat.com
Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com
---
include/linux/sched.h | 4 +-
kernel/sched/fair.c | 68 +++++++++++++++++++++++---------------------------
2 files changed, 34 insertions(+), 38 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,9 +1359,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;
int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_g
#endif
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu)
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
- struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+ struct sched_domain *tmp, affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
- continue;
+ break;
/*
* If both cpu and prev_cpu are part of this domain,
@@ -5041,17 +5037,17 @@ select_task_rq_fair(struct task_struct *
if (tmp->flags & sd_flag)
sd = tmp;
+ else if (!want_affine)
+ break;
}
- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
+ if (affine_sd) { /* Prefer affinity over any other flags */
+ if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ new_cpu = cpu;
- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
- }
+ new_cpu = select_idle_sibling(p, new_cpu);
- while (sd) {
+ } else while (sd) {
struct sched_group *group;
int weight;
@@ -5085,10 +5081,10 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
- rcu_read_unlock();
+ rcu_read_unlock();
return new_cpu;
+
}
/*
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-14 14:07 ` Peter Zijlstra
@ 2015-07-14 14:17 ` Mike Galbraith
2015-07-14 15:04 ` Peter Zijlstra
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-14 14:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Tue, 2015-07-14 at 16:07 +0200, Peter Zijlstra wrote:
> On Tue, Jul 14, 2015 at 03:49:17PM +0200, Mike Galbraith wrote:
> > On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote:
> >
> > > OK, how about something like the below; it tightens things up by
> > > applying two rules:
> > >
> > > - We really should not continue looking for a balancing domain once
> > > SD_LOAD_BALANCE is not set.
> > >
> > > - SD (balance) flags should really be set in a single contiguous range,
> > > always starting at the bottom.
> > >
> > > The latter means what if !want_affine and the (first) sd doesn't have
> > > BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration
> > > junk you didn't like..
> > >
> > > Hmm?
> >
> > Yeah, that's better. It's not big hairy deal either way, it just bugged
> > me to knowingly toss those cycles out the window ;-)
> >
> > select_idle_sibling() looks kinda funny down there, but otoh when the
> > day comes (hah) that we can just balance, it's closer to the exit.
>
> Right, not too pretty, does this look beter?
There's a buglet, I was just about to mention the inverse in the other.
> @@ -5041,17 +5037,17 @@ select_task_rq_fair(struct task_struct *
>
> if (tmp->flags & sd_flag)
> sd = tmp;
> + else if (!want_affine)
> + break;
> }
>
> - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> - prev_cpu = cpu;
> + if (affine_sd) { /* Prefer affinity over any other flags */
> + if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
> + new_cpu = cpu;
>
> - if (sd_flag & SD_BALANCE_WAKE) {
> - new_cpu = select_idle_sibling(p, prev_cpu);
> - goto unlock;
> - }
> + new_cpu = select_idle_sibling(p, new_cpu);
We'll not look for a idle cpu when wake_wide() naks want_affine.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-14 14:17 ` Mike Galbraith
@ 2015-07-14 15:04 ` Peter Zijlstra
2015-07-14 15:39 ` Mike Galbraith
0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2015-07-14 15:04 UTC (permalink / raw)
To: Mike Galbraith
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote:
> There's a buglet,
> We'll not look for a idle cpu when wake_wide() naks want_affine.
*sigh* indeed.. fixing that'll bring us very close to what we started
out wiht..
The one XXX there raises the question on whether we should always so
select_idle_sibling() if we do not have a suitable balance flag, or only
on wakeups.
---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <umgwanakikbuti@gmail.com>
Date: Fri, 10 Jul 2015 07:19:26 +0200
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.
Correct that, letting explicit domain flags override the heuristic.
While at it, adjust task_struct bits, we don't need a 64bit counter.
Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Cc: mingo@redhat.com
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com
---
include/linux/sched.h | 4 +-
kernel/sched/fair.c | 69 ++++++++++++++++++++++++--------------------------
2 files changed, 36 insertions(+), 37 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,9 +1359,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;
int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_g
#endif
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu)
static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
- struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+ struct sched_domain *tmp, affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
- continue;
+ break;
/*
* If both cpu and prev_cpu are part of this domain,
@@ -5041,17 +5037,21 @@ select_task_rq_fair(struct task_struct *
if (tmp->flags & sd_flag)
sd = tmp;
+ else if (!want_affine)
+ break;
}
- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
-
- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ if (affine_sd) {
+ sd = NULL; /* Prefer wake_affine over balance flags */
+ if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ new_cpu = cpu;
}
- while (sd) {
+ if (!sd) {
+ if (sd_flags & SD_BALANCE_WAKE) /* XXX always ? */
+ new_cpu = select_idle_sibling(p, new_cpu);
+
+ } else while (sd) {
struct sched_group *group;
int weight;
@@ -5085,7 +5085,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();
return new_cpu;
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-14 15:04 ` Peter Zijlstra
@ 2015-07-14 15:39 ` Mike Galbraith
2015-07-14 16:01 ` Josef Bacik
2015-08-03 17:07 ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith
0 siblings, 2 replies; 73+ messages in thread
From: Mike Galbraith @ 2015-07-14 15:39 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Tue, 2015-07-14 at 17:04 +0200, Peter Zijlstra wrote:
> On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote:
> > There's a buglet,
>
> > We'll not look for a idle cpu when wake_wide() naks want_affine.
>
> *sigh* indeed.. fixing that'll bring us very close to what we started
> out wiht..
>
> The one XXX there raises the question on whether we should always so
> select_idle_sibling() if we do not have a suitable balance flag, or only
> on wakeups.
That's what I've been sitting here waffling over, finally convinced
myself that should the user turn FORX/EXEC off, he shouldn't find that a
substitute quietly slipped in.. though otoh.. crap, guess I'm not done
waffling after all. Yeah, this will work just fine ;-)
(typos fixed)
---
Subject: sched: Beef up wake_wide()
From: Mike Galbraith <umgwanakikbuti@gmail.com>
Date: Fri, 10 Jul 2015 07:19:26 +0200
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.
Correct that, letting explicit domain flags override the heuristic.
While at it, adjust task_struct bits, we don't need a 64bit counter.
Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Cc: mingo@redhat.com
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Tested-by: Josef Bacik <jbacik@fb.com>
[peterz: frobbings]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com
---
include/linux/sched.h | 4 +-
kernel/sched/fair.c | 67 ++++++++++++++++++++++++--------------------------
2 files changed, 35 insertions(+), 36 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1351,9 +1351,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;
int wake_cpu;
#endif
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4730,26 +4730,29 @@ static long effective_load(struct task_g
#endif
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4761,13 +4764,6 @@ static int wake_affine(struct sched_doma
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5021,17 +5017,17 @@ select_task_rq_fair(struct task_struct *
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
- continue;
+ break;
/*
* If both cpu and prev_cpu are part of this domain,
@@ -5045,17 +5041,21 @@ select_task_rq_fair(struct task_struct *
if (tmp->flags & sd_flag)
sd = tmp;
+ else if (!want_affine)
+ break;
}
- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
-
- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ if (affine_sd) {
+ sd = NULL; /* Prefer wake_affine over balance flags */
+ if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ new_cpu = cpu;
}
- while (sd) {
+ if (!sd) {
+ if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
+ new_cpu = select_idle_sibling(p, new_cpu);
+
+ } else while (sd) {
struct sched_group *group;
int weight;
@@ -5089,7 +5089,6 @@ select_task_rq_fair(struct task_struct *
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();
return new_cpu;
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-14 15:39 ` Mike Galbraith
@ 2015-07-14 16:01 ` Josef Bacik
2015-07-14 17:59 ` Mike Galbraith
2015-08-03 17:07 ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith
1 sibling, 1 reply; 73+ messages in thread
From: Josef Bacik @ 2015-07-14 16:01 UTC (permalink / raw)
To: Mike Galbraith, Peter Zijlstra
Cc: riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 07/14/2015 11:39 AM, Mike Galbraith wrote:
> On Tue, 2015-07-14 at 17:04 +0200, Peter Zijlstra wrote:
>> On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote:
>>> There's a buglet,
>>
>>> We'll not look for a idle cpu when wake_wide() naks want_affine.
>>
>> *sigh* indeed.. fixing that'll bring us very close to what we started
>> out wiht..
>>
>> The one XXX there raises the question on whether we should always so
>> select_idle_sibling() if we do not have a suitable balance flag, or only
>> on wakeups.
>
> That's what I've been sitting here waffling over, finally convinced
> myself that should the user turn FORX/EXEC off, he shouldn't find that a
> substitute quietly slipped in.. though otoh.. crap, guess I'm not done
> waffling after all. Yeah, this will work just fine ;-)
>
> (typos fixed)
>
We happy with this or should I wait for more patches to fly by before I
test something ;)?
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-14 16:01 ` Josef Bacik
@ 2015-07-14 17:59 ` Mike Galbraith
2015-07-15 17:11 ` Josef Bacik
0 siblings, 1 reply; 73+ messages in thread
From: Mike Galbraith @ 2015-07-14 17:59 UTC (permalink / raw)
To: Josef Bacik
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On Tue, 2015-07-14 at 12:01 -0400, Josef Bacik wrote:
> We happy with this or should I wait for more patches to fly by before I
> test something ;)?
Yeah, it should be The End time.
-Mike
^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide()
2015-07-14 17:59 ` Mike Galbraith
@ 2015-07-15 17:11 ` Josef Bacik
0 siblings, 0 replies; 73+ messages in thread
From: Josef Bacik @ 2015-07-15 17:11 UTC (permalink / raw)
To: Mike Galbraith
Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team
On 07/14/2015 01:59 PM, Mike Galbraith wrote:
> On Tue, 2015-07-14 at 12:01 -0400, Josef Bacik wrote:
>
>> We happy with this or should I wait for more patches to fly by before I
>> test something ;)?
>
> Yeah, it should be The End time.
It's showing the same thing as one of the many earlier patches, it's
better at high RPS, but at low RPS there's a 3% regression. Thanks,
Josef
^ permalink raw reply [flat|nested] 73+ messages in thread
* [tip:sched/core] sched/fair: Beef up wake_wide()
2015-07-14 15:39 ` Mike Galbraith
2015-07-14 16:01 ` Josef Bacik
@ 2015-08-03 17:07 ` tip-bot for Mike Galbraith
1 sibling, 0 replies; 73+ messages in thread
From: tip-bot for Mike Galbraith @ 2015-08-03 17:07 UTC (permalink / raw)
To: linux-tip-commits
Cc: tglx, linux-kernel, hpa, Kernel-team, jbacik, mingo, peterz,
torvalds, umgwanakikbuti, efault
Commit-ID: 63b0e9edceec10fa41ec33393a1515a5ff444277
Gitweb: http://git.kernel.org/tip/63b0e9edceec10fa41ec33393a1515a5ff444277
Author: Mike Galbraith <umgwanakikbuti@gmail.com>
AuthorDate: Tue, 14 Jul 2015 17:39:50 +0200
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 3 Aug 2015 12:21:23 +0200
sched/fair: Beef up wake_wide()
Josef Bacik reported that Facebook sees better performance with their
1:N load (1 dispatch/node, N workers/node) when carrying an old patch
to try very hard to wake to an idle CPU. While looking at wake_wide(),
I noticed that it doesn't pay attention to the wakeup of a many partner
waker, returning 1 only when waking one of its many partners.
Correct that, letting explicit domain flags override the heuristic.
While at it, adjust task_struct bits, we don't need a 64-bit counter.
Tested-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
[ Tidy things up. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team<Kernel-team@fb.com>
Cc: morten.rasmussen@arm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1436888390.7983.49.camel@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
include/linux/sched.h | 4 +--
kernel/sched/fair.c | 67 +++++++++++++++++++++++++--------------------------
2 files changed, 35 insertions(+), 36 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7412070..65a8a86 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,9 +1359,9 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
- struct task_struct *last_wakee;
- unsigned long wakee_flips;
+ unsigned int wakee_flips;
unsigned long wakee_flip_decay_ts;
+ struct task_struct *last_wakee;
int wake_cpu;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b384b8d..ea23f9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,26 +4726,29 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
#endif
+/*
+ * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
+ * A waker of many should wake a different task than the one last awakened
+ * at a frequency roughly N times higher than one of its wakees. In order
+ * to determine whether we should let the load spread vs consolodating to
+ * shared cache, we look for a minimum 'flip' frequency of llc_size in one
+ * partner, and a factor of lls_size higher frequency in the other. With
+ * both conditions met, we can be relatively sure that the relationship is
+ * non-monogamous, with partner count exceeding socket size. Waker/wakee
+ * being client/server, worker/dispatcher, interrupt source or whatever is
+ * irrelevant, spread criteria is apparent partner count exceeds socket size.
+ */
static int wake_wide(struct task_struct *p)
{
+ unsigned int master = current->wakee_flips;
+ unsigned int slave = p->wakee_flips;
int factor = this_cpu_read(sd_llc_size);
- /*
- * Yeah, it's the switching-frequency, could means many wakee or
- * rapidly switch, use factor here will just help to automatically
- * adjust the loose-degree, so bigger node will lead to more pull.
- */
- if (p->wakee_flips > factor) {
- /*
- * wakee is somewhat hot, it needs certain amount of cpu
- * resource, so if waker is far more hot, prefer to leave
- * it alone.
- */
- if (current->wakee_flips > (factor * p->wakee_flips))
- return 1;
- }
-
- return 0;
+ if (master < slave)
+ swap(master, slave);
+ if (slave < factor || master < slave * factor)
+ return 0;
+ return 1;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
@@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
unsigned long weight;
int balanced;
- /*
- * If we wake multiple tasks be careful to not bounce
- * ourselves around too much.
- */
- if (wake_wide(p))
- return 0;
-
idx = sd->wake_idx;
this_cpu = smp_processor_id();
prev_cpu = task_cpu(p);
@@ -5017,17 +5013,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int new_cpu = cpu;
+ int new_cpu = prev_cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
if (sd_flag & SD_BALANCE_WAKE)
- want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+ want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
- continue;
+ break;
/*
* If both cpu and prev_cpu are part of this domain,
@@ -5041,17 +5037,21 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (tmp->flags & sd_flag)
sd = tmp;
+ else if (!want_affine)
+ break;
}
- if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
- prev_cpu = cpu;
-
- if (sd_flag & SD_BALANCE_WAKE) {
- new_cpu = select_idle_sibling(p, prev_cpu);
- goto unlock;
+ if (affine_sd) {
+ sd = NULL; /* Prefer wake_affine over balance flags */
+ if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+ new_cpu = cpu;
}
- while (sd) {
+ if (!sd) {
+ if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
+ new_cpu = select_idle_sibling(p, new_cpu);
+
+ } else while (sd) {
struct sched_group *group;
int weight;
@@ -5085,7 +5085,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
}
/* while loop will break here if sd == NULL */
}
-unlock:
rcu_read_unlock();
return new_cpu;
^ permalink raw reply related [flat|nested] 73+ messages in thread
end of thread, other threads:[~2015-08-03 17:08 UTC | newest]
Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik
2015-05-28 3:46 ` Mike Galbraith
2015-05-28 9:49 ` Morten Rasmussen
2015-05-28 10:57 ` Mike Galbraith
2015-05-28 11:48 ` Morten Rasmussen
2015-05-28 11:49 ` Mike Galbraith
2015-05-28 10:21 ` Peter Zijlstra
2015-05-28 11:05 ` Peter Zijlstra
2015-05-28 14:27 ` Josef Bacik
2015-05-29 21:03 ` Josef Bacik
2015-05-30 3:55 ` Mike Galbraith
2015-06-01 19:38 ` Josef Bacik
2015-06-01 20:42 ` Peter Zijlstra
2015-06-01 21:03 ` Josef Bacik
2015-06-02 17:12 ` Josef Bacik
2015-06-03 14:12 ` Rik van Riel
2015-06-03 14:24 ` Peter Zijlstra
2015-06-03 14:49 ` Josef Bacik
2015-06-03 15:30 ` Mike Galbraith
2015-06-03 15:57 ` Josef Bacik
2015-06-03 16:53 ` Mike Galbraith
2015-06-03 17:16 ` Josef Bacik
2015-06-03 17:43 ` Mike Galbraith
2015-06-03 20:34 ` Josef Bacik
2015-06-04 4:52 ` Mike Galbraith
2015-06-01 22:15 ` Rik van Riel
2015-06-11 20:33 ` Josef Bacik
2015-06-12 3:42 ` Rik van Riel
2015-06-12 5:35 ` Mike Galbraith
2015-06-17 18:06 ` Josef Bacik
2015-06-18 0:55 ` Mike Galbraith
2015-06-18 3:46 ` Josef Bacik
2015-06-18 4:12 ` Mike Galbraith
2015-07-02 17:44 ` Josef Bacik
2015-07-03 6:40 ` Mike Galbraith
2015-07-03 9:29 ` Mike Galbraith
2015-07-04 15:57 ` Mike Galbraith
2015-07-05 7:17 ` Mike Galbraith
2015-07-06 5:13 ` Mike Galbraith
2015-07-06 14:34 ` Josef Bacik
2015-07-06 18:36 ` Mike Galbraith
2015-07-06 19:41 ` Josef Bacik
2015-07-07 4:01 ` Mike Galbraith
2015-07-07 9:43 ` [patch] " Mike Galbraith
2015-07-07 13:40 ` Josef Bacik
2015-07-07 15:24 ` Mike Galbraith
2015-07-07 17:06 ` Josef Bacik
2015-07-08 6:13 ` [patch] sched: beef up wake_wide() Mike Galbraith
2015-07-09 13:26 ` Peter Zijlstra
2015-07-09 14:07 ` Mike Galbraith
2015-07-09 14:46 ` Mike Galbraith
2015-07-10 5:19 ` Mike Galbraith
2015-07-10 13:41 ` Josef Bacik
2015-07-10 20:59 ` Josef Bacik
2015-07-11 3:11 ` Mike Galbraith
2015-07-13 13:53 ` Josef Bacik
2015-07-14 11:19 ` Peter Zijlstra
2015-07-14 13:49 ` Mike Galbraith
2015-07-14 14:07 ` Peter Zijlstra
2015-07-14 14:17 ` Mike Galbraith
2015-07-14 15:04 ` Peter Zijlstra
2015-07-14 15:39 ` Mike Galbraith
2015-07-14 16:01 ` Josef Bacik
2015-07-14 17:59 ` Mike Galbraith
2015-07-15 17:11 ` Josef Bacik
2015-08-03 17:07 ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith
2015-05-28 11:16 ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith
2015-05-28 11:49 ` Ingo Molnar
2015-05-28 12:15 ` Mike Galbraith
2015-05-28 12:19 ` Peter Zijlstra
2015-05-28 12:29 ` Ingo Molnar
2015-05-28 15:22 ` David Ahern
2015-05-28 11:55 ` Srikar Dronamraju
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).