* [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE @ 2015-05-27 21:22 Josef Bacik 2015-05-28 3:46 ` Mike Galbraith ` (2 more replies) 0 siblings, 3 replies; 73+ messages in thread From: Josef Bacik @ 2015-05-27 21:22 UTC (permalink / raw) To: riel, mingo, peterz, linux-kernel [ sorry if you get this twice, it seems like the first submission got lost ] At Facebook we have a pretty heavily multi-threaded application that is sensitive to latency. We have been pulling forward the old SD_WAKE_IDLE code because it gives us a pretty significant performance gain (like 20%). It turns out this is because there are cases where the scheduler puts our task on a busy CPU when there are idle CPU's in the system. We verify this by reading the cpu_delay_req_avg_us from the scheduler netlink stuff. With our crappy patch we get much lower numbers vs baseline. SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just looking for an idle sibling, preferring affinity over all else. This is not helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not garuntee affinity. Fix this by first trying to find an idle sibling, and then if the cpu is not idle fall through to the logic to find an idle cpu. With this patch we get slightly better performance than with our forward port of SD_WAKE_IDLE. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Rik van Riel <riel@redhat.com> --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 241213b..03dafa3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4766,7 +4766,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (sd_flag & SD_BALANCE_WAKE) { new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; + if (idle_cpu(new_cpu)) + goto unlock; } while (sd) { -- 1.8.1 ^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik @ 2015-05-28 3:46 ` Mike Galbraith 2015-05-28 9:49 ` Morten Rasmussen 2015-05-28 10:21 ` Peter Zijlstra 2015-05-28 11:55 ` Srikar Dronamraju 2 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-05-28 3:46 UTC (permalink / raw) To: Josef Bacik; +Cc: riel, mingo, peterz, linux-kernel On Wed, 2015-05-27 at 17:22 -0400, Josef Bacik wrote: > [ sorry if you get this twice, it seems like the first submission got lost ] > > At Facebook we have a pretty heavily multi-threaded application that is > sensitive to latency. We have been pulling forward the old SD_WAKE_IDLE code > because it gives us a pretty significant performance gain (like 20%). It turns > out this is because there are cases where the scheduler puts our task on a busy > CPU when there are idle CPU's in the system. We verify this by reading the > cpu_delay_req_avg_us from the scheduler netlink stuff. With our crappy patch we > get much lower numbers vs baseline. > > SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just > looking for an idle sibling, preferring affinity over all else. This is not > helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not > garuntee affinity. Fix this by first trying to find an idle sibling, and then > if the cpu is not idle fall through to the logic to find an idle cpu. With this > patch we get slightly better performance than with our forward port of > SD_WAKE_IDLE. Thanks, The job description isn't really find idle. it's find least loaded. > Signed-off-by: Josef Bacik <jbacik@fb.com> > Acked-by: Rik van Riel <riel@redhat.com> > --- > kernel/sched/fair.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 241213b..03dafa3 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4766,7 +4766,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > > if (sd_flag & SD_BALANCE_WAKE) { > new_cpu = select_idle_sibling(p, prev_cpu); > - goto unlock; > + if (idle_cpu(new_cpu)) > + goto unlock; > } > > while (sd) { Instead of doing what for most will be a redundant idle_cpu() call, perhaps a couple cycles can be saved if you move the sd assignment above affine_sd assignment, and say if (!sd || idle_cpu(new_cpu)) ? You could also stop find_idlest_group() at the first completely idle group to shave cycles off the not fully committed search. It ain't likely to find a negative load.. cool as that would be ;-) -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 3:46 ` Mike Galbraith @ 2015-05-28 9:49 ` Morten Rasmussen 2015-05-28 10:57 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Morten Rasmussen @ 2015-05-28 9:49 UTC (permalink / raw) To: Mike Galbraith; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel On Thu, May 28, 2015 at 04:46:38AM +0100, Mike Galbraith wrote: > On Wed, 2015-05-27 at 17:22 -0400, Josef Bacik wrote: > > [ sorry if you get this twice, it seems like the first submission got lost ] > > > > At Facebook we have a pretty heavily multi-threaded application that is > > sensitive to latency. We have been pulling forward the old SD_WAKE_IDLE code > > because it gives us a pretty significant performance gain (like 20%). It turns > > out this is because there are cases where the scheduler puts our task on a busy > > CPU when there are idle CPU's in the system. We verify this by reading the > > cpu_delay_req_avg_us from the scheduler netlink stuff. With our crappy patch we > > get much lower numbers vs baseline. > > > > SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just > > looking for an idle sibling, preferring affinity over all else. This is not > > helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not > > garuntee affinity. Fix this by first trying to find an idle sibling, and then > > if the cpu is not idle fall through to the logic to find an idle cpu. With this > > patch we get slightly better performance than with our forward port of > > SD_WAKE_IDLE. Thanks, > > The job description isn't really find idle. it's find least loaded. And make sure that the task doesn't migrate away from any data that might still be in the last-level cache? IUIC, the goal of SD_BALANCE_WAKE is changed from finding the least loaded target cpu that shares last-level cache with the previous cpu, to finding an idle cpu and prefer ones that shares the last-level cache but extend the search beyond sd_llc if necessary. > > > Signed-off-by: Josef Bacik <jbacik@fb.com> > > Acked-by: Rik van Riel <riel@redhat.com> > > --- > > kernel/sched/fair.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 241213b..03dafa3 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4766,7 +4766,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > > > > if (sd_flag & SD_BALANCE_WAKE) { > > new_cpu = select_idle_sibling(p, prev_cpu); > > - goto unlock; > > + if (idle_cpu(new_cpu)) > > + goto unlock; > > } > > > > while (sd) { > > Instead of doing what for most will be a redundant idle_cpu() call, > perhaps a couple cycles can be saved if you move the sd assignment above > affine_sd assignment, and say if (!sd || idle_cpu(new_cpu)) ? Isn't sd == NULL is most cases if you don't move the sd assignment before the affine_sd assignment? The break after the affine_sd assignment means that sd isn't assigned under 'normal' circumstances (sibling waking cpu, no restrictions in tsk_cpus_allowed(p), and SD_WAKE_AFFINE set) which causes the while (sd) loop to be bypassed and we end up returning the result of select_idle_sibling() anyway. I must be missing something? Morten ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 9:49 ` Morten Rasmussen @ 2015-05-28 10:57 ` Mike Galbraith 2015-05-28 11:48 ` Morten Rasmussen 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-05-28 10:57 UTC (permalink / raw) To: Morten Rasmussen; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel On Thu, 2015-05-28 at 10:49 +0100, Morten Rasmussen wrote: > Isn't sd == NULL is most cases if you don't move the sd assignment > before the affine_sd assignment? sd will usually be NULL regardless of where the assignment is, as SD_BALANCE_WAKE is usually off in ->flags. Josef is turning it on. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 10:57 ` Mike Galbraith @ 2015-05-28 11:48 ` Morten Rasmussen 2015-05-28 11:49 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Morten Rasmussen @ 2015-05-28 11:48 UTC (permalink / raw) To: Mike Galbraith; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel On Thu, May 28, 2015 at 11:57:44AM +0100, Mike Galbraith wrote: > On Thu, 2015-05-28 at 10:49 +0100, Morten Rasmussen wrote: > > > Isn't sd == NULL is most cases if you don't move the sd assignment > > before the affine_sd assignment? > > sd will usually be NULL regardless of where the assignment is, as > SD_BALANCE_WAKE is usually off in ->flags. Josef is turning it on. Right. SD_BALANCE_WAKE needs to set in the sd flags and the assignment has to happen before the break for this to work. I just don't see SD_BALANCE_WAKE being enabled for the sched_domain anywhere in the patch? ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 11:48 ` Morten Rasmussen @ 2015-05-28 11:49 ` Mike Galbraith 0 siblings, 0 replies; 73+ messages in thread From: Mike Galbraith @ 2015-05-28 11:49 UTC (permalink / raw) To: Morten Rasmussen; +Cc: Josef Bacik, riel, mingo, peterz, linux-kernel On Thu, 2015-05-28 at 12:48 +0100, Morten Rasmussen wrote: > On Thu, May 28, 2015 at 11:57:44AM +0100, Mike Galbraith wrote: > > On Thu, 2015-05-28 at 10:49 +0100, Morten Rasmussen wrote: > > > > > Isn't sd == NULL is most cases if you don't move the sd assignment > > > before the affine_sd assignment? > > > > sd will usually be NULL regardless of where the assignment is, as > > SD_BALANCE_WAKE is usually off in ->flags. Josef is turning it on. > > Right. SD_BALANCE_WAKE needs to set in the sd flags and the assignment > has to happen before the break for this to work. I just don't see > SD_BALANCE_WAKE being enabled for the sched_domain anywhere in the > patch? He's doing that via proc interface.. well, I presume he is anyway. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik 2015-05-28 3:46 ` Mike Galbraith @ 2015-05-28 10:21 ` Peter Zijlstra 2015-05-28 11:05 ` Peter Zijlstra 2015-05-28 11:16 ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith 2015-05-28 11:55 ` Srikar Dronamraju 2 siblings, 2 replies; 73+ messages in thread From: Peter Zijlstra @ 2015-05-28 10:21 UTC (permalink / raw) To: Josef Bacik; +Cc: riel, mingo, linux-kernel On Wed, May 27, 2015 at 05:22:16PM -0400, Josef Bacik wrote: > > SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however sd->flags's SD_BALANCE_WAKE, not sd_flags. And note that SD_BALANCE_WAKE is not set on domains by default. > it is just looking for an idle sibling, preferring affinity over all > else. Not sure what you're saying here, what affinity? > This is not helpful in all cases, and SD_BALANCE_WAKE's job is > to find us an idle cpu, not garuntee affinity. Your argument is going backwards, SD_BALANCE_WAKE is not actually set. > Fix this by first > trying to find an idle sibling, and then if the cpu is not idle fall > through to the logic to find an idle cpu. With this patch we get > slightly better performance than with our forward port of > SD_WAKE_IDLE. This is broken. You most certainly do not want to go do that whole load balance pass on wakeups. It should be controlled by sd->flags. It is far too expensive to consider turning that on by default. In fact, select_idle_sibling() is already too expensive on current server hardware (far too damn many cpus in a LLC domain). ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 10:21 ` Peter Zijlstra @ 2015-05-28 11:05 ` Peter Zijlstra 2015-05-28 14:27 ` Josef Bacik ` (3 more replies) 2015-05-28 11:16 ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith 1 sibling, 4 replies; 73+ messages in thread From: Peter Zijlstra @ 2015-05-28 11:05 UTC (permalink / raw) To: Josef Bacik; +Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen So maybe you want something like the below; that cures the thing Morten raised, and we continue looking for sd, even after we found affine_sd. It also avoids the pointless idle_cpu() check Mike raised by making select_idle_sibling() return -1 if it doesn't find anything. Then it continues doing the full balance IFF sd was set, which is keyed off of sd->flags. And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle CPUs, it looks for the least loaded CPU. And its damn expensive. Rewriting this entire thing is somewhere on the todo list :/ --- kernel/sched/fair.c | 36 +++++++++++++++++++++++------------- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d597aea43030..7330fb4fea9c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1385,8 +1385,13 @@ static void task_numa_compare(struct task_numa_env *env, * One idle CPU per node is evaluated for a task numa move. * Call select_idle_sibling to maybe find a better one. */ - if (!cur) - env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu); + if (!cur) { + int cpu = select_idle_sibling(env->p, env->dst_cpu); + if (cpu < 0) + cpu = env->dst_cpu; + + env->dst_cpu = cpu; + } assign: task_numa_assign(env, cur, imp); @@ -4951,16 +4956,15 @@ static int select_idle_sibling(struct task_struct *p, int target) goto next; } - target = cpumask_first_and(sched_group_cpus(sg), + return cpumask_first_and(sched_group_cpus(sg), tsk_cpus_allowed(p)); - goto done; next: sg = sg->next; } while (sg != sd->groups); } -done: - return target; + return -1; } + /* * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS * tasks. The unit of the return value must be the one of capacity so we can @@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f * If both cpu and prev_cpu are part of this domain, * cpu is a valid SD_WAKE_AFFINE target. */ - if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { + if (want_affine && !affine_sd && + (tmp->flags & SD_WAKE_AFFINE) && + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) affine_sd = tmp; - break; - } if (tmp->flags & sd_flag) sd = tmp; + else if (!want_affine || (want_affine && affine_sd)) + break; } - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) { prev_cpu = cpu; + sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */ + } if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; + int tmp = select_idle_sibling(p, prev_cpu); + if (tmp >= 0) { + new_cpu = tmp; + goto unlock; + } } while (sd) { ^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 11:05 ` Peter Zijlstra @ 2015-05-28 14:27 ` Josef Bacik 2015-05-29 21:03 ` Josef Bacik ` (2 subsequent siblings) 3 siblings, 0 replies; 73+ messages in thread From: Josef Bacik @ 2015-05-28 14:27 UTC (permalink / raw) To: Peter Zijlstra Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen On 05/28/2015 07:05 AM, Peter Zijlstra wrote: > > So maybe you want something like the below; that cures the thing Morten > raised, and we continue looking for sd, even after we found affine_sd. > > It also avoids the pointless idle_cpu() check Mike raised by making > select_idle_sibling() return -1 if it doesn't find anything. > > Then it continues doing the full balance IFF sd was set, which is keyed > off of sd->flags. > > And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle > CPUs, it looks for the least loaded CPU. And its damn expensive. > Sorry I was just assuming based on the commit message when WAKE_IDLE was removed, this isn't my area. > > Rewriting this entire thing is somewhere on the todo list :/ Thanks I'm building and deploying this so I can run our perf test, I'll have results in ~3 hours. Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 11:05 ` Peter Zijlstra 2015-05-28 14:27 ` Josef Bacik @ 2015-05-29 21:03 ` Josef Bacik 2015-05-30 3:55 ` Mike Galbraith 2015-06-01 19:38 ` Josef Bacik 2015-06-11 20:33 ` Josef Bacik 2015-06-12 5:35 ` Mike Galbraith 3 siblings, 2 replies; 73+ messages in thread From: Josef Bacik @ 2015-05-29 21:03 UTC (permalink / raw) To: Peter Zijlstra Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On 05/28/2015 07:05 AM, Peter Zijlstra wrote: > > So maybe you want something like the below; that cures the thing Morten > raised, and we continue looking for sd, even after we found affine_sd. > > It also avoids the pointless idle_cpu() check Mike raised by making > select_idle_sibling() return -1 if it doesn't find anything. > > Then it continues doing the full balance IFF sd was set, which is keyed > off of sd->flags. > > And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle > CPUs, it looks for the least loaded CPU. And its damn expensive. > > > Rewriting this entire thing is somewhere on the todo list :/ > Summarizing what I've found so far. -We turn on SD_BALANCE_WAKE on our domains for our 3.10 boxes, but not for our 4.0 boxes (due to some weird configuration issue.) -Running with this patch is better than plain 4.0 but not as good as my patch, running with SD_BALANCE_WAKE set and not set makes no difference to the runs. -I took out the sd = NULL; bit from the affine case like you said on IRC and I get similar results as before. -I'm thoroughly confused as to why my patch did anything since we weren't turning on SD_BALANCE_WAKE on 4.0 in my previous runs (I assume, it isn't set now so I'm pretty sure the problem has always been there) so we should have always had sd == NULL which means we would have only ever gotten the task cpu I guess. Now I'm looking at the code in select_idle_sibling and we do this for_each_lower_domain(sd) { sg = sd->groups; do { if (!cpumask_intersects(sched_group_cpus(sg), tsk_cpus_allowed(p))) goto next; for_each_cpu(i, sched_group_cpus(sg)) { if (i == target || !idle_cpu(i)) goto next; } return cpumask_first_and(sched_group_cpus(sg), tsk_cpus_allowed(p)); next: sg = sg->next } while (sg != sd->groups); } We get all the schedule groups for the schedule domain and if any of the cpu's are not idle or the target then we skip the whole scheduling group. Isn't the scheduling group a group of CPU's? Why can't we pick an idle CPU in the group that has a none idle cpu or the target cpu? Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-29 21:03 ` Josef Bacik @ 2015-05-30 3:55 ` Mike Galbraith 2015-06-01 19:38 ` Josef Bacik 1 sibling, 0 replies; 73+ messages in thread From: Mike Galbraith @ 2015-05-30 3:55 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Fri, 2015-05-29 at 17:03 -0400, Josef Bacik wrote: > for_each_lower_domain(sd) { > sg = sd->groups; > do { > if (!cpumask_intersects(sched_group_cpus(sg), > tsk_cpus_allowed(p))) > goto next; > > for_each_cpu(i, sched_group_cpus(sg)) { > if (i == target || !idle_cpu(i)) > goto next; > } > > return cpumask_first_and(sched_group_cpus(sg), > tsk_cpus_allowed(p)); > next: > sg = sg->next > } while (sg != sd->groups); > } > > We get all the schedule groups for the schedule domain and if any of the > cpu's are not idle or the target then we skip the whole scheduling > group. Isn't the scheduling group a group of CPU's? Why can't we pick > an idle CPU in the group that has a none idle cpu or the target cpu? > Thanks, We select an idle core if we can get one. Yes, that leaves a pile of SMT threads not checked/selected, but if you're gonna do a full search of a large socket (humongous sparc-thing, shudder), you may as well eat the BALANCE_WAKE overhead. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-29 21:03 ` Josef Bacik 2015-05-30 3:55 ` Mike Galbraith @ 2015-06-01 19:38 ` Josef Bacik 2015-06-01 20:42 ` Peter Zijlstra 2015-06-01 22:15 ` Rik van Riel 1 sibling, 2 replies; 73+ messages in thread From: Josef Bacik @ 2015-06-01 19:38 UTC (permalink / raw) To: Peter Zijlstra Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On 05/29/2015 05:03 PM, Josef Bacik wrote: > On 05/28/2015 07:05 AM, Peter Zijlstra wrote: >> >> So maybe you want something like the below; that cures the thing Morten >> raised, and we continue looking for sd, even after we found affine_sd. >> >> It also avoids the pointless idle_cpu() check Mike raised by making >> select_idle_sibling() return -1 if it doesn't find anything. >> >> Then it continues doing the full balance IFF sd was set, which is keyed >> off of sd->flags. >> >> And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle >> CPUs, it looks for the least loaded CPU. And its damn expensive. >> >> >> Rewriting this entire thing is somewhere on the todo list :/ >> > Ok I got this patch to give me the same performance as all our other crap, just need to apply this incremental diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b71eb2b..e11cfec 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (tmp->flags & sd_flag) sd = tmp; - else if (!want_affine || (want_affine && affine_sd)) - break; } if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) { prev_cpu = cpu; - sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */ } if (sd_flag & SD_BALANCE_WAKE) { And everything works fine. Does that seem reasonable? Thanks, Josef ^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-01 19:38 ` Josef Bacik @ 2015-06-01 20:42 ` Peter Zijlstra 2015-06-01 21:03 ` Josef Bacik 2015-06-02 17:12 ` Josef Bacik 2015-06-01 22:15 ` Rik van Riel 1 sibling, 2 replies; 73+ messages in thread From: Peter Zijlstra @ 2015-06-01 20:42 UTC (permalink / raw) To: Josef Bacik Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On Mon, 2015-06-01 at 15:38 -0400, Josef Bacik wrote: > Ok I got this patch to give me the same performance as all our other > crap, just need to apply this incremental > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b71eb2b..e11cfec 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int > prev_cpu, int sd_flag, int wake_f > > if (tmp->flags & sd_flag) > sd = tmp; > - else if (!want_affine || (want_affine && affine_sd)) > - break; > } That bit worries me a bit, because that causes us to have a weird definition for what sd is. Without WAKE_AFFINE, sd is the biggest domain with BALANCE_WAKE (or any other sd_flag) set. But with WAKE_AFFINE, its the first domain that satisfies the wake affine constraint of covering both the previous and waking cpu. It basically reduces sd to affine_sd. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-01 20:42 ` Peter Zijlstra @ 2015-06-01 21:03 ` Josef Bacik 2015-06-02 17:12 ` Josef Bacik 1 sibling, 0 replies; 73+ messages in thread From: Josef Bacik @ 2015-06-01 21:03 UTC (permalink / raw) To: Peter Zijlstra Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On 06/01/2015 04:42 PM, Peter Zijlstra wrote: > On Mon, 2015-06-01 at 15:38 -0400, Josef Bacik wrote: > >> Ok I got this patch to give me the same performance as all our other >> crap, just need to apply this incremental >> >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index b71eb2b..e11cfec 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int >> prev_cpu, int sd_flag, int wake_f >> >> if (tmp->flags & sd_flag) >> sd = tmp; >> - else if (!want_affine || (want_affine && affine_sd)) >> - break; >> } > > That bit worries me a bit, because that causes us to have a weird > definition for what sd is. > > Without WAKE_AFFINE, sd is the biggest domain with BALANCE_WAKE (or any > other sd_flag) set. > > But with WAKE_AFFINE, its the first domain that satisfies the wake > affine constraint of covering both the previous and waking cpu. It > basically reduces sd to affine_sd. > I'm just giving you what made the performance good for us, I'm relying on you to make it sane ;). So with that bit we break out if !wake_affine as soon as we don't find an sd that matches sd_flag, that seems less than helpful. I was going to change it to else if (want_affine && affine_sd) but the thing is we don't do anything with affine_sd unless the bit below matches if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) so it seems like we're leaving ourselves without sd set in a few cases where we'd actually want it set, so I just nuked the whole thing and carried on. I'm all ears for other ideas, I'm just letting you know what my results are since you are the guys who actually understand this stuff. Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-01 20:42 ` Peter Zijlstra 2015-06-01 21:03 ` Josef Bacik @ 2015-06-02 17:12 ` Josef Bacik 2015-06-03 14:12 ` Rik van Riel 1 sibling, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-06-02 17:12 UTC (permalink / raw) To: Peter Zijlstra Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On 06/01/2015 04:42 PM, Peter Zijlstra wrote: > On Mon, 2015-06-01 at 15:38 -0400, Josef Bacik wrote: > >> Ok I got this patch to give me the same performance as all our other >> crap, just need to apply this incremental >> >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index b71eb2b..e11cfec 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int >> prev_cpu, int sd_flag, int wake_f >> >> if (tmp->flags & sd_flag) >> sd = tmp; >> - else if (!want_affine || (want_affine && affine_sd)) >> - break; >> } > > That bit worries me a bit, because that causes us to have a weird > definition for what sd is. > > Without WAKE_AFFINE, sd is the biggest domain with BALANCE_WAKE (or any > other sd_flag) set. > > But with WAKE_AFFINE, its the first domain that satisfies the wake > affine constraint of covering both the previous and waking cpu. It > basically reduces sd to affine_sd. > Ok I took Rik's idea and came out with this diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b71eb2b..75073d3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4761,13 +4761,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (tmp->flags & sd_flag) sd = tmp; - else if (!want_affine || (want_affine && affine_sd)) + else if (want_affine && affine_sd) { + sd = affine_sd; break; + } } if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) { prev_cpu = cpu; - sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */ } if (sd_flag & SD_BALANCE_WAKE) { But now that I re-read your response I think this is even more what you were worried about than less. Basically it comes down to if sd isn't set then we get shit performance. I realize that this magic to find an idle cpu when sd is set is pretty heavy handed, but it's obviously helpful in our case. So let me ask this question. When do we want to do the heavy handed search and when do we not? With WAKE_AFFINE what is our ultimate goal vs the other SD's? If we don't have an sd that matches our sd_flags what should we be doing, should we just go with whatever cpu we're on and carry on? Thanks, Josef ^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-02 17:12 ` Josef Bacik @ 2015-06-03 14:12 ` Rik van Riel 2015-06-03 14:24 ` Peter Zijlstra 0 siblings, 1 reply; 73+ messages in thread From: Rik van Riel @ 2015-06-03 14:12 UTC (permalink / raw) To: Josef Bacik, Peter Zijlstra Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On 06/02/2015 01:12 PM, Josef Bacik wrote: > But now that I re-read your response I think this is even more what you > were worried about than less. > > Basically it comes down to if sd isn't set then we get shit performance. > I realize that this magic to find an idle cpu when sd is set is pretty > heavy handed, but it's obviously helpful in our case. Ingo and Peter appear to be worried that searching through too many idle CPUs leads to bad performance. Your test results seem to indicate that finding an idle CPU really really helps performance. There is a policy vs mechanism thing here. Ingo and Peter are worried about the overhead in the mechanism of finding an idle CPU. Your measurements show that the policy of finding an idle CPU is the correct one. Can we make the policy change now, and optimize the mechanism later? > So let me ask this question. When do we want to do the heavy handed > search and when do we not? With WAKE_AFFINE what is our ultimate goal > vs the other SD's? If we don't have an sd that matches our sd_flags > what should we be doing, should we just go with whatever cpu we're on > and carry on? Thanks, > > Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-03 14:12 ` Rik van Riel @ 2015-06-03 14:24 ` Peter Zijlstra 2015-06-03 14:49 ` Josef Bacik 2015-06-03 15:30 ` Mike Galbraith 0 siblings, 2 replies; 73+ messages in thread From: Peter Zijlstra @ 2015-06-03 14:24 UTC (permalink / raw) To: Rik van Riel Cc: Josef Bacik, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote: > There is a policy vs mechanism thing here. Ingo and Peter > are worried about the overhead in the mechanism of finding > an idle CPU. Your measurements show that the policy of > finding an idle CPU is the correct one. For his workload; I'm sure I can find a workload where it hurts. In fact, I'm fairly sure Mike knows one from the top of his head, seeing how he's the one playing about trying to shrink that idle search :-) ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-03 14:24 ` Peter Zijlstra @ 2015-06-03 14:49 ` Josef Bacik 2015-06-03 15:30 ` Mike Galbraith 1 sibling, 0 replies; 73+ messages in thread From: Josef Bacik @ 2015-06-03 14:49 UTC (permalink / raw) To: Peter Zijlstra, Rik van Riel Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On 06/03/2015 10:24 AM, Peter Zijlstra wrote: > On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote: > >> There is a policy vs mechanism thing here. Ingo and Peter >> are worried about the overhead in the mechanism of finding >> an idle CPU. Your measurements show that the policy of >> finding an idle CPU is the correct one. > > For his workload; I'm sure I can find a workload where it hurts. > > In fact, I'm fairly sure Mike knows one from the top of his head, seeing > how he's the one playing about trying to shrink that idle search :-) > So the perf bench sched microbenchmarks are a pretty good analog for our workload. I run perf bench sched messaging -g 100 -l 10000 perf bench sched pipe 5 times and average the results to get an answer, really the messaging one is closest one and the one I look at. I get like 56 seconds of runtime on plain 4.0 and 47 seconds patched, it's how I check my little experiments before doing the full real workload. I don't want to tune the scheduler just for our workload, but the microbenchmarks we have are also showing the same performance improvements. I would be super interested in workloads where this patch doesn't help so we could integrate that workload into perf sched bench to make us more confident in making policy changes in the scheduler. So Mike if you have something specific in mind please elaborate and I'm happy to do the legwork to get it into perf bench and to test things until we're happy. In the meantime I really want to get this fixed for us, I do not want to pull some weird old patch around for the next year until we rebase again next year, and then do this whole dance again. What would be the way forward for getting this fixed now? Do I need to hide it behind a sysctl or config option? Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-03 14:24 ` Peter Zijlstra 2015-06-03 14:49 ` Josef Bacik @ 2015-06-03 15:30 ` Mike Galbraith 2015-06-03 15:57 ` Josef Bacik 1 sibling, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-06-03 15:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Rik van Riel, Josef Bacik, mingo, linux-kernel, morten.rasmussen, kernel-team On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote: > On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote: > > > There is a policy vs mechanism thing here. Ingo and Peter > > are worried about the overhead in the mechanism of finding > > an idle CPU. Your measurements show that the policy of > > finding an idle CPU is the correct one. > > For his workload; I'm sure I can find a workload where it hurts. > > In fact, I'm fairly sure Mike knows one from the top of his head, seeing > how he's the one playing about trying to shrink that idle search :-) Like anything where scheduling latency doesn't heavily dominate. Even if searching were free, bounces aren't, even for the very light. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-03 15:30 ` Mike Galbraith @ 2015-06-03 15:57 ` Josef Bacik 2015-06-03 16:53 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-06-03 15:57 UTC (permalink / raw) To: Mike Galbraith, Peter Zijlstra Cc: Rik van Riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 06/03/2015 11:30 AM, Mike Galbraith wrote: > On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote: >> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote: >> >>> There is a policy vs mechanism thing here. Ingo and Peter >>> are worried about the overhead in the mechanism of finding >>> an idle CPU. Your measurements show that the policy of >>> finding an idle CPU is the correct one. >> >> For his workload; I'm sure I can find a workload where it hurts. >> >> In fact, I'm fairly sure Mike knows one from the top of his head, seeing >> how he's the one playing about trying to shrink that idle search :-) > > Like anything where scheduling latency doesn't heavily dominate. Even > if searching were free, bounces aren't, even for the very light. > If scheduling latency doesn't hurt then making the search shouldn't matter should it? I get that migrations aren't free, but it seems like they can't hurt that much. This application is huge, it's our webserver, we're doing like 400 requests per second on these things, and hands down moving stuff to idle cpus is beating the pants off of staying on the same cpu. Is there a specific workload I could build a test for that you think this approach would hurt? Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-03 15:57 ` Josef Bacik @ 2015-06-03 16:53 ` Mike Galbraith 2015-06-03 17:16 ` Josef Bacik 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-06-03 16:53 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Wed, 2015-06-03 at 11:57 -0400, Josef Bacik wrote: > On 06/03/2015 11:30 AM, Mike Galbraith wrote: > > On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote: > >> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote: > >> > >>> There is a policy vs mechanism thing here. Ingo and Peter > >>> are worried about the overhead in the mechanism of finding > >>> an idle CPU. Your measurements show that the policy of > >>> finding an idle CPU is the correct one. > >> > >> For his workload; I'm sure I can find a workload where it hurts. > >> > >> In fact, I'm fairly sure Mike knows one from the top of his head, seeing > >> how he's the one playing about trying to shrink that idle search :-) > > > > Like anything where scheduling latency doesn't heavily dominate. Even > > if searching were free, bounces aren't, even for the very light. > > > > If scheduling latency doesn't hurt then making the search shouldn't > matter should it? I get that migrations aren't free, but it seems like > they can't hurt that much. Nah, they don't hurt much :) commit e0a79f529d5ba2507486d498b25da40911d95cf6 Author: Mike Galbraith <bitbucket@online.de> Date: Mon Jan 28 12:19:25 2013 +0100 sched: Fix select_idle_sibling() bouncing cow syndrome If the previous CPU is cache affine and idle, select it. The current implementation simply traverses the sd_llc domain, taking the first idle CPU encountered, which walks buddy pairs hand in hand over the package, inflicting excruciating pain. 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs > This application is huge, it's our > webserver, we're doing like 400 requests per second on these things, and > hands down moving stuff to idle cpus is beating the pants off of staying > on the same cpu. Is there a specific workload I could build a test for > that you think this approach would hurt? Thanks, Search cost hurts fast movers, as does dragging even a small footprint all over the place, as you can see above. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-03 16:53 ` Mike Galbraith @ 2015-06-03 17:16 ` Josef Bacik 2015-06-03 17:43 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-06-03 17:16 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 06/03/2015 12:53 PM, Mike Galbraith wrote: > On Wed, 2015-06-03 at 11:57 -0400, Josef Bacik wrote: >> On 06/03/2015 11:30 AM, Mike Galbraith wrote: >>> On Wed, 2015-06-03 at 16:24 +0200, Peter Zijlstra wrote: >>>> On Wed, 2015-06-03 at 10:12 -0400, Rik van Riel wrote: >>>> >>>>> There is a policy vs mechanism thing here. Ingo and Peter >>>>> are worried about the overhead in the mechanism of finding >>>>> an idle CPU. Your measurements show that the policy of >>>>> finding an idle CPU is the correct one. >>>> >>>> For his workload; I'm sure I can find a workload where it hurts. >>>> >>>> In fact, I'm fairly sure Mike knows one from the top of his head, seeing >>>> how he's the one playing about trying to shrink that idle search :-) >>> >>> Like anything where scheduling latency doesn't heavily dominate. Even >>> if searching were free, bounces aren't, even for the very light. >>> >> >> If scheduling latency doesn't hurt then making the search shouldn't >> matter should it? I get that migrations aren't free, but it seems like >> they can't hurt that much. > > Nah, they don't hurt much :) > > commit e0a79f529d5ba2507486d498b25da40911d95cf6 > Author: Mike Galbraith <bitbucket@online.de> > Date: Mon Jan 28 12:19:25 2013 +0100 > > sched: Fix select_idle_sibling() bouncing cow syndrome > > If the previous CPU is cache affine and idle, select it. > > The current implementation simply traverses the sd_llc domain, > taking the first idle CPU encountered, which walks buddy pairs > hand in hand over the package, inflicting excruciating pain. > > 1 tbench pair (worst case) in a 10 core + SMT package: > > pre 15.22 MB/sec 1 procs > post 252.01 MB/sec 1 procs > > >> This application is huge, it's our >> webserver, we're doing like 400 requests per second on these things, and >> hands down moving stuff to idle cpus is beating the pants off of staying >> on the same cpu. Is there a specific workload I could build a test for >> that you think this approach would hurt? Thanks, > > Search cost hurts fast movers, as does dragging even a small footprint > all over the place, as you can see above. > Eesh ok, do you happen to remember how you ran tbench so I can add it to my tests here? In addition to fixing this problem we're also interested in tracking performance of new kernels so we don't have to do this "what the hell went wrong in the last 6 releases" dance every year, so I'm throwing every performance thing we find useful in our test infrastructure. Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-03 17:16 ` Josef Bacik @ 2015-06-03 17:43 ` Mike Galbraith 2015-06-03 20:34 ` Josef Bacik 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-06-03 17:43 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Wed, 2015-06-03 at 13:16 -0400, Josef Bacik wrote: > Eesh ok, do you happen to remember how you ran tbench so I can add it to > my tests here? In addition to fixing this problem we're also interested > in tracking performance of new kernels so we don't have to do this "what > the hell went wrong in the last 6 releases" dance every year, so I'm > throwing every performance thing we find useful in our test > infrastructure. Thanks, > > Josef Start a tbench server, then tbench -t 30 1 localhost. You're unlikely to find anything as painful as that bouncing cow bug was, but you won't have to look hard at all to find bounce pain. There are also other loads like your server where waking to an idle cpu dominates all else, pgbench is one of those. In that case, you've got a 1:N waker/wakee relationship, and what matters above ALL else is when the mother of all work (the single server thread) wants a CPU, it had better get it NOW, else the load stalls. Likewise, 'mom' being preempted hurts truckloads. Perhaps your server has a similar thing going on, keeping wakees the hell away from the waker rules all. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-03 17:43 ` Mike Galbraith @ 2015-06-03 20:34 ` Josef Bacik 2015-06-04 4:52 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-06-03 20:34 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 06/03/2015 01:43 PM, Mike Galbraith wrote: > On Wed, 2015-06-03 at 13:16 -0400, Josef Bacik wrote: > >> Eesh ok, do you happen to remember how you ran tbench so I can add it to >> my tests here? In addition to fixing this problem we're also interested >> in tracking performance of new kernels so we don't have to do this "what >> the hell went wrong in the last 6 releases" dance every year, so I'm >> throwing every performance thing we find useful in our test >> infrastructure. Thanks, >> >> Josef > > Start a tbench server, then tbench -t 30 1 localhost. You're unlikely > to find anything as painful as that bouncing cow bug was, but you won't > have to look hard at all to find bounce pain. > > There are also other loads like your server where waking to an idle cpu > dominates all else, pgbench is one of those. In that case, you've got a > 1:N waker/wakee relationship, and what matters above ALL else is when > the mother of all work (the single server thread) wants a CPU, it had > better get it NOW, else the load stalls. Likewise, 'mom' being > preempted hurts truckloads. Perhaps your server has a similar thing > going on, keeping wakees the hell away from the waker rules all. > Yeah our server has two waker threads (one per numa node) and then the N number of wakee threads. I'll run tbench and pgbench with the new patches and see if there's a degredation. Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-03 20:34 ` Josef Bacik @ 2015-06-04 4:52 ` Mike Galbraith 0 siblings, 0 replies; 73+ messages in thread From: Mike Galbraith @ 2015-06-04 4:52 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, Rik van Riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Wed, 2015-06-03 at 16:34 -0400, Josef Bacik wrote: > On 06/03/2015 01:43 PM, Mike Galbraith wrote: > > There are also other loads like your server where waking to an idle cpu > > dominates all else, pgbench is one of those. In that case, you've got a > > 1:N waker/wakee relationship, and what matters above ALL else is when > > the mother of all work (the single server thread) wants a CPU, it had > > better get it NOW, else the load stalls. Likewise, 'mom' being > > preempted hurts truckloads. Perhaps your server has a similar thing > > going on, keeping wakees the hell away from the waker rules all. > > > > Yeah our server has two waker threads (one per numa node) and then the N > number of wakee threads. I'll run tbench and pgbench with the new > patches and see if there's a degredation. Thanks, If you look for wake_wide(), it could perhaps be used to select wider search for only the right flavor load component when BALANCE_WAKE is set. That would let the cache lovers in your box continue to perform while improving the 1:N component. That wider search still needs to become cheaper though, low hanging fruit being to stop searching when you find load = 0.. but you may meet the energy efficient folks, who iirc want to make it even more expensive. wake_wide() inadvertently helped another sore spot btw - a gaggle of pretty light tasks being awakened from an interrupt source tended to cluster around that source, preventing such loads from being all they can be in a very similar manner. Xen (shudder;) showed that nicely in older kernels, due to the way its weird dom0 gizmo works. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-01 19:38 ` Josef Bacik 2015-06-01 20:42 ` Peter Zijlstra @ 2015-06-01 22:15 ` Rik van Riel 1 sibling, 0 replies; 73+ messages in thread From: Rik van Riel @ 2015-06-01 22:15 UTC (permalink / raw) To: Josef Bacik, Peter Zijlstra Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On 06/01/2015 03:38 PM, Josef Bacik wrote: > Ok I got this patch to give me the same performance as all our other > crap, just need to apply this incremental > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b71eb2b..e11cfec 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4761,13 +4761,10 @@ select_task_rq_fair(struct task_struct *p, int > prev_cpu, int sd_flag, int wake_f > > if (tmp->flags & sd_flag) > sd = tmp; > - else if (!want_affine || (want_affine && affine_sd)) > - break; > } > > if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) { > prev_cpu = cpu; > - sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */ Given Peter's worries about wake_affine and affine_sd, should the above be sd = affine_sd, in case select_idle_sibling cannot find an idle sibling? That way we can attempt to at least find an idle cpu inside the affine_sd. Of course, there may be subtleties here I am overlooking... > } > > if (sd_flag & SD_BALANCE_WAKE) { -- All rights reversed ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 11:05 ` Peter Zijlstra 2015-05-28 14:27 ` Josef Bacik 2015-05-29 21:03 ` Josef Bacik @ 2015-06-11 20:33 ` Josef Bacik 2015-06-12 3:42 ` Rik van Riel 2015-06-12 5:35 ` Mike Galbraith 3 siblings, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-06-11 20:33 UTC (permalink / raw) To: Peter Zijlstra Cc: riel, mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On 05/28/2015 07:05 AM, Peter Zijlstra wrote: > > So maybe you want something like the below; that cures the thing Morten > raised, and we continue looking for sd, even after we found affine_sd. > > It also avoids the pointless idle_cpu() check Mike raised by making > select_idle_sibling() return -1 if it doesn't find anything. > > Then it continues doing the full balance IFF sd was set, which is keyed > off of sd->flags. > > And note (as Mike already said), BALANCE_WAKE does _NOT_ look for idle > CPUs, it looks for the least loaded CPU. And its damn expensive. > > > Rewriting this entire thing is somewhere on the todo list :/ > Ugh I'm sorry, I've been running tests trying to get the numbers to look good when I noticed I was getting some inconsistencies in my results. Turns out I never actually tested your patch just plain, I had been testing it with BALANCE_WAKE, because I was under the assumption that was what was best for our workload. Since then I had fixed all of our scripts and such and noticed that it actually super duper sucks for us. So testing with this original patch everything is significantly better (this is with the default SD flags set, no changes at all). So now that I've wasted a good bit of my time and everybody elses, can we go about pushing this patch upstream? If you are happy with it the way it is I'll go ahead and pull it into our kernels and just watch to make sure it ends upstream at some point. Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-11 20:33 ` Josef Bacik @ 2015-06-12 3:42 ` Rik van Riel 0 siblings, 0 replies; 73+ messages in thread From: Rik van Riel @ 2015-06-12 3:42 UTC (permalink / raw) To: Josef Bacik, Peter Zijlstra Cc: mingo, linux-kernel, umgwanakikbuti, morten.rasmussen, kernel-team On 06/11/2015 04:33 PM, Josef Bacik wrote: > Ugh I'm sorry, I've been running tests trying to get the numbers to look > good when I noticed I was getting some inconsistencies in my results. > Turns out I never actually tested your patch just plain, I had been > testing it with BALANCE_WAKE, because I was under the assumption that > was what was best for our workload. Since then I had fixed all of our > scripts and such and noticed that it actually super duper sucks for us. > So testing with this original patch everything is significantly better > (this is with the default SD flags set, no changes at all). > > So now that I've wasted a good bit of my time and everybody elses, can > we go about pushing this patch upstream? If you are happy with it the > way it is I'll go ahead and pull it into our kernels and just watch to > make sure it ends upstream at some point. Thanks, FWIW, Jirka has run some tests with the patch as well, and seen significant performance improvements on various tests, on various systems. Merging the patch makes perfect sense to me. Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 11:05 ` Peter Zijlstra ` (2 preceding siblings ...) 2015-06-11 20:33 ` Josef Bacik @ 2015-06-12 5:35 ` Mike Galbraith 2015-06-17 18:06 ` Josef Bacik 3 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-06-12 5:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote: > @@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > * If both cpu and prev_cpu are part of this domain, > * cpu is a valid SD_WAKE_AFFINE target. > */ > - if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && > - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { > + if (want_affine && !affine_sd && > + (tmp->flags & SD_WAKE_AFFINE) && > + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) > affine_sd = tmp; > - break; > - } > > if (tmp->flags & sd_flag) > sd = tmp; > + else if (!want_affine || (want_affine && affine_sd)) > + break; > } Hm, new_cpu == cpu. > - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) > + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) { > prev_cpu = cpu; > + sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */ > + } If branch above is not taken, new_cpu remains cpu. > if (sd_flag & SD_BALANCE_WAKE) { > - new_cpu = select_idle_sibling(p, prev_cpu); > - goto unlock; > + int tmp = select_idle_sibling(p, prev_cpu); > + if (tmp >= 0) { > + new_cpu = tmp; > + goto unlock; > + } > } If select_idle_sibling() returns -1, new_cpu remains cpu. > > while (sd) { If sd == NULL, we fall through and try to pull wakee despite nacked-by tsk_cpus_allowed() or wake_affine(). -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-12 5:35 ` Mike Galbraith @ 2015-06-17 18:06 ` Josef Bacik 2015-06-18 0:55 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-06-17 18:06 UTC (permalink / raw) To: Mike Galbraith, Peter Zijlstra Cc: riel, mingo, linux-kernel, morten.rasmussen On 06/11/2015 10:35 PM, Mike Galbraith wrote: > On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote: > >> @@ -5022,22 +5026,28 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f >> * If both cpu and prev_cpu are part of this domain, >> * cpu is a valid SD_WAKE_AFFINE target. >> */ >> - if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && >> - cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { >> + if (want_affine && !affine_sd && >> + (tmp->flags & SD_WAKE_AFFINE) && >> + cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) >> affine_sd = tmp; >> - break; >> - } >> >> if (tmp->flags & sd_flag) >> sd = tmp; >> + else if (!want_affine || (want_affine && affine_sd)) >> + break; >> } > > Hm, new_cpu == cpu. > >> - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) >> + if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) { >> prev_cpu = cpu; >> + sd = NULL; /* WAKE_AFFINE trumps BALANCE_WAKE */ >> + } > > If branch above is not taken, new_cpu remains cpu. > >> if (sd_flag & SD_BALANCE_WAKE) { >> - new_cpu = select_idle_sibling(p, prev_cpu); >> - goto unlock; >> + int tmp = select_idle_sibling(p, prev_cpu); >> + if (tmp >= 0) { >> + new_cpu = tmp; >> + goto unlock; >> + } >> } > > If select_idle_sibling() returns -1, new_cpu remains cpu. > >> >> while (sd) { > > If sd == NULL, we fall through and try to pull wakee despite nacked-by > tsk_cpus_allowed() or wake_affine(). > So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something like this if (tmp >= 0) { new_cpu = tmp; goto unlock; } else if (!want_affine) { new_cpu = prev_cpu; } so we can make sure we're not being pushed onto a cpu that we aren't allowed on? Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-17 18:06 ` Josef Bacik @ 2015-06-18 0:55 ` Mike Galbraith 2015-06-18 3:46 ` Josef Bacik 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-06-18 0:55 UTC (permalink / raw) To: Josef Bacik; +Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote: > On 06/11/2015 10:35 PM, Mike Galbraith wrote: > > On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote: > > If sd == NULL, we fall through and try to pull wakee despite nacked-by > > tsk_cpus_allowed() or wake_affine(). > > > > So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something > like this > > if (tmp >= 0) { > new_cpu = tmp; > goto unlock; > } else if (!want_affine) { > new_cpu = prev_cpu; > } > > so we can make sure we're not being pushed onto a cpu that we aren't > allowed on? Thanks, The buglet is a messenger methinks. You saying the patch helped without SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say that preferring cache is not harming your load after all. It now sounds as though wake_wide() may be what you're squabbling with. Things aren't adding up all that well. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-18 0:55 ` Mike Galbraith @ 2015-06-18 3:46 ` Josef Bacik 2015-06-18 4:12 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-06-18 3:46 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen On 06/17/2015 05:55 PM, Mike Galbraith wrote: > On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote: >> On 06/11/2015 10:35 PM, Mike Galbraith wrote: >>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote: > >>> If sd == NULL, we fall through and try to pull wakee despite nacked-by >>> tsk_cpus_allowed() or wake_affine(). >>> >> >> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something >> like this >> >> if (tmp >= 0) { >> new_cpu = tmp; >> goto unlock; >> } else if (!want_affine) { >> new_cpu = prev_cpu; >> } >> >> so we can make sure we're not being pushed onto a cpu that we aren't >> allowed on? Thanks, > > The buglet is a messenger methinks. You saying the patch helped without > SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say > that preferring cache is not harming your load after all. It now sounds > as though wake_wide() may be what you're squabbling with. > > Things aren't adding up all that well. Yeah I'm horribly confused. The other thing is I had to switch clusters (I know, I know, I'm changing the parameters of the test). So these new boxes are haswell boxes, but basically the same otherwise, 2 socket 12 core with HT, just newer/faster CPUs. I'll re-run everything again and give the numbers so we're all on the same page again, but as it stands now I think we have this 3.10 with wake_idle forward ported - good 4.0 stock - 20% perf drop 4.0 w/ Peter's patch - good 4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop I can do all these iterations again to verify, is there any other permutation you'd like to see? Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-18 3:46 ` Josef Bacik @ 2015-06-18 4:12 ` Mike Galbraith 2015-07-02 17:44 ` Josef Bacik 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-06-18 4:12 UTC (permalink / raw) To: Josef Bacik; +Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen On Wed, 2015-06-17 at 20:46 -0700, Josef Bacik wrote: > On 06/17/2015 05:55 PM, Mike Galbraith wrote: > > On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote: > >> On 06/11/2015 10:35 PM, Mike Galbraith wrote: > >>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote: > > > >>> If sd == NULL, we fall through and try to pull wakee despite nacked-by > >>> tsk_cpus_allowed() or wake_affine(). > >>> > >> > >> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something > >> like this > >> > >> if (tmp >= 0) { > >> new_cpu = tmp; > >> goto unlock; > >> } else if (!want_affine) { > >> new_cpu = prev_cpu; > >> } > >> > >> so we can make sure we're not being pushed onto a cpu that we aren't > >> allowed on? Thanks, > > > > The buglet is a messenger methinks. You saying the patch helped without > > SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say > > that preferring cache is not harming your load after all. It now sounds > > as though wake_wide() may be what you're squabbling with. > > > > Things aren't adding up all that well. > > Yeah I'm horribly confused. The other thing is I had to switch clusters > (I know, I know, I'm changing the parameters of the test). So these new > boxes are haswell boxes, but basically the same otherwise, 2 socket 12 > core with HT, just newer/faster CPUs. I'll re-run everything again and > give the numbers so we're all on the same page again, but as it stands > now I think we have this > > 3.10 with wake_idle forward ported - good > 4.0 stock - 20% perf drop > 4.0 w/ Peter's patch - good > 4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop > > I can do all these iterations again to verify, is there any other > permutation you'd like to see? Thanks, Yeah, after re-baseline, please apply/poke these buttons individually in 4.0-virgin. (cat /sys/kernel/debug/sched_features, prepend NO_, echo it back) --- kernel/sched/fair.c | 4 ++-- kernel/sched/features.h | 2 ++ 2 files changed, 4 insertions(+), 2 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4506,7 +4506,7 @@ static int wake_affine(struct sched_doma * If we wake multiple tasks be careful to not bounce * ourselves around too much. */ - if (wake_wide(p)) + if (sched_feat(WAKE_WIDE) && wake_wide(p)) return 0; idx = sd->wake_idx; @@ -4682,7 +4682,7 @@ static int select_idle_sibling(struct ta struct sched_group *sg; int i = task_cpu(p); - if (idle_cpu(target)) + if (!sched_feat(PREFER_IDLE) || idle_cpu(target)) return target; /* --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -59,6 +59,8 @@ SCHED_FEAT(TTWU_QUEUE, true) SCHED_FEAT(FORCE_SD_OVERLAP, false) SCHED_FEAT(RT_RUNTIME_SHARE, true) SCHED_FEAT(LB_MIN, false) +SCHED_FEAT(PREFER_IDLE, true) +SCHED_FEAT(WAKE_WIDE, true) /* * Apply the automatic NUMA scheduling policy. Enabled automatically ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-06-18 4:12 ` Mike Galbraith @ 2015-07-02 17:44 ` Josef Bacik 2015-07-03 6:40 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-07-02 17:44 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 06/18/2015 12:12 AM, Mike Galbraith wrote: > On Wed, 2015-06-17 at 20:46 -0700, Josef Bacik wrote: >> On 06/17/2015 05:55 PM, Mike Galbraith wrote: >>> On Wed, 2015-06-17 at 11:06 -0700, Josef Bacik wrote: >>>> On 06/11/2015 10:35 PM, Mike Galbraith wrote: >>>>> On Thu, 2015-05-28 at 13:05 +0200, Peter Zijlstra wrote: >>> >>>>> If sd == NULL, we fall through and try to pull wakee despite nacked-by >>>>> tsk_cpus_allowed() or wake_affine(). >>>>> >>>> >>>> So maybe add a check in the if (sd_flag & SD_BALANCE_WAKE) for something >>>> like this >>>> >>>> if (tmp >= 0) { >>>> new_cpu = tmp; >>>> goto unlock; >>>> } else if (!want_affine) { >>>> new_cpu = prev_cpu; >>>> } >>>> >>>> so we can make sure we're not being pushed onto a cpu that we aren't >>>> allowed on? Thanks, >>> >>> The buglet is a messenger methinks. You saying the patch helped without >>> SD_BALANCE_WAKE being set is why I looked. The buglet would seem to say >>> that preferring cache is not harming your load after all. It now sounds >>> as though wake_wide() may be what you're squabbling with. >>> >>> Things aren't adding up all that well. >> >> Yeah I'm horribly confused. The other thing is I had to switch clusters >> (I know, I know, I'm changing the parameters of the test). So these new >> boxes are haswell boxes, but basically the same otherwise, 2 socket 12 >> core with HT, just newer/faster CPUs. I'll re-run everything again and >> give the numbers so we're all on the same page again, but as it stands >> now I think we have this >> >> 3.10 with wake_idle forward ported - good >> 4.0 stock - 20% perf drop >> 4.0 w/ Peter's patch - good >> 4.0 w/ Peter's patch + SD_BALANCE_WAKE - 5% perf drop >> >> I can do all these iterations again to verify, is there any other >> permutation you'd like to see? Thanks, > > Yeah, after re-baseline, please apply/poke these buttons individually in > 4.0-virgin. > > (cat /sys/kernel/debug/sched_features, prepend NO_, echo it back) > Sorry it took me a while to get these numbers to you, migrating the whole fleet to a new setup broke the performance test suite thing so I've only just been able to run tests again. I'll do my best to describe what is going on and hopefully that will make the results make sense. This is on our webservers, which is HHVM. A request comes in for a page and this goes onto one of the two hhvm.node.# threads, one thread per NUMA node. From there it is farmed off to one of the worker threads. If there are no idle workers the request gets put on what is called the "select_queue". Basically the select_queue should never be larger than 0 in a perfect world. If it's more than we've hit latency somewhere and that's not good. The other measurement we care about is how long a thread spends on a request before it sends a response (this would be the actual work being done). Our tester slowly increases load to a group of servers until the select queue is consistently >= 1. That means we've loaded the boxes so high that they can't process the requests as soon as they've come in. Then it backs down and then ramps up a second time. It takes all of these measurements and puts them into these pretty graphs. There are 2 graphs we care about, the duration of the requests vs the requests per second and the probability that our select queue is >= 1 vs requests per second. Now for 3.10 vs 4.0 our request duration time is the same if not slightly better on 4.0, so once the workers are doing their job everything is a-ok. The problem is the probability the select queue >= 1 is way different on 4.0 vs 3.10. Normally this graph looks like an S, it's essentially 0 up to some RPS (requests per second) threshold and then shoots up to 100% after the threshold. I'll make a table of these graphs that hopefully makes sense, the numbers are different from run to run because of traffic and such, the test and control are both run at the same time. The header is the probability the select queue >=1 25% 50% 75% 4.0 plain: 371 388 402 control: 386 394 402 difference: 15 6 0 So with 4.0 its basically a straight line, at lower RPS we are getting a higher probability of a select queue >= 1. We are measuring the cpu delay avg ms thing from the scheduler netlink stuff which is how I noticed it was scheduler related, our cpu delay is way higher on 4.0 than it is on 3.10 or 4.0 with the wake idle patch. So the next test is NO_PREFER_IDLE. This is slightly better than 4.0 plain 25% 50% 75% NO_PREFER_IDLE: 399 401 414 control: 385 408 416 difference: 14 7 2 The numbers don't really show it well, but the graphs are closer together, it's slightly more s shaped, but still not great. Next is NO_WAKE_WIDE, which is horrible 25% 50% 75% NO_WAKE_WIDE: 315 344 369 control: 373 380 388 difference: 58 36 19 This isn't even in the same ballpark, it's a way worse regression than plain. The next bit is NO_WAKE_WIDE|NO_PREFER_IDLE, which is just as bad 25% 50% 75% EVERYTHING: 327 360 383 control: 381 390 399 difference: 54 30 19 Hopefully that helps. Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-02 17:44 ` Josef Bacik @ 2015-07-03 6:40 ` Mike Galbraith 2015-07-03 9:29 ` Mike Galbraith 2015-07-04 15:57 ` Mike Galbraith 0 siblings, 2 replies; 73+ messages in thread From: Mike Galbraith @ 2015-07-03 6:40 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Thu, 2015-07-02 at 13:44 -0400, Josef Bacik wrote: > Now for 3.10 vs 4.0 our request duration time is the same if not > slightly better on 4.0, so once the workers are doing their job > everything is a-ok. > > The problem is the probability the select queue >= 1 is way different on > 4.0 vs 3.10. Normally this graph looks like an S, it's essentially 0 up > to some RPS (requests per second) threshold and then shoots up to 100% > after the threshold. I'll make a table of these graphs that hopefully > makes sense, the numbers are different from run to run because of > traffic and such, the test and control are both run at the same time. > The header is the probability the select queue >=1 > > 25% 50% 75% > 4.0 plain: 371 388 402 > control: 386 394 402 > difference: 15 6 0 So control is 3.10? Virgin? > So with 4.0 its basically a straight line, at lower RPS we are getting a > higher probability of a select queue >= 1. We are measuring the cpu > delay avg ms thing from the scheduler netlink stuff which is how I > noticed it was scheduler related, our cpu delay is way higher on 4.0 > than it is on 3.10 or 4.0 with the wake idle patch. > > So the next test is NO_PREFER_IDLE. This is slightly better than 4.0 plain > 25% 50% 75% > NO_PREFER_IDLE: 399 401 414 > control: 385 408 416 > difference: 14 7 2 Hm. Throttling nohz may make larger delta. But never mind that. > The numbers don't really show it well, but the graphs are closer > together, it's slightly more s shaped, but still not great. > > Next is NO_WAKE_WIDE, which is horrible > > 25% 50% 75% > NO_WAKE_WIDE: 315 344 369 > control: 373 380 388 > difference: 58 36 19 > > This isn't even in the same ballpark, it's a way worse regression than > plain. Ok, this jibes perfectly with 1:N waker/wakee thingy. > The next bit is NO_WAKE_WIDE|NO_PREFER_IDLE, which is just as bad > > 25% 50% 75% > EVERYTHING: 327 360 383 > control: 381 390 399 > difference: 54 30 19 Ditto. Hm. Seems what this load should like best is if we detect 1:N, skip all of the routine gyrations, ie move the N (workers) infrequently, expend search cycles frequently only on the 1 (dispatch). Ponder.. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-03 6:40 ` Mike Galbraith @ 2015-07-03 9:29 ` Mike Galbraith 2015-07-04 15:57 ` Mike Galbraith 1 sibling, 0 replies; 73+ messages in thread From: Mike Galbraith @ 2015-07-03 9:29 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Fri, 2015-07-03 at 08:40 +0200, Mike Galbraith wrote: > Hm. Seems what this load should like best is if we detect 1:N, skip all > of the routine gyrations, ie move the N (workers) infrequently, expend > search cycles frequently only on the 1 (dispatch). > > Ponder.. While taking a refresher peek at the wake_wide() thing, seems it's not really paying attention when the waker of many is awakened. I wonder if your load would see more benefit if it watched like so.. rashly assuming I didn't wreck it completely (iow, completely untested). --- kernel/sched/fair.c | 36 ++++++++++++++++++++++-------------- 1 file changed, 22 insertions(+), 14 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4586,10 +4586,23 @@ static void record_wakee(struct task_str current->wakee_flips >>= 1; current->wakee_flip_decay_ts = jiffies; } + if (time_after(jiffies, p->wakee_flip_decay_ts + HZ)) { + p->wakee_flips >>= 1; + p->wakee_flip_decay_ts = jiffies; + } if (current->last_wakee != p) { current->last_wakee = p; current->wakee_flips++; + /* + * Flip the buddy as well. It's the ratio of flips + * with a socket size decayed cutoff that determines + * whether the pair are considered to be part of 1:N + * or M*N loads of a size that we need to spread, so + * ensure flips of both load components. The waker + * of many will have many more flips than its wakees. + */ + p->wakee_flips++; } } @@ -4732,24 +4745,19 @@ static long effective_load(struct task_g static int wake_wide(struct task_struct *p) { + unsigned long max = max(current->wakee_flips, p->wakee_flips); + unsigned long min = min(current->wakee_flips, p->wakee_flips); int factor = this_cpu_read(sd_llc_size); /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. + * Yeah, it's a switching-frequency heuristic, and could mean the + * intended many wakees/waker relationship, or rapidly switching + * between a few. Use factor to try to automatically adjust such + * that the load spreads when it grows beyond what will fit in llc. */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; - } - - return 0; + if (min < factor) + return 0; + return max > min * factor; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-03 6:40 ` Mike Galbraith 2015-07-03 9:29 ` Mike Galbraith @ 2015-07-04 15:57 ` Mike Galbraith 2015-07-05 7:17 ` Mike Galbraith 1 sibling, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-07-04 15:57 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Fri, 2015-07-03 at 08:40 +0200, Mike Galbraith wrote: > Hm. Seems what this load should like best is if we detect 1:N, skip all > of the routine gyrations, ie move the N (workers) infrequently, expend > search cycles frequently only on the 1 (dispatch). > > Ponder.. Since it was too hot to do outside chores (any excuse will do;)... If we're (read /me) on track, the bellow should help. Per my tracing, it may want a wee bit of toning down actually, though when I trace virgin source I expect to see the same, namely Xorg and friends having "wide-load" tattooed across their hindquarters earlier than they should. It doesn't seem to hurt anything, but then demolishing a single llc box is a tad more difficult than demolishing a NUMA box. sched: beef up wake_wide() Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to wakeup of the 1:N waker, returning 1 only when waking one of its N minions. Correct that, and give the user the option to do an expensive balance IFF select_idle_sibling() doesn't find an idle CPU, and IFF the wakee is the the 1:N dispatcher of work, thus worth some extra effort. Not-Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> --- kernel/sched/fair.c | 89 +++++++++++++++++++++++++----------------------- kernel/sched/features.h | 6 +++ 2 files changed, 54 insertions(+), 41 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -666,7 +666,7 @@ static u64 sched_vslice(struct cfs_rq *c } #ifdef CONFIG_SMP -static int select_idle_sibling(struct task_struct *p, int cpu); +static int select_idle_sibling(struct task_struct *p, int cpu, void *clear); static unsigned long task_h_load(struct task_struct *p); static inline void __update_task_entity_contrib(struct sched_entity *se); @@ -1375,7 +1375,7 @@ static void task_numa_compare(struct tas * Call select_idle_sibling to maybe find a better one. */ if (!cur) - env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu); + env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu, NULL); assign: task_numa_assign(env, cur, imp); @@ -4730,26 +4730,30 @@ static long effective_load(struct task_g #endif +/* + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that we are seeing a 1:N + * relationship, and that load size exceeds socket size. + */ static int wake_wide(struct task_struct *p) { - int factor = this_cpu_read(sd_llc_size); - - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; + unsigned long waker_flips = current->wakee_flips; + unsigned long wakee_flips = p->wakee_flips; + int factor = this_cpu_read(sd_llc_size), ret = 1; + + if (waker_flips < wakee_flips) { + swap(waker_flips, wakee_flips); + /* Tell the caller that we're waking a 1:N waker */ + ret += sched_feat(WAKE_WIDE_BALANCE); } - - return 0; + if (wakee_flips < factor || waker_flips < wakee_flips * factor) + return 0; + return ret; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4761,13 +4765,6 @@ static int wake_affine(struct sched_doma unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -4935,20 +4932,22 @@ find_idlest_cpu(struct sched_group *grou /* * Try and locate an idle CPU in the sched_domain. */ -static int select_idle_sibling(struct task_struct *p, int target) +static int select_idle_sibling(struct task_struct *p, int target, void *clear) { struct sched_domain *sd; struct sched_group *sg; int i = task_cpu(p); if (idle_cpu(target)) - return target; + goto done; /* * If the prevous cpu is cache affine and idle, don't be stupid. */ - if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) - return i; + if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) { + target = i; + goto done; + } /* * Otherwise, iterate the domains and find an elegible idle cpu. @@ -4973,7 +4972,11 @@ static int select_idle_sibling(struct ta sg = sg->next; } while (sg != sd->groups); } + return target; done: + if (clear) + *(void **)clear = 0; + return target; } /* @@ -5021,14 +5024,19 @@ select_task_rq_fair(struct task_struct * { struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; - int want_affine = 0; + int new_cpu = prev_cpu; + int want_affine = 0, want_balance = 0; int sync = wake_flags & WF_SYNC; - if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); - rcu_read_lock(); + if (sd_flag & SD_BALANCE_WAKE) { + want_affine = wake_wide(p); + want_balance = want_affine > 1; + want_affine = !want_affine && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + if (!want_affine && !want_balance) + goto select; + } + for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) continue; @@ -5043,23 +5051,23 @@ select_task_rq_fair(struct task_struct * break; } - if (tmp->flags & sd_flag) + if (tmp->flags & sd_flag || want_balance) sd = tmp; } if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + new_cpu = cpu; if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; +select: + new_cpu = select_idle_sibling(p, new_cpu, &sd); } while (sd) { struct sched_group *group; int weight; - if (!(sd->flags & sd_flag)) { + if (!(sd->flags & sd_flag) && !want_balance) { sd = sd->child; continue; } @@ -5089,7 +5097,6 @@ select_task_rq_fair(struct task_struct * } /* while loop will break here if sd == NULL */ } -unlock: rcu_read_unlock(); return new_cpu; --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true) */ SCHED_FEAT(NUMA_RESIST_LOWER, false) #endif + +/* + * Perform expensive full wake balance for 1:N wakers when the + * selected cpu is not completely idle. + */ +SCHED_FEAT(WAKE_WIDE_BALANCE, false) ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-04 15:57 ` Mike Galbraith @ 2015-07-05 7:17 ` Mike Galbraith 2015-07-06 5:13 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-07-05 7:17 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Sat, 2015-07-04 at 17:57 +0200, Mike Galbraith wrote: > If we're (read /me) on track, the bellow should help. Per my tracing, > it may want a wee bit of toning down actually, though when I trace > virgin source I expect to see the same, namely Xorg and friends having > "wide-load" tattooed across their hindquarters earlier than they should. That's true, the only difference is that the virgin kernel will still try to pull the multi-waker when it is being awakened, which I can easily imagine going either way performance wise, depending on a pile of variables.. so let's just ask your box. V2: drop select_idle_sibling() changes (ick), try to save some cycles. sched: beef up wake_wide() Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to wakeup of the 1:N waker, returning 1 only when the 1:N waker is waking one of its minions. Correct that, and give the user the option to do an expensive balance IFF select_idle_sibling() doesn't find an idle CPU, and IFF the wakee is the 1:N waker, the dispatcher of work, thus worth some extra effort. Don't drill down through all domains though, stop searching at highest, we'll either have found the desired completely idle CPU, or if heavily loaded, the least loaded CPU of the least loaded group, which should still add up to an average scheduling latency improvement (knock wood). Not-Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> --- include/linux/sched.h | 7 ++- kernel/sched/fair.c | 86 ++++++++++++++++++++++++++++-------------------- kernel/sched/features.h | 6 +++ 3 files changed, 61 insertions(+), 38 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -962,7 +962,8 @@ extern void wake_up_q(struct wake_q_head #define SD_BALANCE_EXEC 0x0004 /* Balance on exec */ #define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ -#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ +#define SD_BALANCE_WAKE_IDLE 0x0020 /* Balance on wakeup, searching for an idle CPU */ +#define SD_WAKE_AFFINE 0x0040 /* Wake task to waking CPU */ #define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu power */ #define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ @@ -1353,9 +1354,9 @@ struct task_struct { #ifdef CONFIG_SMP struct llist_node wake_entry; int on_cpu; - struct task_struct *last_wakee; - unsigned long wakee_flips; + unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + struct task_struct *last_wakee; int wake_cpu; #endif --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4730,26 +4730,30 @@ static long effective_load(struct task_g #endif +/* + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that we are seeing a 1:N + * relationship, and that load size exceeds socket size. + */ static int wake_wide(struct task_struct *p) { - int factor = this_cpu_read(sd_llc_size); - - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; + unsigned int waker_flips = current->wakee_flips; + unsigned int wakee_flips = p->wakee_flips; + int factor = this_cpu_read(sd_llc_size), ret = 1; + + if (waker_flips < wakee_flips) { + swap(waker_flips, wakee_flips); + /* Tell the caller that we're waking a 1:N waker */ + ret += sched_feat(WAKE_WIDE_IDLE); } - - return 0; + if (wakee_flips < factor || waker_flips < wakee_flips * factor) + return 0; + return ret; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4761,13 +4765,6 @@ static int wake_affine(struct sched_doma unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -4865,6 +4862,10 @@ find_idlest_group(struct sched_domain *s load = target_load(i, load_idx); avg_load += load; + + if (sd_flag & SD_BALANCE_WAKE_IDLE && idle_cpu(i) && + cpumask_test_cpu(1, tsk_cpus_allowed(p))) + return group; } /* Adjust by relative CPU capacity of the group */ @@ -5021,14 +5022,21 @@ select_task_rq_fair(struct task_struct * { struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; - int want_affine = 0; + int new_cpu = prev_cpu; + int want_affine = 0, want_idle = 0; int sync = wake_flags & WF_SYNC; - if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); - rcu_read_lock(); + if (sd_flag & SD_BALANCE_WAKE) { + want_idle = wake_wide(p); + want_affine = !want_idle && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + want_idle = want_idle > 1; + if (!want_affine && !want_idle) + goto select; + if (want_idle) + sd_flag |= SD_BALANCE_WAKE_IDLE; + } + for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) continue; @@ -5043,23 +5051,25 @@ select_task_rq_fair(struct task_struct * break; } - if (tmp->flags & sd_flag) + if (tmp->flags & sd_flag || want_idle) sd = tmp; } if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + new_cpu = cpu; if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; +select: + new_cpu = select_idle_sibling(p, new_cpu); + if (want_idle && (new_cpu != prev_cpu || idle_cpu(prev_cpu))) + sd = NULL; } while (sd) { struct sched_group *group; int weight; - if (!(sd->flags & sd_flag)) { + if (!(sd->flags & sd_flag) && !want_idle) { sd = sd->child; continue; } @@ -5077,6 +5087,13 @@ select_task_rq_fair(struct task_struct * continue; } + /* + * We've either found an idle CPU, or the least loaded CPU in + * the least load group of the highest domain. Good enough. + */ + if (want_idle) + break; + /* Now try balancing at a lower domain level of new_cpu */ cpu = new_cpu; weight = sd->span_weight; @@ -5089,7 +5106,6 @@ select_task_rq_fair(struct task_struct * } /* while loop will break here if sd == NULL */ } -unlock: rcu_read_unlock(); return new_cpu; --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true) */ SCHED_FEAT(NUMA_RESIST_LOWER, false) #endif + +/* + * Perform expensive full wake balance for 1:N wakers when the + * selected cpu is not completely idle. + */ +SCHED_FEAT(WAKE_WIDE_IDLE, false) ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-05 7:17 ` Mike Galbraith @ 2015-07-06 5:13 ` Mike Galbraith 2015-07-06 14:34 ` Josef Bacik 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-07-06 5:13 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team Hm. Piddling with pgbench, which doesn't seem to collapse into a quivering heap when load exceeds cores these days, deltas weren't all that impressive, but it does appreciate the extra effort a bit, and a bit more when clients receive it as well. If you test, and have time to piddle, you could try letting wake_wide() return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is the dispatcher. Numbers from my little desktop box. NO_WAKE_WIDE_IDLE postgres@homer:~> pgbench.sh clients 8 tps = 116697.697662 clients 12 tps = 115160.230523 clients 16 tps = 115569.804548 clients 20 tps = 117879.230514 clients 24 tps = 118281.753040 clients 28 tps = 116974.796627 clients 32 tps = 119082.163998 avg 117092.239 1.000 WAKE_WIDE_IDLE postgres@homer:~> pgbench.sh clients 8 tps = 124351.735754 clients 12 tps = 124419.673135 clients 16 tps = 125050.716498 clients 20 tps = 124813.042352 clients 24 tps = 126047.442307 clients 28 tps = 125373.719401 clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000 WAKE_WIDE_IDLE (clients as well as server) postgres@homer:~> pgbench.sh clients 8 tps = 130539.795246 clients 12 tps = 128984.648554 clients 16 tps = 130564.386447 clients 20 tps = 129149.693118 clients 24 tps = 130211.119780 clients 28 tps = 130325.355433 clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037 ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-06 5:13 ` Mike Galbraith @ 2015-07-06 14:34 ` Josef Bacik 2015-07-06 18:36 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-07-06 14:34 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 07/06/2015 01:13 AM, Mike Galbraith wrote: > Hm. Piddling with pgbench, which doesn't seem to collapse into a > quivering heap when load exceeds cores these days, deltas weren't all > that impressive, but it does appreciate the extra effort a bit, and a > bit more when clients receive it as well. > > If you test, and have time to piddle, you could try letting wake_wide() > return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is > the dispatcher. > > Numbers from my little desktop box. > > NO_WAKE_WIDE_IDLE > postgres@homer:~> pgbench.sh > clients 8 tps = 116697.697662 > clients 12 tps = 115160.230523 > clients 16 tps = 115569.804548 > clients 20 tps = 117879.230514 > clients 24 tps = 118281.753040 > clients 28 tps = 116974.796627 > clients 32 tps = 119082.163998 avg 117092.239 1.000 > > WAKE_WIDE_IDLE > postgres@homer:~> pgbench.sh > clients 8 tps = 124351.735754 > clients 12 tps = 124419.673135 > clients 16 tps = 125050.716498 > clients 20 tps = 124813.042352 > clients 24 tps = 126047.442307 > clients 28 tps = 125373.719401 > clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000 > > WAKE_WIDE_IDLE (clients as well as server) > postgres@homer:~> pgbench.sh > clients 8 tps = 130539.795246 > clients 12 tps = 128984.648554 > clients 16 tps = 130564.386447 > clients 20 tps = 129149.693118 > clients 24 tps = 130211.119780 > clients 28 tps = 130325.355433 > clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037 > I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get this solved upstream and then I'll rip out the old and put in the new, I'm happy to screw around until we're all happy. I'll throw this in a kernel this morning and run stuff today. Barring any issues with the testing infrastructure I should have results today. Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-06 14:34 ` Josef Bacik @ 2015-07-06 18:36 ` Mike Galbraith 2015-07-06 19:41 ` Josef Bacik 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-07-06 18:36 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote: > On 07/06/2015 01:13 AM, Mike Galbraith wrote: > > Hm. Piddling with pgbench, which doesn't seem to collapse into a > > quivering heap when load exceeds cores these days, deltas weren't all > > that impressive, but it does appreciate the extra effort a bit, and a > > bit more when clients receive it as well. > > > > If you test, and have time to piddle, you could try letting wake_wide() > > return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is > > the dispatcher. > > > > Numbers from my little desktop box. > > > > NO_WAKE_WIDE_IDLE > > postgres@homer:~> pgbench.sh > > clients 8 tps = 116697.697662 > > clients 12 tps = 115160.230523 > > clients 16 tps = 115569.804548 > > clients 20 tps = 117879.230514 > > clients 24 tps = 118281.753040 > > clients 28 tps = 116974.796627 > > clients 32 tps = 119082.163998 avg 117092.239 1.000 > > > > WAKE_WIDE_IDLE > > postgres@homer:~> pgbench.sh > > clients 8 tps = 124351.735754 > > clients 12 tps = 124419.673135 > > clients 16 tps = 125050.716498 > > clients 20 tps = 124813.042352 > > clients 24 tps = 126047.442307 > > clients 28 tps = 125373.719401 > > clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000 > > > > WAKE_WIDE_IDLE (clients as well as server) > > postgres@homer:~> pgbench.sh > > clients 8 tps = 130539.795246 > > clients 12 tps = 128984.648554 > > clients 16 tps = 130564.386447 > > clients 20 tps = 129149.693118 > > clients 24 tps = 130211.119780 > > clients 28 tps = 130325.355433 > > clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037 I had a typo in my script, so those desktop box numbers were all doing the same number of clients. It doesn't invalidate anything, but the individual deltas are just run to run variance.. not to mention that single cache box is not all that interesting for this anyway. That happens when interconnect becomes a player. > I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get > this solved upstream and then I'll rip out the old and put in the new, > I'm happy to screw around until we're all happy. I'll throw this in a > kernel this morning and run stuff today. Barring any issues with the > testing infrastructure I should have results today. Thanks, I'll be interested in your results. Taking pgbench to a little NUMA box, I'm seeing _nada_ outside of variance with master (crap). I have a way to win significantly for _older_ kernels, and that win over master _may_ provide some useful insight, but I don't trust postgres/pgbench as far as I can toss the planet, so don't have a warm fuzzy about trying to use it to approximate your real world load. BTW, what's your topology look like (numactl --hardware). -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-06 18:36 ` Mike Galbraith @ 2015-07-06 19:41 ` Josef Bacik 2015-07-07 4:01 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-07-06 19:41 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 07/06/2015 02:36 PM, Mike Galbraith wrote: > On Mon, 2015-07-06 at 10:34 -0400, Josef Bacik wrote: >> On 07/06/2015 01:13 AM, Mike Galbraith wrote: >>> Hm. Piddling with pgbench, which doesn't seem to collapse into a >>> quivering heap when load exceeds cores these days, deltas weren't all >>> that impressive, but it does appreciate the extra effort a bit, and a >>> bit more when clients receive it as well. >>> >>> If you test, and have time to piddle, you could try letting wake_wide() >>> return 1 + sched_feat(WAKE_WIDE_IDLE) instead of adding only if wakee is >>> the dispatcher. >>> >>> Numbers from my little desktop box. >>> >>> NO_WAKE_WIDE_IDLE >>> postgres@homer:~> pgbench.sh >>> clients 8 tps = 116697.697662 >>> clients 12 tps = 115160.230523 >>> clients 16 tps = 115569.804548 >>> clients 20 tps = 117879.230514 >>> clients 24 tps = 118281.753040 >>> clients 28 tps = 116974.796627 >>> clients 32 tps = 119082.163998 avg 117092.239 1.000 >>> >>> WAKE_WIDE_IDLE >>> postgres@homer:~> pgbench.sh >>> clients 8 tps = 124351.735754 >>> clients 12 tps = 124419.673135 >>> clients 16 tps = 125050.716498 >>> clients 20 tps = 124813.042352 >>> clients 24 tps = 126047.442307 >>> clients 28 tps = 125373.719401 >>> clients 32 tps = 126711.243383 avg 125252.510 1.069 1.000 >>> >>> WAKE_WIDE_IDLE (clients as well as server) >>> postgres@homer:~> pgbench.sh >>> clients 8 tps = 130539.795246 >>> clients 12 tps = 128984.648554 >>> clients 16 tps = 130564.386447 >>> clients 20 tps = 129149.693118 >>> clients 24 tps = 130211.119780 >>> clients 28 tps = 130325.355433 >>> clients 32 tps = 129585.656963 avg 129908.665 1.109 1.037 > > I had a typo in my script, so those desktop box numbers were all doing > the same number of clients. It doesn't invalidate anything, but the > individual deltas are just run to run variance.. not to mention that > single cache box is not all that interesting for this anyway. That > happens when interconnect becomes a player. > >> I have time for twiddling, we're carrying ye olde WAKE_IDLE until we get >> this solved upstream and then I'll rip out the old and put in the new, >> I'm happy to screw around until we're all happy. I'll throw this in a >> kernel this morning and run stuff today. Barring any issues with the >> testing infrastructure I should have results today. Thanks, > > I'll be interested in your results. Taking pgbench to a little NUMA > box, I'm seeing _nada_ outside of variance with master (crap). I have a > way to win significantly for _older_ kernels, and that win over master > _may_ provide some useful insight, but I don't trust postgres/pgbench as > far as I can toss the planet, so don't have a warm fuzzy about trying to > use it to approximate your real world load. > > BTW, what's your topology look like (numactl --hardware). > So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the baseline with a slight regression at lower RPS and a slight improvement at high RPS. I'm running with WAKE_WIDE_IDLE set now, that should be done soonish and then I'll do the 1 + sched_feat(WAKE_WIDE_IDLE) thing next and those results should come in the morning. Here is the numa information from one of the boxes in the test cluster available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29 node 0 size: 15890 MB node 0 free: 2651 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39 node 1 size: 16125 MB node 1 free: 2063 MB node distances: node 0 1 0: 10 20 1: 20 10 Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-06 19:41 ` Josef Bacik @ 2015-07-07 4:01 ` Mike Galbraith 2015-07-07 9:43 ` [patch] " Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-07-07 4:01 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote: > So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the > baseline with a slight regression at lower RPS and a slight improvement > at high RPS. Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm not real keen on the feature unless your numbers are really good, and odds are that ain't gonna happen. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-07 4:01 ` Mike Galbraith @ 2015-07-07 9:43 ` Mike Galbraith 2015-07-07 13:40 ` Josef Bacik 2015-07-07 17:06 ` Josef Bacik 0 siblings, 2 replies; 73+ messages in thread From: Mike Galbraith @ 2015-07-07 9:43 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote: > On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote: > > > So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the > > baseline with a slight regression at lower RPS and a slight improvement > > at high RPS. > > Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm > not real keen on the feature unless your numbers are really good, and > odds are that ain't gonna happen. More extensive testing in pedantic-man mode increased my confidence of that enough to sign off and ship the dirt simple version. Any further twiddles should grow their own wings if they want to fly anyway, the simplest form helps your real world load, as well as the not so real pgbench, my numbers for that below. virgin master, 2 socket box postgres@nessler:~> pgbench.sh clients 12 tps = 96233.854271 1.000 clients 24 tps = 142234.686166 1.000 clients 36 tps = 148433.534531 1.000 clients 48 tps = 133105.634302 1.000 clients 60 tps = 128903.080371 1.000 clients 72 tps = 128591.821782 1.000 clients 84 tps = 114445.967116 1.000 clients 96 tps = 109803.557524 1.000 avg 125219.017 1.000 V3 (KISS, below) postgres@nessler:~> pgbench.sh clients 12 tps = 120793.023637 1.255 clients 24 tps = 144668.961468 1.017 clients 36 tps = 156705.239251 1.055 clients 48 tps = 152004.886893 1.141 clients 60 tps = 138582.113864 1.075 clients 72 tps = 136286.891104 1.059 clients 84 tps = 137420.986043 1.200 clients 96 tps = 135199.060242 1.231 avg 140207.645 1.119 1.000 V2 NO_WAKE_WIDE_IDLE postgres@nessler:~> pgbench.sh clients 12 tps = 121821.966162 1.265 clients 24 tps = 146446.388366 1.029 clients 36 tps = 151373.362190 1.019 clients 48 tps = 156806.730746 1.178 clients 60 tps = 133933.491567 1.039 clients 72 tps = 131460.489424 1.022 clients 84 tps = 130859.340261 1.143 clients 96 tps = 130787.476584 1.191 avg 137936.155 1.101 0.983 V2 WAKE_WIDE_IDLE (crawl in a hole feature, you're dead) postgres@nessler:~> pgbench.sh clients 12 tps = 121297.791570 clients 24 tps = 145939.488312 clients 36 tps = 155336.090263 clients 48 tps = 149018.245323 clients 60 tps = 136730.079391 clients 72 tps = 134886.116831 clients 84 tps = 130493.283398 clients 96 tps = 126043.336074 sched: beef up wake_wide() Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to wakeup of the 1:N waker, returning 1 only when the 1:N waker is waking one of its minions. Correct that, and don't bother doing domain traversal when we know that all we need to do is check for an idle cpu. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> --- include/linux/sched.h | 4 +-- kernel/sched/fair.c | 56 ++++++++++++++++++++++++-------------------------- 2 files changed, 29 insertions(+), 31 deletions(-) Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1351,9 +1351,9 @@ struct task_struct { #ifdef CONFIG_SMP struct llist_node wake_entry; int on_cpu; - struct task_struct *last_wakee; - unsigned long wakee_flips; + unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + struct task_struct *last_wakee; int wake_cpu; #endif Index: linux-2.6/kernel/sched/fair.c =================================================================== --- linux-2.6.orig/kernel/sched/fair.c +++ linux-2.6/kernel/sched/fair.c @@ -4730,26 +4730,27 @@ static long effective_load(struct task_g #endif +/* + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that we are seeing a 1:N + * relationship, and that load size exceeds socket size. + */ static int wake_wide(struct task_struct *p) { + unsigned int waker_flips = current->wakee_flips; + unsigned int wakee_flips = p->wakee_flips; int factor = this_cpu_read(sd_llc_size); - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; - } - - return 0; + if (waker_flips < wakee_flips) + swap(waker_flips, wakee_flips); + if (wakee_flips < factor || waker_flips < wakee_flips * factor) + return 0; + return 1; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4761,13 +4762,6 @@ static int wake_affine(struct sched_doma unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct * { struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; + int new_cpu = prev_cpu; int want_affine = 0; int sync = wake_flags & WF_SYNC; - if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); - rcu_read_lock(); + if (sd_flag & SD_BALANCE_WAKE) { + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + if (!want_affine) + goto select_idle; + } + for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) continue; @@ -5048,10 +5045,11 @@ select_task_rq_fair(struct task_struct * } if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + new_cpu = cpu; if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); +select_idle: + new_cpu = select_idle_sibling(p, new_cpu); goto unlock; } ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-07 9:43 ` [patch] " Mike Galbraith @ 2015-07-07 13:40 ` Josef Bacik 2015-07-07 15:24 ` Mike Galbraith 2015-07-07 17:06 ` Josef Bacik 1 sibling, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-07-07 13:40 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 07/07/2015 05:43 AM, Mike Galbraith wrote: > On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote: >> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote: >> >>> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the >>> baseline with a slight regression at lower RPS and a slight improvement >>> at high RPS. >> >> Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm >> not real keen on the feature unless your numbers are really good, and >> odds are that ain't gonna happen. > > More extensive testing in pedantic-man mode increased my confidence of > that enough to sign off and ship the dirt simple version. Any further > twiddles should grow their own wings if they want to fly anyway, the > simplest form helps your real world load, as well as the not so real > pgbench, my numbers for that below. > The WAKE_WIDE_IDLE run was basically the same so I'm good with the KISS version. I'll run that through the load tests this morning and let you know how it goes. I'm still seeing a slight regression at lower RPS, but it's like 1-2%, compared to ~15%. Once load ramps up we're good to go, not sure why that is but it may also be my sample size (the cluster is only 32 boxes, the minimum for decent results). Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-07 13:40 ` Josef Bacik @ 2015-07-07 15:24 ` Mike Galbraith 0 siblings, 0 replies; 73+ messages in thread From: Mike Galbraith @ 2015-07-07 15:24 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Tue, 2015-07-07 at 09:40 -0400, Josef Bacik wrote: > The WAKE_WIDE_IDLE run was basically the same so I'm good with the KISS > version. I'll run that through the load tests this morning and let you > know how it goes. I'm still seeing a slight regression at lower RPS, > but it's like 1-2%, compared to ~15%. If I'm to believe pgbench, and configs really are as close as they should be, there may be something between 3.12 and master which could account for your delta and then some. I may go hunting.. heck, if I trusted pgbench I would be hunting. patched SLE (resembles 3.12 if you squint) postgres@nessler:~> pgbench.sh clients 12 tps = 134795.901777 clients 24 tps = 156955.584523 clients 36 tps = 161450.044316 clients 48 tps = 171181.069044 clients 60 tps = 157896.858179 clients 72 tps = 155816.051709 clients 84 tps = 152456.003468 clients 96 tps = 121129.848008 patched master postgres@nessler:~> pgbench.sh clients 12 tps = 120793.023637 clients 24 tps = 144668.961468 clients 36 tps = 156705.239251 clients 48 tps = 152004.886893 clients 60 tps = 138582.113864 clients 72 tps = 136286.891104 clients 84 tps = 137420.986043 clients 96 tps = 135199.060242 ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-07-07 9:43 ` [patch] " Mike Galbraith 2015-07-07 13:40 ` Josef Bacik @ 2015-07-07 17:06 ` Josef Bacik 2015-07-08 6:13 ` [patch] sched: beef up wake_wide() Mike Galbraith 1 sibling, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-07-07 17:06 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 07/07/2015 05:43 AM, Mike Galbraith wrote: > On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote: >> On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote: >> >>> So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the >>> baseline with a slight regression at lower RPS and a slight improvement >>> at high RPS. >> >> Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm >> not real keen on the feature unless your numbers are really good, and >> odds are that ain't gonna happen. > > More extensive testing in pedantic-man mode increased my confidence of > that enough to sign off and ship the dirt simple version. Any further > twiddles should grow their own wings if they want to fly anyway, the > simplest form helps your real world load, as well as the not so real > pgbench, my numbers for that below. > > virgin master, 2 socket box > postgres@nessler:~> pgbench.sh > clients 12 tps = 96233.854271 1.000 > clients 24 tps = 142234.686166 1.000 > clients 36 tps = 148433.534531 1.000 > clients 48 tps = 133105.634302 1.000 > clients 60 tps = 128903.080371 1.000 > clients 72 tps = 128591.821782 1.000 > clients 84 tps = 114445.967116 1.000 > clients 96 tps = 109803.557524 1.000 avg 125219.017 1.000 > > V3 (KISS, below) > postgres@nessler:~> pgbench.sh > clients 12 tps = 120793.023637 1.255 > clients 24 tps = 144668.961468 1.017 > clients 36 tps = 156705.239251 1.055 > clients 48 tps = 152004.886893 1.141 > clients 60 tps = 138582.113864 1.075 > clients 72 tps = 136286.891104 1.059 > clients 84 tps = 137420.986043 1.200 > clients 96 tps = 135199.060242 1.231 avg 140207.645 1.119 1.000 > > V2 NO_WAKE_WIDE_IDLE > postgres@nessler:~> pgbench.sh > clients 12 tps = 121821.966162 1.265 > clients 24 tps = 146446.388366 1.029 > clients 36 tps = 151373.362190 1.019 > clients 48 tps = 156806.730746 1.178 > clients 60 tps = 133933.491567 1.039 > clients 72 tps = 131460.489424 1.022 > clients 84 tps = 130859.340261 1.143 > clients 96 tps = 130787.476584 1.191 avg 137936.155 1.101 0.983 > > V2 WAKE_WIDE_IDLE (crawl in a hole feature, you're dead) > postgres@nessler:~> pgbench.sh > clients 12 tps = 121297.791570 > clients 24 tps = 145939.488312 > clients 36 tps = 155336.090263 > clients 48 tps = 149018.245323 > clients 60 tps = 136730.079391 > clients 72 tps = 134886.116831 > clients 84 tps = 130493.283398 > clients 96 tps = 126043.336074 > > > sched: beef up wake_wide() > > Josef Bacik reported that Facebook sees better performance with their > 1:N load (1 dispatch/node, N workers/node) when carrying an old patch > to try very hard to wake to an idle CPU. While looking at wake_wide(), > I noticed that it doesn't pay attention to wakeup of the 1:N waker, > returning 1 only when the 1:N waker is waking one of its minions. > > Correct that, and don't bother doing domain traversal when we know > that all we need to do is check for an idle cpu. > > Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Ok you can add Tested-by: Josef Bacik <jbacik@fb.com> The new patch is the best across the board, there are no regressions and about a 5% improvement for the bulk of the run (25 percentile and up). Thanks for your help on this! Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* [patch] sched: beef up wake_wide() 2015-07-07 17:06 ` Josef Bacik @ 2015-07-08 6:13 ` Mike Galbraith 2015-07-09 13:26 ` Peter Zijlstra 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-07-08 6:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to wakeup of the 1:N waker, returning 1 only when the 1:N waker is waking one of its minions. Correct that, and don't bother doing domain traversal when we know that all we need to do is check for an idle cpu. While at it, adjust task_struct bits, we don't need a 64bit counter. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Tested-by: Josef Bacik <jbacik@fb.com> --- include/linux/sched.h | 4 +-- kernel/sched/fair.c | 56 ++++++++++++++++++++++++-------------------------- 2 files changed, 29 insertions(+), 31 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1351,9 +1351,9 @@ struct task_struct { #ifdef CONFIG_SMP struct llist_node wake_entry; int on_cpu; - struct task_struct *last_wakee; - unsigned long wakee_flips; + unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + struct task_struct *last_wakee; int wake_cpu; #endif --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4730,26 +4730,27 @@ static long effective_load(struct task_g #endif +/* + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that we are seeing a 1:N + * relationship, and that load size exceeds socket size. + */ static int wake_wide(struct task_struct *p) { + unsigned int waker_flips = current->wakee_flips; + unsigned int wakee_flips = p->wakee_flips; int factor = this_cpu_read(sd_llc_size); - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; - } - - return 0; + if (waker_flips < wakee_flips) + swap(waker_flips, wakee_flips); + if (wakee_flips < factor || waker_flips < wakee_flips * factor) + return 0; + return 1; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4761,13 +4762,6 @@ static int wake_affine(struct sched_doma unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct * { struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; + int new_cpu = prev_cpu; int want_affine = 0; int sync = wake_flags & WF_SYNC; - if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); - rcu_read_lock(); + if (sd_flag & SD_BALANCE_WAKE) { + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + if (!want_affine) + goto select_idle; + } + for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) continue; @@ -5048,10 +5045,11 @@ select_task_rq_fair(struct task_struct * } if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + new_cpu = cpu; if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); +select_idle: + new_cpu = select_idle_sibling(p, new_cpu); goto unlock; } ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-08 6:13 ` [patch] sched: beef up wake_wide() Mike Galbraith @ 2015-07-09 13:26 ` Peter Zijlstra 2015-07-09 14:07 ` Mike Galbraith 2015-07-10 5:19 ` Mike Galbraith 0 siblings, 2 replies; 73+ messages in thread From: Peter Zijlstra @ 2015-07-09 13:26 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote: > > +/* > + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic. > + * A waker of many should wake a different task than the one last awakened > + * at a frequency roughly N times higher than one of its wakees. In order > + * to determine whether we should let the load spread vs consolodating to > + * shared cache, we look for a minimum 'flip' frequency of llc_size in one > + * partner, and a factor of lls_size higher frequency in the other. With > + * both conditions met, we can be relatively sure that we are seeing a 1:N > + * relationship, and that load size exceeds socket size. > + */ > static int wake_wide(struct task_struct *p) > { > + unsigned int waker_flips = current->wakee_flips; > + unsigned int wakee_flips = p->wakee_flips; > int factor = this_cpu_read(sd_llc_size); > > + if (waker_flips < wakee_flips) > + swap(waker_flips, wakee_flips); This makes the wakee/waker names useless, the end result is more like wakee_flips := client_flips, waker_flips := server_flips. > + if (wakee_flips < factor || waker_flips < wakee_flips * factor) > + return 0; I don't get the first condition... why would the client ever flip? It only talks to that one server. > + return 1; > } > @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct * > { > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; > int cpu = smp_processor_id(); > + int new_cpu = prev_cpu; > int want_affine = 0; > int sync = wake_flags & WF_SYNC; > > rcu_read_lock(); > + if (sd_flag & SD_BALANCE_WAKE) { > + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); > + if (!want_affine) > + goto select_idle; > + } So this preserves/makes worse the bug Morten spotted, even without want_affine we should still attempt SD_BALANCE_WAKE if set. > + > for_each_domain(cpu, tmp) { > if (!(tmp->flags & SD_LOAD_BALANCE)) > continue; ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-09 13:26 ` Peter Zijlstra @ 2015-07-09 14:07 ` Mike Galbraith 2015-07-09 14:46 ` Mike Galbraith 2015-07-10 5:19 ` Mike Galbraith 1 sibling, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-07-09 14:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote: > On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote: > > > > +/* > > + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic. > > + * A waker of many should wake a different task than the one last awakened > > + * at a frequency roughly N times higher than one of its wakees. In order > > + * to determine whether we should let the load spread vs consolodating to > > + * shared cache, we look for a minimum 'flip' frequency of llc_size in one > > + * partner, and a factor of lls_size higher frequency in the other. With > > + * both conditions met, we can be relatively sure that we are seeing a 1:N > > + * relationship, and that load size exceeds socket size. > > + */ > > static int wake_wide(struct task_struct *p) > > { > > + unsigned int waker_flips = current->wakee_flips; > > + unsigned int wakee_flips = p->wakee_flips; > > int factor = this_cpu_read(sd_llc_size); > > > > + if (waker_flips < wakee_flips) > > + swap(waker_flips, wakee_flips); > > This makes the wakee/waker names useless, the end result is more like > wakee_flips := client_flips, waker_flips := server_flips. True, perhaps a rename is in order. > > + if (wakee_flips < factor || waker_flips < wakee_flips * factor) > > + return 0; > > I don't get the first condition... why would the client ever flip? It > only talks to that one server. So I was thinking too, and I initially cemented the relationship by flipping both. However, the thing works in virgin source, ie clients do in fact flip, so I removed that cementing based on the hard evidence. > > @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct * > > { > > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; > > int cpu = smp_processor_id(); > > + int new_cpu = prev_cpu; > > int want_affine = 0; > > int sync = wake_flags & WF_SYNC; > > > > rcu_read_lock(); > > + if (sd_flag & SD_BALANCE_WAKE) { > > + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); > > + if (!want_affine) > > + goto select_idle; > > + } > > So this preserves/makes worse the bug Morten spotted, even without > want_affine we should still attempt SD_BALANCE_WAKE if set. Yeah. I can redo it if you want, but it seems a shame to traverse for nothing given we know SD_BALANCE_WAKE is so painful that nobody really really wants to do that. One has to override the other in any case, no? -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-09 14:07 ` Mike Galbraith @ 2015-07-09 14:46 ` Mike Galbraith 0 siblings, 0 replies; 73+ messages in thread From: Mike Galbraith @ 2015-07-09 14:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Thu, 2015-07-09 at 16:07 +0200, Mike Galbraith wrote: > On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote: > > On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote: > > > > > > +/* > > > + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic. > > > + * A waker of many should wake a different task than the one last awakened > > > + * at a frequency roughly N times higher than one of its wakees. In order > > > + * to determine whether we should let the load spread vs consolodating to > > > + * shared cache, we look for a minimum 'flip' frequency of llc_size in one > > > + * partner, and a factor of lls_size higher frequency in the other. With > > > + * both conditions met, we can be relatively sure that we are seeing a 1:N > > > + * relationship, and that load size exceeds socket size. > > > + */ > > > static int wake_wide(struct task_struct *p) > > > { > > > + unsigned int waker_flips = current->wakee_flips; > > > + unsigned int wakee_flips = p->wakee_flips; > > > int factor = this_cpu_read(sd_llc_size); > > > > > > + if (waker_flips < wakee_flips) > > > + swap(waker_flips, wakee_flips); > > > > This makes the wakee/waker names useless, the end result is more like > > wakee_flips := client_flips, waker_flips := server_flips. > > True, perhaps a rename is in order. I should perhaps add that waker/wakee_flips does make sense to me in the sense that the task who's flips end up in the waker_flips bucket is our waker of many vs being one its many wakees. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-09 13:26 ` Peter Zijlstra 2015-07-09 14:07 ` Mike Galbraith @ 2015-07-10 5:19 ` Mike Galbraith 2015-07-10 13:41 ` Josef Bacik 2015-07-10 20:59 ` Josef Bacik 1 sibling, 2 replies; 73+ messages in thread From: Mike Galbraith @ 2015-07-10 5:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote: > On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote: > > static int wake_wide(struct task_struct *p) > > { > > + unsigned int waker_flips = current->wakee_flips; > > + unsigned int wakee_flips = p->wakee_flips; > > int factor = this_cpu_read(sd_llc_size); > > > > + if (waker_flips < wakee_flips) > > + swap(waker_flips, wakee_flips); > > This makes the wakee/waker names useless, the end result is more like > wakee_flips := client_flips, waker_flips := server_flips. I settled on master/slave plus hopefully improved comment block. > > + if (wakee_flips < factor || waker_flips < wakee_flips * factor) > > + return 0; > > I don't get the first condition... why would the client ever flip? It > only talks to that one server. (tightening heuristic up a bit by one means or another would be good, but "if it ain't broke, don't fix it" applies for this patchlet) > > @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct * > > { > > struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; > > int cpu = smp_processor_id(); > > + int new_cpu = prev_cpu; > > int want_affine = 0; > > int sync = wake_flags & WF_SYNC; > > > > rcu_read_lock(); > > + if (sd_flag & SD_BALANCE_WAKE) { > > + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); > > + if (!want_affine) > > + goto select_idle; > > + } > > So this preserves/makes worse the bug Morten spotted, even without > want_affine we should still attempt SD_BALANCE_WAKE if set. Fixed. wake_wide() may override want_affine as before, want_affine may override other ->flags as before, but a surviving domain selection now results in a full balance instead of a select_idle_sibling() call. sched: beef up wake_wide() Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to the wakeup of a many partner waker, returning 1 only when waking one of its many partners. Correct that, letting explicit domain flags override the heuristic. While at it, adjust task_struct bits, we don't need a 64bit counter. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Tested-by: Josef Bacik <jbacik@fb.com> --- include/linux/sched.h | 4 +-- kernel/sched/fair.c | 57 ++++++++++++++++++++++---------------------------- 2 files changed, 28 insertions(+), 33 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1351,9 +1351,9 @@ struct task_struct { #ifdef CONFIG_SMP struct llist_node wake_entry; int on_cpu; - struct task_struct *last_wakee; - unsigned long wakee_flips; + unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + struct task_struct *last_wakee; int wake_cpu; #endif --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4730,26 +4730,29 @@ static long effective_load(struct task_g #endif +/* + * Detect M:N waker/wakee relationships via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that the relationship is + * non-monogamous, with partner count exceeding socket size. Waker/wakee + * being client/server, worker/dispatcher, interrupt source or whatever is + * irrelevant, spread criteria is apparent partner count exceeds socket size. + */ static int wake_wide(struct task_struct *p) { + unsigned int master = current->wakee_flips; + unsigned int slave = p->wakee_flips; int factor = this_cpu_read(sd_llc_size); - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; - } - - return 0; + if (master < slave) + swap(master, slave); + if (slave < factor || master < slave * factor) + return 0; + return 1; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4761,13 +4764,6 @@ static int wake_affine(struct sched_doma unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -5021,12 +5017,12 @@ select_task_rq_fair(struct task_struct * { struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; + int new_cpu = prev_cpu; int want_affine = 0; int sync = wake_flags & WF_SYNC; if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); rcu_read_lock(); for_each_domain(cpu, tmp) { @@ -5040,6 +5036,8 @@ select_task_rq_fair(struct task_struct * if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { affine_sd = tmp; + /* Prefer affinity over any other flags */ + sd = NULL; break; } @@ -5048,12 +5046,10 @@ select_task_rq_fair(struct task_struct * } if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + new_cpu = cpu; - if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; - } + if ((sd_flag & SD_BALANCE_WAKE) && (!sd || (!(sd->flags & SD_BALANCE_WAKE)))) + new_cpu = select_idle_sibling(p, new_cpu); while (sd) { struct sched_group *group; @@ -5089,7 +5085,6 @@ select_task_rq_fair(struct task_struct * } /* while loop will break here if sd == NULL */ } -unlock: rcu_read_unlock(); return new_cpu; ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-10 5:19 ` Mike Galbraith @ 2015-07-10 13:41 ` Josef Bacik 2015-07-10 20:59 ` Josef Bacik 1 sibling, 0 replies; 73+ messages in thread From: Josef Bacik @ 2015-07-10 13:41 UTC (permalink / raw) To: Mike Galbraith, Peter Zijlstra Cc: riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 07/10/2015 01:19 AM, Mike Galbraith wrote: > On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote: >> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote: >>> static int wake_wide(struct task_struct *p) >>> { >>> + unsigned int waker_flips = current->wakee_flips; >>> + unsigned int wakee_flips = p->wakee_flips; >>> int factor = this_cpu_read(sd_llc_size); >>> >>> + if (waker_flips < wakee_flips) >>> + swap(waker_flips, wakee_flips); >> >> This makes the wakee/waker names useless, the end result is more like >> wakee_flips := client_flips, waker_flips := server_flips. > > I settled on master/slave plus hopefully improved comment block. > >>> + if (wakee_flips < factor || waker_flips < wakee_flips * factor) >>> + return 0; >> >> I don't get the first condition... why would the client ever flip? It >> only talks to that one server. > > (tightening heuristic up a bit by one means or another would be good, > but "if it ain't broke, don't fix it" applies for this patchlet) > >>> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct * >>> { >>> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; >>> int cpu = smp_processor_id(); >>> + int new_cpu = prev_cpu; >>> int want_affine = 0; >>> int sync = wake_flags & WF_SYNC; >>> >>> rcu_read_lock(); >>> + if (sd_flag & SD_BALANCE_WAKE) { >>> + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); >>> + if (!want_affine) >>> + goto select_idle; >>> + } >> >> So this preserves/makes worse the bug Morten spotted, even without >> want_affine we should still attempt SD_BALANCE_WAKE if set. > > Fixed. wake_wide() may override want_affine as before, want_affine may > override other ->flags as before, but a surviving domain selection now > results in a full balance instead of a select_idle_sibling() call. > > sched: beef up wake_wide() > > Josef Bacik reported that Facebook sees better performance with their > 1:N load (1 dispatch/node, N workers/node) when carrying an old patch > to try very hard to wake to an idle CPU. While looking at wake_wide(), > I noticed that it doesn't pay attention to the wakeup of a many partner > waker, returning 1 only when waking one of its many partners. > > Correct that, letting explicit domain flags override the heuristic. > > While at it, adjust task_struct bits, we don't need a 64bit counter. > > Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> > Tested-by: Josef Bacik <jbacik@fb.com> I'll give this new one a whirl and let you know how it goes. Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-10 5:19 ` Mike Galbraith 2015-07-10 13:41 ` Josef Bacik @ 2015-07-10 20:59 ` Josef Bacik 2015-07-11 3:11 ` Mike Galbraith 1 sibling, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-07-10 20:59 UTC (permalink / raw) To: Mike Galbraith, Peter Zijlstra Cc: riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 07/10/2015 01:19 AM, Mike Galbraith wrote: > On Thu, 2015-07-09 at 15:26 +0200, Peter Zijlstra wrote: >> On Wed, Jul 08, 2015 at 08:13:46AM +0200, Mike Galbraith wrote: >>> static int wake_wide(struct task_struct *p) >>> { >>> + unsigned int waker_flips = current->wakee_flips; >>> + unsigned int wakee_flips = p->wakee_flips; >>> int factor = this_cpu_read(sd_llc_size); >>> >>> + if (waker_flips < wakee_flips) >>> + swap(waker_flips, wakee_flips); >> >> This makes the wakee/waker names useless, the end result is more like >> wakee_flips := client_flips, waker_flips := server_flips. > > I settled on master/slave plus hopefully improved comment block. > >>> + if (wakee_flips < factor || waker_flips < wakee_flips * factor) >>> + return 0; >> >> I don't get the first condition... why would the client ever flip? It >> only talks to that one server. > > (tightening heuristic up a bit by one means or another would be good, > but "if it ain't broke, don't fix it" applies for this patchlet) > >>> @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct * >>> { >>> struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; >>> int cpu = smp_processor_id(); >>> + int new_cpu = prev_cpu; >>> int want_affine = 0; >>> int sync = wake_flags & WF_SYNC; >>> >>> rcu_read_lock(); >>> + if (sd_flag & SD_BALANCE_WAKE) { >>> + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); >>> + if (!want_affine) >>> + goto select_idle; >>> + } >> >> So this preserves/makes worse the bug Morten spotted, even without >> want_affine we should still attempt SD_BALANCE_WAKE if set. > > Fixed. wake_wide() may override want_affine as before, want_affine may > override other ->flags as before, but a surviving domain selection now > results in a full balance instead of a select_idle_sibling() call. > > sched: beef up wake_wide() > > Josef Bacik reported that Facebook sees better performance with their > 1:N load (1 dispatch/node, N workers/node) when carrying an old patch > to try very hard to wake to an idle CPU. While looking at wake_wide(), > I noticed that it doesn't pay attention to the wakeup of a many partner > waker, returning 1 only when waking one of its many partners. > > Correct that, letting explicit domain flags override the heuristic. > > While at it, adjust task_struct bits, we don't need a 64bit counter. > > Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> > Tested-by: Josef Bacik <jbacik@fb.com> Not quite as awesome but still better than the baseline so we're good. Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-10 20:59 ` Josef Bacik @ 2015-07-11 3:11 ` Mike Galbraith 2015-07-13 13:53 ` Josef Bacik 2015-07-14 11:19 ` Peter Zijlstra 0 siblings, 2 replies; 73+ messages in thread From: Mike Galbraith @ 2015-07-11 3:11 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote: > Not quite as awesome but still better than the baseline so we're good. > Thanks, I personally like the other much better. We're not doing the user any favor by making the thing balance when SD_BALANCE_WAKE is set. Until such time as it ceases to be horribly painful, we're just paying a few cycles to help a theoretically existing masochist torture his box. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-11 3:11 ` Mike Galbraith @ 2015-07-13 13:53 ` Josef Bacik 2015-07-14 11:19 ` Peter Zijlstra 1 sibling, 0 replies; 73+ messages in thread From: Josef Bacik @ 2015-07-13 13:53 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 07/10/2015 11:11 PM, Mike Galbraith wrote: > On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote: > >> Not quite as awesome but still better than the baseline so we're good. >> Thanks, > > I personally like the other much better. We're not doing the user any > favor by making the thing balance when SD_BALANCE_WAKE is set. Until > such time as it ceases to be horribly painful, we're just paying a few > cycles to help a theoretically existing masochist torture his box. > Agreed, I'm all for the faster version ;), Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-11 3:11 ` Mike Galbraith 2015-07-13 13:53 ` Josef Bacik @ 2015-07-14 11:19 ` Peter Zijlstra 2015-07-14 13:49 ` Mike Galbraith 1 sibling, 1 reply; 73+ messages in thread From: Peter Zijlstra @ 2015-07-14 11:19 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Sat, Jul 11, 2015 at 05:11:51AM +0200, Mike Galbraith wrote: > On Fri, 2015-07-10 at 16:59 -0400, Josef Bacik wrote: > > > Not quite as awesome but still better than the baseline so we're good. > > Thanks, > > I personally like the other much better. We're not doing the user any > favor by making the thing balance when SD_BALANCE_WAKE is set. Until > such time as it ceases to be horribly painful, we're just paying a few > cycles to help a theoretically existing masochist torture his box. OK, how about something like the below; it tightens things up by applying two rules: - We really should not continue looking for a balancing domain once SD_LOAD_BALANCE is not set. - SD (balance) flags should really be set in a single contiguous range, always starting at the bottom. The latter means what if !want_affine and the (first) sd doesn't have BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration junk you didn't like.. Hmm? --- Subject: sched: Beef up wake_wide() From: Mike Galbraith <umgwanakikbuti@gmail.com> Date: Fri, 10 Jul 2015 07:19:26 +0200 Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to the wakeup of a many partner waker, returning 1 only when waking one of its many partners. Correct that, letting explicit domain flags override the heuristic. While at it, adjust task_struct bits, we don't need a 64bit counter. Cc: morten.rasmussen@arm.com Cc: riel@redhat.com Cc: mingo@redhat.com Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Tested-by: Josef Bacik <jbacik@fb.com> [peterz: frobbings] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com --- include/linux/sched.h | 4 +- kernel/sched/fair.c | 72 +++++++++++++++++++++++--------------------------- 2 files changed, 36 insertions(+), 40 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1351,9 +1351,9 @@ struct task_struct { #ifdef CONFIG_SMP struct llist_node wake_entry; int on_cpu; - struct task_struct *last_wakee; - unsigned long wakee_flips; + unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + struct task_struct *last_wakee; int wake_cpu; #endif --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4726,26 +4726,29 @@ static long effective_load(struct task_g #endif +/* + * Detect M:N waker/wakee relationships via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that the relationship is + * non-monogamous, with partner count exceeding socket size. Waker/wakee + * being client/server, worker/dispatcher, interrupt source or whatever is + * irrelevant, spread criteria is apparent partner count exceeds socket size. + */ static int wake_wide(struct task_struct *p) { + unsigned int master = current->wakee_flips; + unsigned int slave = p->wakee_flips; int factor = this_cpu_read(sd_llc_size); - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; - } - - return 0; + if (master < slave) + swap(master, slave); + if (slave < factor || master < slave * factor) + return 0; + return 1; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu) static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) { - struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; + struct sched_domain *tmp, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; + int new_cpu = prev_cpu; int want_affine = 0; int sync = wake_flags & WF_SYNC; if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); rcu_read_lock(); for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) - continue; + break; /* * If both cpu and prev_cpu are part of this domain, @@ -5035,20 +5031,14 @@ select_task_rq_fair(struct task_struct * */ if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { - affine_sd = tmp; - break; + /* Prefer affinity over any other flags */ + goto do_affine; } if (tmp->flags & sd_flag) sd = tmp; - } - - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; - - if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; + else if (!want_affine) + break; } while (sd) { @@ -5087,8 +5077,14 @@ select_task_rq_fair(struct task_struct * } unlock: rcu_read_unlock(); - return new_cpu; + +do_affine: + if (cpu != prev_cpu && wake_affine(tmp, p, sync)) + new_cpu = cpu; + + new_cpu = select_idle_sibling(p, new_cpu); + goto unlock; } /* ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-14 11:19 ` Peter Zijlstra @ 2015-07-14 13:49 ` Mike Galbraith 2015-07-14 14:07 ` Peter Zijlstra 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-07-14 13:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote: > OK, how about something like the below; it tightens things up by > applying two rules: > > - We really should not continue looking for a balancing domain once > SD_LOAD_BALANCE is not set. > > - SD (balance) flags should really be set in a single contiguous range, > always starting at the bottom. > > The latter means what if !want_affine and the (first) sd doesn't have > BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration > junk you didn't like.. > > Hmm? Yeah, that's better. It's not big hairy deal either way, it just bugged me to knowingly toss those cycles out the window ;-) select_idle_sibling() looks kinda funny down there, but otoh when the day comes (hah) that we can just balance, it's closer to the exit. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-14 13:49 ` Mike Galbraith @ 2015-07-14 14:07 ` Peter Zijlstra 2015-07-14 14:17 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Peter Zijlstra @ 2015-07-14 14:07 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Tue, Jul 14, 2015 at 03:49:17PM +0200, Mike Galbraith wrote: > On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote: > > > OK, how about something like the below; it tightens things up by > > applying two rules: > > > > - We really should not continue looking for a balancing domain once > > SD_LOAD_BALANCE is not set. > > > > - SD (balance) flags should really be set in a single contiguous range, > > always starting at the bottom. > > > > The latter means what if !want_affine and the (first) sd doesn't have > > BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration > > junk you didn't like.. > > > > Hmm? > > Yeah, that's better. It's not big hairy deal either way, it just bugged > me to knowingly toss those cycles out the window ;-) > > select_idle_sibling() looks kinda funny down there, but otoh when the > day comes (hah) that we can just balance, it's closer to the exit. Right, not too pretty, does this look beter? --- Subject: sched: Beef up wake_wide() From: Mike Galbraith <umgwanakikbuti@gmail.com> Date: Fri, 10 Jul 2015 07:19:26 +0200 Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to the wakeup of a many partner waker, returning 1 only when waking one of its many partners. Correct that, letting explicit domain flags override the heuristic. While at it, adjust task_struct bits, we don't need a 64bit counter. Cc: mingo@redhat.com Cc: morten.rasmussen@arm.com Cc: riel@redhat.com Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Tested-by: Josef Bacik <jbacik@fb.com> [peterz: frobbings] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com --- include/linux/sched.h | 4 +- kernel/sched/fair.c | 68 +++++++++++++++++++++++--------------------------- 2 files changed, 34 insertions(+), 38 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1359,9 +1359,9 @@ struct task_struct { #ifdef CONFIG_SMP struct llist_node wake_entry; int on_cpu; - struct task_struct *last_wakee; - unsigned long wakee_flips; + unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + struct task_struct *last_wakee; int wake_cpu; #endif --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4726,26 +4726,29 @@ static long effective_load(struct task_g #endif +/* + * Detect M:N waker/wakee relationships via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that the relationship is + * non-monogamous, with partner count exceeding socket size. Waker/wakee + * being client/server, worker/dispatcher, interrupt source or whatever is + * irrelevant, spread criteria is apparent partner count exceeds socket size. + */ static int wake_wide(struct task_struct *p) { + unsigned int master = current->wakee_flips; + unsigned int slave = p->wakee_flips; int factor = this_cpu_read(sd_llc_size); - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; - } - - return 0; + if (master < slave) + swap(master, slave); + if (slave < factor || master < slave * factor) + return 0; + return 1; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu) static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) { - struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; + struct sched_domain *tmp, affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; + int new_cpu = prev_cpu; int want_affine = 0; int sync = wake_flags & WF_SYNC; if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); rcu_read_lock(); for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) - continue; + break; /* * If both cpu and prev_cpu are part of this domain, @@ -5041,17 +5037,17 @@ select_task_rq_fair(struct task_struct * if (tmp->flags & sd_flag) sd = tmp; + else if (!want_affine) + break; } - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + if (affine_sd) { /* Prefer affinity over any other flags */ + if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + new_cpu = cpu; - if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; - } + new_cpu = select_idle_sibling(p, new_cpu); - while (sd) { + } else while (sd) { struct sched_group *group; int weight; @@ -5085,10 +5081,10 @@ select_task_rq_fair(struct task_struct * } /* while loop will break here if sd == NULL */ } -unlock: - rcu_read_unlock(); + rcu_read_unlock(); return new_cpu; + } /* ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-14 14:07 ` Peter Zijlstra @ 2015-07-14 14:17 ` Mike Galbraith 2015-07-14 15:04 ` Peter Zijlstra 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-07-14 14:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Tue, 2015-07-14 at 16:07 +0200, Peter Zijlstra wrote: > On Tue, Jul 14, 2015 at 03:49:17PM +0200, Mike Galbraith wrote: > > On Tue, 2015-07-14 at 13:19 +0200, Peter Zijlstra wrote: > > > > > OK, how about something like the below; it tightens things up by > > > applying two rules: > > > > > > - We really should not continue looking for a balancing domain once > > > SD_LOAD_BALANCE is not set. > > > > > > - SD (balance) flags should really be set in a single contiguous range, > > > always starting at the bottom. > > > > > > The latter means what if !want_affine and the (first) sd doesn't have > > > BALANCE_WAKE set, we're done. Getting rid of (most of) that iteration > > > junk you didn't like.. > > > > > > Hmm? > > > > Yeah, that's better. It's not big hairy deal either way, it just bugged > > me to knowingly toss those cycles out the window ;-) > > > > select_idle_sibling() looks kinda funny down there, but otoh when the > > day comes (hah) that we can just balance, it's closer to the exit. > > Right, not too pretty, does this look beter? There's a buglet, I was just about to mention the inverse in the other. > @@ -5041,17 +5037,17 @@ select_task_rq_fair(struct task_struct * > > if (tmp->flags & sd_flag) > sd = tmp; > + else if (!want_affine) > + break; > } > > - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) > - prev_cpu = cpu; > + if (affine_sd) { /* Prefer affinity over any other flags */ > + if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) > + new_cpu = cpu; > > - if (sd_flag & SD_BALANCE_WAKE) { > - new_cpu = select_idle_sibling(p, prev_cpu); > - goto unlock; > - } > + new_cpu = select_idle_sibling(p, new_cpu); We'll not look for a idle cpu when wake_wide() naks want_affine. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-14 14:17 ` Mike Galbraith @ 2015-07-14 15:04 ` Peter Zijlstra 2015-07-14 15:39 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Peter Zijlstra @ 2015-07-14 15:04 UTC (permalink / raw) To: Mike Galbraith Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote: > There's a buglet, > We'll not look for a idle cpu when wake_wide() naks want_affine. *sigh* indeed.. fixing that'll bring us very close to what we started out wiht.. The one XXX there raises the question on whether we should always so select_idle_sibling() if we do not have a suitable balance flag, or only on wakeups. --- Subject: sched: Beef up wake_wide() From: Mike Galbraith <umgwanakikbuti@gmail.com> Date: Fri, 10 Jul 2015 07:19:26 +0200 Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to the wakeup of a many partner waker, returning 1 only when waking one of its many partners. Correct that, letting explicit domain flags override the heuristic. While at it, adjust task_struct bits, we don't need a 64bit counter. Cc: morten.rasmussen@arm.com Cc: riel@redhat.com Cc: mingo@redhat.com Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Tested-by: Josef Bacik <jbacik@fb.com> [peterz: frobbings] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com --- include/linux/sched.h | 4 +- kernel/sched/fair.c | 69 ++++++++++++++++++++++++-------------------------- 2 files changed, 36 insertions(+), 37 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1359,9 +1359,9 @@ struct task_struct { #ifdef CONFIG_SMP struct llist_node wake_entry; int on_cpu; - struct task_struct *last_wakee; - unsigned long wakee_flips; + unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + struct task_struct *last_wakee; int wake_cpu; #endif --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4726,26 +4726,29 @@ static long effective_load(struct task_g #endif +/* + * Detect M:N waker/wakee relationships via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that the relationship is + * non-monogamous, with partner count exceeding socket size. Waker/wakee + * being client/server, worker/dispatcher, interrupt source or whatever is + * irrelevant, spread criteria is apparent partner count exceeds socket size. + */ static int wake_wide(struct task_struct *p) { + unsigned int master = current->wakee_flips; + unsigned int slave = p->wakee_flips; int factor = this_cpu_read(sd_llc_size); - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; - } - - return 0; + if (master < slave) + swap(master, slave); + if (slave < factor || master < slave * factor) + return 0; + return 1; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_doma unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -5015,19 +5011,19 @@ static int get_cpu_usage(int cpu) static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) { - struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; + struct sched_domain *tmp, affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; + int new_cpu = prev_cpu; int want_affine = 0; int sync = wake_flags & WF_SYNC; if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); rcu_read_lock(); for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) - continue; + break; /* * If both cpu and prev_cpu are part of this domain, @@ -5041,17 +5037,21 @@ select_task_rq_fair(struct task_struct * if (tmp->flags & sd_flag) sd = tmp; + else if (!want_affine) + break; } - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; - - if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; + if (affine_sd) { + sd = NULL; /* Prefer wake_affine over balance flags */ + if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + new_cpu = cpu; } - while (sd) { + if (!sd) { + if (sd_flags & SD_BALANCE_WAKE) /* XXX always ? */ + new_cpu = select_idle_sibling(p, new_cpu); + + } else while (sd) { struct sched_group *group; int weight; @@ -5085,7 +5085,6 @@ select_task_rq_fair(struct task_struct * } /* while loop will break here if sd == NULL */ } -unlock: rcu_read_unlock(); return new_cpu; ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-14 15:04 ` Peter Zijlstra @ 2015-07-14 15:39 ` Mike Galbraith 2015-07-14 16:01 ` Josef Bacik 2015-08-03 17:07 ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith 0 siblings, 2 replies; 73+ messages in thread From: Mike Galbraith @ 2015-07-14 15:39 UTC (permalink / raw) To: Peter Zijlstra Cc: Josef Bacik, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Tue, 2015-07-14 at 17:04 +0200, Peter Zijlstra wrote: > On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote: > > There's a buglet, > > > We'll not look for a idle cpu when wake_wide() naks want_affine. > > *sigh* indeed.. fixing that'll bring us very close to what we started > out wiht.. > > The one XXX there raises the question on whether we should always so > select_idle_sibling() if we do not have a suitable balance flag, or only > on wakeups. That's what I've been sitting here waffling over, finally convinced myself that should the user turn FORX/EXEC off, he shouldn't find that a substitute quietly slipped in.. though otoh.. crap, guess I'm not done waffling after all. Yeah, this will work just fine ;-) (typos fixed) --- Subject: sched: Beef up wake_wide() From: Mike Galbraith <umgwanakikbuti@gmail.com> Date: Fri, 10 Jul 2015 07:19:26 +0200 Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to the wakeup of a many partner waker, returning 1 only when waking one of its many partners. Correct that, letting explicit domain flags override the heuristic. While at it, adjust task_struct bits, we don't need a 64bit counter. Cc: morten.rasmussen@arm.com Cc: riel@redhat.com Cc: mingo@redhat.com Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Tested-by: Josef Bacik <jbacik@fb.com> [peterz: frobbings] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1436505566.5715.50.camel@gmail.com --- include/linux/sched.h | 4 +- kernel/sched/fair.c | 67 ++++++++++++++++++++++++-------------------------- 2 files changed, 35 insertions(+), 36 deletions(-) --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1351,9 +1351,9 @@ struct task_struct { #ifdef CONFIG_SMP struct llist_node wake_entry; int on_cpu; - struct task_struct *last_wakee; - unsigned long wakee_flips; + unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + struct task_struct *last_wakee; int wake_cpu; #endif --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4730,26 +4730,29 @@ static long effective_load(struct task_g #endif +/* + * Detect M:N waker/wakee relationships via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that the relationship is + * non-monogamous, with partner count exceeding socket size. Waker/wakee + * being client/server, worker/dispatcher, interrupt source or whatever is + * irrelevant, spread criteria is apparent partner count exceeds socket size. + */ static int wake_wide(struct task_struct *p) { + unsigned int master = current->wakee_flips; + unsigned int slave = p->wakee_flips; int factor = this_cpu_read(sd_llc_size); - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; - } - - return 0; + if (master < slave) + swap(master, slave); + if (slave < factor || master < slave * factor) + return 0; + return 1; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4761,13 +4764,6 @@ static int wake_affine(struct sched_doma unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -5021,17 +5017,17 @@ select_task_rq_fair(struct task_struct * { struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; + int new_cpu = prev_cpu; int want_affine = 0; int sync = wake_flags & WF_SYNC; if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); rcu_read_lock(); for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) - continue; + break; /* * If both cpu and prev_cpu are part of this domain, @@ -5045,17 +5041,21 @@ select_task_rq_fair(struct task_struct * if (tmp->flags & sd_flag) sd = tmp; + else if (!want_affine) + break; } - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; - - if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; + if (affine_sd) { + sd = NULL; /* Prefer wake_affine over balance flags */ + if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + new_cpu = cpu; } - while (sd) { + if (!sd) { + if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ + new_cpu = select_idle_sibling(p, new_cpu); + + } else while (sd) { struct sched_group *group; int weight; @@ -5089,7 +5089,6 @@ select_task_rq_fair(struct task_struct * } /* while loop will break here if sd == NULL */ } -unlock: rcu_read_unlock(); return new_cpu; ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-14 15:39 ` Mike Galbraith @ 2015-07-14 16:01 ` Josef Bacik 2015-07-14 17:59 ` Mike Galbraith 2015-08-03 17:07 ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith 1 sibling, 1 reply; 73+ messages in thread From: Josef Bacik @ 2015-07-14 16:01 UTC (permalink / raw) To: Mike Galbraith, Peter Zijlstra Cc: riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 07/14/2015 11:39 AM, Mike Galbraith wrote: > On Tue, 2015-07-14 at 17:04 +0200, Peter Zijlstra wrote: >> On Tue, Jul 14, 2015 at 04:17:46PM +0200, Mike Galbraith wrote: >>> There's a buglet, >> >>> We'll not look for a idle cpu when wake_wide() naks want_affine. >> >> *sigh* indeed.. fixing that'll bring us very close to what we started >> out wiht.. >> >> The one XXX there raises the question on whether we should always so >> select_idle_sibling() if we do not have a suitable balance flag, or only >> on wakeups. > > That's what I've been sitting here waffling over, finally convinced > myself that should the user turn FORX/EXEC off, he shouldn't find that a > substitute quietly slipped in.. though otoh.. crap, guess I'm not done > waffling after all. Yeah, this will work just fine ;-) > > (typos fixed) > We happy with this or should I wait for more patches to fly by before I test something ;)? Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-14 16:01 ` Josef Bacik @ 2015-07-14 17:59 ` Mike Galbraith 2015-07-15 17:11 ` Josef Bacik 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-07-14 17:59 UTC (permalink / raw) To: Josef Bacik Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On Tue, 2015-07-14 at 12:01 -0400, Josef Bacik wrote: > We happy with this or should I wait for more patches to fly by before I > test something ;)? Yeah, it should be The End time. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [patch] sched: beef up wake_wide() 2015-07-14 17:59 ` Mike Galbraith @ 2015-07-15 17:11 ` Josef Bacik 0 siblings, 0 replies; 73+ messages in thread From: Josef Bacik @ 2015-07-15 17:11 UTC (permalink / raw) To: Mike Galbraith Cc: Peter Zijlstra, riel, mingo, linux-kernel, morten.rasmussen, kernel-team On 07/14/2015 01:59 PM, Mike Galbraith wrote: > On Tue, 2015-07-14 at 12:01 -0400, Josef Bacik wrote: > >> We happy with this or should I wait for more patches to fly by before I >> test something ;)? > > Yeah, it should be The End time. It's showing the same thing as one of the many earlier patches, it's better at high RPS, but at low RPS there's a 3% regression. Thanks, Josef ^ permalink raw reply [flat|nested] 73+ messages in thread
* [tip:sched/core] sched/fair: Beef up wake_wide() 2015-07-14 15:39 ` Mike Galbraith 2015-07-14 16:01 ` Josef Bacik @ 2015-08-03 17:07 ` tip-bot for Mike Galbraith 1 sibling, 0 replies; 73+ messages in thread From: tip-bot for Mike Galbraith @ 2015-08-03 17:07 UTC (permalink / raw) To: linux-tip-commits Cc: tglx, linux-kernel, hpa, Kernel-team, jbacik, mingo, peterz, torvalds, umgwanakikbuti, efault Commit-ID: 63b0e9edceec10fa41ec33393a1515a5ff444277 Gitweb: http://git.kernel.org/tip/63b0e9edceec10fa41ec33393a1515a5ff444277 Author: Mike Galbraith <umgwanakikbuti@gmail.com> AuthorDate: Tue, 14 Jul 2015 17:39:50 +0200 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Mon, 3 Aug 2015 12:21:23 +0200 sched/fair: Beef up wake_wide() Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to the wakeup of a many partner waker, returning 1 only when waking one of its many partners. Correct that, letting explicit domain flags override the heuristic. While at it, adjust task_struct bits, we don't need a 64-bit counter. Tested-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> [ Tidy things up. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team<Kernel-team@fb.com> Cc: morten.rasmussen@arm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1436888390.7983.49.camel@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> --- include/linux/sched.h | 4 +-- kernel/sched/fair.c | 67 +++++++++++++++++++++++++-------------------------- 2 files changed, 35 insertions(+), 36 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 7412070..65a8a86 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1359,9 +1359,9 @@ struct task_struct { #ifdef CONFIG_SMP struct llist_node wake_entry; int on_cpu; - struct task_struct *last_wakee; - unsigned long wakee_flips; + unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + struct task_struct *last_wakee; int wake_cpu; #endif diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8b384b8d..ea23f9f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4726,26 +4726,29 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg) #endif +/* + * Detect M:N waker/wakee relationships via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that the relationship is + * non-monogamous, with partner count exceeding socket size. Waker/wakee + * being client/server, worker/dispatcher, interrupt source or whatever is + * irrelevant, spread criteria is apparent partner count exceeds socket size. + */ static int wake_wide(struct task_struct *p) { + unsigned int master = current->wakee_flips; + unsigned int slave = p->wakee_flips; int factor = this_cpu_read(sd_llc_size); - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; - } - - return 0; + if (master < slave) + swap(master, slave); + if (slave < factor || master < slave * factor) + return 0; + return 1; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4757,13 +4760,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -5017,17 +5013,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f { struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; + int new_cpu = prev_cpu; int want_affine = 0; int sync = wake_flags & WF_SYNC; if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); rcu_read_lock(); for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) - continue; + break; /* * If both cpu and prev_cpu are part of this domain, @@ -5041,17 +5037,21 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (tmp->flags & sd_flag) sd = tmp; + else if (!want_affine) + break; } - if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; - - if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); - goto unlock; + if (affine_sd) { + sd = NULL; /* Prefer wake_affine over balance flags */ + if (cpu != prev_cpu && wake_affine(affine_sd, p, sync)) + new_cpu = cpu; } - while (sd) { + if (!sd) { + if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ + new_cpu = select_idle_sibling(p, new_cpu); + + } else while (sd) { struct sched_group *group; int weight; @@ -5085,7 +5085,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f } /* while loop will break here if sd == NULL */ } -unlock: rcu_read_unlock(); return new_cpu; ^ permalink raw reply related [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 10:21 ` Peter Zijlstra 2015-05-28 11:05 ` Peter Zijlstra @ 2015-05-28 11:16 ` Mike Galbraith 2015-05-28 11:49 ` Ingo Molnar 1 sibling, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-05-28 11:16 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Josef Bacik, riel, mingo, linux-kernel On Thu, 2015-05-28 at 12:21 +0200, Peter Zijlstra wrote: > In fact, select_idle_sibling() is already too expensive on current > server hardware (far too damn many cpus in a LLC domain). Yup. I've played with rate limiting motion per task because of that. Packages have gotten way too damn big. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 11:16 ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith @ 2015-05-28 11:49 ` Ingo Molnar 2015-05-28 12:15 ` Mike Galbraith 0 siblings, 1 reply; 73+ messages in thread From: Ingo Molnar @ 2015-05-28 11:49 UTC (permalink / raw) To: Mike Galbraith; +Cc: Peter Zijlstra, Josef Bacik, riel, mingo, linux-kernel * Mike Galbraith <mgalbraith@novell.com> wrote: > On Thu, 2015-05-28 at 12:21 +0200, Peter Zijlstra wrote: > > > In fact, select_idle_sibling() is already too expensive on current > > server hardware (far too damn many cpus in a LLC domain). > > Yup. I've played with rate limiting motion per task because of that. > Packages have gotten way too damn big. What's the biggest you've seen? Thanks, Ingo ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 11:49 ` Ingo Molnar @ 2015-05-28 12:15 ` Mike Galbraith 2015-05-28 12:19 ` Peter Zijlstra 0 siblings, 1 reply; 73+ messages in thread From: Mike Galbraith @ 2015-05-28 12:15 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, Josef Bacik, riel, mingo, linux-kernel On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote: > * Mike Galbraith <mgalbraith@novell.com> wrote: > > > On Thu, 2015-05-28 at 12:21 +0200, Peter Zijlstra wrote: > > > > > In fact, select_idle_sibling() is already too expensive on current > > > server hardware (far too damn many cpus in a LLC domain). > > > > Yup. I've played with rate limiting motion per task because of that. > > Packages have gotten way too damn big. > > What's the biggest you've seen? 15 cores so far. It'll no doubt grow. -Mike ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 12:15 ` Mike Galbraith @ 2015-05-28 12:19 ` Peter Zijlstra 2015-05-28 12:29 ` Ingo Molnar 2015-05-28 15:22 ` David Ahern 0 siblings, 2 replies; 73+ messages in thread From: Peter Zijlstra @ 2015-05-28 12:19 UTC (permalink / raw) To: Mike Galbraith; +Cc: Ingo Molnar, Josef Bacik, riel, mingo, linux-kernel > On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote: > > What's the biggest you've seen? Wikipedia here: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29 Tell us HSW-E[PX] have 18 cores 36 thread SKUs. But yes, what Mike says, its bound to only get bigger. ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 12:19 ` Peter Zijlstra @ 2015-05-28 12:29 ` Ingo Molnar 2015-05-28 15:22 ` David Ahern 1 sibling, 0 replies; 73+ messages in thread From: Ingo Molnar @ 2015-05-28 12:29 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Mike Galbraith, Josef Bacik, riel, mingo, linux-kernel * Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote: > > > > What's the biggest you've seen? > > Wikipedia here: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29 > > Tell us HSW-E[PX] have 18 cores 36 thread SKUs. > > But yes, what Mike says, its bound to only get bigger. So it's starting to get big enough to warrant an optimization of the way we account and discover idle CPUs: So when a CPU goes idle, it has idle cycles it could spend on registering itself in either an idle-CPUs bitmap, or in an idle-CPUs queue. The queue (or bitmap) would strictly be only shared between CPUs within the same domain, so the cache bouncing cost from that is still small and package-local. (We remote access overhead in select_idle_sibling() already, due to having to access half of all remote rqs on average.) Such an approach would make select_idle_sibling() independent on the size of the cores domain, it would make it essentially O(1). ( There's a bit of a complication with rq->wake_list, but I think it would be good enough to just register/unregister from the idle handler, if something is idle only short term it should probably not be considered for SMP balancing. ) But I'd definitely not go towards making our SMP balancing macro idle selection decisions poorer, just because our internal implementation is O(nr_cores_per_package) ... Agreed? Thanks, Ingo ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-28 12:19 ` Peter Zijlstra 2015-05-28 12:29 ` Ingo Molnar @ 2015-05-28 15:22 ` David Ahern 1 sibling, 0 replies; 73+ messages in thread From: David Ahern @ 2015-05-28 15:22 UTC (permalink / raw) To: Peter Zijlstra, Mike Galbraith Cc: Ingo Molnar, Josef Bacik, riel, mingo, linux-kernel On 5/28/15 6:19 AM, Peter Zijlstra wrote: >> On Thu, 2015-05-28 at 13:49 +0200, Ingo Molnar wrote: > >>> What's the biggest you've seen? > > Wikipedia here: http://en.wikipedia.org/wiki/Haswell_%28microarchitecture%29 > > Tell us HSW-E[PX] have 18 cores 36 thread SKUs. > > But yes, what Mike says, its bound to only get bigger. sparc M7: 8 threads per core, 32 cores per socket = 256 cpus per socket ^ permalink raw reply [flat|nested] 73+ messages in thread
* Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE 2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik 2015-05-28 3:46 ` Mike Galbraith 2015-05-28 10:21 ` Peter Zijlstra @ 2015-05-28 11:55 ` Srikar Dronamraju 2 siblings, 0 replies; 73+ messages in thread From: Srikar Dronamraju @ 2015-05-28 11:55 UTC (permalink / raw) To: Josef Bacik; +Cc: riel, mingo, peterz, linux-kernel > At Facebook we have a pretty heavily multi-threaded application that is > sensitive to latency. We have been pulling forward the old SD_WAKE_IDLE code > because it gives us a pretty significant performance gain (like 20%). It turns > out this is because there are cases where the scheduler puts our task on a busy > CPU when there are idle CPU's in the system. We verify this by reading the > cpu_delay_req_avg_us from the scheduler netlink stuff. With our crappy patch we > get much lower numbers vs baseline. > Was this application run under cpu cgroup. Because we were seeing bursty workloads exhibiting this behaviour esp when run under cpu cgroups. http://mid.gmane.org/53A11A89.5000602@linux.vnet.ibm.com -- Thansk and Regards Srikar ^ permalink raw reply [flat|nested] 73+ messages in thread
end of thread, other threads:[~2015-08-03 17:08 UTC | newest] Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-05-27 21:22 [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Josef Bacik 2015-05-28 3:46 ` Mike Galbraith 2015-05-28 9:49 ` Morten Rasmussen 2015-05-28 10:57 ` Mike Galbraith 2015-05-28 11:48 ` Morten Rasmussen 2015-05-28 11:49 ` Mike Galbraith 2015-05-28 10:21 ` Peter Zijlstra 2015-05-28 11:05 ` Peter Zijlstra 2015-05-28 14:27 ` Josef Bacik 2015-05-29 21:03 ` Josef Bacik 2015-05-30 3:55 ` Mike Galbraith 2015-06-01 19:38 ` Josef Bacik 2015-06-01 20:42 ` Peter Zijlstra 2015-06-01 21:03 ` Josef Bacik 2015-06-02 17:12 ` Josef Bacik 2015-06-03 14:12 ` Rik van Riel 2015-06-03 14:24 ` Peter Zijlstra 2015-06-03 14:49 ` Josef Bacik 2015-06-03 15:30 ` Mike Galbraith 2015-06-03 15:57 ` Josef Bacik 2015-06-03 16:53 ` Mike Galbraith 2015-06-03 17:16 ` Josef Bacik 2015-06-03 17:43 ` Mike Galbraith 2015-06-03 20:34 ` Josef Bacik 2015-06-04 4:52 ` Mike Galbraith 2015-06-01 22:15 ` Rik van Riel 2015-06-11 20:33 ` Josef Bacik 2015-06-12 3:42 ` Rik van Riel 2015-06-12 5:35 ` Mike Galbraith 2015-06-17 18:06 ` Josef Bacik 2015-06-18 0:55 ` Mike Galbraith 2015-06-18 3:46 ` Josef Bacik 2015-06-18 4:12 ` Mike Galbraith 2015-07-02 17:44 ` Josef Bacik 2015-07-03 6:40 ` Mike Galbraith 2015-07-03 9:29 ` Mike Galbraith 2015-07-04 15:57 ` Mike Galbraith 2015-07-05 7:17 ` Mike Galbraith 2015-07-06 5:13 ` Mike Galbraith 2015-07-06 14:34 ` Josef Bacik 2015-07-06 18:36 ` Mike Galbraith 2015-07-06 19:41 ` Josef Bacik 2015-07-07 4:01 ` Mike Galbraith 2015-07-07 9:43 ` [patch] " Mike Galbraith 2015-07-07 13:40 ` Josef Bacik 2015-07-07 15:24 ` Mike Galbraith 2015-07-07 17:06 ` Josef Bacik 2015-07-08 6:13 ` [patch] sched: beef up wake_wide() Mike Galbraith 2015-07-09 13:26 ` Peter Zijlstra 2015-07-09 14:07 ` Mike Galbraith 2015-07-09 14:46 ` Mike Galbraith 2015-07-10 5:19 ` Mike Galbraith 2015-07-10 13:41 ` Josef Bacik 2015-07-10 20:59 ` Josef Bacik 2015-07-11 3:11 ` Mike Galbraith 2015-07-13 13:53 ` Josef Bacik 2015-07-14 11:19 ` Peter Zijlstra 2015-07-14 13:49 ` Mike Galbraith 2015-07-14 14:07 ` Peter Zijlstra 2015-07-14 14:17 ` Mike Galbraith 2015-07-14 15:04 ` Peter Zijlstra 2015-07-14 15:39 ` Mike Galbraith 2015-07-14 16:01 ` Josef Bacik 2015-07-14 17:59 ` Mike Galbraith 2015-07-15 17:11 ` Josef Bacik 2015-08-03 17:07 ` [tip:sched/core] sched/fair: Beef " tip-bot for Mike Galbraith 2015-05-28 11:16 ` [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE Mike Galbraith 2015-05-28 11:49 ` Ingo Molnar 2015-05-28 12:15 ` Mike Galbraith 2015-05-28 12:19 ` Peter Zijlstra 2015-05-28 12:29 ` Ingo Molnar 2015-05-28 15:22 ` David Ahern 2015-05-28 11:55 ` Srikar Dronamraju
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).