From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755587AbbGGJn4 (ORCPT ); Tue, 7 Jul 2015 05:43:56 -0400 Received: from mail-wg0-f51.google.com ([74.125.82.51]:32773 "EHLO mail-wg0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752350AbbGGJnq (ORCPT ); Tue, 7 Jul 2015 05:43:46 -0400 Message-ID: <1436262224.1836.74.camel@gmail.com> Subject: [patch] Re: [PATCH RESEND] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE From: Mike Galbraith To: Josef Bacik Cc: Peter Zijlstra , riel@redhat.com, mingo@redhat.com, linux-kernel@vger.kernel.org, morten.rasmussen@arm.com, kernel-team Date: Tue, 07 Jul 2015 11:43:44 +0200 In-Reply-To: <1436241678.1836.29.camel@gmail.com> References: <1432761736-22093-1-git-send-email-jbacik@fb.com> <20150528102127.GD3644@twins.programming.kicks-ass.net> <20150528110514.GR18673@twins.programming.kicks-ass.net> <1434087305.3674.26.camel@gmail.com> <5581B70D.2000800@fb.com> <1434588939.3444.25.camel@gmail.com> <55823F33.7040005@fb.com> <1434600765.3393.9.camel@gmail.com> <55957871.7080906@fb.com> <1435905658.6418.52.camel@gmail.com> <1436025462.17152.37.camel@gmail.com> <1436080661.22930.22.camel@gmail.com> <1436159590.5850.27.camel@gmail.com> <559A91F4.7000903@fb.com> <1436207790.2940.30.camel@gmail.com> <559AD9CE.4090309@fb.com> <1436241678.1836.29.camel@gmail.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.12.11 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2015-07-07 at 06:01 +0200, Mike Galbraith wrote: > On Mon, 2015-07-06 at 15:41 -0400, Josef Bacik wrote: > > > So the NO_WAKE_WIDE_IDLE results are very good, almost the same as the > > baseline with a slight regression at lower RPS and a slight improvement > > at high RPS. > > Good. I can likely drop the rest then (I like dinky, so do CPUs;). I'm > not real keen on the feature unless your numbers are really good, and > odds are that ain't gonna happen. More extensive testing in pedantic-man mode increased my confidence of that enough to sign off and ship the dirt simple version. Any further twiddles should grow their own wings if they want to fly anyway, the simplest form helps your real world load, as well as the not so real pgbench, my numbers for that below. virgin master, 2 socket box postgres@nessler:~> pgbench.sh clients 12 tps = 96233.854271 1.000 clients 24 tps = 142234.686166 1.000 clients 36 tps = 148433.534531 1.000 clients 48 tps = 133105.634302 1.000 clients 60 tps = 128903.080371 1.000 clients 72 tps = 128591.821782 1.000 clients 84 tps = 114445.967116 1.000 clients 96 tps = 109803.557524 1.000 avg 125219.017 1.000 V3 (KISS, below) postgres@nessler:~> pgbench.sh clients 12 tps = 120793.023637 1.255 clients 24 tps = 144668.961468 1.017 clients 36 tps = 156705.239251 1.055 clients 48 tps = 152004.886893 1.141 clients 60 tps = 138582.113864 1.075 clients 72 tps = 136286.891104 1.059 clients 84 tps = 137420.986043 1.200 clients 96 tps = 135199.060242 1.231 avg 140207.645 1.119 1.000 V2 NO_WAKE_WIDE_IDLE postgres@nessler:~> pgbench.sh clients 12 tps = 121821.966162 1.265 clients 24 tps = 146446.388366 1.029 clients 36 tps = 151373.362190 1.019 clients 48 tps = 156806.730746 1.178 clients 60 tps = 133933.491567 1.039 clients 72 tps = 131460.489424 1.022 clients 84 tps = 130859.340261 1.143 clients 96 tps = 130787.476584 1.191 avg 137936.155 1.101 0.983 V2 WAKE_WIDE_IDLE (crawl in a hole feature, you're dead) postgres@nessler:~> pgbench.sh clients 12 tps = 121297.791570 clients 24 tps = 145939.488312 clients 36 tps = 155336.090263 clients 48 tps = 149018.245323 clients 60 tps = 136730.079391 clients 72 tps = 134886.116831 clients 84 tps = 130493.283398 clients 96 tps = 126043.336074 sched: beef up wake_wide() Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to wakeup of the 1:N waker, returning 1 only when the 1:N waker is waking one of its minions. Correct that, and don't bother doing domain traversal when we know that all we need to do is check for an idle cpu. Signed-off-by: Mike Galbraith --- include/linux/sched.h | 4 +-- kernel/sched/fair.c | 56 ++++++++++++++++++++++++-------------------------- 2 files changed, 29 insertions(+), 31 deletions(-) Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1351,9 +1351,9 @@ struct task_struct { #ifdef CONFIG_SMP struct llist_node wake_entry; int on_cpu; - struct task_struct *last_wakee; - unsigned long wakee_flips; + unsigned int wakee_flips; unsigned long wakee_flip_decay_ts; + struct task_struct *last_wakee; int wake_cpu; #endif Index: linux-2.6/kernel/sched/fair.c =================================================================== --- linux-2.6.orig/kernel/sched/fair.c +++ linux-2.6/kernel/sched/fair.c @@ -4730,26 +4730,27 @@ static long effective_load(struct task_g #endif +/* + * Detect 1:N waker/wakee relationship via a switching-frequency heuristic. + * A waker of many should wake a different task than the one last awakened + * at a frequency roughly N times higher than one of its wakees. In order + * to determine whether we should let the load spread vs consolodating to + * shared cache, we look for a minimum 'flip' frequency of llc_size in one + * partner, and a factor of lls_size higher frequency in the other. With + * both conditions met, we can be relatively sure that we are seeing a 1:N + * relationship, and that load size exceeds socket size. + */ static int wake_wide(struct task_struct *p) { + unsigned int waker_flips = current->wakee_flips; + unsigned int wakee_flips = p->wakee_flips; int factor = this_cpu_read(sd_llc_size); - /* - * Yeah, it's the switching-frequency, could means many wakee or - * rapidly switch, use factor here will just help to automatically - * adjust the loose-degree, so bigger node will lead to more pull. - */ - if (p->wakee_flips > factor) { - /* - * wakee is somewhat hot, it needs certain amount of cpu - * resource, so if waker is far more hot, prefer to leave - * it alone. - */ - if (current->wakee_flips > (factor * p->wakee_flips)) - return 1; - } - - return 0; + if (waker_flips < wakee_flips) + swap(waker_flips, wakee_flips); + if (wakee_flips < factor || waker_flips < wakee_flips * factor) + return 0; + return 1; } static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) @@ -4761,13 +4762,6 @@ static int wake_affine(struct sched_doma unsigned long weight; int balanced; - /* - * If we wake multiple tasks be careful to not bounce - * ourselves around too much. - */ - if (wake_wide(p)) - return 0; - idx = sd->wake_idx; this_cpu = smp_processor_id(); prev_cpu = task_cpu(p); @@ -5021,14 +5015,17 @@ select_task_rq_fair(struct task_struct * { struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL; int cpu = smp_processor_id(); - int new_cpu = cpu; + int new_cpu = prev_cpu; int want_affine = 0; int sync = wake_flags & WF_SYNC; - if (sd_flag & SD_BALANCE_WAKE) - want_affine = cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); - rcu_read_lock(); + if (sd_flag & SD_BALANCE_WAKE) { + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + if (!want_affine) + goto select_idle; + } + for_each_domain(cpu, tmp) { if (!(tmp->flags & SD_LOAD_BALANCE)) continue; @@ -5048,10 +5045,11 @@ select_task_rq_fair(struct task_struct * } if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) - prev_cpu = cpu; + new_cpu = cpu; if (sd_flag & SD_BALANCE_WAKE) { - new_cpu = select_idle_sibling(p, prev_cpu); +select_idle: + new_cpu = select_idle_sibling(p, new_cpu); goto unlock; }