From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7BC9C433E3 for ; Mon, 22 Mar 2021 11:04:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C5AE661879 for ; Mon, 22 Mar 2021 11:04:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230486AbhCVLDe (ORCPT ); Mon, 22 Mar 2021 07:03:34 -0400 Received: from outbound-smtp33.blacknight.com ([81.17.249.66]:58055 "EHLO outbound-smtp33.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230327AbhCVLDJ (ORCPT ); Mon, 22 Mar 2021 07:03:09 -0400 Received: from mail.blacknight.com (pemlinmail06.blacknight.ie [81.17.255.152]) by outbound-smtp33.blacknight.com (Postfix) with ESMTPS id 06A8E16C06E for ; Mon, 22 Mar 2021 11:03:08 +0000 (GMT) Received: (qmail 24632 invoked from network); 22 Mar 2021 11:03:07 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.22.4]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 22 Mar 2021 11:03:07 -0000 Date: Mon, 22 Mar 2021 11:03:06 +0000 From: Mel Gorman To: Rik van Riel Cc: linux-kernel@vger.kernel.org, kernel-team@fb.com, "Peter Zijlstra (Intel)" , Ingo Molnar , Vincent Guittot , Valentin Schneider Subject: Re: [PATCH v2] sched/fair: bring back select_idle_smt, but differently Message-ID: <20210322110306.GE3697@techsingularity.net> References: <20210321150358.71ef52b1@imladris.surriel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20210321150358.71ef52b1@imladris.surriel.com> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Mar 21, 2021 at 03:03:58PM -0400, Rik van Riel wrote: > Mel Gorman did some nice work in 9fe1f127b913 > ("sched/fair: Merge select_idle_core/cpu()"), resulting in the kernel > being more efficient at finding an idle CPU, and in tasks spending less > time waiting to be run, both according to the schedstats run_delay > numbers, and according to measured application latencies. Yay. > Other than unusual indentation of the changelog, yey for part 1. > The flip side of this is that we see more task migrations (about > 30% more), higher cache misses, higher memory bandwidth utilization, > and higher CPU use, for the same number of requests/second. > I am having difficulty with this part and whether this patch affects task migrations in particular. > This is most pronounced on a memcache type workload, which saw > a consistent 1-3% increase in total CPU use on the system, due > to those increased task migrations leading to higher L2 cache > miss numbers, and higher memory utilization. The exclusive L3 > cache on Skylake does us no favors there. > Out of curiousity, what is the load generator for memcache or is this based on analysis of a production workload? I ask because mutilate (https://github.com/leverich/mutilate) is allegedly a load generator that can simulate FaceBook patterns but it is old. I would be interested in hearing if mutilate is used and if so, what parameters the load generator is given. > On our web serving workload, that effect is usually negligible, > but occasionally as large as a 9% regression in the number of > requests served, due to some interaction between scheduler latency > and the web load balancer. > > It appears that the increased number of CPU migrations is generally > a good thing, since it leads to lower cpu_delay numbers, reflecting > the fact that tasks get to run faster. However, the reduced locality > and the corresponding increase in L2 cache misses hurts a little. > This is understandable because finding an idle CPU and incurring the migration cost can be better than stacking a task on a busy CPU for workloads that are sensitive to wakeup latency. > The patch below appears to fix the regression, while keeping the > benefit of the lower cpu_delay numbers, by reintroducing select_idle_smt > with a twist: when a socket has no idle cores, check to see if the > sibling of "prev" is idle, before searching all the other CPUs. > But this is the part I'm having a problem with. If the sibling of prev is idle then a task migration cost is still incurred. The main difference is that it's more likely to share a L1 or L2 cache and with no idle cores, some sharing of resources between HT siblings is inevitable. Can you confirm that task migrations are still higher with this patch? > This fixes both the occasional 9% regression on the web serving > workload, and the continuous 2% CPU use regression on the memcache > type workload. > > With Mel's patches and this patch together, the p95 and p99 response > times for the memcache type application improve by about 20% over what > they were before Mel's patches got merged. > Again, I would be very interested in hearing how this conclusion was reached because I can update mmtests accordingly and wire it into the "standard battery of scheduler workloads". > Signed-off-by: Rik van Riel > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 794c2cb945f8..0c986972f4cd 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6098,6 +6098,28 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu > return -1; > } > > +/* > + * Scan the local SMT mask for idle CPUs. > + */ > +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int > +target) > +{ > + int cpu; > + > + if (!static_branch_likely(&sched_smt_present)) > + return -1; > + > + for_each_cpu(cpu, cpu_smt_mask(target)) { > + if (!cpumask_test_cpu(cpu, p->cpus_ptr) || > + !cpumask_test_cpu(cpu, sched_domain_span(sd))) > + continue; > + if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > + return cpu; > + } > + > + return -1; > +} > + > #else /* CONFIG_SCHED_SMT */ > Take a look at Barry's patch "!static_branch_likely(&sched_smt_present)". If that is applied first then you can remove the static_branch_likely check here as it'll be covered by an earlier test_idle_cores() call. As the ordering is not known, just note that it's a potential follow-up if Barry's patch was merged after yours. > static inline void set_idle_cores(int cpu, int val) > @@ -6114,6 +6136,11 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma > return __select_idle_cpu(core); > } > > +static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) > +{ > + return -1; > +} > + > #endif /* CONFIG_SCHED_SMT */ > > /* > @@ -6121,7 +6148,7 @@ static inline int select_idle_core(struct task_struct *p, int core, struct cpuma > * comparing the average scan cost (tracked in sd->avg_scan_cost) against the > * average idle time for this rq (as found in rq->avg_idle). > */ > -static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) > +static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int prev, int target) > { > struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); > int i, cpu, idle_cpu = -1, nr = INT_MAX; > @@ -6155,6 +6182,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t > time = cpu_clock(this); > } > > + if (!smt && cpus_share_cache(prev, target)) { > + /* No idle core. Check if prev has an idle sibling. */ > + i = select_idle_smt(p, sd, prev); > + if ((unsigned int)i < nr_cpumask_bits) > + return i; > + } > + > for_each_cpu_wrap(cpu, cpus, target) { > if (smt) { > i = select_idle_core(p, cpu, cpus, &idle_cpu); Please consider moving this block within the SIS_PROP && !smt check above. It could become something like if (!smt) { /* No idle core. Check if prev has an idle sibling. */ i = select_idle_smt(p, sd, prev); if ((unsigned int)i < nr_cpumask_bits) return i; if (sched_feat(SIS_PROP)) { SIS_PROP STUFF } } That way you avoid some calculations and a clock read if the HT sibling is idle. Second, select_idle_smt() does not use the cpus mask so consider moving the cpus initialisation after select_idle_smt() has been called. Specifically this initialisation cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); Alternatively, clear the bits in the SMT sibling scan to avoid checking the siblings twice. It's a tradeoff because initialising and clearing bits is not free and the cost is wasted if a sibling is free. A third concern, although it is mild, is that the SMT scan ignores the SIS_PROP limits on the depth search. This patch may increase the scan depth as a result. It's only a mild concern as limiting the depth of a search is a magic number anyway. Otherwise, the biggest difference from historical behaviour is that we are explicitly favouring SMT siblings when !idle_cores and !idle_cores can be inaccurate. That is going to be a "win some, lose some" scenario and might show up in overloaded workloads that rapidly idle (e.g. hackbench). I don't have a strong opinion on this one because *some* HT sibling is going to be used and it's workload and hardware dependant what the impact is. I think I'll run this through a short run of some scheduler loads but I'll wait to see if you decide to create another version based on this review. -- Mel Gorman SUSE Labs