From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752463AbdF3Rzq (ORCPT ); Fri, 30 Jun 2017 13:55:46 -0400 Received: from mail-qk0-f174.google.com ([209.85.220.174]:35387 "EHLO mail-qk0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751657AbdF3Rzp (ORCPT ); Fri, 30 Jun 2017 13:55:45 -0400 Date: Fri, 30 Jun 2017 13:55:42 -0400 From: Josef Bacik To: Mike Galbraith Cc: Josef Bacik , Joel Fernandes , Peter Zijlstra , LKML , Juri Lelli , Dietmar Eggemann , Patrick Bellasi , Brendan Jackman , Chris Redpath Subject: Re: wake_wide mechanism clarification Message-ID: <20170630175540.GA2097@destiny> References: <20170630004912.GA2457@destiny> <20170630142815.GA9743@destiny> <1498842140.15161.66.camel@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1498842140.15161.66.camel@gmail.com> User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote: > On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote: > > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote: > > > > > That makes sense that we multiply slave's flips by a factor because > > > its low, but I still didn't get why the factor is chosen to be > > > llc_size instead of something else for the multiplication with slave > > > (slave * factor). > > > Yeah I don't know why llc_size was chosen... > > static void update_top_cache_domain(int cpu) > { >         struct sched_domain_shared *sds = NULL; >         struct sched_domain *sd; >         int id = cpu; >         int size = 1; > >         sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); >         if (sd) { >                 id = cpumask_first(sched_domain_span(sd)); >                 size = cpumask_weight(sched_domain_span(sd)); >                 sds = sd->shared; >         } > >         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); >         per_cpu(sd_llc_size, cpu) = size; > > The goal of wake wide was to approximate when pulling would be a futile > consolidation effort and counterproductive to scaling.  'course with > ever increasing socket size, any 1:N waker is ever more likely to run > out of CPU for its one and only self (slamming into scaling wall) > before it needing to turn its minions loose to conquer the world. > > Something else to consider: network interrupt waking multiple workers > at high frequency.  If the waking CPU is idle, do you really want to > place a worker directly in front of a tattoo artist, or is it better > off nearly anywhere but there? > > If the box is virtual, with no topology exposed (or real but ancient) > to let select_idle_sibling() come to the rescue, two workers can even > get tattooed simultaneously (see sync wakeup).  > Heuristics are hard, news at 11. I think messing with wake_wide() itself is too big of a hammer, we probably need a middle ground. I'm messing with it right now so it's too early to say for sure, but i _suspect_ the bigger latencies we see are not because we overload the cpu we're trying to pull to, but because when we fail to do the wake_affine() we only look at siblings of the affine_sd instead of doing the full "find the idlest cpu in the land!" thing. I _think_ the answer is to make select_idle_sibling() try less hard to find something workable and only use obviously idle cpu's in the affine sd, and fall back to the full load balance esque search. This would make affine misses really expensive, but we can probably negate this by tracking per task how often it misses the target, and use that to adjust when we do wake_affine in the future for that task. Still experimenting some, I just found out a few hours ago I need to rework some of this to fix my cpu imbalance problem with cgroups, so once I get something working I'll throw it your way to take a look. Thanks, Josef