Re: wake_wide mechanism clarification

From: Josef Bacik <josef@toxicpanda.com>
To: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>,
	Joel Fernandes <joelaf@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Juri Lelli <Juri.Lelli@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Patrick Bellasi <patrick.bellasi@arm.com>,
	Brendan Jackman <brendan.jackman@arm.com>,
	Chris Redpath <Chris.Redpath@arm.com>
Subject: Re: wake_wide mechanism clarification
Date: Fri, 30 Jun 2017 13:55:42 -0400	[thread overview]
Message-ID: <20170630175540.GA2097@destiny> (raw)
In-Reply-To: <1498842140.15161.66.camel@gmail.com>

On Fri, Jun 30, 2017 at 07:02:20PM +0200, Mike Galbraith wrote:
> On Fri, 2017-06-30 at 10:28 -0400, Josef Bacik wrote:
> > On Thu, Jun 29, 2017 at 08:04:59PM -0700, Joel Fernandes wrote:
> > 
> > > That makes sense that we multiply slave's flips by a factor because
> > > its low, but I still didn't get why the factor is chosen to be
> > > llc_size instead of something else for the multiplication with slave
> > > (slave * factor).
> 
> > Yeah I don't know why llc_size was chosen...
> 
> static void update_top_cache_domain(int cpu)
> {
>         struct sched_domain_shared *sds = NULL;
>         struct sched_domain *sd;
>         int id = cpu;
>         int size = 1;
> 
>         sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
>         if (sd) {
>                 id = cpumask_first(sched_domain_span(sd));
>                 size = cpumask_weight(sched_domain_span(sd));
>                 sds = sd->shared;
>         }
> 
>         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>         per_cpu(sd_llc_size, cpu) = size;
> 
> The goal of wake wide was to approximate when pulling would be a futile
> consolidation effort and counterproductive to scaling.  'course with
> ever increasing socket size, any 1:N waker is ever more likely to run
> out of CPU for its one and only self (slamming into scaling wall)
> before it needing to turn its minions loose to conquer the world.
> 
> Something else to consider: network interrupt waking multiple workers
> at high frequency.  If the waking CPU is idle, do you really want to
> place a worker directly in front of a tattoo artist, or is it better
> off nearly anywhere but there?
> 
> If the box is virtual, with no topology exposed (or real but ancient)
> to let select_idle_sibling() come to the rescue, two workers can even
> get tattooed simultaneously (see sync wakeup). 
> 

Heuristics are hard, news at 11.  I think messing with wake_wide() itself is too
big of a hammer, we probably need a middle ground.  I'm messing with it right
now so it's too early to say for sure, but i _suspect_ the bigger latencies we
see are not because we overload the cpu we're trying to pull to, but because
when we fail to do the wake_affine() we only look at siblings of the affine_sd
instead of doing the full "find the idlest cpu in the land!" thing.  I _think_
the answer is to make select_idle_sibling() try less hard to find something
workable and only use obviously idle cpu's in the affine sd, and fall back to
the full load balance esque search.

This would make affine misses really expensive, but we can probably negate this
by tracking per task how often it misses the target, and use that to adjust when
we do wake_affine in the future for that task.  Still experimenting some, I just
found out a few hours ago I need to rework some of this to fix my cpu imbalance
problem with cgroups, so once I get something working I'll throw it your way to
take a look.  Thanks,

Josef