Re: [PATCH] sched/fair: Fix wake_affine() for !NUMA_BALANCING

From: Chris Wilson <chris@chris-wilson.co.uk>
To: Peter Zijlstra <peterz@infradead.org>,
	"Josef Bacik" <josef@toxicpanda.com>
Cc: "Rik van Riel" <riel@redhat.com>,
	linux-kernel@vger.kernel.org, jhladky@redhat.com,
	mingo@kernel.org, mgorman@suse.de
Subject: Re: [PATCH] sched/fair: Fix wake_affine() for !NUMA_BALANCING
Date: Thu, 24 Aug 2017 23:29:55 +0100	[thread overview]
Message-ID: <150361379597.22505.2216754249003350303@mail.alporthouse.com> (raw)
In-Reply-To: <20170801214307.ig62jhe7smtisvlr@hirez.programming.kicks-ass.net>

We've stumbled across the same regression on Broxton/Apollolake (J3455).

Quoting Peter Zijlstra (2017-08-01 22:43:07)
> On Tue, Aug 01, 2017 at 03:26:51PM -0400, Josef Bacik wrote:
> > > @@ -7574,6 +7607,18 @@ static inline void update_sd_lb_stats(st
> > >                     env->dst_rq->rd->overload = overload;
> > >     }
> > >  
> > > +   /*
> > > +    * Since these are sums over groups they can contain some CPUs
> > > +    * multiple times for the NUMA domains.
> > > +    *
> > > +    * Currently only wake_affine_llc() and find_busiest_group()
> > > +    * uses these numbers, only the last is affected by this problem.
> > > +    *
> > > +    * XXX fix that.
> > > +    */
> > > +   WRITE_ONCE(shared->nr_running,  sds->total_running);
> > > +   WRITE_ONCE(shared->load,        sds->total_load);
> > > +   WRITE_ONCE(shared->capacity,    sds->total_capacity);
> > 
> > This panic's on boot for me because shared is NULL.  Same happens in
> > select_task_rq_fair when it tries to do the READ_ONCE.  Here is my .config in
> > case it's something strange with my config.  Thanks,
> 
> Nah, its just me being an idiot and booting the wrong kernel. Unless I
> messed up again, this one boots.
> 
> There is state during boot and hotplug where there are no domains, and
> thus also no shared. Simply ignoring things when that happens should be
> good enough I think.

This is still not as effective as the previous code in spreading across
siblings.

> +/*
> + * Can a task be moved from prev_cpu to this_cpu without causing a load
> + * imbalance that would trigger the load balancer?
> + *
> + * Since we're running on 'stale' values, we might in fact create an imbalance
> + * but recomputing these values is expensive, as that'd mean iteration 2 cache
> + * domains worth of CPUs.
> + */
> +static bool
> +wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
> +               int this_cpu, int prev_cpu, int sync)
> +{
> +       struct llc_stats prev_stats, this_stats;
> +       s64 this_eff_load, prev_eff_load;
> +       unsigned long task_load;
> +
> +       get_llc_stats(&prev_stats, prev_cpu);
> +       get_llc_stats(&this_stats, this_cpu);
> +
> +       /*
> +        * If sync wakeup then subtract the (maximum possible)
> +        * effect of the currently running task from the load
> +        * of the current LLC.
> +        */
> +       if (sync) {
> +               unsigned long current_load = task_h_load(current);
> +
> +               /* in this case load hits 0 and this LLC is considered 'idle' */
> +               if (current_load > this_stats.load)
> +                       return true;
> +
> +               this_stats.load -= current_load;
> +       }
> +
> +       /*
> +        * The has_capacity stuff is not SMT aware, but by trying to balance
> +        * the nr_running on both ends we try and fill the domain at equal
> +        * rates, thereby first consuming cores before siblings.
> +        */
> +
> +       /* if the old cache has capacity, stay there */
> +       if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
> +               return false;
> +
> +       /* if this cache has capacity, come here */
> +       if (this_stats.has_capacity && this_stats.nr_running < prev_stats.nr_running+1)

I think you mean,

	if (this_stats.has_capacity && this_stats.nr_running + 1 < prev_stats.nr_running)

and with that our particular graphics benchmark behaves similarly to as
before (the regression appears fixed). But I'll let Eero double check.

> +               return true;
> +
> +
> +       /*
> +        * Check to see if we can move the load without causing too much
> +        * imbalance.
> +        */
> +       task_load = task_h_load(p);
> +
> +       this_eff_load = 100;
> +       this_eff_load *= prev_stats.capacity;
> +
> +       prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
> +       prev_eff_load *= this_stats.capacity;
> +
> +       this_eff_load *= this_stats.load + task_load;
> +       prev_eff_load *= prev_stats.load - task_load;
> +
> +       return this_eff_load <= prev_eff_load;
> +}