From: Chris Wilson <chris@chris-wilson.co.uk>
To: Peter Zijlstra <peterz@infradead.org>,
"Josef Bacik" <josef@toxicpanda.com>
Cc: "Rik van Riel" <riel@redhat.com>,
linux-kernel@vger.kernel.org, jhladky@redhat.com,
mingo@kernel.org, mgorman@suse.de
Subject: Re: [PATCH] sched/fair: Fix wake_affine() for !NUMA_BALANCING
Date: Thu, 24 Aug 2017 23:29:55 +0100 [thread overview]
Message-ID: <150361379597.22505.2216754249003350303@mail.alporthouse.com> (raw)
In-Reply-To: <20170801214307.ig62jhe7smtisvlr@hirez.programming.kicks-ass.net>
We've stumbled across the same regression on Broxton/Apollolake (J3455).
Quoting Peter Zijlstra (2017-08-01 22:43:07)
> On Tue, Aug 01, 2017 at 03:26:51PM -0400, Josef Bacik wrote:
> > > @@ -7574,6 +7607,18 @@ static inline void update_sd_lb_stats(st
> > > env->dst_rq->rd->overload = overload;
> > > }
> > >
> > > + /*
> > > + * Since these are sums over groups they can contain some CPUs
> > > + * multiple times for the NUMA domains.
> > > + *
> > > + * Currently only wake_affine_llc() and find_busiest_group()
> > > + * uses these numbers, only the last is affected by this problem.
> > > + *
> > > + * XXX fix that.
> > > + */
> > > + WRITE_ONCE(shared->nr_running, sds->total_running);
> > > + WRITE_ONCE(shared->load, sds->total_load);
> > > + WRITE_ONCE(shared->capacity, sds->total_capacity);
> >
> > This panic's on boot for me because shared is NULL. Same happens in
> > select_task_rq_fair when it tries to do the READ_ONCE. Here is my .config in
> > case it's something strange with my config. Thanks,
>
> Nah, its just me being an idiot and booting the wrong kernel. Unless I
> messed up again, this one boots.
>
> There is state during boot and hotplug where there are no domains, and
> thus also no shared. Simply ignoring things when that happens should be
> good enough I think.
This is still not as effective as the previous code in spreading across
siblings.
> +/*
> + * Can a task be moved from prev_cpu to this_cpu without causing a load
> + * imbalance that would trigger the load balancer?
> + *
> + * Since we're running on 'stale' values, we might in fact create an imbalance
> + * but recomputing these values is expensive, as that'd mean iteration 2 cache
> + * domains worth of CPUs.
> + */
> +static bool
> +wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
> + int this_cpu, int prev_cpu, int sync)
> +{
> + struct llc_stats prev_stats, this_stats;
> + s64 this_eff_load, prev_eff_load;
> + unsigned long task_load;
> +
> + get_llc_stats(&prev_stats, prev_cpu);
> + get_llc_stats(&this_stats, this_cpu);
> +
> + /*
> + * If sync wakeup then subtract the (maximum possible)
> + * effect of the currently running task from the load
> + * of the current LLC.
> + */
> + if (sync) {
> + unsigned long current_load = task_h_load(current);
> +
> + /* in this case load hits 0 and this LLC is considered 'idle' */
> + if (current_load > this_stats.load)
> + return true;
> +
> + this_stats.load -= current_load;
> + }
> +
> + /*
> + * The has_capacity stuff is not SMT aware, but by trying to balance
> + * the nr_running on both ends we try and fill the domain at equal
> + * rates, thereby first consuming cores before siblings.
> + */
> +
> + /* if the old cache has capacity, stay there */
> + if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
> + return false;
> +
> + /* if this cache has capacity, come here */
> + if (this_stats.has_capacity && this_stats.nr_running < prev_stats.nr_running+1)
I think you mean,
if (this_stats.has_capacity && this_stats.nr_running + 1 < prev_stats.nr_running)
and with that our particular graphics benchmark behaves similarly to as
before (the regression appears fixed). But I'll let Eero double check.
> + return true;
> +
> +
> + /*
> + * Check to see if we can move the load without causing too much
> + * imbalance.
> + */
> + task_load = task_h_load(p);
> +
> + this_eff_load = 100;
> + this_eff_load *= prev_stats.capacity;
> +
> + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
> + prev_eff_load *= this_stats.capacity;
> +
> + this_eff_load *= this_stats.load + task_load;
> + prev_eff_load *= prev_stats.load - task_load;
> +
> + return this_eff_load <= prev_eff_load;
> +}
next prev parent reply other threads:[~2017-08-24 22:30 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-06-23 16:55 [PATCH 0/4] NUMA improvements with task wakeup and load balancing riel
2017-06-23 16:55 ` [PATCH 1/4] sched,numa: override part of migrate_degrades_locality when idle balancing riel
2017-06-24 6:58 ` Ingo Molnar
2017-06-24 23:45 ` Rik van Riel
2017-06-24 7:22 ` [tip:sched/core] sched/numa: Override part of migrate_degrades_locality() " tip-bot for Rik van Riel
2017-06-23 16:55 ` [PATCH 2/4] sched: simplify wake_affine for single socket case riel
2017-06-24 7:22 ` [tip:sched/core] sched/fair: Simplify wake_affine() for the " tip-bot for Rik van Riel
2017-06-23 16:55 ` [PATCH 3/4] sched,numa: implement numa node level wake_affine riel
2017-06-24 7:23 ` [tip:sched/core] sched/numa: Implement NUMA node level wake_affine() tip-bot for Rik van Riel
2017-06-26 14:43 ` [PATCH 3/4] sched,numa: implement numa node level wake_affine Peter Zijlstra
2017-06-23 16:55 ` [PATCH 4/4] sched,fair: remove effective_load riel
2017-06-24 7:23 ` [tip:sched/core] sched/fair: Remove effective_load() tip-bot for Rik van Riel
2017-06-26 14:44 ` [PATCH 4/4] sched,fair: remove effective_load Peter Zijlstra
2017-06-26 14:46 ` Peter Zijlstra
2017-06-26 14:55 ` Rik van Riel
2017-06-26 15:04 ` Peter Zijlstra
2017-06-26 15:20 ` Rik van Riel
2017-06-26 16:12 ` Peter Zijlstra
2017-06-26 19:34 ` Rik van Riel
2017-06-27 5:39 ` Peter Zijlstra
2017-06-27 14:55 ` Rik van Riel
2017-08-01 12:19 ` [PATCH] sched/fair: Fix wake_affine() for !NUMA_BALANCING Peter Zijlstra
2017-08-01 19:26 ` Josef Bacik
2017-08-01 21:43 ` Peter Zijlstra
2017-08-24 22:29 ` Chris Wilson [this message]
2017-08-25 15:46 ` Chris Wilson
2017-06-27 18:27 ` [PATCH 4/4] sched,fair: remove effective_load Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=150361379597.22505.2216754249003350303@mail.alporthouse.com \
--to=chris@chris-wilson.co.uk \
--cc=jhladky@redhat.com \
--cc=josef@toxicpanda.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).