All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Hillf Danton <hdanton@sina.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Phil Auld <pauld@redhat.com>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity
Date: Mon, 17 Feb 2020 15:06:15 +0000	[thread overview]
Message-ID: <20200217150615.GJ3466@techsingularity.net> (raw)
In-Reply-To: <20200217132019.6684-1-hdanton@sina.com>

On Mon, Feb 17, 2020 at 09:20:19PM +0800, Hillf Danton wrote:
> 
> On Mon, 17 Feb 2020 10:43:55 +0000 Mel Gorman wrote:
> > 
> > The standard load balancer generally tries to keep the number of running
> > tasks or idle CPUs balanced between NUMA domains. The NUMA balancer allows
> > tasks to move if there is spare capacity but this causes a conflict and
> > utilisation between NUMA nodes gets badly skewed. This patch uses similar
> > logic between the NUMA balancer and load balancer when deciding if a task
> > migrating to its preferred node can use an idle CPU.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.com>
> > ---
> >  kernel/sched/fair.c | 76 +++++++++++++++++++++++++++++++----------------------
> >  1 file changed, 45 insertions(+), 31 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 52e74b53d6e7..ae7870f71457 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1520,6 +1520,7 @@ struct task_numa_env {
> >  
> >  static unsigned long cpu_load(struct rq *rq);
> >  static unsigned long cpu_util(int cpu);
> > +static inline long adjust_numa_imbalance(int imbalance, int src_nr_running);
> >  
> >  static inline enum
> >  numa_type numa_classify(unsigned int imbalance_pct,
> > @@ -1594,11 +1595,6 @@ static bool load_too_imbalanced(long src_load, long dst_load,
> >  	long orig_src_load, orig_dst_load;
> >  	long src_capacity, dst_capacity;
> >  
> > -
> > -	/* If dst node has spare capacity, there is no real load imbalance */
> > -	if (env->dst_stats.node_type == node_has_spare)
> > -		return false;
> > -
>
> The current logic is: move if dst node has spare capacity,
> 

THe load balancer takes either idle CPUs or running tasks into account
depending on SD flags. Unless this check is removed, the CPU usage
across nodes becomes imbalanced, the load balancer takes action and
testing indicates that performance degrades severely. The impact is
illustrated in the leader of the patch series as well as the effect of
this patch in isolation.

> >  	/*
> >  	 * The load is corrected for the CPU capacity available on each node.
> >  	 *
> > @@ -1757,19 +1753,37 @@ static void task_numa_compare(struct task_numa_env *env,
> >  static void task_numa_find_cpu(struct task_numa_env *env,
> >  				long taskimp, long groupimp)
> >  {
> > -	long src_load, dst_load, load;
> >  	bool maymove = false;
> >  	int cpu;
> >  
> > -	load = task_h_load(env->p);
> > -	dst_load = env->dst_stats.load + load;
> > -	src_load = env->src_stats.load - load;
> > -
> >  	/*
> > -	 * If the improvement from just moving env->p direction is better
> > -	 * than swapping tasks around, check if a move is possible.
> > +	 * If dst node has spare capacity, then check if there is an
> > +	 * imbalance that would be overruled by the load balancer.
> >  	 */
> > -	maymove = !load_too_imbalanced(src_load, dst_load, env);
> > +	if (env->dst_stats.node_type == node_has_spare) {
> 
> so maymove should be true here.
> 

Performance suffers on numerous workloads that way.

> > +		unsigned int imbalance;
> > +		int src_running, dst_running;
> > +
> > +		/* Would movement cause an imbalance? */
> > +		src_running = env->src_stats.nr_running - 1;
> > +		dst_running = env->dst_stats.nr_running + 1;
> > +		imbalance = max(0, dst_running - src_running);
> > +		imbalance = adjust_numa_imbalance(imbalance, src_running);
> > +
> The imbalance could be ignored if src domain is idle enough, and no move
> could be expected.
> 

Again, it hits corner cases. While there is scope for allowing some
degree of imbalance, it needs to be a separate patch on top of this.
It's something I intend to examine but only once this series is out of
the way because the NUMA and load balancer do need to be using similar
logic first or it gets a bit fragile.

With this patch, and the series in general, it does mean that some tasks
fail to migrate to a CPU local to the memory being accessed even though
there are CPUs available but having the NUMA balancer and load balancer
override each other is not free either.

Thanks.

-- 
Mel Gorman
SUSE Labs

  parent reply	other threads:[~2020-02-17 15:06 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-17 10:43 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v3 Mel Gorman
2020-02-17 10:43 ` [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
2020-02-17 10:43 ` [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
2020-02-17 10:43 ` [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
2020-02-17 10:43 ` [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path Mel Gorman
2020-02-17 10:43 ` [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg Mel Gorman
2020-02-17 10:43 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-17 10:43 ` [PATCH 07/13] sched/pelt: Remove unused runnable load average Mel Gorman
2020-02-17 10:43 ` [PATCH 08/13] sched/pelt: Add a new runnable average signal Mel Gorman
2020-02-17 10:43 ` [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group Mel Gorman
2020-02-17 10:43 ` [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
2020-02-17 10:44 ` [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
2020-02-17 10:44 ` [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
2020-02-17 10:44 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
2020-02-17 13:49 ` [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v3 Vincent Guittot
2020-02-17 14:52   ` Peter Zijlstra
2020-02-17 15:31     ` Mel Gorman
2020-02-17 15:14   ` Mel Gorman
2020-02-17 16:15     ` Vincent Guittot
     [not found] ` <20200217132019.6684-1-hdanton@sina.com>
2020-02-17 15:06   ` Mel Gorman [this message]
     [not found]   ` <20200218033244.6860-1-hdanton@sina.com>
2020-02-18  8:47     ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
     [not found]     ` <20200218095915.844-1-hdanton@sina.com>
2020-02-18 10:53       ` Mel Gorman
2020-02-19 13:54 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v4 Mel Gorman
2020-02-19 13:54 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-19 14:07 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v5 Mel Gorman
2020-02-19 14:07 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
2020-02-24  9:52 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200217150615.GJ3466@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hdanton@sina.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.