linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Phil Auld <pauld@redhat.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Quentin Perret <quentin.perret@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>,
	Hillf Danton <hdanton@sina.com>, Parth Shah <parth@linux.ibm.com>,
	Rik van Riel <riel@surriel.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
Date: Fri, 3 Jan 2020 14:31:21 +0000	[thread overview]
Message-ID: <20200103143051.GA3027@techsingularity.net> (raw)
In-Reply-To: <CAKfTPtDp624geHEnPmHki70L=ZrBuz6zJG3zW0VFy+_S064Etw@mail.gmail.com>

On Mon, Dec 23, 2019 at 02:31:40PM +0100, Vincent Guittot wrote:
> > +               /* Consider allowing a small imbalance between NUMA groups */
> > +               if (env->sd->flags & SD_NUMA) {
> > +                       unsigned int imbalance_adj, imbalance_max;
> > +
> > +                       /*
> > +                        * imbalance_adj is the allowable degree of imbalance
> > +                        * to exist between two NUMA domains. It's calculated
> > +                        * relative to imbalance_pct with a minimum of two
> > +                        * tasks or idle CPUs. The choice of two is due to
> > +                        * the most basic case of two communicating tasks
> > +                        * that should remain on the same NUMA node after
> > +                        * wakeup.
> > +                        */
> > +                       imbalance_adj = max(2U, (busiest->group_weight *
> > +                               (env->sd->imbalance_pct - 100) / 100) >> 1);
> > +
> > +                       /*
> > +                        * Ignore small imbalances unless the busiest sd has
> > +                        * almost half as many busy CPUs as there are
> > +                        * available CPUs in the busiest group. Note that
> > +                        * it is not exactly half as imbalance_adj must be
> > +                        * accounted for or the two domains do not converge
> > +                        * as equally balanced if the number of busy tasks is
> > +                        * roughly the size of one NUMA domain.
> > +                        */
> > +                       imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> > +                       if (env->imbalance <= imbalance_adj &&
> 
> AFAICT, env->imbalance is undefined there. I have tried your patch
> with the below instead
> 

Even when fixed, other corner cases were hit for parallelised loads that
benefit from spreading early. At the moment I'm testing a variant of the
first approach except it adjust small balances at low utilisation and
otherwise balances at normal. It appears to work for basic communicating
tasks at relatively low utitisation that fits within a node without
obviously impacting higher utilisation non-communicating workloads but
more testing time will be needed to be sure.

It's still based on sum_nr_running as a cut-off instead of idle_cpus as
at low utilisation, there is not much of a material difference in the
cut-offs given that either approach can be misleading depending on the
load of the individual tasks, any cpu binding configurations and the
degree the tasks are memory-bound.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba749f579714..58ba2f0c6363 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8691,16 +8686,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
-		}
+		} else {
 
-		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
-		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
 						 busiest->idle_cpus) >> 1);
+		}
+
+		/* Consider allowing a small imbalance between NUMA groups */
+		if (env->sd->flags & SD_NUMA) {
+			long imbalance_adj, imbalance_max;
+
+			/*
+			 * imbalance_adj is the allowable degree of imbalance
+			 * to exist between two NUMA domains. imbalance_pct
+			 * is used to estimate the number of active tasks
+			 * needed before memory bandwidth may be as important
+			 * as memory locality.
+			 */
+			imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;
+
+			/*
+			 * Allow small imbalances when the busiest group has
+			 * low utilisation.
+			 */
+			imbalance_max = imbalance_adj << 1;
+			if (busiest->sum_nr_running < imbalance_max)
+				env->imbalance -= min(env->imbalance, imbalance_adj);
+		}
+
 		return;
 	}
 

-- 
Mel Gorman
SUSE Labs

  parent reply	other threads:[~2020-01-03 14:31 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-20  8:42 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 Mel Gorman
2019-12-20 12:40 ` Valentin Schneider
2019-12-20 14:22   ` Mel Gorman
2019-12-20 15:32     ` Valentin Schneider
2019-12-21 11:25   ` Mel Gorman
2019-12-22 12:00 ` Srikar Dronamraju
2019-12-23 13:31 ` Vincent Guittot
2019-12-23 13:41   ` Vincent Guittot
2020-01-03 14:31   ` Mel Gorman [this message]
2020-01-06 13:55     ` Vincent Guittot
2020-01-06 14:52       ` Mel Gorman
2020-01-07  8:38         ` Vincent Guittot
2020-01-07  9:56           ` Mel Gorman
2020-01-07 11:17             ` Vincent Guittot
2020-01-07 11:56               ` Mel Gorman
2020-01-07 16:00                 ` Vincent Guittot
2020-01-07 20:24                   ` Mel Gorman
2020-01-08  8:25                     ` Vincent Guittot
2020-01-08  8:49                       ` Mel Gorman
2020-01-08 13:18                     ` Peter Zijlstra
2020-01-08 14:03                       ` Mel Gorman
2020-01-08 16:46                         ` Vincent Guittot
2020-01-08 18:03                           ` Mel Gorman
2020-01-07 11:22             ` Peter Zijlstra
2020-01-07 11:42               ` Mel Gorman
2020-01-07 12:29                 ` Peter Zijlstra
2020-01-07 12:28               ` Peter Zijlstra
2020-01-07 19:26             ` Phil Auld

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200103143051.GA3027@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=Morten.Rasmussen@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hdanton@sina.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=parth@linux.ibm.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quentin.perret@arm.com \
    --cc=riel@surriel.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).