All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3
@ 2020-01-06 14:42 Mel Gorman
  2020-01-06 15:47 ` Rik van Riel
  0 siblings, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2020-01-06 14:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML, Mel Gorman

Changelog since V2
o Only allow a small imbalance when utilisation is low to address reports that
  higher utilisation workloads were hitting corner cases.

Changelog since V1
o Alter code flow 						vincent.guittot
o Use idle CPUs for comparison instead of sum_nr_running	vincent.guittot
o Note that the division is still in place. Without it and taking
  imbalance_adj into account before the cutoff, two NUMA domains
  do not converage as being equally balanced when the number of
  busy tasks equals the size of one domain (50% of the sum).

The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.

This patch allows a degree of imbalance to exist between NUMA domains based
on the imbalance_pct defined by the scheduler domain when utilisation is
very low. This slight imbalance is allowed until the number of running
tasks reaches the point where imbalance_pct may be noticable.

The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offload depending on the
transmission rate.

2-socket Haswell machine 48 core, HT enabled
netperf-tcp -- mmtests config config-network-netperf-unbound
                       	      baseline              lbnuma-v3
Hmean     64         666.67 (   0.00%)      676.19 *   1.43%*
Hmean     128       1276.33 (   0.00%)     1301.66 *   1.98%*
Hmean     256       2394.38 (   0.00%)     2426.15 *   1.33%*
Hmean     1024      8163.19 (   0.00%)     8510.80 *   4.26%*
Hmean     2048     13102.01 (   0.00%)    13356.81 *   1.94%*
Hmean     3312     18072.51 (   0.00%)    17343.81 *  -4.03%*
Hmean     4096     19967.79 (   0.00%)    19451.95 *  -2.58%*
Hmean     8192     28458.73 (   0.00%)    28000.46 *  -1.61%*
Hmean     16384    30295.77 (   0.00%)    29145.59 *  -3.80%*
Stddev    64           5.07 (   0.00%)        0.33 (  93.55%)
Stddev    128         15.24 (   0.00%)        2.45 (  83.90%)
Stddev    256         23.90 (   0.00%)       10.62 (  55.58%)
Stddev    1024        57.66 (   0.00%)       19.20 (  66.70%)
Stddev    2048        97.52 (   0.00%)       43.19 (  55.71%)
Stddev    3312       344.31 (   0.00%)       86.02 (  75.02%)
Stddev    4096       177.54 (   0.00%)       60.12 (  66.13%)
Stddev    8192       423.47 (   0.00%)      239.01 (  43.56%)
Stddev    16384      183.18 (   0.00%)       96.91 (  47.09%)

Fairly small impact on average performance but note how much the standard
deviation is reduced.

Ops NUMA base-page range updates       22576.00         336.00
Ops NUMA PTE updates                   22576.00         336.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                   18487.00         162.00
Ops NUMA hint local faults %           10662.00          96.00
Ops NUMA hint local percent               57.67          59.26
Ops NUMA pages migrated                 4365.00          66.00

Without the patch, only 55.75% of sampled accesses are local.  In an
earlier changelog, 100% of sampled accesses are local and indeed on
most machines, this was still the case. In this specific case, the
local sampled rates was 59.26% but note the "base-page range updates"
and "PTE updates".  The activity with the patch is negligible as were
the number of faults. The small number of pages migrated were related to
shared libraries.  A 2-socket Broadwell showed better results on average
but are not presented for brevity as the performance was similar except
it showed 100% of the sampled NUMA hints were local. The patch holds up
for a 4-socket Haswell box and an AMD Epyc 2 based machine.

For dbench, the impact depends on the filesystem used and the number of
clients. On XFS, there is little difference as the clients typically
communicate with workqueues which have a separate class of scheduler
problem at the moment. For ext4, performance is generally better,
particularly for small numbers of clients as NUMA balancing activity is
negligible with the patch applied.

A more interesting example is the Facebook schbench which uses a
number of messaging threads to communicate with worker threads. In this
configuration, one messaging thread is used per NUMA node and the number of
worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
for response latency is then reported.

Lat 50.00th-qrtle-1        44.00 (   0.00%)       36.00 (  18.18%)
Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
Lat 99.00th-qrtle-1        76.00 (   0.00%)       52.00 (  31.58%)
Lat 99.50th-qrtle-1        89.00 (   0.00%)       53.00 (  40.45%)
Lat 99.90th-qrtle-1        98.00 (   0.00%)       59.00 (  39.80%)
Lat 50.00th-qrtle-2        42.00 (   0.00%)       41.00 (   2.38%)
Lat 75.00th-qrtle-2        48.00 (   0.00%)       45.00 (   6.25%)
Lat 90.00th-qrtle-2        53.00 (   0.00%)       51.00 (   3.77%)
Lat 95.00th-qrtle-2        55.00 (   0.00%)       52.00 (   5.45%)
Lat 99.00th-qrtle-2        62.00 (   0.00%)       59.00 (   4.84%)
Lat 99.50th-qrtle-2        63.00 (   0.00%)       61.00 (   3.17%)
Lat 99.90th-qrtle-2        68.00 (   0.00%)       65.00 (   4.41%)

For higher worker threads, the differences become negligible but it's
interesting to note the difference in wakeup latency at low utilisation
and mpstat confirms that activity was almost all on one node until
the number of worker threads increase.

Hackbench generally showed neutral results across a range of machines.
This is different to earlier versions of the patch which allowed imbalances
for higher degrees of utilisation. perf bench pipe showed negligible
differences in overall performance as the

An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is not detected as the number of threads used was always high enough
that normal load balancing was applied. Similarly, there were report of
regressions for the autonuma benchmark against earlier versions but again,
normal load balancing now applies for that workload.

In general, the patch simply seeks to avoid unnecessary cross-node
migrations when a machine is lightly loaded where load balancing across
NUMA nodes can be noticable. For low utilisation workloads, this patch
generally behaves better with less NUMA balancing activity. For high
utilisation, there is no change in behaviour.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba749f579714..e541b700e339 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8681,7 +8677,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 
 			return;
 		}
-
 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
 			unsigned int nr_diff = busiest->sum_nr_running;
 			/*
@@ -8691,16 +8686,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
-		}
+		} else {
 
-		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
-		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
 						 busiest->idle_cpus) >> 1);
+		}
+
+		/* Consider allowing a small imbalance between NUMA groups */
+		if (env->sd->flags & SD_NUMA) {
+			long imbalance_adj, imbalance_max;
+
+			/*
+			 * imbalance_adj is the allowable degree of imbalance
+			 * to exist between two NUMA domains. imbalance_pct
+			 * is used to estimate the number of active tasks
+			 * needed before memory bandwidth may be as important
+			 * as memory locality.
+			 */
+			imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;
+
+			/*
+			 * Allow small imbalances when the busiest group has
+			 * low utilisation.
+			 */
+			imbalance_max = imbalance_adj << 1;
+			if (busiest->sum_nr_running < imbalance_max)
+				env->imbalance -= min(env->imbalance, imbalance_adj);
+		}
+
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3
  2020-01-06 14:42 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3 Mel Gorman
@ 2020-01-06 15:47 ` Rik van Riel
  2020-01-06 16:33   ` Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Rik van Riel @ 2020-01-06 15:47 UTC (permalink / raw)
  To: Mel Gorman, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, LKML

[-- Attachment #1: Type: text/plain, Size: 1482 bytes --]

On Mon, 2020-01-06 at 14:42 +0000, Mel Gorman wrote:
> 
> +
> +		/* Consider allowing a small imbalance between NUMA
> groups */
> +		if (env->sd->flags & SD_NUMA) {
> +			long imbalance_adj, imbalance_max;
> +
> +			/*
> +			 * imbalance_adj is the allowable degree of
> imbalance
> +			 * to exist between two NUMA domains.
> imbalance_pct
> +			 * is used to estimate the number of active
> tasks
> +			 * needed before memory bandwidth may be as
> important
> +			 * as memory locality.
> +			 */
> +			imbalance_adj = (100 / (env->sd->imbalance_pct
> - 100)) - 1;
> +
> +			/*
> +			 * Allow small imbalances when the busiest
> group has
> +			 * low utilisation.
> +			 */
> +			imbalance_max = imbalance_adj << 1;
> +			if (busiest->sum_nr_running < imbalance_max)
> +				env->imbalance -= min(env->imbalance,
> imbalance_adj);
> +		}
> +

Wait, so imbalance_max is a function only of
env->sd->imbalance_pct, and it gets compared
against busiest->sum_nr_running, which is related
to the number of CPUs in the node?

Lets see how this works out for different numbers
of CPUs in each node.

With imbalance_pct == 115, we end up with 
imbalance_adj = (100 / (115 - 100)) - 1 = 0.18,
which gets rounded down to 0.

Wait, never mind the different numbers of CPUs.

Am I overlooking something, or did you mean to have
a factor of number of CPUs in the imbalance_adj
calculation?

kind regards,

Rik
-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3
  2020-01-06 15:47 ` Rik van Riel
@ 2020-01-06 16:33   ` Mel Gorman
  2020-01-06 16:44     ` Rik van Riel
       [not found]     ` <20200107015111.4836-1-hdanton@sina.com>
  0 siblings, 2 replies; 10+ messages in thread
From: Mel Gorman @ 2020-01-06 16:33 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld,
	valentin.schneider, srikar, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, hdanton, parth, LKML

On Mon, Jan 06, 2020 at 10:47:18AM -0500, Rik van Riel wrote:
> > +			imbalance_adj = (100 / (env->sd->imbalance_pct
> > - 100)) - 1;
> > +
> > +			/*
> > +			 * Allow small imbalances when the busiest
> > group has
> > +			 * low utilisation.
> > +			 */
> > +			imbalance_max = imbalance_adj << 1;
> > +			if (busiest->sum_nr_running < imbalance_max)
> > +				env->imbalance -= min(env->imbalance,
> > imbalance_adj);
> > +		}
> > +
> 
> Wait, so imbalance_max is a function only of
> env->sd->imbalance_pct, and it gets compared
> against busiest->sum_nr_running, which is related
> to the number of CPUs in the node?
> 

It's not directly related to the number of CPUs in the node. Are you
thinking of busiest->group_weight?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3
  2020-01-06 16:33   ` Mel Gorman
@ 2020-01-06 16:44     ` Rik van Riel
  2020-01-06 17:19       ` Mel Gorman
       [not found]     ` <20200107015111.4836-1-hdanton@sina.com>
  1 sibling, 1 reply; 10+ messages in thread
From: Rik van Riel @ 2020-01-06 16:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld,
	valentin.schneider, srikar, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, hdanton, parth, LKML

[-- Attachment #1: Type: text/plain, Size: 1160 bytes --]

On Mon, 2020-01-06 at 16:33 +0000, Mel Gorman wrote:
> On Mon, Jan 06, 2020 at 10:47:18AM -0500, Rik van Riel wrote:
> > > +			imbalance_adj = (100 / (env->sd->imbalance_pct
> > > - 100)) - 1;
> > > +
> > > +			/*
> > > +			 * Allow small imbalances when the busiest
> > > group has
> > > +			 * low utilisation.
> > > +			 */
> > > +			imbalance_max = imbalance_adj << 1;
> > > +			if (busiest->sum_nr_running < imbalance_max)
> > > +				env->imbalance -= min(env->imbalance,
> > > imbalance_adj);
> > > +		}
> > > +
> > 
> > Wait, so imbalance_max is a function only of
> > env->sd->imbalance_pct, and it gets compared
> > against busiest->sum_nr_running, which is related
> > to the number of CPUs in the node?
> > 
> 
> It's not directly related to the number of CPUs in the node. Are you
> thinking of busiest->group_weight?

I am, because as it is right now that if condition
looks like it might never be true for imbalance_pct 115.

Presumably you put that check there for a reason, and
would like it to trigger when the amount by which a node
is busy is less than 2 * (imbalance_pct - 100).

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3
  2020-01-06 16:44     ` Rik van Riel
@ 2020-01-06 17:19       ` Mel Gorman
  0 siblings, 0 replies; 10+ messages in thread
From: Mel Gorman @ 2020-01-06 17:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld,
	valentin.schneider, srikar, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, hdanton, parth, LKML

On Mon, Jan 06, 2020 at 11:44:57AM -0500, Rik van Riel wrote:
> On Mon, 2020-01-06 at 16:33 +0000, Mel Gorman wrote:
> > On Mon, Jan 06, 2020 at 10:47:18AM -0500, Rik van Riel wrote:
> > > > +			imbalance_adj = (100 / (env->sd->imbalance_pct
> > > > - 100)) - 1;
> > > > +
> > > > +			/*
> > > > +			 * Allow small imbalances when the busiest
> > > > group has
> > > > +			 * low utilisation.
> > > > +			 */
> > > > +			imbalance_max = imbalance_adj << 1;
> > > > +			if (busiest->sum_nr_running < imbalance_max)
> > > > +				env->imbalance -= min(env->imbalance,
> > > > imbalance_adj);
> > > > +		}
> > > > +
> > > 
> > > Wait, so imbalance_max is a function only of
> > > env->sd->imbalance_pct, and it gets compared
> > > against busiest->sum_nr_running, which is related
> > > to the number of CPUs in the node?
> > > 
> > 
> > It's not directly related to the number of CPUs in the node. Are you
> > thinking of busiest->group_weight?
> 
> I am, because as it is right now that if condition
> looks like it might never be true for imbalance_pct 115.
> 

True but while imbalance_pct has the possibility of being something
other than 125 for SD_NUMA, I'm not aware of a case where it happens.
If/when it does, it would be worth reconsidering the threshold.

> Presumably you put that check there for a reason, and
> would like it to trigger when the amount by which a node
> is busy is less than 2 * (imbalance_pct - 100).

Yes, it's there for a reason. The intent is to only allow the imbalance for
low utilisation. Too many corner cases were hit otherwise -- utilisation
near a nodes capacity, highly parallelised workloads wanting to balance
as quickly as possible etc. In this version, the only case that really
is being handled is one where the utilisation of a NUMA machine is low
which happens often enough to be interesting.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3
       [not found]     ` <20200107015111.4836-1-hdanton@sina.com>
@ 2020-01-07  8:44       ` Vincent Guittot
  2020-01-07  9:12       ` Mel Gorman
  1 sibling, 0 replies; 10+ messages in thread
From: Vincent Guittot @ 2020-01-07  8:44 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Rik van Riel, Mel Gorman, Ingo Molnar, Peter Zijlstra, Phil Auld,
	Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Parth Shah, LKML

On Tue, 7 Jan 2020 at 02:51, Hillf Danton <hdanton@sina.com> wrote:
>
>
> Hi Folks
>
> On Mon, 06 Jan 2020 11:44:57 -0500 Rik van Riel wrote:
> > On Mon, 2020-01-06 at 16:33 +0000, Mel Gorman wrote:
> > > On Mon, Jan 06, 2020 at 10:47:18AM -0500, Rik van Riel wrote:
> > > > > +                       imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;
> > > > > +
> > > > > +                       /*
> > > > > +                        * Allow small imbalances when the busiest group has
> > > > > +                        * low utilisation.
> > > > > +                        */
> > > > > +                       imbalance_max = imbalance_adj << 1;
> > > > > +                       if (busiest->sum_nr_running < imbalance_max)
> > > > > +                               env->imbalance -= min(env->imbalance, imbalance_adj);
> > > > > +               }
> > > > > +
> > > >
> > > > Wait, so imbalance_max is a function only of
> > > > env->sd->imbalance_pct, and it gets compared
> > > > against busiest->sum_nr_running, which is related
> > > > to the number of CPUs in the node?
> > > >
> > >
> > > It's not directly related to the number of CPUs in the node. Are you
> > > thinking of busiest->group_weight?
> >
> > I am, because as it is right now that if condition
> > looks like it might never be true for imbalance_pct 115.
> >
> > Presumably you put that check there for a reason, and
> > would like it to trigger when the amount by which a node
> > is busy is less than 2 * (imbalance_pct - 100).
>
>
> If three per cent can make any sense in helping determine utilisation
> low then the busy load has to meet
>
>         busiest->sum_nr_running < max(3, cpus in the node / 32);

I agree that we should take into account the size or the utilization
of the node in some way

>
> And we can't skip pulling tasks from a numa node without comparing it
> to the local load
>
>         local->sum_nr_running * env->sd->imbalance_pct <
>         busiest->sum_nr_running * 100;

A similar test has already been done in fbg and we call calculate
imbalance only if busiest is busier than local

>
> with imbalance_pct taken into account.
>
> Hillf
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3
       [not found]     ` <20200107015111.4836-1-hdanton@sina.com>
  2020-01-07  8:44       ` Vincent Guittot
@ 2020-01-07  9:12       ` Mel Gorman
  2020-01-07  9:43         ` Vincent Guittot
  1 sibling, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2020-01-07  9:12 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Rik van Riel, Vincent Guittot, Ingo Molnar, Peter Zijlstra,
	pauld, valentin.schneider, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, parth, LKML

On Tue, Jan 07, 2020 at 09:51:11AM +0800, Hillf Danton wrote:
> 
> Hi Folks
> 
> On Mon, 06 Jan 2020 11:44:57 -0500 Rik van Riel wrote:
> > On Mon, 2020-01-06 at 16:33 +0000, Mel Gorman wrote:
> > > On Mon, Jan 06, 2020 at 10:47:18AM -0500, Rik van Riel wrote:
> > > > > +			imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;
> > > > > +
> > > > > +			/*
> > > > > +			 * Allow small imbalances when the busiest group has
> > > > > +			 * low utilisation.
> > > > > +			 */
> > > > > +			imbalance_max = imbalance_adj << 1;
> > > > > +			if (busiest->sum_nr_running < imbalance_max)
> > > > > +				env->imbalance -= min(env->imbalance, imbalance_adj);
> > > > > +		}
> > > > > +
> > > >
> > > > Wait, so imbalance_max is a function only of
> > > > env->sd->imbalance_pct, and it gets compared
> > > > against busiest->sum_nr_running, which is related
> > > > to the number of CPUs in the node?
> > > >
> > >
> > > It's not directly related to the number of CPUs in the node. Are you
> > > thinking of busiest->group_weight?
> > 
> > I am, because as it is right now that if condition
> > looks like it might never be true for imbalance_pct 115.
> > 
> > Presumably you put that check there for a reason, and
> > would like it to trigger when the amount by which a node
> > is busy is less than 2 * (imbalance_pct - 100).
> 
> 
> If three per cent can make any sense in helping determine utilisation
> low then the busy load has to meet
> 
> 	busiest->sum_nr_running < max(3, cpus in the node / 32);
> 

Why 3% and why would the low utilisation cut-off depend on the number of
CPUs in the node? That simply means that the cut-off scales to machine
size and does not take into account any consideration between local memory
latency and memory bandwidth.

> And we can't skip pulling tasks from a numa node without comparing it
> to the local load
> 
> 	local->sum_nr_running * env->sd->imbalance_pct <
> 	busiest->sum_nr_running * 100;
> 
> with imbalance_pct taken into account.
> 

Again, why? In this context, an imbalance has already been calculated
and whether based on running tasks or idle CPUs, it's not a negative
number. The imbalance_adj used as already accounted for imbalance_pct
albeit not as a ratio as it's normally used.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3
  2020-01-07  9:12       ` Mel Gorman
@ 2020-01-07  9:43         ` Vincent Guittot
  2020-01-07 10:16           ` Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Vincent Guittot @ 2020-01-07  9:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hillf Danton, Rik van Riel, Ingo Molnar, Peter Zijlstra,
	Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Parth Shah, LKML

On Tue, 7 Jan 2020 at 10:12, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Jan 07, 2020 at 09:51:11AM +0800, Hillf Danton wrote:
> >
> > Hi Folks
> >
> > On Mon, 06 Jan 2020 11:44:57 -0500 Rik van Riel wrote:
> > > On Mon, 2020-01-06 at 16:33 +0000, Mel Gorman wrote:
> > > > On Mon, Jan 06, 2020 at 10:47:18AM -0500, Rik van Riel wrote:
> > > > > > +                     imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;
> > > > > > +
> > > > > > +                     /*
> > > > > > +                      * Allow small imbalances when the busiest group has
> > > > > > +                      * low utilisation.
> > > > > > +                      */
> > > > > > +                     imbalance_max = imbalance_adj << 1;
> > > > > > +                     if (busiest->sum_nr_running < imbalance_max)
> > > > > > +                             env->imbalance -= min(env->imbalance, imbalance_adj);
> > > > > > +             }
> > > > > > +
> > > > >
> > > > > Wait, so imbalance_max is a function only of
> > > > > env->sd->imbalance_pct, and it gets compared
> > > > > against busiest->sum_nr_running, which is related
> > > > > to the number of CPUs in the node?
> > > > >
> > > >
> > > > It's not directly related to the number of CPUs in the node. Are you
> > > > thinking of busiest->group_weight?
> > >
> > > I am, because as it is right now that if condition
> > > looks like it might never be true for imbalance_pct 115.
> > >
> > > Presumably you put that check there for a reason, and
> > > would like it to trigger when the amount by which a node
> > > is busy is less than 2 * (imbalance_pct - 100).
> >
> >
> > If three per cent can make any sense in helping determine utilisation
> > low then the busy load has to meet
> >
> >       busiest->sum_nr_running < max(3, cpus in the node / 32);
> >
>
> Why 3% and why would the low utilisation cut-off depend on the number of

But in the same way, why only 6 tasks ? which is the value with
default imbalance_pct ?
I expect a machine with 128 CPUs to have more bandwidth than a machine
with only 32 CPUs and as a result to allow more imbalance

Maybe the number of running tasks (or idle cpus) is not the right
metrics to choose if we can allow a small degree of imbalance because
this doesn't take into account it wether the tasks are long running or
short running ones

> CPUs in the node? That simply means that the cut-off scales to machine
> size and does not take into account any consideration between local memory
> latency and memory bandwidth.
>
> > And we can't skip pulling tasks from a numa node without comparing it
> > to the local load
> >
> >       local->sum_nr_running * env->sd->imbalance_pct <
> >       busiest->sum_nr_running * 100;
> >
> > with imbalance_pct taken into account.
> >
>
> Again, why? In this context, an imbalance has already been calculated
> and whether based on running tasks or idle CPUs, it's not a negative
> number. The imbalance_adj used as already accounted for imbalance_pct
> albeit not as a ratio as it's normally used.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3
  2020-01-07  9:43         ` Vincent Guittot
@ 2020-01-07 10:16           ` Mel Gorman
  2020-01-08 15:49             ` Valentin Schneider
  0 siblings, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2020-01-07 10:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Hillf Danton, Rik van Riel, Ingo Molnar, Peter Zijlstra,
	Phil Auld, Valentin Schneider, Srikar Dronamraju, Quentin Perret,
	Dietmar Eggemann, Morten Rasmussen, Parth Shah, LKML

On Tue, Jan 07, 2020 at 10:43:08AM +0100, Vincent Guittot wrote:
> > > > > It's not directly related to the number of CPUs in the node. Are you
> > > > > thinking of busiest->group_weight?
> > > >
> > > > I am, because as it is right now that if condition
> > > > looks like it might never be true for imbalance_pct 115.
> > > >
> > > > Presumably you put that check there for a reason, and
> > > > would like it to trigger when the amount by which a node
> > > > is busy is less than 2 * (imbalance_pct - 100).
> > >
> > >
> > > If three per cent can make any sense in helping determine utilisation
> > > low then the busy load has to meet
> > >
> > >       busiest->sum_nr_running < max(3, cpus in the node / 32);
> > >
> >
> > Why 3% and why would the low utilisation cut-off depend on the number of
> 
> But in the same way, why only 6 tasks ? which is the value with
> default imbalance_pct ?

I laid this out in another mail sent based on timing so I would repeat
myself other than to say it's predictable across machines.

> I expect a machine with 128 CPUs to have more bandwidth than a machine
> with only 32 CPUs and as a result to allow more imbalance
> 

I would expect so too with the caveat that there can be more memory
channels within a node so positioning does matter but we can't take
everything into account without creating a convulated mess. Worse, we have
no decent method for estimating bandwidth as it depends on the reference
pattern and scheduler domains do not currently take memory channels into
account. Maybe they should but that's a whole different discussion that
we do not want to get into right now.

> Maybe the number of running tasks (or idle cpus) is not the right
> metrics to choose if we can allow a small degree of imbalance because
> this doesn't take into account it wether the tasks are long running or
> short running ones
> 

I think running tasks at least is the least bad metric. idle CPUs gets
caught up in corner cases with bindings and util_avg can be skewed by
outliers. Running tasks is a sensible starting point until there is a
concrete use case that shows it is unworkable. Lets see what you think of
the other untested patch I posted that takes the group weight and child
domain weight into account.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3
  2020-01-07 10:16           ` Mel Gorman
@ 2020-01-08 15:49             ` Valentin Schneider
  0 siblings, 0 replies; 10+ messages in thread
From: Valentin Schneider @ 2020-01-08 15:49 UTC (permalink / raw)
  To: Mel Gorman, Vincent Guittot
  Cc: Hillf Danton, Rik van Riel, Ingo Molnar, Peter Zijlstra,
	Phil Auld, Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Parth Shah, LKML

On 07/01/2020 10:16, Mel Gorman wrote:
> I think running tasks at least is the least bad metric. idle CPUs gets
> caught up in corner cases with bindings and util_avg can be skewed by
> outliers. Running tasks is a sensible starting point until there is a
> concrete use case that shows it is unworkable.

I'd tend to agree with you here. Also; this being in the group_has_spare
imbalance type we have some guarantees that the group is not overutilized.

If we keep some threshold of 'sum_nr_running < group_weight / 2', then this
"naturally" puts a hard limit of 50% total group utilization (corner case
where all tasks are 100% util).

> Lets see what you think of
> the other untested patch I posted that takes the group weight and child
> domain weight into account.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-01-08 15:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-06 14:42 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3 Mel Gorman
2020-01-06 15:47 ` Rik van Riel
2020-01-06 16:33   ` Mel Gorman
2020-01-06 16:44     ` Rik van Riel
2020-01-06 17:19       ` Mel Gorman
     [not found]     ` <20200107015111.4836-1-hdanton@sina.com>
2020-01-07  8:44       ` Vincent Guittot
2020-01-07  9:12       ` Mel Gorman
2020-01-07  9:43         ` Vincent Guittot
2020-01-07 10:16           ` Mel Gorman
2020-01-08 15:49             ` Valentin Schneider

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.