All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing
@ 2018-09-07 10:11 Mel Gorman
  2018-09-07 10:11 ` [PATCH 1/4] sched/numa: Remove redundant numa_stats nr_running field Mel Gorman
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 10:11 UTC (permalink / raw)
  To: Srikar Dronamraju, Peter Zijlstra
  Cc: Ingo Molnar, Rik van Riel, LKML, Mel Gorman

Srikar had an automatic NUMA balancing series merged during the 4.19 window
and there some issues I missed during review that this series addresses.

Patches 1-2 are simply removing redundant code and calculations that are
never used.

Patch 3 makes the observation that we can call select_idle_sibling during
NUMA placement several times for each idle CPU on a socket.  The patch
stops the search on the first idle CPU. While there is the risk that
parallel callers will stack on the same idle CPU, the current code has
the same problem.

Patch 4 is the one that needs the most examination. I believe intent of
load_too_imbalanced was to stop automatic NUMA balancing fighting the load
balancer. Unfortunately, when a machine is lightly utilised there are idle
CPUs and tasks are allowed to migrate even though the load is imbalanced.
In some cases, this limits memory bandwidth and can perform worse even if
data locality is fine. The patch corrects the condition.

If they pass review, I think it would suggest considering them a fix for
4.19 instead of deferring to 4.20.

 kernel/sched/fair.c | 35 ++++++++---------------------------
 1 file changed, 8 insertions(+), 27 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/4] sched/numa: Remove redundant numa_stats nr_running field
  2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
@ 2018-09-07 10:11 ` Mel Gorman
  2018-09-07 10:11 ` [PATCH 2/4] sched/numa: Remove unused calculations in update_numa_stats Mel Gorman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 10:11 UTC (permalink / raw)
  To: Srikar Dronamraju, Peter Zijlstra
  Cc: Ingo Molnar, Rik van Riel, LKML, Mel Gorman

The nr_running field has not been used since commit 2d4056fafa19
("sched/numa: Remove numa_has_capacity()") so remove it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..2472aeaff92e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1454,8 +1454,6 @@ struct numa_stats {
 
 	/* Total compute capacity of CPUs on a node */
 	unsigned long compute_capacity;
-
-	unsigned int nr_running;
 };
 
 /*
@@ -1470,7 +1468,6 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 	for_each_cpu(cpu, cpumask_of_node(nid)) {
 		struct rq *rq = cpu_rq(cpu);
 
-		ns->nr_running += rq->nr_running;
 		ns->load += weighted_cpuload(rq);
 		ns->compute_capacity += capacity_of(cpu);
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/4] sched/numa: Remove unused calculations in update_numa_stats
  2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
  2018-09-07 10:11 ` [PATCH 1/4] sched/numa: Remove redundant numa_stats nr_running field Mel Gorman
@ 2018-09-07 10:11 ` Mel Gorman
  2018-09-07 10:11 ` [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core Mel Gorman
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 10:11 UTC (permalink / raw)
  To: Srikar Dronamraju, Peter Zijlstra
  Cc: Ingo Molnar, Rik van Riel, LKML, Mel Gorman

Commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()") removed
the the has_free_capacity field but did not remove calculations
related to it in update_numa_stats. This patch removes the unused
code.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2472aeaff92e..5b2f1684e96e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1461,8 +1461,7 @@ struct numa_stats {
  */
 static void update_numa_stats(struct numa_stats *ns, int nid)
 {
-	int smt, cpu, cpus = 0;
-	unsigned long capacity;
+	int cpu;
 
 	memset(ns, 0, sizeof(*ns));
 	for_each_cpu(cpu, cpumask_of_node(nid)) {
@@ -1470,26 +1469,7 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 
 		ns->load += weighted_cpuload(rq);
 		ns->compute_capacity += capacity_of(cpu);
-
-		cpus++;
 	}
-
-	/*
-	 * If we raced with hotplug and there are no CPUs left in our mask
-	 * the @ns structure is NULL'ed and task_numa_compare() will
-	 * not find this node attractive.
-	 *
-	 * We'll detect a huge imbalance and bail there.
-	 */
-	if (!cpus)
-		return;
-
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_power < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, ns->compute_capacity);
-	capacity = cpus / smt; /* cores */
-
-	capacity = min_t(unsigned, capacity,
-		DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
 }
 
 struct task_numa_env {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core
  2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
  2018-09-07 10:11 ` [PATCH 1/4] sched/numa: Remove redundant numa_stats nr_running field Mel Gorman
  2018-09-07 10:11 ` [PATCH 2/4] sched/numa: Remove unused calculations in update_numa_stats Mel Gorman
@ 2018-09-07 10:11 ` Mel Gorman
  2018-09-07 13:05   ` Srikar Dronamraju
  2018-09-07 10:11 ` [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU Mel Gorman
  2018-09-07 11:24 ` [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Peter Zijlstra
  4 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 10:11 UTC (permalink / raw)
  To: Srikar Dronamraju, Peter Zijlstra
  Cc: Ingo Molnar, Rik van Riel, LKML, Mel Gorman

task_numa_migrate is responsible for finding a core on a preferred NUMA
node for a task. As part of this, task_numa_find_cpu iterates through
the CPUs of a node and evaulates CPUs, both idle and with running tasks,
as placement candidates. Generally though, any idle CPU is equivalent in
terms of improving imbalances and a search after finding one is pointless.
This patch stops examining CPUs on a node if an idle CPU is considered
suitable.

While there are some workloads that show minor gains and losses, they are
mostly within the noise with the exception of specjbb whether running
as one large VM or one VM per socket. The following was reported on a
two-socket Haswell machine with 24 cores per socket.

specjbb, one JVM per socket (2 total)
                              4.19.0-rc1             4.19.0-rc1
                                 vanilla           oneselect-v1
Hmean     tput-1     42258.43 (   0.00%)    43692.10 (   3.39%)
Hmean     tput-2     87811.26 (   0.00%)    93719.52 (   6.73%)
Hmean     tput-3    138100.56 (   0.00%)   143484.08 (   3.90%)
Hmean     tput-4    181061.51 (   0.00%)   191292.99 (   5.65%)
Hmean     tput-5    225577.34 (   0.00%)   233439.58 (   3.49%)
Hmean     tput-6    264763.44 (   0.00%)   270634.50 (   2.22%)
Hmean     tput-7    301458.48 (   0.00%)   314133.32 (   4.20%)
Hmean     tput-8    348364.50 (   0.00%)   358445.76 (   2.89%)
Hmean     tput-9    382129.65 (   0.00%)   403288.75 (   5.54%)
Hmean     tput-10   403566.70 (   0.00%)   444592.51 (  10.17%)
Hmean     tput-11   456967.43 (   0.00%)   483300.45 (   5.76%)
Hmean     tput-12   502295.98 (   0.00%)   526281.53 (   4.78%)
Hmean     tput-13   441284.41 (   0.00%)   535507.75 (  21.35%)
Hmean     tput-14   461478.57 (   0.00%)   542068.97 (  17.46%)
Hmean     tput-15   489725.29 (   0.00%)   545033.17 (  11.29%)
Hmean     tput-16   503726.56 (   0.00%)   549738.23 (   9.13%)
Hmean     tput-17   528650.57 (   0.00%)   550849.00 (   4.20%)
Hmean     tput-18   518065.41 (   0.00%)   550018.29 (   6.17%)
Hmean     tput-19   527412.99 (   0.00%)   550652.26 (   4.41%)
Hmean     tput-20   528166.25 (   0.00%)   545783.85 (   3.34%)
Hmean     tput-21   524669.70 (   0.00%)   544848.37 (   3.85%)
Hmean     tput-22   519010.38 (   0.00%)   539603.70 (   3.97%)
Hmean     tput-23   514947.43 (   0.00%)   534714.32 (   3.84%)
Hmean     tput-24   517953.29 (   0.00%)   531783.24 (   2.67%)

Coeffecient of variance is roughly 0-3% depending on the wareshouse count
so these results are generally outside of the noise. Note that the biggest
improvements are when a socket would be roughly half loaded. It's not
especially obvious why this would be true given that without the patch the
socket is scanned anyway but it may be cache miss related. On a 2-socket
broadwell machine, the same observation was made in that the biggest
benefit was when a socket was half-loaded. If a single JVM is used for
the entire machine, the biggest benefit was also when the machine was
half-utilised.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b2f1684e96e..d59d3e00a480 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1535,7 +1535,7 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env,
+static bool task_numa_compare(struct task_numa_env *env,
 			      long taskimp, long groupimp, bool maymove)
 {
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
@@ -1545,6 +1545,7 @@ static void task_numa_compare(struct task_numa_env *env,
 	long imp = env->p->numa_group ? groupimp : taskimp;
 	long moveimp = imp;
 	int dist = env->dist;
+	bool dst_idle = false;
 
 	rcu_read_lock();
 	cur = task_rcu_dereference(&dst_rq->curr);
@@ -1638,11 +1639,13 @@ static void task_numa_compare(struct task_numa_env *env,
 		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
 						   env->dst_cpu);
 		local_irq_enable();
+		dst_idle = true;
 	}
 
 	task_numa_assign(env, cur, imp);
 unlock:
 	rcu_read_unlock();
+	return dst_idle;
 }
 
 static void task_numa_find_cpu(struct task_numa_env *env,
@@ -1668,7 +1671,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp, maymove);
+		if (task_numa_compare(env, taskimp, groupimp, maymove))
+			break;
 	}
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
                   ` (2 preceding siblings ...)
  2018-09-07 10:11 ` [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core Mel Gorman
@ 2018-09-07 10:11 ` Mel Gorman
  2018-09-07 11:33   ` Peter Zijlstra
  2018-09-07 11:24 ` [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Peter Zijlstra
  4 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 10:11 UTC (permalink / raw)
  To: Srikar Dronamraju, Peter Zijlstra
  Cc: Ingo Molnar, Rik van Riel, LKML, Mel Gorman

Commit 305c1fac3225 ("sched/numa: Evaluate move once per node")
restructured how task_numa_compare evaluates load but there is an anomaly.
task_numa_find_cpu() checks if the load balance between too nodes is too
imbalanced with the intent of only swapping tasks if it would improve
the balance overall. However, if an idle CPU is encountered, the task is
still moved if it's the best improvement but an idle cpu is always going
to appear to be the best improvement.

If a machine is lightly loaded such that all tasks can fit on one node then
the idle CPUs are found and the tasks migrate to one socket.  From a NUMA
perspective, this seems intuitively great because memory accesses are all
local but there are two counter-intuitive effects.

First, the load balancer may move tasks so the machine is more evenly
utilised and conflict with automatic NUMA balancing which may respond by
scanning more frequently and increasing overhead.  Second, sockets usually
have their own memory channels so using one socket means that fewer
channels are available yielding less memory bandwidth overall. For
memory-bound tasks, it can be beneficial to migrate to another socket and
migrate the data to increase bandwidth even though the accesses are remote
in the short term.

The second observation is not universally true for all workloads but some
of the computational kernels opf NAS benefit when paralellised with openMP.

NAS C class 2-socket
                         4.19.0-rc1             4.19.0-rc1
                    oneselect-v1r18           nomove-v1r19
Amean     bt       62.26 (   0.00%)       53.03 (  14.83%)
Amean     cg       27.85 (   0.00%)       27.82 (   0.09%)
Amean     ep        8.94 (   0.00%)        8.58 (   4.09%)
Amean     ft       11.89 (   0.00%)       12.00 (  -0.93%)
Amean     is        0.87 (   0.00%)        0.86 (   0.92%)
Amean     lu       41.77 (   0.00%)       38.95 (   6.76%)
Amean     mg        5.30 (   0.00%)        5.26 (   0.64%)
Amean     sp      105.39 (   0.00%)       63.80 (  39.46%)
Amean     ua       47.42 (   0.00%)       43.99 (   7.24%)

Active balancing for NUMA still happens but it greatly reduced. When
running with D class (so it runs longer), the relevant unpatched stats are

           3773.21	Elapsed time in seconds
            489.24	Mops/sec/thread
            38,918      cpu-migrations
         3,817,238      page-faults
            11,197      sched:sched_move_numa
                 0      sched:sched_stick_numa
                23      sched:sched_swap_numa

With the patch applied

           2037.92      Elapsed time in seconds
            905.83      Mops/sec/thread
               147      cpu-migrations
           552,529      page-faults
                26      sched:sched_move_numa
                 0      sched:sched_stick_numa
                16      sched:sched_swap_numa

Note the large drop in CPU migrations, the calls to sched_move_numa and
page faults.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d59d3e00a480..d4c289c11012 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
 		goto unlock;
 
 	if (!cur) {
-		if (maymove || imp > env->best_imp)
+		if (maymove)
 			goto assign;
 		else
 			goto unlock;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing
  2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
                   ` (3 preceding siblings ...)
  2018-09-07 10:11 ` [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU Mel Gorman
@ 2018-09-07 11:24 ` Peter Zijlstra
  2018-09-07 12:29   ` Mel Gorman
  4 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-09-07 11:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML, Vincent Guittot

On Fri, Sep 07, 2018 at 11:11:35AM +0100, Mel Gorman wrote:
> Srikar had an automatic NUMA balancing series merged during the 4.19 window
> and there some issues I missed during review that this series addresses.
> 
> Patches 1-2 are simply removing redundant code and calculations that are
> never used.

I already have those queued, from Vincent:

  https://lkml.kernel.org/r/1535548752-4434-1-git-send-email-vincent.guittot@linaro.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-07 10:11 ` [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU Mel Gorman
@ 2018-09-07 11:33   ` Peter Zijlstra
  2018-09-07 12:37     ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-09-07 11:33 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML

On Fri, Sep 07, 2018 at 11:11:39AM +0100, Mel Gorman wrote:
> Commit 305c1fac3225 ("sched/numa: Evaluate move once per node")
> restructured how task_numa_compare evaluates load but there is an anomaly.
> task_numa_find_cpu() checks if the load balance between too nodes is too
> imbalanced with the intent of only swapping tasks if it would improve
> the balance overall. However, if an idle CPU is encountered, the task is
> still moved if it's the best improvement but an idle cpu is always going
> to appear to be the best improvement.
> 
> If a machine is lightly loaded such that all tasks can fit on one node then
> the idle CPUs are found and the tasks migrate to one socket.  From a NUMA
> perspective, this seems intuitively great because memory accesses are all
> local but there are two counter-intuitive effects.
> 
> First, the load balancer may move tasks so the machine is more evenly
> utilised and conflict with automatic NUMA balancing which may respond by
> scanning more frequently and increasing overhead.  Second, sockets usually
> have their own memory channels so using one socket means that fewer
> channels are available yielding less memory bandwidth overall. For
> memory-bound tasks, it can be beneficial to migrate to another socket and
> migrate the data to increase bandwidth even though the accesses are remote
> in the short term.

> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d59d3e00a480..d4c289c11012 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
>  		goto unlock;
>  
>  	if (!cur) {
> -		if (maymove || imp > env->best_imp)
> +		if (maymove)
>  			goto assign;
>  		else
>  			goto unlock;

Srikar's patch here:

  http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com

Also frobs this condition, but in a less radical way. Does that yield
similar results?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing
  2018-09-07 11:24 ` [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Peter Zijlstra
@ 2018-09-07 12:29   ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 12:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML, Vincent Guittot

On Fri, Sep 07, 2018 at 01:24:55PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 07, 2018 at 11:11:35AM +0100, Mel Gorman wrote:
> > Srikar had an automatic NUMA balancing series merged during the 4.19 window
> > and there some issues I missed during review that this series addresses.
> > 
> > Patches 1-2 are simply removing redundant code and calculations that are
> > never used.
> 
> I already have those queued, from Vincent:
> 
>   https://lkml.kernel.org/r/1535548752-4434-1-git-send-email-vincent.guittot@linaro.org

They are exactly equivalent. Consider them acked!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-07 11:33   ` Peter Zijlstra
@ 2018-09-07 12:37     ` Mel Gorman
  2018-09-07 12:44       ` Peter Zijlstra
  2018-09-10  9:41       ` Mel Gorman
  0 siblings, 2 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 12:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML

On Fri, Sep 07, 2018 at 01:33:09PM +0200, Peter Zijlstra wrote:
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d59d3e00a480..d4c289c11012 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
> >  		goto unlock;
> >  
> >  	if (!cur) {
> > -		if (maymove || imp > env->best_imp)
> > +		if (maymove)
> >  			goto assign;
> >  		else
> >  			goto unlock;
> 
> Srikar's patch here:
> 
>   http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> 
> Also frobs this condition, but in a less radical way. Does that yield
> similar results?

I can check. I do wonder of course if the less radical approach just means
that automatic NUMA balancing and the load balancer simply disagree about
placement at a different time. It'll take a few days to have an answer as
the battery of workloads to check this take ages.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-07 12:37     ` Mel Gorman
@ 2018-09-07 12:44       ` Peter Zijlstra
  2018-09-07 13:42         ` Srikar Dronamraju
  2018-09-10  9:41       ` Mel Gorman
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-09-07 12:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML

On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> On Fri, Sep 07, 2018 at 01:33:09PM +0200, Peter Zijlstra wrote:
> > > ---
> > >  kernel/sched/fair.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d59d3e00a480..d4c289c11012 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
> > >  		goto unlock;
> > >  
> > >  	if (!cur) {
> > > -		if (maymove || imp > env->best_imp)
> > > +		if (maymove)
> > >  			goto assign;
> > >  		else
> > >  			goto unlock;
> > 
> > Srikar's patch here:
> > 
> >   http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > 
> > Also frobs this condition, but in a less radical way. Does that yield
> > similar results?
> 
> I can check. I do wonder of course if the less radical approach just means
> that automatic NUMA balancing and the load balancer simply disagree about
> placement at a different time. It'll take a few days to have an answer as
> the battery of workloads to check this take ages.

Yeah, I was afraid it would.. Srikar, can you also evaluate, I suspect
we'll have to pick one of these two patches.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core
  2018-09-07 10:11 ` [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core Mel Gorman
@ 2018-09-07 13:05   ` Srikar Dronamraju
  2018-09-07 14:20     ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2018-09-07 13:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML

* Mel Gorman <mgorman@techsingularity.net> [2018-09-07 11:11:38]:

> task_numa_migrate is responsible for finding a core on a preferred NUMA
> node for a task. As part of this, task_numa_find_cpu iterates through
> the CPUs of a node and evaulates CPUs, both idle and with running tasks,
> as placement candidates. Generally though, any idle CPU is equivalent in
> terms of improving imbalances and a search after finding one is pointless.
> This patch stops examining CPUs on a node if an idle CPU is considered
> suitable.
> 

However there can be a thread on the destination node that might benefit
from swapping with the current thread. Don't we loose that opportunity to
swap if skip checking for other threads?

To articulate.
Thread A currently running on node 0 wants to move to node 1.
Thread B currently running on node 1 is better of if it ran on node 0.

Thread A seems idle cpu before seeing Thread B; skips and looses
an opportunity to swap.

Eventually thread B will get an opportunity to move to node 0, when thread B
calls task_numa_placement but we are probably stopping it from achieving
earlier.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-07 12:44       ` Peter Zijlstra
@ 2018-09-07 13:42         ` Srikar Dronamraju
  2018-09-07 14:28           ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2018-09-07 13:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mel Gorman, Ingo Molnar, Rik van Riel, LKML

* Peter Zijlstra <peterz@infradead.org> [2018-09-07 14:44:32]:

> On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > On Fri, Sep 07, 2018 at 01:33:09PM +0200, Peter Zijlstra wrote:
> > > > ---
> > > >  kernel/sched/fair.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index d59d3e00a480..d4c289c11012 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
> > > >  		goto unlock;
> > > >  
> > > >  	if (!cur) {
> > > > -		if (maymove || imp > env->best_imp)
> > > > +		if (maymove)
> > > >  			goto assign;
> > > >  		else
> > > >  			goto unlock;
> > > 
> > > Srikar's patch here:
> > > 
> > >   http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > > 
> > > Also frobs this condition, but in a less radical way. Does that yield
> > > similar results?
> > 
> > I can check. I do wonder of course if the less radical approach just means
> > that automatic NUMA balancing and the load balancer simply disagree about
> > placement at a different time. It'll take a few days to have an answer as
> > the battery of workloads to check this take ages.
> 
> Yeah, I was afraid it would.. Srikar, can you also evaluate, I suspect
> we'll have to pick one of these two patches.

I can surely run some benchmarks between the two patches.
However comparing  Mel's patch with
http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com

Mel's patch

  	if (!cur) {
 -		if (maymove || imp > env->best_imp)
 +		if (maymove)
  			goto assign;
  		else
http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com


 	if (!cur) {
-		if (maymove || imp > env->best_imp)
+		if (maymove && moveimp >= env->best_imp)
 			goto assign;
 		else

In Mel's fix, if we already found a candidate task to swap and then encounter a
idle cpu, we are going ahead and overwriting the swap candidate. There is
always a chance that swap candidate is a better fit than moving to idle cpu.

In the patch which is in your queue, we are saying move only if it is better than
swap candidate. So this is noway less radical than Mel's patch and probably
more correct.

-- 
Thanks and Regards
Srikar Dronamraju
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core
  2018-09-07 13:05   ` Srikar Dronamraju
@ 2018-09-07 14:20     ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 14:20 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML

On Fri, Sep 07, 2018 at 06:35:53PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2018-09-07 11:11:38]:
> 
> > task_numa_migrate is responsible for finding a core on a preferred NUMA
> > node for a task. As part of this, task_numa_find_cpu iterates through
> > the CPUs of a node and evaulates CPUs, both idle and with running tasks,
> > as placement candidates. Generally though, any idle CPU is equivalent in
> > terms of improving imbalances and a search after finding one is pointless.
> > This patch stops examining CPUs on a node if an idle CPU is considered
> > suitable.
> > 
> 
> However there can be a thread on the destination node that might benefit
> from swapping with the current thread. Don't we loose that opportunity to
> swap if skip checking for other threads?
> 
> To articulate.
> Thread A currently running on node 0 wants to move to node 1.
> Thread B currently running on node 1 is better of if it ran on node 0.
> 
> Thread A seems idle cpu before seeing Thread B; skips and looses
> an opportunity to swap.
> 
> Eventually thread B will get an opportunity to move to node 0, when thread B
> calls task_numa_placement but we are probably stopping it from achieving
> earlier.
> 

Potentially this opportunity is missed but I think the only case where
swapping is better than an idle CPU is when both tasks are not running
on their preferred node. For that to happen, it would likely require that
the machine be heavily saturated (or both would just find idle cores). I
would think that's the rare case and it's better just to save the cycles
searching through runqueues and examining tasks and just take the idle
CPU. Furthermore, swapping is guaranteed to disrupt two tasks as they
have to be dequeued, migrated and requeued for what may or may not be an
overall performance gain. Lastly, even if it's the case that there is a
swap candidate out there, that does not justify calling select_idle_sibling
for every idle CPU encountered which is what happens currently.

I think the patch I have is almost certain a win (reduced search costs)
and continuing the search just in case there is a good swap candidate
out there is often going to cost more than it saves.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-07 13:42         ` Srikar Dronamraju
@ 2018-09-07 14:28           ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 14:28 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML

On Fri, Sep 07, 2018 at 07:12:01PM +0530, Srikar Dronamraju wrote:
> > Yeah, I was afraid it would.. Srikar, can you also evaluate, I suspect
> > we'll have to pick one of these two patches.
> 
> I can surely run some benchmarks between the two patches.
> However comparing  Mel's patch with
> http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> 
> Mel's patch
> 
>   	if (!cur) {
>  -		if (maymove || imp > env->best_imp)
>  +		if (maymove)
>   			goto assign;
>   		else
> http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> 
> 
>  	if (!cur) {
> -		if (maymove || imp > env->best_imp)
> +		if (maymove && moveimp >= env->best_imp)
>  			goto assign;
>  		else
> 
> In Mel's fix, if we already found a candidate task to swap and then encounter a
> idle cpu, we are going ahead and overwriting the swap candidate. There is
> always a chance that swap candidate is a better fit than moving to idle cpu.
> 

There is a chance but to find out, the task has to be dequeued and requeued
on a maybe. An idle CPU is less disruptive and the only task affected is
migrating to the preferred node where, based on previous fault behaviour,
should have better locality. It's also a simplier patch but I'm going to
be biased towards my own patch, the tests will decide one way or the other.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-07 12:37     ` Mel Gorman
  2018-09-07 12:44       ` Peter Zijlstra
@ 2018-09-10  9:41       ` Mel Gorman
  2018-09-12  6:54         ` Srikar Dronamraju
  1 sibling, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-09-10  9:41 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML

On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d59d3e00a480..d4c289c11012 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
> > >  		goto unlock;
> > >  
> > >  	if (!cur) {
> > > -		if (maymove || imp > env->best_imp)
> > > +		if (maymove)
> > >  			goto assign;
> > >  		else
> > >  			goto unlock;
> > 
> > Srikar's patch here:
> > 
> >   http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > 
> > Also frobs this condition, but in a less radical way. Does that yield
> > similar results?
> 
> I can check. I do wonder of course if the less radical approach just means
> that automatic NUMA balancing and the load balancer simply disagree about
> placement at a different time. It'll take a few days to have an answer as
> the battery of workloads to check this take ages.
> 

Tests completed over the weekend and I've found that the performance of
both patches are very similar for two machines (both 2 socket) running a
variety of workloads. Hence, I'm not worried about which patch gets picked
up. However, I would prefer my own on the grounds that the additional
complexity does not appear to get us anything. Of course, that changes if
Srikar's tests on his larger ppc64 machines show the more complex approach
is justified.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-10  9:41       ` Mel Gorman
@ 2018-09-12  6:54         ` Srikar Dronamraju
  2018-09-12  9:36           ` Peter Zijlstra
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2018-09-12  6:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML

* Mel Gorman <mgorman@techsingularity.net> [2018-09-10 10:41:47]:

> On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > > Srikar's patch here:
> > > 
> > >   http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > > 
> > > Also frobs this condition, but in a less radical way. Does that yield
> > > similar results?
> > 
> > I can check. I do wonder of course if the less radical approach just means
> > that automatic NUMA balancing and the load balancer simply disagree about
> > placement at a different time. It'll take a few days to have an answer as
> > the battery of workloads to check this take ages.
> > 
> 
> Tests completed over the weekend and I've found that the performance of
> both patches are very similar for two machines (both 2 socket) running a
> variety of workloads. Hence, I'm not worried about which patch gets picked
> up. However, I would prefer my own on the grounds that the additional
> complexity does not appear to get us anything. Of course, that changes if
> Srikar's tests on his larger ppc64 machines show the more complex approach
> is justified.
> 

Running SPECJbb2005. Higher bops are better.

Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
	http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
	+ 2 patches from Mel
	(Do not move imbalanced load purely)
	http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
	(Stop comparing tasks for NUMA placement)
	http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net

To me, Kernel B which is the 13 patches accepted in v4.19-rc1 + 6 patches
posted for review seem to be giving better performance.

The numbers are compared to previous kernel i.e
for Kernel A, v4.18 is prev
for kernel B, Kernel A is prev
for Kernel C, B is prev

2 node x86 Haswell

v4.18 or 94710cac0ef4
JVMS  Prev    Current  %Change
4     203769
1     316734

Kernel A
JVMS  Prev    Current  %Change
4     203769  209790   2.95482
1     316734  312377   -1.3756

Kernel B
JVMS  Prev    Current  %Change
4     209790  202059   -3.68511
1     312377  326987   4.67704

Kernel C
JVMS  Prev    Current  %Change
4     202059  200681   -0.681979
1     326987  316715   -3.14141

================================================


4 Node / 2 Socket PowerNV / Power 8

v4.18 or 94710cac0ef4
JVMS  Prev    Current  %Change
8     88411.9
1     222075

Kernel A
JVMS  Prev     Current  %Change
8     88411.9  88733.5  0.363752
1     222075   214607   -3.36283

Kernel B
JVMS  Prev     Current  %Change
8     88733.5  89952    1.37321
1     214607   217226   1.22037

Kernel C
JVMS  Prev    Current  %Change
8     89952   89912.9  -0.0434676
1     217226  219281   0.946019


================================================


2 Node / 2 Socket Power 9 / PowerNV

v4.18 or 94710cac0ef4
JVMS  Prev    Current  %Change
4     195989
1     202854

Kernel A
JVMS  Prev    Current  %Change
4     195989  193108   -1.46998
1     202854  204042   0.585643

Kernel B
JVMS  Prev    Current  %Change
4     193108  196422   1.71614
1     204042  211219   3.51741

Kernel C
JVMS  Prev    Current  %Change
4     196422  195052   -0.697478
1     211219  207854   -1.59313


================================================

4 Node / 4 Socket Power 7 PhyP LPAR.

v4.18 or 94710cac0ef4
JVMS  Prev    Current  %Change
8     52826.9
1     103103

Kernel A
JVMS  Prev     Current  %Change
8     52826.9  59504.4  12.6403
1     103103   102542   -0.544116

Kernel B
JVMS  Prev     Current  %Change
8     59504.4  61674.8  3.64746
1     102542   108211   5.52847

Kernel C
JVMS  Prev     Current  %Change
8     61674.8  57946.5  -6.04509
1     108211   104533   -3.39892


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-12  6:54         ` Srikar Dronamraju
@ 2018-09-12  9:36           ` Peter Zijlstra
  2018-09-12 10:45             ` Srikar Dronamraju
  2018-09-12  9:57           ` Ingo Molnar
  2018-09-12 10:52           ` Mel Gorman
  2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-09-12  9:36 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Mel Gorman, Ingo Molnar, Rik van Riel, LKML

On Wed, Sep 12, 2018 at 12:24:10PM +0530, Srikar Dronamraju wrote:

> Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> 	http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> 	+ 2 patches from Mel
> 	(Do not move imbalanced load purely)
> 	http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> 	(Stop comparing tasks for NUMA placement)
> 	http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net

But that's not a fair comparison :/ You've complicated things by adding
that second patch from Mel.

Now you cannot (unambiguously) tell what the cause for the performance
difference is.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-12  6:54         ` Srikar Dronamraju
  2018-09-12  9:36           ` Peter Zijlstra
@ 2018-09-12  9:57           ` Ingo Molnar
  2018-09-12 10:27             ` Srikar Dronamraju
  2018-09-12 10:57             ` Mel Gorman
  2018-09-12 10:52           ` Mel Gorman
  2 siblings, 2 replies; 22+ messages in thread
From: Ingo Molnar @ 2018-09-12  9:57 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Mel Gorman, Peter Zijlstra, Rik van Riel, LKML


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> * Mel Gorman <mgorman@techsingularity.net> [2018-09-10 10:41:47]:
> 
> > On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > > > Srikar's patch here:
> > > > 
> > > >   http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > > > 
> > > > Also frobs this condition, but in a less radical way. Does that yield
> > > > similar results?
> > > 
> > > I can check. I do wonder of course if the less radical approach just means
> > > that automatic NUMA balancing and the load balancer simply disagree about
> > > placement at a different time. It'll take a few days to have an answer as
> > > the battery of workloads to check this take ages.
> > > 
> > 
> > Tests completed over the weekend and I've found that the performance of
> > both patches are very similar for two machines (both 2 socket) running a
> > variety of workloads. Hence, I'm not worried about which patch gets picked
> > up. However, I would prefer my own on the grounds that the additional
> > complexity does not appear to get us anything. Of course, that changes if
> > Srikar's tests on his larger ppc64 machines show the more complex approach
> > is justified.
> > 
> 
> Running SPECJbb2005. Higher bops are better.
> 
> Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> 	http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> 	+ 2 patches from Mel
> 	(Do not move imbalanced load purely)
> 	http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> 	(Stop comparing tasks for NUMA placement)
> 	http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net

We absolutely need the 'best' pre-regression baseline kernel measurements as well - was it 
vanilla v4.17?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-12  9:57           ` Ingo Molnar
@ 2018-09-12 10:27             ` Srikar Dronamraju
  2018-09-12 10:57             ` Mel Gorman
  1 sibling, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 10:27 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Mel Gorman, Peter Zijlstra, Rik van Riel, LKML

> > Running SPECJbb2005. Higher bops are better.
> > 
> > Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> > Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> > Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> > 	http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > 	+ 2 patches from Mel
> > 	(Do not move imbalanced load purely)
> > 	http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> > 	(Stop comparing tasks for NUMA placement)
> > 	http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
> 
> We absolutely need the 'best' pre-regression baseline kernel measurements as well - was it 
> vanilla v4.17?
> 

I kept the baseline as 4.18.
The only revert I know since 4.16 from a numa balancing front is the one
where we unintentionally skipped task migrations.  It did somehow give good
results to a set of benchmarks but it would completely circumvent the task
migration code. I am not sure if Jirka was taking the numbers from that
kernel. From what I remember, we will pulled it out before 4.16 stabilized.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-12  9:36           ` Peter Zijlstra
@ 2018-09-12 10:45             ` Srikar Dronamraju
  0 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 10:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mel Gorman, Ingo Molnar, Rik van Riel, LKML

* Peter Zijlstra <peterz@infradead.org> [2018-09-12 11:36:21]:

> On Wed, Sep 12, 2018 at 12:24:10PM +0530, Srikar Dronamraju wrote:
> 
> > Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> > Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> > Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> > 	http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > 	+ 2 patches from Mel
> > 	(Do not move imbalanced load purely)
> > 	http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> > 	(Stop comparing tasks for NUMA placement)
> > 	http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
> 
> But that's not a fair comparison :/ You've complicated things by adding
> that second patch from Mel.
> 

One patch does choose a idle thread over a possible swap. This is the one
that conflicts with "Avoid task migration for small numa improvement".

The other one detects if we have already found an idle thread and stops
iterating. So if we have already chosen a idle CPU, then dont iterate. So
this patch should ideally help the above patch.

My take on both the patches:
If there is a possible swap candidate (i.e if choosing a thread over idle is
better from a NUMA score improvement perspective), then it means the swap
candidate will benefit from movement. It means the swap candidate is not
running on its preferred node at this moment. If we don't do it, the swap
candidate will later on try migrating to its preferred node. We could avoid
that if we choose swap candidate over idle thread. Please note if the idle
and swap candidates yield the same NUMA score, then we still choose the idle
thread and not the swap candidate.

> Now you cannot (unambiguously) tell what the cause for the performance
> difference is.
> 

The difference as I listed above is should we choose a swap candidate with
more NUMA score compared to idle thread.

My patch would hurt if there was an idle thread and we still iterate for
possible swap candidates and we dont find any. But that we cant speculate.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-12  6:54         ` Srikar Dronamraju
  2018-09-12  9:36           ` Peter Zijlstra
  2018-09-12  9:57           ` Ingo Molnar
@ 2018-09-12 10:52           ` Mel Gorman
  2 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-12 10:52 UTC (permalink / raw)
  To: Srikar Dronamraju; +Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML

On Wed, Sep 12, 2018 at 12:24:10PM +0530, Srikar Dronamraju wrote:
> > > > Srikar's patch here:
> > > > 
> > > >   http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > > > 
> > > > Also frobs this condition, but in a less radical way. Does that yield
> > > > similar results?
> > > 
> > > I can check. I do wonder of course if the less radical approach just means
> > > that automatic NUMA balancing and the load balancer simply disagree about
> > > placement at a different time. It'll take a few days to have an answer as
> > > the battery of workloads to check this take ages.
> > > 
> > 
> > Tests completed over the weekend and I've found that the performance of
> > both patches are very similar for two machines (both 2 socket) running a
> > variety of workloads. Hence, I'm not worried about which patch gets picked
> > up. However, I would prefer my own on the grounds that the additional
> > complexity does not appear to get us anything. Of course, that changes if
> > Srikar's tests on his larger ppc64 machines show the more complex approach
> > is justified.
> > 
> 
> Running SPECJbb2005. Higher bops are better.
> 
> Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> 	http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> 	+ 2 patches from Mel
> 	(Do not move imbalanced load purely)
> 	http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> 	(Stop comparing tasks for NUMA placement)
> 	http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
> 

We ended up comparing different things. I started with 4.19-rc1 with
patches 1-3 from my series. I then compared my patch "Do not move
imbalanced load" with just yours so there was one point of variability.
You compared one patch of yours against two of mine so we ended up
looking at different things. That aside;

> 2 node x86 Haswell
> 
> v4.18 or 94710cac0ef4
> JVMS  Prev    Current  %Change
> 4     203769
> 1     316734
> 
> Kernel A
> JVMS  Prev    Current  %Change
> 4     203769  209790   2.95482
> 1     316734  312377   -1.3756
> 
> Kernel B
> JVMS  Prev    Current  %Change
> 4     209790  202059   -3.68511
> 1     312377  326987   4.67704
> 
> Kernel C
> JVMS  Prev    Current  %Change
> 4     202059  200681   -0.681979
> 1     326987  316715   -3.14141

Overall, this is actually not that promising. With Kernel B, we lose
almost as much as we gain depending on the thread count but the data
presented is very limited.

I can correlate though that the Haswll results make some sense. The
baseline here is my patch and it's comparing against yours

2 socket Haswell machine
1 JVM -- 1-3.5% gain -- http://www.skynet.ie/~mel/postings/numab-20180912/global-dhp__jvm-specjbb2005-single/marvin5/
2 JVM -- 1-6.3% gain -- http://www.skynet.ie/~mel/postings/numab-20180912/global-dhp__jvm-specjbb2005-multi/marvin5/

Results are similarly good for Broadwell. But it's not universal. It's
both good and bad with specjbb2015 depending on the CPU family. For
example, on Haswell with one JVM per load I see a 5% gain with your
patch and with Broadwell, I see an 8% loss.

For tbench, yours was a marginal loss, autonuma benchmark was a win on
one machine, a loss on another. hackbench showed both gains and losses.
All the results I had were marginal which is why I was not convinced the
complexity was justified.

Your ppc64 figures look a bit more convincing and while I'm disappointed
that you did not make a like-like comparison, I'm happy enough to go with
your version. I can re-evaluate "Stop comparing tasks for NUMA placement"
on its own later as well as the fast-migrate patches.

Thanks Srikar.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
  2018-09-12  9:57           ` Ingo Molnar
  2018-09-12 10:27             ` Srikar Dronamraju
@ 2018-09-12 10:57             ` Mel Gorman
  1 sibling, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-12 10:57 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, LKML

On Wed, Sep 12, 2018 at 11:57:42AM +0200, Ingo Molnar wrote:
> > * Mel Gorman <mgorman@techsingularity.net> [2018-09-10 10:41:47]:
> > 
> > > On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > > > > Srikar's patch here:
> > > > > 
> > > > >   http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > > > > 
> > > > > Also frobs this condition, but in a less radical way. Does that yield
> > > > > similar results?
> > > > 
> > > > I can check. I do wonder of course if the less radical approach just means
> > > > that automatic NUMA balancing and the load balancer simply disagree about
> > > > placement at a different time. It'll take a few days to have an answer as
> > > > the battery of workloads to check this take ages.
> > > > 
> > > 
> > > Tests completed over the weekend and I've found that the performance of
> > > both patches are very similar for two machines (both 2 socket) running a
> > > variety of workloads. Hence, I'm not worried about which patch gets picked
> > > up. However, I would prefer my own on the grounds that the additional
> > > complexity does not appear to get us anything. Of course, that changes if
> > > Srikar's tests on his larger ppc64 machines show the more complex approach
> > > is justified.
> > > 
> > 
> > Running SPECJbb2005. Higher bops are better.
> > 
> > Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> > Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> > Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> > 	http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > 	+ 2 patches from Mel
> > 	(Do not move imbalanced load purely)
> > 	http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> > 	(Stop comparing tasks for NUMA placement)
> > 	http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
> 
> We absolutely need the 'best' pre-regression baseline kernel measurements as well - was it 
> vanilla v4.17?
> 

That will hit a separate problem -- the scheduler patches that prefer
keeping new children local instead of spreading wide prematurely. The patch
in question favours communicating tasks (e.g. short-lived communicating
processes from shell scripts) but hurts loads that prefer spreading early
(e.g. STREAM). So while a comparison tells us something, it tells us
relatively little about this series in isolation.

The comparison with 4.17 is expected to be resolved by allowing data to
migrate faster if the load balancer spreads the load without wake-affine
disagreeing about placement. Patches for that exist but they were confirmed
to be working correctly on top of and old version of Srikar's series
based on 4.18. If we get get this series resolved, I can rebase the old
series and baring any major surprises, that should improve things
overall while mitigating the STREAM regression against 4.17.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-09-12 10:57 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
2018-09-07 10:11 ` [PATCH 1/4] sched/numa: Remove redundant numa_stats nr_running field Mel Gorman
2018-09-07 10:11 ` [PATCH 2/4] sched/numa: Remove unused calculations in update_numa_stats Mel Gorman
2018-09-07 10:11 ` [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core Mel Gorman
2018-09-07 13:05   ` Srikar Dronamraju
2018-09-07 14:20     ` Mel Gorman
2018-09-07 10:11 ` [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU Mel Gorman
2018-09-07 11:33   ` Peter Zijlstra
2018-09-07 12:37     ` Mel Gorman
2018-09-07 12:44       ` Peter Zijlstra
2018-09-07 13:42         ` Srikar Dronamraju
2018-09-07 14:28           ` Mel Gorman
2018-09-10  9:41       ` Mel Gorman
2018-09-12  6:54         ` Srikar Dronamraju
2018-09-12  9:36           ` Peter Zijlstra
2018-09-12 10:45             ` Srikar Dronamraju
2018-09-12  9:57           ` Ingo Molnar
2018-09-12 10:27             ` Srikar Dronamraju
2018-09-12 10:57             ` Mel Gorman
2018-09-12 10:52           ` Mel Gorman
2018-09-07 11:24 ` [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Peter Zijlstra
2018-09-07 12:29   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.