* [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing
@ 2018-09-07 10:11 Mel Gorman
2018-09-07 10:11 ` [PATCH 1/4] sched/numa: Remove redundant numa_stats nr_running field Mel Gorman
` (4 more replies)
0 siblings, 5 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 10:11 UTC (permalink / raw)
To: Srikar Dronamraju, Peter Zijlstra
Cc: Ingo Molnar, Rik van Riel, LKML, Mel Gorman
Srikar had an automatic NUMA balancing series merged during the 4.19 window
and there some issues I missed during review that this series addresses.
Patches 1-2 are simply removing redundant code and calculations that are
never used.
Patch 3 makes the observation that we can call select_idle_sibling during
NUMA placement several times for each idle CPU on a socket. The patch
stops the search on the first idle CPU. While there is the risk that
parallel callers will stack on the same idle CPU, the current code has
the same problem.
Patch 4 is the one that needs the most examination. I believe intent of
load_too_imbalanced was to stop automatic NUMA balancing fighting the load
balancer. Unfortunately, when a machine is lightly utilised there are idle
CPUs and tasks are allowed to migrate even though the load is imbalanced.
In some cases, this limits memory bandwidth and can perform worse even if
data locality is fine. The patch corrects the condition.
If they pass review, I think it would suggest considering them a fix for
4.19 instead of deferring to 4.20.
kernel/sched/fair.c | 35 ++++++++---------------------------
1 file changed, 8 insertions(+), 27 deletions(-)
--
2.16.4
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/4] sched/numa: Remove redundant numa_stats nr_running field
2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
@ 2018-09-07 10:11 ` Mel Gorman
2018-09-07 10:11 ` [PATCH 2/4] sched/numa: Remove unused calculations in update_numa_stats Mel Gorman
` (3 subsequent siblings)
4 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 10:11 UTC (permalink / raw)
To: Srikar Dronamraju, Peter Zijlstra
Cc: Ingo Molnar, Rik van Riel, LKML, Mel Gorman
The nr_running field has not been used since commit 2d4056fafa19
("sched/numa: Remove numa_has_capacity()") so remove it.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..2472aeaff92e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1454,8 +1454,6 @@ struct numa_stats {
/* Total compute capacity of CPUs on a node */
unsigned long compute_capacity;
-
- unsigned int nr_running;
};
/*
@@ -1470,7 +1468,6 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
for_each_cpu(cpu, cpumask_of_node(nid)) {
struct rq *rq = cpu_rq(cpu);
- ns->nr_running += rq->nr_running;
ns->load += weighted_cpuload(rq);
ns->compute_capacity += capacity_of(cpu);
--
2.16.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 2/4] sched/numa: Remove unused calculations in update_numa_stats
2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
2018-09-07 10:11 ` [PATCH 1/4] sched/numa: Remove redundant numa_stats nr_running field Mel Gorman
@ 2018-09-07 10:11 ` Mel Gorman
2018-09-07 10:11 ` [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core Mel Gorman
` (2 subsequent siblings)
4 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 10:11 UTC (permalink / raw)
To: Srikar Dronamraju, Peter Zijlstra
Cc: Ingo Molnar, Rik van Riel, LKML, Mel Gorman
Commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()") removed
the the has_free_capacity field but did not remove calculations
related to it in update_numa_stats. This patch removes the unused
code.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 22 +---------------------
1 file changed, 1 insertion(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2472aeaff92e..5b2f1684e96e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1461,8 +1461,7 @@ struct numa_stats {
*/
static void update_numa_stats(struct numa_stats *ns, int nid)
{
- int smt, cpu, cpus = 0;
- unsigned long capacity;
+ int cpu;
memset(ns, 0, sizeof(*ns));
for_each_cpu(cpu, cpumask_of_node(nid)) {
@@ -1470,26 +1469,7 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
ns->load += weighted_cpuload(rq);
ns->compute_capacity += capacity_of(cpu);
-
- cpus++;
}
-
- /*
- * If we raced with hotplug and there are no CPUs left in our mask
- * the @ns structure is NULL'ed and task_numa_compare() will
- * not find this node attractive.
- *
- * We'll detect a huge imbalance and bail there.
- */
- if (!cpus)
- return;
-
- /* smt := ceil(cpus / capacity), assumes: 1 < smt_power < 2 */
- smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, ns->compute_capacity);
- capacity = cpus / smt; /* cores */
-
- capacity = min_t(unsigned, capacity,
- DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
}
struct task_numa_env {
--
2.16.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core
2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
2018-09-07 10:11 ` [PATCH 1/4] sched/numa: Remove redundant numa_stats nr_running field Mel Gorman
2018-09-07 10:11 ` [PATCH 2/4] sched/numa: Remove unused calculations in update_numa_stats Mel Gorman
@ 2018-09-07 10:11 ` Mel Gorman
2018-09-07 13:05 ` Srikar Dronamraju
2018-09-07 10:11 ` [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU Mel Gorman
2018-09-07 11:24 ` [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Peter Zijlstra
4 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 10:11 UTC (permalink / raw)
To: Srikar Dronamraju, Peter Zijlstra
Cc: Ingo Molnar, Rik van Riel, LKML, Mel Gorman
task_numa_migrate is responsible for finding a core on a preferred NUMA
node for a task. As part of this, task_numa_find_cpu iterates through
the CPUs of a node and evaulates CPUs, both idle and with running tasks,
as placement candidates. Generally though, any idle CPU is equivalent in
terms of improving imbalances and a search after finding one is pointless.
This patch stops examining CPUs on a node if an idle CPU is considered
suitable.
While there are some workloads that show minor gains and losses, they are
mostly within the noise with the exception of specjbb whether running
as one large VM or one VM per socket. The following was reported on a
two-socket Haswell machine with 24 cores per socket.
specjbb, one JVM per socket (2 total)
4.19.0-rc1 4.19.0-rc1
vanilla oneselect-v1
Hmean tput-1 42258.43 ( 0.00%) 43692.10 ( 3.39%)
Hmean tput-2 87811.26 ( 0.00%) 93719.52 ( 6.73%)
Hmean tput-3 138100.56 ( 0.00%) 143484.08 ( 3.90%)
Hmean tput-4 181061.51 ( 0.00%) 191292.99 ( 5.65%)
Hmean tput-5 225577.34 ( 0.00%) 233439.58 ( 3.49%)
Hmean tput-6 264763.44 ( 0.00%) 270634.50 ( 2.22%)
Hmean tput-7 301458.48 ( 0.00%) 314133.32 ( 4.20%)
Hmean tput-8 348364.50 ( 0.00%) 358445.76 ( 2.89%)
Hmean tput-9 382129.65 ( 0.00%) 403288.75 ( 5.54%)
Hmean tput-10 403566.70 ( 0.00%) 444592.51 ( 10.17%)
Hmean tput-11 456967.43 ( 0.00%) 483300.45 ( 5.76%)
Hmean tput-12 502295.98 ( 0.00%) 526281.53 ( 4.78%)
Hmean tput-13 441284.41 ( 0.00%) 535507.75 ( 21.35%)
Hmean tput-14 461478.57 ( 0.00%) 542068.97 ( 17.46%)
Hmean tput-15 489725.29 ( 0.00%) 545033.17 ( 11.29%)
Hmean tput-16 503726.56 ( 0.00%) 549738.23 ( 9.13%)
Hmean tput-17 528650.57 ( 0.00%) 550849.00 ( 4.20%)
Hmean tput-18 518065.41 ( 0.00%) 550018.29 ( 6.17%)
Hmean tput-19 527412.99 ( 0.00%) 550652.26 ( 4.41%)
Hmean tput-20 528166.25 ( 0.00%) 545783.85 ( 3.34%)
Hmean tput-21 524669.70 ( 0.00%) 544848.37 ( 3.85%)
Hmean tput-22 519010.38 ( 0.00%) 539603.70 ( 3.97%)
Hmean tput-23 514947.43 ( 0.00%) 534714.32 ( 3.84%)
Hmean tput-24 517953.29 ( 0.00%) 531783.24 ( 2.67%)
Coeffecient of variance is roughly 0-3% depending on the wareshouse count
so these results are generally outside of the noise. Note that the biggest
improvements are when a socket would be roughly half loaded. It's not
especially obvious why this would be true given that without the patch the
socket is scanned anyway but it may be cache miss related. On a 2-socket
broadwell machine, the same observation was made in that the biggest
benefit was when a socket was half-loaded. If a single JVM is used for
the entire machine, the biggest benefit was also when the machine was
half-utilised.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b2f1684e96e..d59d3e00a480 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1535,7 +1535,7 @@ static bool load_too_imbalanced(long src_load, long dst_load,
* into account that it might be best if task running on the dst_cpu should
* be exchanged with the source task
*/
-static void task_numa_compare(struct task_numa_env *env,
+static bool task_numa_compare(struct task_numa_env *env,
long taskimp, long groupimp, bool maymove)
{
struct rq *dst_rq = cpu_rq(env->dst_cpu);
@@ -1545,6 +1545,7 @@ static void task_numa_compare(struct task_numa_env *env,
long imp = env->p->numa_group ? groupimp : taskimp;
long moveimp = imp;
int dist = env->dist;
+ bool dst_idle = false;
rcu_read_lock();
cur = task_rcu_dereference(&dst_rq->curr);
@@ -1638,11 +1639,13 @@ static void task_numa_compare(struct task_numa_env *env,
env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
env->dst_cpu);
local_irq_enable();
+ dst_idle = true;
}
task_numa_assign(env, cur, imp);
unlock:
rcu_read_unlock();
+ return dst_idle;
}
static void task_numa_find_cpu(struct task_numa_env *env,
@@ -1668,7 +1671,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
continue;
env->dst_cpu = cpu;
- task_numa_compare(env, taskimp, groupimp, maymove);
+ if (task_numa_compare(env, taskimp, groupimp, maymove))
+ break;
}
}
--
2.16.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
` (2 preceding siblings ...)
2018-09-07 10:11 ` [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core Mel Gorman
@ 2018-09-07 10:11 ` Mel Gorman
2018-09-07 11:33 ` Peter Zijlstra
2018-09-07 11:24 ` [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Peter Zijlstra
4 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 10:11 UTC (permalink / raw)
To: Srikar Dronamraju, Peter Zijlstra
Cc: Ingo Molnar, Rik van Riel, LKML, Mel Gorman
Commit 305c1fac3225 ("sched/numa: Evaluate move once per node")
restructured how task_numa_compare evaluates load but there is an anomaly.
task_numa_find_cpu() checks if the load balance between too nodes is too
imbalanced with the intent of only swapping tasks if it would improve
the balance overall. However, if an idle CPU is encountered, the task is
still moved if it's the best improvement but an idle cpu is always going
to appear to be the best improvement.
If a machine is lightly loaded such that all tasks can fit on one node then
the idle CPUs are found and the tasks migrate to one socket. From a NUMA
perspective, this seems intuitively great because memory accesses are all
local but there are two counter-intuitive effects.
First, the load balancer may move tasks so the machine is more evenly
utilised and conflict with automatic NUMA balancing which may respond by
scanning more frequently and increasing overhead. Second, sockets usually
have their own memory channels so using one socket means that fewer
channels are available yielding less memory bandwidth overall. For
memory-bound tasks, it can be beneficial to migrate to another socket and
migrate the data to increase bandwidth even though the accesses are remote
in the short term.
The second observation is not universally true for all workloads but some
of the computational kernels opf NAS benefit when paralellised with openMP.
NAS C class 2-socket
4.19.0-rc1 4.19.0-rc1
oneselect-v1r18 nomove-v1r19
Amean bt 62.26 ( 0.00%) 53.03 ( 14.83%)
Amean cg 27.85 ( 0.00%) 27.82 ( 0.09%)
Amean ep 8.94 ( 0.00%) 8.58 ( 4.09%)
Amean ft 11.89 ( 0.00%) 12.00 ( -0.93%)
Amean is 0.87 ( 0.00%) 0.86 ( 0.92%)
Amean lu 41.77 ( 0.00%) 38.95 ( 6.76%)
Amean mg 5.30 ( 0.00%) 5.26 ( 0.64%)
Amean sp 105.39 ( 0.00%) 63.80 ( 39.46%)
Amean ua 47.42 ( 0.00%) 43.99 ( 7.24%)
Active balancing for NUMA still happens but it greatly reduced. When
running with D class (so it runs longer), the relevant unpatched stats are
3773.21 Elapsed time in seconds
489.24 Mops/sec/thread
38,918 cpu-migrations
3,817,238 page-faults
11,197 sched:sched_move_numa
0 sched:sched_stick_numa
23 sched:sched_swap_numa
With the patch applied
2037.92 Elapsed time in seconds
905.83 Mops/sec/thread
147 cpu-migrations
552,529 page-faults
26 sched:sched_move_numa
0 sched:sched_stick_numa
16 sched:sched_swap_numa
Note the large drop in CPU migrations, the calls to sched_move_numa and
page faults.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d59d3e00a480..d4c289c11012 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
goto unlock;
if (!cur) {
- if (maymove || imp > env->best_imp)
+ if (maymove)
goto assign;
else
goto unlock;
--
2.16.4
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing
2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
` (3 preceding siblings ...)
2018-09-07 10:11 ` [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU Mel Gorman
@ 2018-09-07 11:24 ` Peter Zijlstra
2018-09-07 12:29 ` Mel Gorman
4 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-09-07 11:24 UTC (permalink / raw)
To: Mel Gorman
Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML, Vincent Guittot
On Fri, Sep 07, 2018 at 11:11:35AM +0100, Mel Gorman wrote:
> Srikar had an automatic NUMA balancing series merged during the 4.19 window
> and there some issues I missed during review that this series addresses.
>
> Patches 1-2 are simply removing redundant code and calculations that are
> never used.
I already have those queued, from Vincent:
https://lkml.kernel.org/r/1535548752-4434-1-git-send-email-vincent.guittot@linaro.org
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-07 10:11 ` [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU Mel Gorman
@ 2018-09-07 11:33 ` Peter Zijlstra
2018-09-07 12:37 ` Mel Gorman
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-09-07 11:33 UTC (permalink / raw)
To: Mel Gorman; +Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML
On Fri, Sep 07, 2018 at 11:11:39AM +0100, Mel Gorman wrote:
> Commit 305c1fac3225 ("sched/numa: Evaluate move once per node")
> restructured how task_numa_compare evaluates load but there is an anomaly.
> task_numa_find_cpu() checks if the load balance between too nodes is too
> imbalanced with the intent of only swapping tasks if it would improve
> the balance overall. However, if an idle CPU is encountered, the task is
> still moved if it's the best improvement but an idle cpu is always going
> to appear to be the best improvement.
>
> If a machine is lightly loaded such that all tasks can fit on one node then
> the idle CPUs are found and the tasks migrate to one socket. From a NUMA
> perspective, this seems intuitively great because memory accesses are all
> local but there are two counter-intuitive effects.
>
> First, the load balancer may move tasks so the machine is more evenly
> utilised and conflict with automatic NUMA balancing which may respond by
> scanning more frequently and increasing overhead. Second, sockets usually
> have their own memory channels so using one socket means that fewer
> channels are available yielding less memory bandwidth overall. For
> memory-bound tasks, it can be beneficial to migrate to another socket and
> migrate the data to increase bandwidth even though the accesses are remote
> in the short term.
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d59d3e00a480..d4c289c11012 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
> goto unlock;
>
> if (!cur) {
> - if (maymove || imp > env->best_imp)
> + if (maymove)
> goto assign;
> else
> goto unlock;
Srikar's patch here:
http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
Also frobs this condition, but in a less radical way. Does that yield
similar results?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing
2018-09-07 11:24 ` [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Peter Zijlstra
@ 2018-09-07 12:29 ` Mel Gorman
0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 12:29 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML, Vincent Guittot
On Fri, Sep 07, 2018 at 01:24:55PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 07, 2018 at 11:11:35AM +0100, Mel Gorman wrote:
> > Srikar had an automatic NUMA balancing series merged during the 4.19 window
> > and there some issues I missed during review that this series addresses.
> >
> > Patches 1-2 are simply removing redundant code and calculations that are
> > never used.
>
> I already have those queued, from Vincent:
>
> https://lkml.kernel.org/r/1535548752-4434-1-git-send-email-vincent.guittot@linaro.org
They are exactly equivalent. Consider them acked!
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-07 11:33 ` Peter Zijlstra
@ 2018-09-07 12:37 ` Mel Gorman
2018-09-07 12:44 ` Peter Zijlstra
2018-09-10 9:41 ` Mel Gorman
0 siblings, 2 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 12:37 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML
On Fri, Sep 07, 2018 at 01:33:09PM +0200, Peter Zijlstra wrote:
> > ---
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d59d3e00a480..d4c289c11012 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
> > goto unlock;
> >
> > if (!cur) {
> > - if (maymove || imp > env->best_imp)
> > + if (maymove)
> > goto assign;
> > else
> > goto unlock;
>
> Srikar's patch here:
>
> http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
>
> Also frobs this condition, but in a less radical way. Does that yield
> similar results?
I can check. I do wonder of course if the less radical approach just means
that automatic NUMA balancing and the load balancer simply disagree about
placement at a different time. It'll take a few days to have an answer as
the battery of workloads to check this take ages.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-07 12:37 ` Mel Gorman
@ 2018-09-07 12:44 ` Peter Zijlstra
2018-09-07 13:42 ` Srikar Dronamraju
2018-09-10 9:41 ` Mel Gorman
1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-09-07 12:44 UTC (permalink / raw)
To: Mel Gorman; +Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML
On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> On Fri, Sep 07, 2018 at 01:33:09PM +0200, Peter Zijlstra wrote:
> > > ---
> > > kernel/sched/fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d59d3e00a480..d4c289c11012 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
> > > goto unlock;
> > >
> > > if (!cur) {
> > > - if (maymove || imp > env->best_imp)
> > > + if (maymove)
> > > goto assign;
> > > else
> > > goto unlock;
> >
> > Srikar's patch here:
> >
> > http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> >
> > Also frobs this condition, but in a less radical way. Does that yield
> > similar results?
>
> I can check. I do wonder of course if the less radical approach just means
> that automatic NUMA balancing and the load balancer simply disagree about
> placement at a different time. It'll take a few days to have an answer as
> the battery of workloads to check this take ages.
Yeah, I was afraid it would.. Srikar, can you also evaluate, I suspect
we'll have to pick one of these two patches.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core
2018-09-07 10:11 ` [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core Mel Gorman
@ 2018-09-07 13:05 ` Srikar Dronamraju
2018-09-07 14:20 ` Mel Gorman
0 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2018-09-07 13:05 UTC (permalink / raw)
To: Mel Gorman; +Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
* Mel Gorman <mgorman@techsingularity.net> [2018-09-07 11:11:38]:
> task_numa_migrate is responsible for finding a core on a preferred NUMA
> node for a task. As part of this, task_numa_find_cpu iterates through
> the CPUs of a node and evaulates CPUs, both idle and with running tasks,
> as placement candidates. Generally though, any idle CPU is equivalent in
> terms of improving imbalances and a search after finding one is pointless.
> This patch stops examining CPUs on a node if an idle CPU is considered
> suitable.
>
However there can be a thread on the destination node that might benefit
from swapping with the current thread. Don't we loose that opportunity to
swap if skip checking for other threads?
To articulate.
Thread A currently running on node 0 wants to move to node 1.
Thread B currently running on node 1 is better of if it ran on node 0.
Thread A seems idle cpu before seeing Thread B; skips and looses
an opportunity to swap.
Eventually thread B will get an opportunity to move to node 0, when thread B
calls task_numa_placement but we are probably stopping it from achieving
earlier.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-07 12:44 ` Peter Zijlstra
@ 2018-09-07 13:42 ` Srikar Dronamraju
2018-09-07 14:28 ` Mel Gorman
0 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2018-09-07 13:42 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mel Gorman, Ingo Molnar, Rik van Riel, LKML
* Peter Zijlstra <peterz@infradead.org> [2018-09-07 14:44:32]:
> On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > On Fri, Sep 07, 2018 at 01:33:09PM +0200, Peter Zijlstra wrote:
> > > > ---
> > > > kernel/sched/fair.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index d59d3e00a480..d4c289c11012 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
> > > > goto unlock;
> > > >
> > > > if (!cur) {
> > > > - if (maymove || imp > env->best_imp)
> > > > + if (maymove)
> > > > goto assign;
> > > > else
> > > > goto unlock;
> > >
> > > Srikar's patch here:
> > >
> > > http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > >
> > > Also frobs this condition, but in a less radical way. Does that yield
> > > similar results?
> >
> > I can check. I do wonder of course if the less radical approach just means
> > that automatic NUMA balancing and the load balancer simply disagree about
> > placement at a different time. It'll take a few days to have an answer as
> > the battery of workloads to check this take ages.
>
> Yeah, I was afraid it would.. Srikar, can you also evaluate, I suspect
> we'll have to pick one of these two patches.
I can surely run some benchmarks between the two patches.
However comparing Mel's patch with
http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
Mel's patch
if (!cur) {
- if (maymove || imp > env->best_imp)
+ if (maymove)
goto assign;
else
http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
if (!cur) {
- if (maymove || imp > env->best_imp)
+ if (maymove && moveimp >= env->best_imp)
goto assign;
else
In Mel's fix, if we already found a candidate task to swap and then encounter a
idle cpu, we are going ahead and overwriting the swap candidate. There is
always a chance that swap candidate is a better fit than moving to idle cpu.
In the patch which is in your queue, we are saying move only if it is better than
swap candidate. So this is noway less radical than Mel's patch and probably
more correct.
--
Thanks and Regards
Srikar Dronamraju
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core
2018-09-07 13:05 ` Srikar Dronamraju
@ 2018-09-07 14:20 ` Mel Gorman
0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 14:20 UTC (permalink / raw)
To: Srikar Dronamraju; +Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On Fri, Sep 07, 2018 at 06:35:53PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2018-09-07 11:11:38]:
>
> > task_numa_migrate is responsible for finding a core on a preferred NUMA
> > node for a task. As part of this, task_numa_find_cpu iterates through
> > the CPUs of a node and evaulates CPUs, both idle and with running tasks,
> > as placement candidates. Generally though, any idle CPU is equivalent in
> > terms of improving imbalances and a search after finding one is pointless.
> > This patch stops examining CPUs on a node if an idle CPU is considered
> > suitable.
> >
>
> However there can be a thread on the destination node that might benefit
> from swapping with the current thread. Don't we loose that opportunity to
> swap if skip checking for other threads?
>
> To articulate.
> Thread A currently running on node 0 wants to move to node 1.
> Thread B currently running on node 1 is better of if it ran on node 0.
>
> Thread A seems idle cpu before seeing Thread B; skips and looses
> an opportunity to swap.
>
> Eventually thread B will get an opportunity to move to node 0, when thread B
> calls task_numa_placement but we are probably stopping it from achieving
> earlier.
>
Potentially this opportunity is missed but I think the only case where
swapping is better than an idle CPU is when both tasks are not running
on their preferred node. For that to happen, it would likely require that
the machine be heavily saturated (or both would just find idle cores). I
would think that's the rare case and it's better just to save the cycles
searching through runqueues and examining tasks and just take the idle
CPU. Furthermore, swapping is guaranteed to disrupt two tasks as they
have to be dequeued, migrated and requeued for what may or may not be an
overall performance gain. Lastly, even if it's the case that there is a
swap candidate out there, that does not justify calling select_idle_sibling
for every idle CPU encountered which is what happens currently.
I think the patch I have is almost certain a win (reduced search costs)
and continuing the search just in case there is a good swap candidate
out there is often going to cost more than it saves.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-07 13:42 ` Srikar Dronamraju
@ 2018-09-07 14:28 ` Mel Gorman
0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-07 14:28 UTC (permalink / raw)
To: Srikar Dronamraju; +Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On Fri, Sep 07, 2018 at 07:12:01PM +0530, Srikar Dronamraju wrote:
> > Yeah, I was afraid it would.. Srikar, can you also evaluate, I suspect
> > we'll have to pick one of these two patches.
>
> I can surely run some benchmarks between the two patches.
> However comparing Mel's patch with
> http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
>
> Mel's patch
>
> if (!cur) {
> - if (maymove || imp > env->best_imp)
> + if (maymove)
> goto assign;
> else
> http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
>
>
> if (!cur) {
> - if (maymove || imp > env->best_imp)
> + if (maymove && moveimp >= env->best_imp)
> goto assign;
> else
>
> In Mel's fix, if we already found a candidate task to swap and then encounter a
> idle cpu, we are going ahead and overwriting the swap candidate. There is
> always a chance that swap candidate is a better fit than moving to idle cpu.
>
There is a chance but to find out, the task has to be dequeued and requeued
on a maybe. An idle CPU is less disruptive and the only task affected is
migrating to the preferred node where, based on previous fault behaviour,
should have better locality. It's also a simplier patch but I'm going to
be biased towards my own patch, the tests will decide one way or the other.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-07 12:37 ` Mel Gorman
2018-09-07 12:44 ` Peter Zijlstra
@ 2018-09-10 9:41 ` Mel Gorman
2018-09-12 6:54 ` Srikar Dronamraju
1 sibling, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-09-10 9:41 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Srikar Dronamraju, Ingo Molnar, Rik van Riel, LKML
On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d59d3e00a480..d4c289c11012 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
> > > goto unlock;
> > >
> > > if (!cur) {
> > > - if (maymove || imp > env->best_imp)
> > > + if (maymove)
> > > goto assign;
> > > else
> > > goto unlock;
> >
> > Srikar's patch here:
> >
> > http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> >
> > Also frobs this condition, but in a less radical way. Does that yield
> > similar results?
>
> I can check. I do wonder of course if the less radical approach just means
> that automatic NUMA balancing and the load balancer simply disagree about
> placement at a different time. It'll take a few days to have an answer as
> the battery of workloads to check this take ages.
>
Tests completed over the weekend and I've found that the performance of
both patches are very similar for two machines (both 2 socket) running a
variety of workloads. Hence, I'm not worried about which patch gets picked
up. However, I would prefer my own on the grounds that the additional
complexity does not appear to get us anything. Of course, that changes if
Srikar's tests on his larger ppc64 machines show the more complex approach
is justified.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-10 9:41 ` Mel Gorman
@ 2018-09-12 6:54 ` Srikar Dronamraju
2018-09-12 9:36 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 6:54 UTC (permalink / raw)
To: Mel Gorman; +Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
* Mel Gorman <mgorman@techsingularity.net> [2018-09-10 10:41:47]:
> On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > > Srikar's patch here:
> > >
> > > http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > >
> > > Also frobs this condition, but in a less radical way. Does that yield
> > > similar results?
> >
> > I can check. I do wonder of course if the less radical approach just means
> > that automatic NUMA balancing and the load balancer simply disagree about
> > placement at a different time. It'll take a few days to have an answer as
> > the battery of workloads to check this take ages.
> >
>
> Tests completed over the weekend and I've found that the performance of
> both patches are very similar for two machines (both 2 socket) running a
> variety of workloads. Hence, I'm not worried about which patch gets picked
> up. However, I would prefer my own on the grounds that the additional
> complexity does not appear to get us anything. Of course, that changes if
> Srikar's tests on his larger ppc64 machines show the more complex approach
> is justified.
>
Running SPECJbb2005. Higher bops are better.
Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
+ 2 patches from Mel
(Do not move imbalanced load purely)
http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
(Stop comparing tasks for NUMA placement)
http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
To me, Kernel B which is the 13 patches accepted in v4.19-rc1 + 6 patches
posted for review seem to be giving better performance.
The numbers are compared to previous kernel i.e
for Kernel A, v4.18 is prev
for kernel B, Kernel A is prev
for Kernel C, B is prev
2 node x86 Haswell
v4.18 or 94710cac0ef4
JVMS Prev Current %Change
4 203769
1 316734
Kernel A
JVMS Prev Current %Change
4 203769 209790 2.95482
1 316734 312377 -1.3756
Kernel B
JVMS Prev Current %Change
4 209790 202059 -3.68511
1 312377 326987 4.67704
Kernel C
JVMS Prev Current %Change
4 202059 200681 -0.681979
1 326987 316715 -3.14141
================================================
4 Node / 2 Socket PowerNV / Power 8
v4.18 or 94710cac0ef4
JVMS Prev Current %Change
8 88411.9
1 222075
Kernel A
JVMS Prev Current %Change
8 88411.9 88733.5 0.363752
1 222075 214607 -3.36283
Kernel B
JVMS Prev Current %Change
8 88733.5 89952 1.37321
1 214607 217226 1.22037
Kernel C
JVMS Prev Current %Change
8 89952 89912.9 -0.0434676
1 217226 219281 0.946019
================================================
2 Node / 2 Socket Power 9 / PowerNV
v4.18 or 94710cac0ef4
JVMS Prev Current %Change
4 195989
1 202854
Kernel A
JVMS Prev Current %Change
4 195989 193108 -1.46998
1 202854 204042 0.585643
Kernel B
JVMS Prev Current %Change
4 193108 196422 1.71614
1 204042 211219 3.51741
Kernel C
JVMS Prev Current %Change
4 196422 195052 -0.697478
1 211219 207854 -1.59313
================================================
4 Node / 4 Socket Power 7 PhyP LPAR.
v4.18 or 94710cac0ef4
JVMS Prev Current %Change
8 52826.9
1 103103
Kernel A
JVMS Prev Current %Change
8 52826.9 59504.4 12.6403
1 103103 102542 -0.544116
Kernel B
JVMS Prev Current %Change
8 59504.4 61674.8 3.64746
1 102542 108211 5.52847
Kernel C
JVMS Prev Current %Change
8 61674.8 57946.5 -6.04509
1 108211 104533 -3.39892
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-12 6:54 ` Srikar Dronamraju
@ 2018-09-12 9:36 ` Peter Zijlstra
2018-09-12 10:45 ` Srikar Dronamraju
2018-09-12 9:57 ` Ingo Molnar
2018-09-12 10:52 ` Mel Gorman
2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-09-12 9:36 UTC (permalink / raw)
To: Srikar Dronamraju; +Cc: Mel Gorman, Ingo Molnar, Rik van Riel, LKML
On Wed, Sep 12, 2018 at 12:24:10PM +0530, Srikar Dronamraju wrote:
> Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> + 2 patches from Mel
> (Do not move imbalanced load purely)
> http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> (Stop comparing tasks for NUMA placement)
> http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
But that's not a fair comparison :/ You've complicated things by adding
that second patch from Mel.
Now you cannot (unambiguously) tell what the cause for the performance
difference is.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-12 6:54 ` Srikar Dronamraju
2018-09-12 9:36 ` Peter Zijlstra
@ 2018-09-12 9:57 ` Ingo Molnar
2018-09-12 10:27 ` Srikar Dronamraju
2018-09-12 10:57 ` Mel Gorman
2018-09-12 10:52 ` Mel Gorman
2 siblings, 2 replies; 22+ messages in thread
From: Ingo Molnar @ 2018-09-12 9:57 UTC (permalink / raw)
To: Srikar Dronamraju; +Cc: Mel Gorman, Peter Zijlstra, Rik van Riel, LKML
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2018-09-10 10:41:47]:
>
> > On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > > > Srikar's patch here:
> > > >
> > > > http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > > >
> > > > Also frobs this condition, but in a less radical way. Does that yield
> > > > similar results?
> > >
> > > I can check. I do wonder of course if the less radical approach just means
> > > that automatic NUMA balancing and the load balancer simply disagree about
> > > placement at a different time. It'll take a few days to have an answer as
> > > the battery of workloads to check this take ages.
> > >
> >
> > Tests completed over the weekend and I've found that the performance of
> > both patches are very similar for two machines (both 2 socket) running a
> > variety of workloads. Hence, I'm not worried about which patch gets picked
> > up. However, I would prefer my own on the grounds that the additional
> > complexity does not appear to get us anything. Of course, that changes if
> > Srikar's tests on his larger ppc64 machines show the more complex approach
> > is justified.
> >
>
> Running SPECJbb2005. Higher bops are better.
>
> Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> + 2 patches from Mel
> (Do not move imbalanced load purely)
> http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> (Stop comparing tasks for NUMA placement)
> http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
We absolutely need the 'best' pre-regression baseline kernel measurements as well - was it
vanilla v4.17?
Thanks,
Ingo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-12 9:57 ` Ingo Molnar
@ 2018-09-12 10:27 ` Srikar Dronamraju
2018-09-12 10:57 ` Mel Gorman
1 sibling, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 10:27 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Mel Gorman, Peter Zijlstra, Rik van Riel, LKML
> > Running SPECJbb2005. Higher bops are better.
> >
> > Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> > Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> > Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> > http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > + 2 patches from Mel
> > (Do not move imbalanced load purely)
> > http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> > (Stop comparing tasks for NUMA placement)
> > http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
>
> We absolutely need the 'best' pre-regression baseline kernel measurements as well - was it
> vanilla v4.17?
>
I kept the baseline as 4.18.
The only revert I know since 4.16 from a numa balancing front is the one
where we unintentionally skipped task migrations. It did somehow give good
results to a set of benchmarks but it would completely circumvent the task
migration code. I am not sure if Jirka was taking the numbers from that
kernel. From what I remember, we will pulled it out before 4.16 stabilized.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-12 9:36 ` Peter Zijlstra
@ 2018-09-12 10:45 ` Srikar Dronamraju
0 siblings, 0 replies; 22+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 10:45 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mel Gorman, Ingo Molnar, Rik van Riel, LKML
* Peter Zijlstra <peterz@infradead.org> [2018-09-12 11:36:21]:
> On Wed, Sep 12, 2018 at 12:24:10PM +0530, Srikar Dronamraju wrote:
>
> > Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> > Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> > Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> > http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > + 2 patches from Mel
> > (Do not move imbalanced load purely)
> > http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> > (Stop comparing tasks for NUMA placement)
> > http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
>
> But that's not a fair comparison :/ You've complicated things by adding
> that second patch from Mel.
>
One patch does choose a idle thread over a possible swap. This is the one
that conflicts with "Avoid task migration for small numa improvement".
The other one detects if we have already found an idle thread and stops
iterating. So if we have already chosen a idle CPU, then dont iterate. So
this patch should ideally help the above patch.
My take on both the patches:
If there is a possible swap candidate (i.e if choosing a thread over idle is
better from a NUMA score improvement perspective), then it means the swap
candidate will benefit from movement. It means the swap candidate is not
running on its preferred node at this moment. If we don't do it, the swap
candidate will later on try migrating to its preferred node. We could avoid
that if we choose swap candidate over idle thread. Please note if the idle
and swap candidates yield the same NUMA score, then we still choose the idle
thread and not the swap candidate.
> Now you cannot (unambiguously) tell what the cause for the performance
> difference is.
>
The difference as I listed above is should we choose a swap candidate with
more NUMA score compared to idle thread.
My patch would hurt if there was an idle thread and we still iterate for
possible swap candidates and we dont find any. But that we cant speculate.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-12 6:54 ` Srikar Dronamraju
2018-09-12 9:36 ` Peter Zijlstra
2018-09-12 9:57 ` Ingo Molnar
@ 2018-09-12 10:52 ` Mel Gorman
2 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-12 10:52 UTC (permalink / raw)
To: Srikar Dronamraju; +Cc: Peter Zijlstra, Ingo Molnar, Rik van Riel, LKML
On Wed, Sep 12, 2018 at 12:24:10PM +0530, Srikar Dronamraju wrote:
> > > > Srikar's patch here:
> > > >
> > > > http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > > >
> > > > Also frobs this condition, but in a less radical way. Does that yield
> > > > similar results?
> > >
> > > I can check. I do wonder of course if the less radical approach just means
> > > that automatic NUMA balancing and the load balancer simply disagree about
> > > placement at a different time. It'll take a few days to have an answer as
> > > the battery of workloads to check this take ages.
> > >
> >
> > Tests completed over the weekend and I've found that the performance of
> > both patches are very similar for two machines (both 2 socket) running a
> > variety of workloads. Hence, I'm not worried about which patch gets picked
> > up. However, I would prefer my own on the grounds that the additional
> > complexity does not appear to get us anything. Of course, that changes if
> > Srikar's tests on his larger ppc64 machines show the more complex approach
> > is justified.
> >
>
> Running SPECJbb2005. Higher bops are better.
>
> Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> + 2 patches from Mel
> (Do not move imbalanced load purely)
> http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> (Stop comparing tasks for NUMA placement)
> http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
>
We ended up comparing different things. I started with 4.19-rc1 with
patches 1-3 from my series. I then compared my patch "Do not move
imbalanced load" with just yours so there was one point of variability.
You compared one patch of yours against two of mine so we ended up
looking at different things. That aside;
> 2 node x86 Haswell
>
> v4.18 or 94710cac0ef4
> JVMS Prev Current %Change
> 4 203769
> 1 316734
>
> Kernel A
> JVMS Prev Current %Change
> 4 203769 209790 2.95482
> 1 316734 312377 -1.3756
>
> Kernel B
> JVMS Prev Current %Change
> 4 209790 202059 -3.68511
> 1 312377 326987 4.67704
>
> Kernel C
> JVMS Prev Current %Change
> 4 202059 200681 -0.681979
> 1 326987 316715 -3.14141
Overall, this is actually not that promising. With Kernel B, we lose
almost as much as we gain depending on the thread count but the data
presented is very limited.
I can correlate though that the Haswll results make some sense. The
baseline here is my patch and it's comparing against yours
2 socket Haswell machine
1 JVM -- 1-3.5% gain -- http://www.skynet.ie/~mel/postings/numab-20180912/global-dhp__jvm-specjbb2005-single/marvin5/
2 JVM -- 1-6.3% gain -- http://www.skynet.ie/~mel/postings/numab-20180912/global-dhp__jvm-specjbb2005-multi/marvin5/
Results are similarly good for Broadwell. But it's not universal. It's
both good and bad with specjbb2015 depending on the CPU family. For
example, on Haswell with one JVM per load I see a 5% gain with your
patch and with Broadwell, I see an 8% loss.
For tbench, yours was a marginal loss, autonuma benchmark was a win on
one machine, a loss on another. hackbench showed both gains and losses.
All the results I had were marginal which is why I was not convinced the
complexity was justified.
Your ppc64 figures look a bit more convincing and while I'm disappointed
that you did not make a like-like comparison, I'm happy enough to go with
your version. I can re-evaluate "Stop comparing tasks for NUMA placement"
on its own later as well as the fast-migrate patches.
Thanks Srikar.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU
2018-09-12 9:57 ` Ingo Molnar
2018-09-12 10:27 ` Srikar Dronamraju
@ 2018-09-12 10:57 ` Mel Gorman
1 sibling, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-09-12 10:57 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Srikar Dronamraju, Peter Zijlstra, Rik van Riel, LKML
On Wed, Sep 12, 2018 at 11:57:42AM +0200, Ingo Molnar wrote:
> > * Mel Gorman <mgorman@techsingularity.net> [2018-09-10 10:41:47]:
> >
> > > On Fri, Sep 07, 2018 at 01:37:39PM +0100, Mel Gorman wrote:
> > > > > Srikar's patch here:
> > > > >
> > > > > http://lkml.kernel.org/r/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > > > >
> > > > > Also frobs this condition, but in a less radical way. Does that yield
> > > > > similar results?
> > > >
> > > > I can check. I do wonder of course if the less radical approach just means
> > > > that automatic NUMA balancing and the load balancer simply disagree about
> > > > placement at a different time. It'll take a few days to have an answer as
> > > > the battery of workloads to check this take ages.
> > > >
> > >
> > > Tests completed over the weekend and I've found that the performance of
> > > both patches are very similar for two machines (both 2 socket) running a
> > > variety of workloads. Hence, I'm not worried about which patch gets picked
> > > up. However, I would prefer my own on the grounds that the additional
> > > complexity does not appear to get us anything. Of course, that changes if
> > > Srikar's tests on his larger ppc64 machines show the more complex approach
> > > is justified.
> > >
> >
> > Running SPECJbb2005. Higher bops are better.
> >
> > Kernel A = 4.18+ 13 sched patches part of v4.19-rc1.
> > Kernel B = Kernel A + 6 patches (http://lore.kernel.org/lkml/1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com)
> > Kernel C = Kernel B - (Avoid task migration for small numa improvement) i.e
> > http://lore.kernel.org/lkml/1533276841-16341-4-git-send-email-srikar@linux.vnet.ibm.com
> > + 2 patches from Mel
> > (Do not move imbalanced load purely)
> > http://lore.kernel.org/lkml/20180907101139.20760-5-mgorman@techsingularity.net
> > (Stop comparing tasks for NUMA placement)
> > http://lore.kernel.org/lkml/20180907101139.20760-4-mgorman@techsingularity.net
>
> We absolutely need the 'best' pre-regression baseline kernel measurements as well - was it
> vanilla v4.17?
>
That will hit a separate problem -- the scheduler patches that prefer
keeping new children local instead of spreading wide prematurely. The patch
in question favours communicating tasks (e.g. short-lived communicating
processes from shell scripts) but hurts loads that prefer spreading early
(e.g. STREAM). So while a comparison tells us something, it tells us
relatively little about this series in isolation.
The comparison with 4.17 is expected to be resolved by allowing data to
migrate faster if the load balancer spreads the load without wake-affine
disagreeing about placement. Patches for that exist but they were confirmed
to be working correctly on top of and old version of Srikar's series
based on 4.18. If we get get this series resolved, I can rebase the old
series and baring any major surprises, that should improve things
overall while mitigating the STREAM regression against 4.17.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2018-09-12 10:57 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-07 10:11 [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Mel Gorman
2018-09-07 10:11 ` [PATCH 1/4] sched/numa: Remove redundant numa_stats nr_running field Mel Gorman
2018-09-07 10:11 ` [PATCH 2/4] sched/numa: Remove unused calculations in update_numa_stats Mel Gorman
2018-09-07 10:11 ` [PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core Mel Gorman
2018-09-07 13:05 ` Srikar Dronamraju
2018-09-07 14:20 ` Mel Gorman
2018-09-07 10:11 ` [PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU Mel Gorman
2018-09-07 11:33 ` Peter Zijlstra
2018-09-07 12:37 ` Mel Gorman
2018-09-07 12:44 ` Peter Zijlstra
2018-09-07 13:42 ` Srikar Dronamraju
2018-09-07 14:28 ` Mel Gorman
2018-09-10 9:41 ` Mel Gorman
2018-09-12 6:54 ` Srikar Dronamraju
2018-09-12 9:36 ` Peter Zijlstra
2018-09-12 10:45 ` Srikar Dronamraju
2018-09-12 9:57 ` Ingo Molnar
2018-09-12 10:27 ` Srikar Dronamraju
2018-09-12 10:57 ` Mel Gorman
2018-09-12 10:52 ` Mel Gorman
2018-09-07 11:24 ` [PATCH 0/4] Follow-up fixes for v4.19-rc1 NUMA balancing Peter Zijlstra
2018-09-07 12:29 ` Mel Gorman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.