* [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
@ 2020-01-14 10:13 Mel Gorman
2020-01-16 16:35 ` Mel Gorman
` (6 more replies)
0 siblings, 7 replies; 24+ messages in thread
From: Mel Gorman @ 2020-01-14 10:13 UTC (permalink / raw)
To: Vincent Guittot
Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML,
Mel Gorman
Changelog since V3
o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
be as good or better than allowing an imbalance based on the group weight
without worrying about potential spillover of the lower scheduler domains.
Changelog since V2
o Only allow a small imbalance when utilisation is low to address reports that
higher utilisation workloads were hitting corner cases.
Changelog since V1
o Alter code flow vincent.guittot
o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot
o Note that the division is still in place. Without it and taking
imbalance_adj into account before the cutoff, two NUMA domains
do not converage as being equally balanced when the number of
busy tasks equals the size of one domain (50% of the sum).
The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.
This patch allows a fixed degree of imbalance of two tasks to exist
between NUMA domains regardless of utilisation levels. In many cases,
this prevents communicating tasks being pulled apart. It was evaluated
whether the imbalance should be scaled to the domain size. However, no
additional benefit was measured across a range of workloads and machines
and scaling adds the risk that lower domains have to be rebalanced. While
this could change again in the future, such a change should specify the
use case and benefit.
The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offload depending on the
transmission rate.
2-socket Haswell machine 48 core, HT enabled
netperf-tcp -- mmtests config config-network-netperf-unbound
baseline lbnuma-v3
Hmean 64 568.73 ( 0.00%) 577.56 * 1.55%*
Hmean 128 1089.98 ( 0.00%) 1128.06 * 3.49%*
Hmean 256 2061.72 ( 0.00%) 2104.39 * 2.07%*
Hmean 1024 7254.27 ( 0.00%) 7557.52 * 4.18%*
Hmean 2048 11729.20 ( 0.00%) 13350.67 * 13.82%*
Hmean 3312 15309.08 ( 0.00%) 18058.95 * 17.96%*
Hmean 4096 17338.75 ( 0.00%) 20483.66 * 18.14%*
Hmean 8192 25047.12 ( 0.00%) 27806.84 * 11.02%*
Hmean 16384 27359.55 ( 0.00%) 33071.88 * 20.88%*
Stddev 64 2.16 ( 0.00%) 2.02 ( 6.53%)
Stddev 128 2.31 ( 0.00%) 2.19 ( 5.05%)
Stddev 256 11.88 ( 0.00%) 3.22 ( 72.88%)
Stddev 1024 23.68 ( 0.00%) 7.24 ( 69.43%)
Stddev 2048 79.46 ( 0.00%) 71.49 ( 10.03%)
Stddev 3312 26.71 ( 0.00%) 57.80 (-116.41%)
Stddev 4096 185.57 ( 0.00%) 96.15 ( 48.19%)
Stddev 8192 245.80 ( 0.00%) 100.73 ( 59.02%)
Stddev 16384 207.31 ( 0.00%) 141.65 ( 31.67%)
In this case, there was a sizable improvement to performance and
a general reduction in variance. However, this is not univeral.
For most machines, the impact was roughly a 3% performance gain.
Ops NUMA base-page range updates 19796.00 292.00
Ops NUMA PTE updates 19796.00 292.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 16113.00 143.00
Ops NUMA hint local faults % 8407.00 142.00
Ops NUMA hint local percent 52.18 99.30
Ops NUMA pages migrated 4244.00 1.00
Without the patch, only 52.18% of sampled accesses are local. In an
earlier changelog, 100% of sampled accesses are local and indeed on
most machines, this was still the case. In this specific case, the
local sampled rates was 99.3% but note the "base-page range updates"
and "PTE updates". The activity with the patch is negligible as were
the number of faults. The small number of pages migrated were related to
shared libraries. A 2-socket Broadwell showed better results on average
but are not presented for brevity as the performance was similar except
it showed 100% of the sampled NUMA hints were local. The patch holds up
for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
For dbench, the impact depends on the filesystem used and the number of
clients. On XFS, there is little difference as the clients typically
communicate with workqueues which have a separate class of scheduler
problem at the moment. For ext4, performance is generally better,
particularly for small numbers of clients as NUMA balancing activity is
negligible with the patch applied.
A more interesting example is the Facebook schbench which uses a
number of messaging threads to communicate with worker threads. In this
configuration, one messaging thread is used per NUMA node and the number of
worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
for response latency is then reported.
Lat 50.00th-qrtle-1 44.00 ( 0.00%) 37.00 ( 15.91%)
Lat 75.00th-qrtle-1 53.00 ( 0.00%) 41.00 ( 22.64%)
Lat 90.00th-qrtle-1 57.00 ( 0.00%) 42.00 ( 26.32%)
Lat 95.00th-qrtle-1 63.00 ( 0.00%) 43.00 ( 31.75%)
Lat 99.00th-qrtle-1 76.00 ( 0.00%) 51.00 ( 32.89%)
Lat 99.50th-qrtle-1 89.00 ( 0.00%) 52.00 ( 41.57%)
Lat 99.90th-qrtle-1 98.00 ( 0.00%) 55.00 ( 43.88%)
Lat 50.00th-qrtle-2 42.00 ( 0.00%) 42.00 ( 0.00%)
Lat 75.00th-qrtle-2 48.00 ( 0.00%) 47.00 ( 2.08%)
Lat 90.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
Lat 95.00th-qrtle-2 55.00 ( 0.00%) 53.00 ( 3.64%)
Lat 99.00th-qrtle-2 62.00 ( 0.00%) 60.00 ( 3.23%)
Lat 99.50th-qrtle-2 63.00 ( 0.00%) 63.00 ( 0.00%)
Lat 99.90th-qrtle-2 68.00 ( 0.00%) 66.00 ( 2.94%
For higher worker threads, the differences become negligible but it's
interesting to note the difference in wakeup latency at low utilisation
and mpstat confirms that activity was almost all on one node until
the number of worker threads increase.
Hackbench generally showed neutral results across a range of machines.
This is different to earlier versions of the patch which allowed imbalances
for higher degrees of utilisation. perf bench pipe showed negligible
differences in overall performance as the differences are very close to
the noise.
An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is negligible with small gains/losses within the noise measured. This is
because the number of threads far exceeds the small imbalance the aptch
cares about. Similarly, there were report of regressions for the autonuma
benchmark against earlier versions but again, normal load balancing now
applies for that workload.
In general, the patch simply seeks to avoid unnecessary cross-node
migrations in the basic case where imbalances are very small. For low
utilisation communicating workloads, this patch generally behaves better
with less NUMA balancing activity. For high utilisation, there is no
change in behaviour.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
1 file changed, 29 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba749f579714..ade7a8dca5e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
/*
* Try to use spare capacity of local group without overloading it or
* emptying busiest.
- * XXX Spreading tasks across NUMA nodes is not always the best policy
- * and special care should be taken for SD_NUMA domain level before
- * spreading the tasks. For now, load_balance() fully relies on
- * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
*/
if (local->group_type == group_has_spare) {
if (busiest->group_type > group_fully_busy) {
@@ -8691,16 +8687,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
env->migration_type = migrate_task;
lsub_positive(&nr_diff, local->sum_nr_running);
env->imbalance = nr_diff >> 1;
- return;
- }
+ } else {
- /*
- * If there is no overload, we just want to even the number of
- * idle cpus.
- */
- env->migration_type = migrate_task;
- env->imbalance = max_t(long, 0, (local->idle_cpus -
+ /*
+ * If there is no overload, we just want to even the number of
+ * idle cpus.
+ */
+ env->migration_type = migrate_task;
+ env->imbalance = max_t(long, 0, (local->idle_cpus -
busiest->idle_cpus) >> 1);
+ }
+
+ /* Consider allowing a small imbalance between NUMA groups */
+ if (env->sd->flags & SD_NUMA) {
+ unsigned int imbalance_min;
+
+ /*
+ * Compute an allowed imbalance based on a simple
+ * pair of communicating tasks that should remain
+ * local and ignore them.
+ *
+ * NOTE: Generally this would have been based on
+ * the domain size and this was evaluated. However,
+ * the benefit is similar across a range of workloads
+ * and machines but scaling by the domain size adds
+ * the risk that lower domains have to be rebalanced.
+ */
+ imbalance_min = 2;
+ if (busiest->sum_nr_running <= imbalance_min)
+ env->imbalance = 0;
+ }
+
return;
}
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
@ 2020-01-16 16:35 ` Mel Gorman
2020-01-17 13:08 ` Vincent Guittot
2020-01-17 14:37 ` Valentin Schneider
2020-01-17 13:16 ` Vincent Guittot
` (5 subsequent siblings)
6 siblings, 2 replies; 24+ messages in thread
From: Mel Gorman @ 2020-01-16 16:35 UTC (permalink / raw)
To: Vincent Guittot
Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Tue, Jan 14, 2020 at 10:13:20AM +0000, Mel Gorman wrote:
> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> be as good or better than allowing an imbalance based on the group weight
> without worrying about potential spillover of the lower scheduler domains.
>
> Changelog since V2
> o Only allow a small imbalance when utilisation is low to address reports that
> higher utilisation workloads were hitting corner cases.
>
> Changelog since V1
> o Alter code flow vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot
> o Note that the division is still in place. Without it and taking
> imbalance_adj into account before the cutoff, two NUMA domains
> do not converage as being equally balanced when the number of
> busy tasks equals the size of one domain (50% of the sum).
>
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
>
> This patch allows a fixed degree of imbalance of two tasks to exist
> between NUMA domains regardless of utilisation levels. In many cases,
> this prevents communicating tasks being pulled apart. It was evaluated
> whether the imbalance should be scaled to the domain size. However, no
> additional benefit was measured across a range of workloads and machines
> and scaling adds the risk that lower domains have to be rebalanced. While
> this could change again in the future, such a change should specify the
> use case and benefit.
>
Any thoughts on whether this is ok for tip or are there suggestions on
an alternative approach?
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-16 16:35 ` Mel Gorman
@ 2020-01-17 13:08 ` Vincent Guittot
2020-01-17 14:15 ` Mel Gorman
2020-01-17 14:23 ` Phil Auld
2020-01-17 14:37 ` Valentin Schneider
1 sibling, 2 replies; 24+ messages in thread
From: Vincent Guittot @ 2020-01-17 13:08 UTC (permalink / raw)
To: Mel Gorman, Phil Auld
Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
Hi Mel,
On Thu, 16 Jan 2020 at 17:35, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Jan 14, 2020 at 10:13:20AM +0000, Mel Gorman wrote:
> > Changelog since V3
> > o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> > be as good or better than allowing an imbalance based on the group weight
> > without worrying about potential spillover of the lower scheduler domains.
> >
> > Changelog since V2
> > o Only allow a small imbalance when utilisation is low to address reports that
> > higher utilisation workloads were hitting corner cases.
> >
> > Changelog since V1
> > o Alter code flow vincent.guittot
> > o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot
> > o Note that the division is still in place. Without it and taking
> > imbalance_adj into account before the cutoff, two NUMA domains
> > do not converage as being equally balanced when the number of
> > busy tasks equals the size of one domain (50% of the sum).
> >
> > The CPU load balancer balances between different domains to spread load
> > and strives to have equal balance everywhere. Communicating tasks can
> > migrate so they are topologically close to each other but these decisions
> > are independent. On a lightly loaded NUMA machine, two communicating tasks
> > pulled together at wakeup time can be pushed apart by the load balancer.
> > In isolation, the load balancer decision is fine but it ignores the tasks
> > data locality and the wakeup/LB paths continually conflict. NUMA balancing
> > is also a factor but it also simply conflicts with the load balancer.
> >
> > This patch allows a fixed degree of imbalance of two tasks to exist
> > between NUMA domains regardless of utilisation levels. In many cases,
> > this prevents communicating tasks being pulled apart. It was evaluated
> > whether the imbalance should be scaled to the domain size. However, no
> > additional benefit was measured across a range of workloads and machines
> > and scaling adds the risk that lower domains have to be rebalanced. While
> > this could change again in the future, such a change should specify the
> > use case and benefit.
> >
>
> Any thoughts on whether this is ok for tip or are there suggestions on
> an alternative approach?
I have just finished to run some tests on my system with your patch
and I haven't seen any noticeable any changes so far which was a bit
expected. The tests that I usually run, use more than 4 tasks on my 2
nodes system; the only exception is perf sched pipe and the results
for this test stays the same with and without your patch. I'm curious
if this impacts Phil's tests which run LU.c benchmark with some
burning cpu tasks
>
> --
> Mel Gorman
> SUSE Labs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
2020-01-16 16:35 ` Mel Gorman
@ 2020-01-17 13:16 ` Vincent Guittot
2020-01-17 14:26 ` Mel Gorman
2020-01-17 15:09 ` Vincent Guittot
` (4 subsequent siblings)
6 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2020-01-17 13:16 UTC (permalink / raw)
To: Mel Gorman
Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Tue, 14 Jan 2020 at 11:13, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> be as good or better than allowing an imbalance based on the group weight
> without worrying about potential spillover of the lower scheduler domains.
>
> Changelog since V2
> o Only allow a small imbalance when utilisation is low to address reports that
> higher utilisation workloads were hitting corner cases.
>
> Changelog since V1
> o Alter code flow vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot
> o Note that the division is still in place. Without it and taking
> imbalance_adj into account before the cutoff, two NUMA domains
> do not converage as being equally balanced when the number of
> busy tasks equals the size of one domain (50% of the sum).
>
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
>
> This patch allows a fixed degree of imbalance of two tasks to exist
> between NUMA domains regardless of utilisation levels. In many cases,
> this prevents communicating tasks being pulled apart. It was evaluated
> whether the imbalance should be scaled to the domain size. However, no
> additional benefit was measured across a range of workloads and machines
> and scaling adds the risk that lower domains have to be rebalanced. While
> this could change again in the future, such a change should specify the
> use case and benefit.
>
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offload depending on the
> transmission rate.
>
> 2-socket Haswell machine 48 core, HT enabled
> netperf-tcp -- mmtests config config-network-netperf-unbound
> baseline lbnuma-v3
> Hmean 64 568.73 ( 0.00%) 577.56 * 1.55%*
> Hmean 128 1089.98 ( 0.00%) 1128.06 * 3.49%*
> Hmean 256 2061.72 ( 0.00%) 2104.39 * 2.07%*
> Hmean 1024 7254.27 ( 0.00%) 7557.52 * 4.18%*
> Hmean 2048 11729.20 ( 0.00%) 13350.67 * 13.82%*
> Hmean 3312 15309.08 ( 0.00%) 18058.95 * 17.96%*
> Hmean 4096 17338.75 ( 0.00%) 20483.66 * 18.14%*
> Hmean 8192 25047.12 ( 0.00%) 27806.84 * 11.02%*
> Hmean 16384 27359.55 ( 0.00%) 33071.88 * 20.88%*
> Stddev 64 2.16 ( 0.00%) 2.02 ( 6.53%)
> Stddev 128 2.31 ( 0.00%) 2.19 ( 5.05%)
> Stddev 256 11.88 ( 0.00%) 3.22 ( 72.88%)
> Stddev 1024 23.68 ( 0.00%) 7.24 ( 69.43%)
> Stddev 2048 79.46 ( 0.00%) 71.49 ( 10.03%)
> Stddev 3312 26.71 ( 0.00%) 57.80 (-116.41%)
> Stddev 4096 185.57 ( 0.00%) 96.15 ( 48.19%)
> Stddev 8192 245.80 ( 0.00%) 100.73 ( 59.02%)
> Stddev 16384 207.31 ( 0.00%) 141.65 ( 31.67%)
>
> In this case, there was a sizable improvement to performance and
> a general reduction in variance. However, this is not univeral.
> For most machines, the impact was roughly a 3% performance gain.
>
> Ops NUMA base-page range updates 19796.00 292.00
> Ops NUMA PTE updates 19796.00 292.00
> Ops NUMA PMD updates 0.00 0.00
> Ops NUMA hint faults 16113.00 143.00
> Ops NUMA hint local faults % 8407.00 142.00
> Ops NUMA hint local percent 52.18 99.30
> Ops NUMA pages migrated 4244.00 1.00
>
> Without the patch, only 52.18% of sampled accesses are local. In an
> earlier changelog, 100% of sampled accesses are local and indeed on
> most machines, this was still the case. In this specific case, the
> local sampled rates was 99.3% but note the "base-page range updates"
> and "PTE updates". The activity with the patch is negligible as were
> the number of faults. The small number of pages migrated were related to
> shared libraries. A 2-socket Broadwell showed better results on average
> but are not presented for brevity as the performance was similar except
> it showed 100% of the sampled NUMA hints were local. The patch holds up
> for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
>
> For dbench, the impact depends on the filesystem used and the number of
> clients. On XFS, there is little difference as the clients typically
> communicate with workqueues which have a separate class of scheduler
> problem at the moment. For ext4, performance is generally better,
> particularly for small numbers of clients as NUMA balancing activity is
> negligible with the patch applied.
>
> A more interesting example is the Facebook schbench which uses a
> number of messaging threads to communicate with worker threads. In this
> configuration, one messaging thread is used per NUMA node and the number of
> worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
> for response latency is then reported.
>
> Lat 50.00th-qrtle-1 44.00 ( 0.00%) 37.00 ( 15.91%)
> Lat 75.00th-qrtle-1 53.00 ( 0.00%) 41.00 ( 22.64%)
> Lat 90.00th-qrtle-1 57.00 ( 0.00%) 42.00 ( 26.32%)
> Lat 95.00th-qrtle-1 63.00 ( 0.00%) 43.00 ( 31.75%)
> Lat 99.00th-qrtle-1 76.00 ( 0.00%) 51.00 ( 32.89%)
> Lat 99.50th-qrtle-1 89.00 ( 0.00%) 52.00 ( 41.57%)
> Lat 99.90th-qrtle-1 98.00 ( 0.00%) 55.00 ( 43.88%)
Which parameter changes between above and below tests ?
> Lat 50.00th-qrtle-2 42.00 ( 0.00%) 42.00 ( 0.00%)
> Lat 75.00th-qrtle-2 48.00 ( 0.00%) 47.00 ( 2.08%)
> Lat 90.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
> Lat 95.00th-qrtle-2 55.00 ( 0.00%) 53.00 ( 3.64%)
> Lat 99.00th-qrtle-2 62.00 ( 0.00%) 60.00 ( 3.23%)
> Lat 99.50th-qrtle-2 63.00 ( 0.00%) 63.00 ( 0.00%)
> Lat 99.90th-qrtle-2 68.00 ( 0.00%) 66.00 ( 2.94%
>
> For higher worker threads, the differences become negligible but it's
> interesting to note the difference in wakeup latency at low utilisation
> and mpstat confirms that activity was almost all on one node until
> the number of worker threads increase.
>
> Hackbench generally showed neutral results across a range of machines.
> This is different to earlier versions of the patch which allowed imbalances
> for higher degrees of utilisation. perf bench pipe showed negligible
> differences in overall performance as the differences are very close to
> the noise.
>
> An earlier prototype of the patch showed major regressions for NAS C-class
> when running with only half of the available CPUs -- 20-30% performance
> hits were measured at the time. With this version of the patch, the impact
> is negligible with small gains/losses within the noise measured. This is
> because the number of threads far exceeds the small imbalance the aptch
> cares about. Similarly, there were report of regressions for the autonuma
> benchmark against earlier versions but again, normal load balancing now
> applies for that workload.
>
> In general, the patch simply seeks to avoid unnecessary cross-node
> migrations in the basic case where imbalances are very small. For low
> utilisation communicating workloads, this patch generally behaves better
> with less NUMA balancing activity. For high utilisation, there is no
> change in behaviour.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
> 1 file changed, 29 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ba749f579714..ade7a8dca5e4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> /*
> * Try to use spare capacity of local group without overloading it or
> * emptying busiest.
> - * XXX Spreading tasks across NUMA nodes is not always the best policy
> - * and special care should be taken for SD_NUMA domain level before
> - * spreading the tasks. For now, load_balance() fully relies on
> - * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
> */
> if (local->group_type == group_has_spare) {
> if (busiest->group_type > group_fully_busy) {
> @@ -8691,16 +8687,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> env->migration_type = migrate_task;
> lsub_positive(&nr_diff, local->sum_nr_running);
> env->imbalance = nr_diff >> 1;
> - return;
> - }
> + } else {
>
> - /*
> - * If there is no overload, we just want to even the number of
> - * idle cpus.
> - */
> - env->migration_type = migrate_task;
> - env->imbalance = max_t(long, 0, (local->idle_cpus -
> + /*
> + * If there is no overload, we just want to even the number of
> + * idle cpus.
> + */
> + env->migration_type = migrate_task;
> + env->imbalance = max_t(long, 0, (local->idle_cpus -
> busiest->idle_cpus) >> 1);
> + }
> +
> + /* Consider allowing a small imbalance between NUMA groups */
> + if (env->sd->flags & SD_NUMA) {
> + unsigned int imbalance_min;
> +
> + /*
> + * Compute an allowed imbalance based on a simple
> + * pair of communicating tasks that should remain
> + * local and ignore them.
> + *
> + * NOTE: Generally this would have been based on
> + * the domain size and this was evaluated. However,
> + * the benefit is similar across a range of workloads
> + * and machines but scaling by the domain size adds
> + * the risk that lower domains have to be rebalanced.
> + */
> + imbalance_min = 2;
> + if (busiest->sum_nr_running <= imbalance_min)
> + env->imbalance = 0;
Out of curiosity why have you decided to use the above instead of
env->imbalance -= min(env->imbalance, imbalance_adj);
Have you seen perf regression with the min ?
That being said, the proposal looks good to me. It is self contained
and provides perf improvement for some targeted UCs.
> + }
> +
> return;
> }
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-17 13:08 ` Vincent Guittot
@ 2020-01-17 14:15 ` Mel Gorman
2020-01-17 14:32 ` Phil Auld
2020-01-17 14:23 ` Phil Auld
1 sibling, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-17 14:15 UTC (permalink / raw)
To: Vincent Guittot
Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Fri, Jan 17, 2020 at 02:08:13PM +0100, Vincent Guittot wrote:
> > > This patch allows a fixed degree of imbalance of two tasks to exist
> > > between NUMA domains regardless of utilisation levels. In many cases,
> > > this prevents communicating tasks being pulled apart. It was evaluated
> > > whether the imbalance should be scaled to the domain size. However, no
> > > additional benefit was measured across a range of workloads and machines
> > > and scaling adds the risk that lower domains have to be rebalanced. While
> > > this could change again in the future, such a change should specify the
> > > use case and benefit.
> > >
> >
> > Any thoughts on whether this is ok for tip or are there suggestions on
> > an alternative approach?
>
> I have just finished to run some tests on my system with your patch
> and I haven't seen any noticeable any changes so far which was a bit
> expected. The tests that I usually run, use more than 4 tasks on my 2
> nodes system;
This is indeed expected. With more active tasks, normal load balancing
applies.
> the only exception is perf sched pipe and the results
> for this test stays the same with and without your patch.
I never saw much difference with perf sched pipe either. It was
generally within the noise.
> I'm curious
> if this impacts Phil's tests which run LU.c benchmark with some
> burning cpu tasks
I didn't see any problem with LU.c whether parallelised by openMPI or
openMP but an independent check would be nice.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-17 13:08 ` Vincent Guittot
2020-01-17 14:15 ` Mel Gorman
@ 2020-01-17 14:23 ` Phil Auld
1 sibling, 0 replies; 24+ messages in thread
From: Phil Auld @ 2020-01-17 14:23 UTC (permalink / raw)
To: Vincent Guittot
Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
Hi,
On Fri, Jan 17, 2020 at 02:08:13PM +0100 Vincent Guittot wrote:
> Hi Mel,
>
>
> On Thu, 16 Jan 2020 at 17:35, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Tue, Jan 14, 2020 at 10:13:20AM +0000, Mel Gorman wrote:
> > > Changelog since V3
> > > o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> > > be as good or better than allowing an imbalance based on the group weight
> > > without worrying about potential spillover of the lower scheduler domains.
> > >
> > > Changelog since V2
> > > o Only allow a small imbalance when utilisation is low to address reports that
> > > higher utilisation workloads were hitting corner cases.
> > >
> > > Changelog since V1
> > > o Alter code flow vincent.guittot
> > > o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot
> > > o Note that the division is still in place. Without it and taking
> > > imbalance_adj into account before the cutoff, two NUMA domains
> > > do not converage as being equally balanced when the number of
> > > busy tasks equals the size of one domain (50% of the sum).
> > >
> > > The CPU load balancer balances between different domains to spread load
> > > and strives to have equal balance everywhere. Communicating tasks can
> > > migrate so they are topologically close to each other but these decisions
> > > are independent. On a lightly loaded NUMA machine, two communicating tasks
> > > pulled together at wakeup time can be pushed apart by the load balancer.
> > > In isolation, the load balancer decision is fine but it ignores the tasks
> > > data locality and the wakeup/LB paths continually conflict. NUMA balancing
> > > is also a factor but it also simply conflicts with the load balancer.
> > >
> > > This patch allows a fixed degree of imbalance of two tasks to exist
> > > between NUMA domains regardless of utilisation levels. In many cases,
> > > this prevents communicating tasks being pulled apart. It was evaluated
> > > whether the imbalance should be scaled to the domain size. However, no
> > > additional benefit was measured across a range of workloads and machines
> > > and scaling adds the risk that lower domains have to be rebalanced. While
> > > this could change again in the future, such a change should specify the
> > > use case and benefit.
> > >
> >
> > Any thoughts on whether this is ok for tip or are there suggestions on
> > an alternative approach?
>
> I have just finished to run some tests on my system with your patch
> and I haven't seen any noticeable any changes so far which was a bit
> expected. The tests that I usually run, use more than 4 tasks on my 2
> nodes system; the only exception is perf sched pipe and the results
> for this test stays the same with and without your patch. I'm curious
> if this impacts Phil's tests which run LU.c benchmark with some
> burning cpu tasks
>
I'm not seeing much meaningful difference with this real v4 versus what I
posted earlier. It seems to have tightened up the range in some cases, but
otherwise it's hard to see much difference, which is a good thing. I'll
put the eye chart below for the curious.
I'll see if the perf team has time to run a full suite on it. But for this
case it looks fine.
Cheers,
Phil
The lbv4* one is the non-official v4 from the email thread. The other v4
is the real posted one. Otherwise the rest of this is the same as I posted
the other day, which described the test. https://lkml.org/lkml/2020/1/7/840
GROUP - LU.c and cpu hogs in separate cgroups
Mop/s - Higher is better
============76_GROUP========Mop/s===================================
min q1 median q3 max
5.4.0 1671.8 4211.2 6103.0 6934.1 7865.4
5.4.0 1777.1 3719.9 4861.8 5822.5 13479.6
5.4.0 2015.3 2716.2 5007.1 6214.5 9491.7
5.5-rc2 27641.0 30684.7 32091.8 33417.3 38118.1
5.5-rc2 27386.0 29795.2 32484.1 36004.0 37704.3
5.5-rc2 26649.6 29485.0 30379.7 33116.0 36832.8
lbv3 28496.3 29716.0 30634.8 32998.4 40945.2
lbv3 27294.7 29336.4 30186.0 31888.3 35839.1
lbv3 27099.3 29325.3 31680.1 35973.5 39000.0
lbv4* 27936.4 30109.0 31724.8 33150.7 35905.1
lbv4* 26431.0 29355.6 29850.1 32704.4 36060.3
lbv4* 27436.6 29945.9 31076.9 32207.8 35401.5
lbv4 28006.3 29861.1 31993.1 33469.3 34060.7
lbv4 28468.2 30057.7 31606.3 31963.5 35348.5
lbv4 25016.3 28897.5 29274.4 30229.2 36862.7
Runtime - Lower is better
============76_GROUP========time====================================
min q1 median q3 max
5.4.0 259.2 294.92 335.39 484.33 1219.61
5.4.0 151.3 351.1 419.4 551.99 1147.3
5.4.0 214.8 328.16 407.27 751.03 1011.77
5.5-rc2 53.49 61.03 63.56 66.46 73.77
5.5-rc2 54.08 56.67 62.78 68.44 74.45
5.5-rc2 55.36 61.61 67.14 69.16 76.51
lbv3 49.8 61.8 66.59 68.62 71.55
lbv3 56.89 63.95 67.55 69.51 74.7
lbv3 52.28 56.68 64.38 69.54 75.24
lbv4* 56.79 61.52 64.3 67.73 72.99
lbv4* 56.54 62.36 68.31 69.47 77.14
lbv4* 57.6 63.33 65.64 68.11 74.32
lbv4 59.86 60.93 63.74 68.28 72.81
lbv4 57.68 63.79 64.52 67.84 71.62
lbv4 55.31 67.46 69.66 70.56 81.51
NORMAL - LU.c and cpu hogs all in one cgroup
Mop/s - Higher is better
============76_NORMAL========Mop/s===================================
min q1 median q3 max
5.4.0 32912.6 34047.5 36739.4 39124.1 41592.5
5.4.0 29937.7 33060.6 34860.8 39528.8 43328.1
5.4.0 31851.2 34281.1 35284.4 36016.8 38847.4
5.5-rc2 30475.6 32505.1 33977.3 34876 36233.8
5.5-rc2 30657.7 31301.1 32059.4 34396.7 38661.8
5.5-rc2 31022 32247.6 32628.9 33245 38572.3
lbv3 30606.4 32794.4 34258.6 35699 38669.2
lbv3 29722.7 30558.9 32731.2 36412 40752.3
lbv3 30297.7 32568.3 36654.6 38066.2 38988.3
lbv4* 30084.9 31227.5 32312.8 33222.8 36039.7
lbv4* 29875.9 32903.6 33803.1 34519.3 38663.5
lbv4* 27923.3 30631.1 32666.9 33516.7 36663.4
lbv4 30401.4 32559.5 33268.3 35012.9 35953.9
lbv4 29372.5 30677 32081.7 33734.2 39326.8
lbv4 29583.7 30432.5 32542.9 33170.5 34123.1
Runtime - Lower is better
============76_NORMAL========time====================================
min q1 median q3 max
5.4.0 49.02 52.115 55.58 59.89 61.95
5.4.0 47.06 51.615 58.57 61.68 68.11
5.4.0 52.49 56.615 57.795 59.48 64.02
5.5-rc2 56.27 58.47 60.02 62.735 66.91
5.5-rc2 52.74 59.295 63.605 65.145 66.51
5.5-rc2 52.86 61.335 62.495 63.23 65.73
lbv3 52.73 57.12 59.52 62.19 66.62
lbv3 50.03 56.02 62.39 66.725 68.6
lbv3 52.3 53.565 55.65 62.645 67.3
lbv4* 56.58 61.375 63.135 65.3 67.77
lbv4* 52.74 59.07 60.335 61.97 68.25
lbv4* 55.61 60.835 62.42 66.635 73.02
lbv4 56.71 58.235 61.295 62.63 67.07
lbv4 51.85 60.535 63.56 66.54 69.42
lbv4 59.75 61.48 62.655 67 68.92
> >
> > --
> > Mel Gorman
> > SUSE Labs
>
--
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-17 13:16 ` Vincent Guittot
@ 2020-01-17 14:26 ` Mel Gorman
2020-01-17 14:29 ` Vincent Guittot
0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-17 14:26 UTC (permalink / raw)
To: Vincent Guittot
Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Fri, Jan 17, 2020 at 02:16:15PM +0100, Vincent Guittot wrote:
> > A more interesting example is the Facebook schbench which uses a
> > number of messaging threads to communicate with worker threads. In this
> > configuration, one messaging thread is used per NUMA node and the number of
> > worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
> > for response latency is then reported.
> >
> > Lat 50.00th-qrtle-1 44.00 ( 0.00%) 37.00 ( 15.91%)
> > Lat 75.00th-qrtle-1 53.00 ( 0.00%) 41.00 ( 22.64%)
> > Lat 90.00th-qrtle-1 57.00 ( 0.00%) 42.00 ( 26.32%)
> > Lat 95.00th-qrtle-1 63.00 ( 0.00%) 43.00 ( 31.75%)
> > Lat 99.00th-qrtle-1 76.00 ( 0.00%) 51.00 ( 32.89%)
> > Lat 99.50th-qrtle-1 89.00 ( 0.00%) 52.00 ( 41.57%)
> > Lat 99.90th-qrtle-1 98.00 ( 0.00%) 55.00 ( 43.88%)
>
> Which parameter changes between above and below tests ?
>
> > Lat 50.00th-qrtle-2 42.00 ( 0.00%) 42.00 ( 0.00%)
> > Lat 75.00th-qrtle-2 48.00 ( 0.00%) 47.00 ( 2.08%)
> > Lat 90.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
> > Lat 95.00th-qrtle-2 55.00 ( 0.00%) 53.00 ( 3.64%)
> > Lat 99.00th-qrtle-2 62.00 ( 0.00%) 60.00 ( 3.23%)
> > Lat 99.50th-qrtle-2 63.00 ( 0.00%) 63.00 ( 0.00%)
> > Lat 99.90th-qrtle-2 68.00 ( 0.00%) 66.00 ( 2.94%
> >
The number of worker pool threads. Above is 1 worker thread, below is 2.
> > @@ -8691,16 +8687,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > env->migration_type = migrate_task;
> > lsub_positive(&nr_diff, local->sum_nr_running);
> > env->imbalance = nr_diff >> 1;
> > - return;
> > - }
> > + } else {
> >
> > - /*
> > - * If there is no overload, we just want to even the number of
> > - * idle cpus.
> > - */
> > - env->migration_type = migrate_task;
> > - env->imbalance = max_t(long, 0, (local->idle_cpus -
> > + /*
> > + * If there is no overload, we just want to even the number of
> > + * idle cpus.
> > + */
> > + env->migration_type = migrate_task;
> > + env->imbalance = max_t(long, 0, (local->idle_cpus -
> > busiest->idle_cpus) >> 1);
> > + }
> > +
> > + /* Consider allowing a small imbalance between NUMA groups */
> > + if (env->sd->flags & SD_NUMA) {
> > + unsigned int imbalance_min;
> > +
> > + /*
> > + * Compute an allowed imbalance based on a simple
> > + * pair of communicating tasks that should remain
> > + * local and ignore them.
> > + *
> > + * NOTE: Generally this would have been based on
> > + * the domain size and this was evaluated. However,
> > + * the benefit is similar across a range of workloads
> > + * and machines but scaling by the domain size adds
> > + * the risk that lower domains have to be rebalanced.
> > + */
> > + imbalance_min = 2;
> > + if (busiest->sum_nr_running <= imbalance_min)
> > + env->imbalance = 0;
>
> Out of curiosity why have you decided to use the above instead of
> env->imbalance -= min(env->imbalance, imbalance_adj);
>
> Have you seen perf regression with the min ?
>
I didn't see a regression with min() but at this point, we're only
dealing with the case of ignoring a small imbalance when the busiest
group is almost completely idle. The distinction between using min and
just ignoring the imbalance is almost irrevelant in that case.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-17 14:26 ` Mel Gorman
@ 2020-01-17 14:29 ` Vincent Guittot
0 siblings, 0 replies; 24+ messages in thread
From: Vincent Guittot @ 2020-01-17 14:29 UTC (permalink / raw)
To: Mel Gorman
Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Fri, 17 Jan 2020 at 15:26, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Jan 17, 2020 at 02:16:15PM +0100, Vincent Guittot wrote:
> > > A more interesting example is the Facebook schbench which uses a
> > > number of messaging threads to communicate with worker threads. In this
> > > configuration, one messaging thread is used per NUMA node and the number of
> > > worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
> > > for response latency is then reported.
> > >
> > > Lat 50.00th-qrtle-1 44.00 ( 0.00%) 37.00 ( 15.91%)
> > > Lat 75.00th-qrtle-1 53.00 ( 0.00%) 41.00 ( 22.64%)
> > > Lat 90.00th-qrtle-1 57.00 ( 0.00%) 42.00 ( 26.32%)
> > > Lat 95.00th-qrtle-1 63.00 ( 0.00%) 43.00 ( 31.75%)
> > > Lat 99.00th-qrtle-1 76.00 ( 0.00%) 51.00 ( 32.89%)
> > > Lat 99.50th-qrtle-1 89.00 ( 0.00%) 52.00 ( 41.57%)
> > > Lat 99.90th-qrtle-1 98.00 ( 0.00%) 55.00 ( 43.88%)
> >
> > Which parameter changes between above and below tests ?
> >
> > > Lat 50.00th-qrtle-2 42.00 ( 0.00%) 42.00 ( 0.00%)
> > > Lat 75.00th-qrtle-2 48.00 ( 0.00%) 47.00 ( 2.08%)
> > > Lat 90.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
> > > Lat 95.00th-qrtle-2 55.00 ( 0.00%) 53.00 ( 3.64%)
> > > Lat 99.00th-qrtle-2 62.00 ( 0.00%) 60.00 ( 3.23%)
> > > Lat 99.50th-qrtle-2 63.00 ( 0.00%) 63.00 ( 0.00%)
> > > Lat 99.90th-qrtle-2 68.00 ( 0.00%) 66.00 ( 2.94%
> > >
>
> The number of worker pool threads. Above is 1 worker thread, below is 2.
>
> > > @@ -8691,16 +8687,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > env->migration_type = migrate_task;
> > > lsub_positive(&nr_diff, local->sum_nr_running);
> > > env->imbalance = nr_diff >> 1;
> > > - return;
> > > - }
> > > + } else {
> > >
> > > - /*
> > > - * If there is no overload, we just want to even the number of
> > > - * idle cpus.
> > > - */
> > > - env->migration_type = migrate_task;
> > > - env->imbalance = max_t(long, 0, (local->idle_cpus -
> > > + /*
> > > + * If there is no overload, we just want to even the number of
> > > + * idle cpus.
> > > + */
> > > + env->migration_type = migrate_task;
> > > + env->imbalance = max_t(long, 0, (local->idle_cpus -
> > > busiest->idle_cpus) >> 1);
> > > + }
> > > +
> > > + /* Consider allowing a small imbalance between NUMA groups */
> > > + if (env->sd->flags & SD_NUMA) {
> > > + unsigned int imbalance_min;
> > > +
> > > + /*
> > > + * Compute an allowed imbalance based on a simple
> > > + * pair of communicating tasks that should remain
> > > + * local and ignore them.
> > > + *
> > > + * NOTE: Generally this would have been based on
> > > + * the domain size and this was evaluated. However,
> > > + * the benefit is similar across a range of workloads
> > > + * and machines but scaling by the domain size adds
> > > + * the risk that lower domains have to be rebalanced.
> > > + */
> > > + imbalance_min = 2;
> > > + if (busiest->sum_nr_running <= imbalance_min)
> > > + env->imbalance = 0;
> >
> > Out of curiosity why have you decided to use the above instead of
> > env->imbalance -= min(env->imbalance, imbalance_adj);
> >
> > Have you seen perf regression with the min ?
> >
>
> I didn't see a regression with min() but at this point, we're only
> dealing with the case of ignoring a small imbalance when the busiest
> group is almost completely idle. The distinction between using min and
> just ignoring the imbalance is almost irrevelant in that case.
yes you're right
>
> --
> Mel Gorman
> SUSE Labs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-17 14:15 ` Mel Gorman
@ 2020-01-17 14:32 ` Phil Auld
0 siblings, 0 replies; 24+ messages in thread
From: Phil Auld @ 2020-01-17 14:32 UTC (permalink / raw)
To: Mel Gorman
Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Fri, Jan 17, 2020 at 02:15:03PM +0000 Mel Gorman wrote:
> On Fri, Jan 17, 2020 at 02:08:13PM +0100, Vincent Guittot wrote:
> > > > This patch allows a fixed degree of imbalance of two tasks to exist
> > > > between NUMA domains regardless of utilisation levels. In many cases,
> > > > this prevents communicating tasks being pulled apart. It was evaluated
> > > > whether the imbalance should be scaled to the domain size. However, no
> > > > additional benefit was measured across a range of workloads and machines
> > > > and scaling adds the risk that lower domains have to be rebalanced. While
> > > > this could change again in the future, such a change should specify the
> > > > use case and benefit.
> > > >
> > >
> > > Any thoughts on whether this is ok for tip or are there suggestions on
> > > an alternative approach?
> >
> > I have just finished to run some tests on my system with your patch
> > and I haven't seen any noticeable any changes so far which was a bit
> > expected. The tests that I usually run, use more than 4 tasks on my 2
> > nodes system;
>
> This is indeed expected. With more active tasks, normal load balancing
> applies.
>
> > the only exception is perf sched pipe and the results
> > for this test stays the same with and without your patch.
>
> I never saw much difference with perf sched pipe either. It was
> generally within the noise.
>
> > I'm curious
> > if this impacts Phil's tests which run LU.c benchmark with some
> > burning cpu tasks
>
> I didn't see any problem with LU.c whether parallelised by openMPI or
> openMP but an independent check would be nice.
>
My particular case is not straight up LU.c. It's the group imbalance
setup which was totally borken before Vincent's work. The test setup
is designed to show how the load balancer (used to) fail by using group
scaled "load" at larger (NUMA) domain levels. It's very susceptible to
imbalances so I wanted to make sure your patch allowing imblanances
didn't re-break it.
Cheers,
Phil
> --
> Mel Gorman
> SUSE Labs
>
--
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-16 16:35 ` Mel Gorman
2020-01-17 13:08 ` Vincent Guittot
@ 2020-01-17 14:37 ` Valentin Schneider
1 sibling, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2020-01-17 14:37 UTC (permalink / raw)
To: Mel Gorman, Vincent Guittot
Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton,
Parth Shah, Rik van Riel, LKML
On 16/01/2020 16:35, Mel Gorman wrote:
> Any thoughts on whether this is ok for tip or are there suggestions on
> an alternative approach?
>
My main concern was about using number of tasks instead of number of busy
CPUs, which you're doing here, so I'm happy with that side of things.
As for the simpler imbalance heuristic, I don't have any issue with it
either. It's obvious that it caters to pairs of communicating tasks, and
we can try to extend it later on if required.
So yeah, FWIW:
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
2020-01-16 16:35 ` Mel Gorman
2020-01-17 13:16 ` Vincent Guittot
@ 2020-01-17 15:09 ` Vincent Guittot
2020-01-17 15:11 ` Peter Zijlstra
2020-01-17 15:21 ` Phil Auld
` (3 subsequent siblings)
6 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2020-01-17 15:09 UTC (permalink / raw)
To: Mel Gorman
Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Tue, 14 Jan 2020 at 11:13, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> be as good or better than allowing an imbalance based on the group weight
> without worrying about potential spillover of the lower scheduler domains.
>
> Changelog since V2
> o Only allow a small imbalance when utilisation is low to address reports that
> higher utilisation workloads were hitting corner cases.
>
> Changelog since V1
> o Alter code flow vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot
> o Note that the division is still in place. Without it and taking
> imbalance_adj into account before the cutoff, two NUMA domains
> do not converage as being equally balanced when the number of
> busy tasks equals the size of one domain (50% of the sum).
>
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
>
> This patch allows a fixed degree of imbalance of two tasks to exist
> between NUMA domains regardless of utilisation levels. In many cases,
> this prevents communicating tasks being pulled apart. It was evaluated
> whether the imbalance should be scaled to the domain size. However, no
> additional benefit was measured across a range of workloads and machines
> and scaling adds the risk that lower domains have to be rebalanced. While
> this could change again in the future, such a change should specify the
> use case and benefit.
>
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offload depending on the
> transmission rate.
>
> 2-socket Haswell machine 48 core, HT enabled
> netperf-tcp -- mmtests config config-network-netperf-unbound
> baseline lbnuma-v3
> Hmean 64 568.73 ( 0.00%) 577.56 * 1.55%*
> Hmean 128 1089.98 ( 0.00%) 1128.06 * 3.49%*
> Hmean 256 2061.72 ( 0.00%) 2104.39 * 2.07%*
> Hmean 1024 7254.27 ( 0.00%) 7557.52 * 4.18%*
> Hmean 2048 11729.20 ( 0.00%) 13350.67 * 13.82%*
> Hmean 3312 15309.08 ( 0.00%) 18058.95 * 17.96%*
> Hmean 4096 17338.75 ( 0.00%) 20483.66 * 18.14%*
> Hmean 8192 25047.12 ( 0.00%) 27806.84 * 11.02%*
> Hmean 16384 27359.55 ( 0.00%) 33071.88 * 20.88%*
> Stddev 64 2.16 ( 0.00%) 2.02 ( 6.53%)
> Stddev 128 2.31 ( 0.00%) 2.19 ( 5.05%)
> Stddev 256 11.88 ( 0.00%) 3.22 ( 72.88%)
> Stddev 1024 23.68 ( 0.00%) 7.24 ( 69.43%)
> Stddev 2048 79.46 ( 0.00%) 71.49 ( 10.03%)
> Stddev 3312 26.71 ( 0.00%) 57.80 (-116.41%)
> Stddev 4096 185.57 ( 0.00%) 96.15 ( 48.19%)
> Stddev 8192 245.80 ( 0.00%) 100.73 ( 59.02%)
> Stddev 16384 207.31 ( 0.00%) 141.65 ( 31.67%)
>
> In this case, there was a sizable improvement to performance and
> a general reduction in variance. However, this is not univeral.
> For most machines, the impact was roughly a 3% performance gain.
>
> Ops NUMA base-page range updates 19796.00 292.00
> Ops NUMA PTE updates 19796.00 292.00
> Ops NUMA PMD updates 0.00 0.00
> Ops NUMA hint faults 16113.00 143.00
> Ops NUMA hint local faults % 8407.00 142.00
> Ops NUMA hint local percent 52.18 99.30
> Ops NUMA pages migrated 4244.00 1.00
>
> Without the patch, only 52.18% of sampled accesses are local. In an
> earlier changelog, 100% of sampled accesses are local and indeed on
> most machines, this was still the case. In this specific case, the
> local sampled rates was 99.3% but note the "base-page range updates"
> and "PTE updates". The activity with the patch is negligible as were
> the number of faults. The small number of pages migrated were related to
> shared libraries. A 2-socket Broadwell showed better results on average
> but are not presented for brevity as the performance was similar except
> it showed 100% of the sampled NUMA hints were local. The patch holds up
> for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
>
> For dbench, the impact depends on the filesystem used and the number of
> clients. On XFS, there is little difference as the clients typically
> communicate with workqueues which have a separate class of scheduler
> problem at the moment. For ext4, performance is generally better,
> particularly for small numbers of clients as NUMA balancing activity is
> negligible with the patch applied.
>
> A more interesting example is the Facebook schbench which uses a
> number of messaging threads to communicate with worker threads. In this
> configuration, one messaging thread is used per NUMA node and the number of
> worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
> for response latency is then reported.
>
> Lat 50.00th-qrtle-1 44.00 ( 0.00%) 37.00 ( 15.91%)
> Lat 75.00th-qrtle-1 53.00 ( 0.00%) 41.00 ( 22.64%)
> Lat 90.00th-qrtle-1 57.00 ( 0.00%) 42.00 ( 26.32%)
> Lat 95.00th-qrtle-1 63.00 ( 0.00%) 43.00 ( 31.75%)
> Lat 99.00th-qrtle-1 76.00 ( 0.00%) 51.00 ( 32.89%)
> Lat 99.50th-qrtle-1 89.00 ( 0.00%) 52.00 ( 41.57%)
> Lat 99.90th-qrtle-1 98.00 ( 0.00%) 55.00 ( 43.88%)
> Lat 50.00th-qrtle-2 42.00 ( 0.00%) 42.00 ( 0.00%)
> Lat 75.00th-qrtle-2 48.00 ( 0.00%) 47.00 ( 2.08%)
> Lat 90.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
> Lat 95.00th-qrtle-2 55.00 ( 0.00%) 53.00 ( 3.64%)
> Lat 99.00th-qrtle-2 62.00 ( 0.00%) 60.00 ( 3.23%)
> Lat 99.50th-qrtle-2 63.00 ( 0.00%) 63.00 ( 0.00%)
> Lat 99.90th-qrtle-2 68.00 ( 0.00%) 66.00 ( 2.94%
>
> For higher worker threads, the differences become negligible but it's
> interesting to note the difference in wakeup latency at low utilisation
> and mpstat confirms that activity was almost all on one node until
> the number of worker threads increase.
>
> Hackbench generally showed neutral results across a range of machines.
> This is different to earlier versions of the patch which allowed imbalances
> for higher degrees of utilisation. perf bench pipe showed negligible
> differences in overall performance as the differences are very close to
> the noise.
>
> An earlier prototype of the patch showed major regressions for NAS C-class
> when running with only half of the available CPUs -- 20-30% performance
> hits were measured at the time. With this version of the patch, the impact
> is negligible with small gains/losses within the noise measured. This is
> because the number of threads far exceeds the small imbalance the aptch
> cares about. Similarly, there were report of regressions for the autonuma
> benchmark against earlier versions but again, normal load balancing now
> applies for that workload.
>
> In general, the patch simply seeks to avoid unnecessary cross-node
> migrations in the basic case where imbalances are very small. For low
> utilisation communicating workloads, this patch generally behaves better
> with less NUMA balancing activity. For high utilisation, there is no
> change in behaviour.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-17 15:09 ` Vincent Guittot
@ 2020-01-17 15:11 ` Peter Zijlstra
0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2020-01-17 15:11 UTC (permalink / raw)
To: Vincent Guittot
Cc: Mel Gorman, Phil Auld, Ingo Molnar, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Fri, Jan 17, 2020 at 04:09:33PM +0100, Vincent Guittot wrote:
> On Tue, 14 Jan 2020 at 11:13, Mel Gorman <mgorman@techsingularity.net> wrote:
> > In general, the patch simply seeks to avoid unnecessary cross-node
> > migrations in the basic case where imbalances are very small. For low
> > utilisation communicating workloads, this patch generally behaves better
> > with less NUMA balancing activity. For high utilisation, there is no
> > change in behaviour.
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Thanks all!
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
` (2 preceding siblings ...)
2020-01-17 15:09 ` Vincent Guittot
@ 2020-01-17 15:21 ` Phil Auld
2020-01-17 17:56 ` Srikar Dronamraju
` (2 subsequent siblings)
6 siblings, 0 replies; 24+ messages in thread
From: Phil Auld @ 2020-01-17 15:21 UTC (permalink / raw)
To: Mel Gorman
Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Tue, Jan 14, 2020 at 10:13:20AM +0000 Mel Gorman wrote:
> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> be as good or better than allowing an imbalance based on the group weight
> without worrying about potential spillover of the lower scheduler domains.
>
> Changelog since V2
> o Only allow a small imbalance when utilisation is low to address reports that
> higher utilisation workloads were hitting corner cases.
>
> Changelog since V1
> o Alter code flow vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running vincent.guittot
> o Note that the division is still in place. Without it and taking
> imbalance_adj into account before the cutoff, two NUMA domains
> do not converage as being equally balanced when the number of
> busy tasks equals the size of one domain (50% of the sum).
>
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
>
> This patch allows a fixed degree of imbalance of two tasks to exist
> between NUMA domains regardless of utilisation levels. In many cases,
> this prevents communicating tasks being pulled apart. It was evaluated
> whether the imbalance should be scaled to the domain size. However, no
> additional benefit was measured across a range of workloads and machines
> and scaling adds the risk that lower domains have to be rebalanced. While
> this could change again in the future, such a change should specify the
> use case and benefit.
>
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offload depending on the
> transmission rate.
>
> 2-socket Haswell machine 48 core, HT enabled
> netperf-tcp -- mmtests config config-network-netperf-unbound
> baseline lbnuma-v3
> Hmean 64 568.73 ( 0.00%) 577.56 * 1.55%*
> Hmean 128 1089.98 ( 0.00%) 1128.06 * 3.49%*
> Hmean 256 2061.72 ( 0.00%) 2104.39 * 2.07%*
> Hmean 1024 7254.27 ( 0.00%) 7557.52 * 4.18%*
> Hmean 2048 11729.20 ( 0.00%) 13350.67 * 13.82%*
> Hmean 3312 15309.08 ( 0.00%) 18058.95 * 17.96%*
> Hmean 4096 17338.75 ( 0.00%) 20483.66 * 18.14%*
> Hmean 8192 25047.12 ( 0.00%) 27806.84 * 11.02%*
> Hmean 16384 27359.55 ( 0.00%) 33071.88 * 20.88%*
> Stddev 64 2.16 ( 0.00%) 2.02 ( 6.53%)
> Stddev 128 2.31 ( 0.00%) 2.19 ( 5.05%)
> Stddev 256 11.88 ( 0.00%) 3.22 ( 72.88%)
> Stddev 1024 23.68 ( 0.00%) 7.24 ( 69.43%)
> Stddev 2048 79.46 ( 0.00%) 71.49 ( 10.03%)
> Stddev 3312 26.71 ( 0.00%) 57.80 (-116.41%)
> Stddev 4096 185.57 ( 0.00%) 96.15 ( 48.19%)
> Stddev 8192 245.80 ( 0.00%) 100.73 ( 59.02%)
> Stddev 16384 207.31 ( 0.00%) 141.65 ( 31.67%)
>
> In this case, there was a sizable improvement to performance and
> a general reduction in variance. However, this is not univeral.
> For most machines, the impact was roughly a 3% performance gain.
>
> Ops NUMA base-page range updates 19796.00 292.00
> Ops NUMA PTE updates 19796.00 292.00
> Ops NUMA PMD updates 0.00 0.00
> Ops NUMA hint faults 16113.00 143.00
> Ops NUMA hint local faults % 8407.00 142.00
> Ops NUMA hint local percent 52.18 99.30
> Ops NUMA pages migrated 4244.00 1.00
>
> Without the patch, only 52.18% of sampled accesses are local. In an
> earlier changelog, 100% of sampled accesses are local and indeed on
> most machines, this was still the case. In this specific case, the
> local sampled rates was 99.3% but note the "base-page range updates"
> and "PTE updates". The activity with the patch is negligible as were
> the number of faults. The small number of pages migrated were related to
> shared libraries. A 2-socket Broadwell showed better results on average
> but are not presented for brevity as the performance was similar except
> it showed 100% of the sampled NUMA hints were local. The patch holds up
> for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
>
> For dbench, the impact depends on the filesystem used and the number of
> clients. On XFS, there is little difference as the clients typically
> communicate with workqueues which have a separate class of scheduler
> problem at the moment. For ext4, performance is generally better,
> particularly for small numbers of clients as NUMA balancing activity is
> negligible with the patch applied.
>
> A more interesting example is the Facebook schbench which uses a
> number of messaging threads to communicate with worker threads. In this
> configuration, one messaging thread is used per NUMA node and the number of
> worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
> for response latency is then reported.
>
> Lat 50.00th-qrtle-1 44.00 ( 0.00%) 37.00 ( 15.91%)
> Lat 75.00th-qrtle-1 53.00 ( 0.00%) 41.00 ( 22.64%)
> Lat 90.00th-qrtle-1 57.00 ( 0.00%) 42.00 ( 26.32%)
> Lat 95.00th-qrtle-1 63.00 ( 0.00%) 43.00 ( 31.75%)
> Lat 99.00th-qrtle-1 76.00 ( 0.00%) 51.00 ( 32.89%)
> Lat 99.50th-qrtle-1 89.00 ( 0.00%) 52.00 ( 41.57%)
> Lat 99.90th-qrtle-1 98.00 ( 0.00%) 55.00 ( 43.88%)
> Lat 50.00th-qrtle-2 42.00 ( 0.00%) 42.00 ( 0.00%)
> Lat 75.00th-qrtle-2 48.00 ( 0.00%) 47.00 ( 2.08%)
> Lat 90.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
> Lat 95.00th-qrtle-2 55.00 ( 0.00%) 53.00 ( 3.64%)
> Lat 99.00th-qrtle-2 62.00 ( 0.00%) 60.00 ( 3.23%)
> Lat 99.50th-qrtle-2 63.00 ( 0.00%) 63.00 ( 0.00%)
> Lat 99.90th-qrtle-2 68.00 ( 0.00%) 66.00 ( 2.94%
>
> For higher worker threads, the differences become negligible but it's
> interesting to note the difference in wakeup latency at low utilisation
> and mpstat confirms that activity was almost all on one node until
> the number of worker threads increase.
>
> Hackbench generally showed neutral results across a range of machines.
> This is different to earlier versions of the patch which allowed imbalances
> for higher degrees of utilisation. perf bench pipe showed negligible
> differences in overall performance as the differences are very close to
> the noise.
>
> An earlier prototype of the patch showed major regressions for NAS C-class
> when running with only half of the available CPUs -- 20-30% performance
> hits were measured at the time. With this version of the patch, the impact
> is negligible with small gains/losses within the noise measured. This is
> because the number of threads far exceeds the small imbalance the aptch
> cares about. Similarly, there were report of regressions for the autonuma
> benchmark against earlier versions but again, normal load balancing now
> applies for that workload.
>
> In general, the patch simply seeks to avoid unnecessary cross-node
> migrations in the basic case where imbalances are very small. For low
> utilisation communicating workloads, this patch generally behaves better
> with less NUMA balancing activity. For high utilisation, there is no
> change in behaviour.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
> kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
> 1 file changed, 29 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ba749f579714..ade7a8dca5e4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> /*
> * Try to use spare capacity of local group without overloading it or
> * emptying busiest.
> - * XXX Spreading tasks across NUMA nodes is not always the best policy
> - * and special care should be taken for SD_NUMA domain level before
> - * spreading the tasks. For now, load_balance() fully relies on
> - * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
> */
> if (local->group_type == group_has_spare) {
> if (busiest->group_type > group_fully_busy) {
> @@ -8691,16 +8687,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> env->migration_type = migrate_task;
> lsub_positive(&nr_diff, local->sum_nr_running);
> env->imbalance = nr_diff >> 1;
> - return;
> - }
> + } else {
>
> - /*
> - * If there is no overload, we just want to even the number of
> - * idle cpus.
> - */
> - env->migration_type = migrate_task;
> - env->imbalance = max_t(long, 0, (local->idle_cpus -
> + /*
> + * If there is no overload, we just want to even the number of
> + * idle cpus.
> + */
> + env->migration_type = migrate_task;
> + env->imbalance = max_t(long, 0, (local->idle_cpus -
> busiest->idle_cpus) >> 1);
> + }
> +
> + /* Consider allowing a small imbalance between NUMA groups */
> + if (env->sd->flags & SD_NUMA) {
> + unsigned int imbalance_min;
> +
> + /*
> + * Compute an allowed imbalance based on a simple
> + * pair of communicating tasks that should remain
> + * local and ignore them.
> + *
> + * NOTE: Generally this would have been based on
> + * the domain size and this was evaluated. However,
> + * the benefit is similar across a range of workloads
> + * and machines but scaling by the domain size adds
> + * the risk that lower domains have to be rebalanced.
> + */
> + imbalance_min = 2;
> + if (busiest->sum_nr_running <= imbalance_min)
> + env->imbalance = 0;
> + }
> +
> return;
> }
>
>
Works for me. I like this simlified version.
Acked-by: Phil Auld <pauld@redhat.com>
and/or
Tested-by: Phil Auld <pauld@redhat.com>
--
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
` (3 preceding siblings ...)
2020-01-17 15:21 ` Phil Auld
@ 2020-01-17 17:56 ` Srikar Dronamraju
2020-01-17 21:58 ` Mel Gorman
2020-01-21 9:59 ` Srikar Dronamraju
2020-01-29 11:32 ` [tip: sched/core] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains tip-bot2 for Mel Gorman
6 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-01-17 17:56 UTC (permalink / raw)
To: Mel Gorman
Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
Valentin Schneider, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
* Mel Gorman <mgorman@techsingularity.net> [2020-01-14 10:13:20]:
> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> be as good or better than allowing an imbalance based on the group weight
> without worrying about potential spillover of the lower scheduler domains.
>
We certainly are seeing better results than v1.
However numa02, numa03, numa05, numa09 and numa10 still seem to regressing, while
the others are improving.
While numa04 improves by 14%, numa02 regress by around 12%.
Setup:
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 8
Core(s) per socket: 1
Socket(s): 32
NUMA node(s): 8
Model: 2.1 (pvr 004b 0201)
Model name: POWER8 (architected), altivec supported
Hypervisor vendor: pHyp
Virtualization type: para
L1d cache: 64K
L1i cache: 32K
L2 cache: 512K
L3 cache: 8192K
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
NUMA node2 CPU(s): 64-95
NUMA node3 CPU(s): 96-127
NUMA node4 CPU(s): 128-159
NUMA node5 CPU(s): 160-191
NUMA node6 CPU(s): 192-223
NUMA node7 CPU(s): 224-255
numa01 is a set of 2 process each running 128 threads;
each thread doing 50 loops on 3GB process shared memory operations.
numa02 is a single process with 256 threads;
each thread doing 800 loops on 32MB thread local memory operations.
numa03 is a single process with 256 threads;
each thread doing 50 loops on 3GB process shared memory operations.
numa04 is a set of 8 process (as many nodes) each running 32 threads;
each thread doing 50 loops on 3GB process shared memory operations.
numa05 is a set of 16 process (twice as many nodes) each running 16 threads;
each thread doing 50 loops on 3GB process shared memory operations.
Details below:
Testcase Time: Min Max Avg StdDev
./numa01.sh Real: 513.12 547.37 530.25 17.12
./numa01.sh Sys: 107.73 146.26 127.00 19.26
./numa01.sh User: 122812.39 129136.61 125974.50 3162.11
./numa02.sh Real: 68.23 72.44 70.34 2.10
./numa02.sh Sys: 52.35 55.65 54.00 1.65
./numa02.sh User: 14334.37 14907.14 14620.76 286.38
./numa03.sh Real: 471.36 485.19 478.27 6.92
./numa03.sh Sys: 74.91 77.03 75.97 1.06
./numa03.sh User: 118197.30 121238.68 119717.99 1520.69
./numa04.sh Real: 450.35 454.93 452.64 2.29
./numa04.sh Sys: 362.49 397.95 380.22 17.73
./numa04.sh User: 93150.82 93300.60 93225.71 74.89
./numa05.sh Real: 361.18 366.32 363.75 2.57
./numa05.sh Sys: 678.72 726.32 702.52 23.80
./numa05.sh User: 82634.58 85103.97 83869.27 1234.70
Testcase Time: Min Max Avg StdDev %Change
./numa01.sh Real: 485.45 530.20 507.83 22.37 4.41486%
./numa01.sh Sys: 123.45 130.62 127.03 3.59 -0.0236165%
./numa01.sh User: 119152.08 127121.14 123136.61 3984.53 2.30467%
./numa02.sh Real: 78.87 82.31 80.59 1.72 -12.7187%
./numa02.sh Sys: 81.18 85.07 83.12 1.94 -35.0337%
./numa02.sh User: 16303.70 17122.14 16712.92 409.22 -12.5182%
./numa03.sh Real: 477.20 528.12 502.66 25.46 -4.85219%
./numa03.sh Sys: 88.93 115.36 102.15 13.21 -25.629%
./numa03.sh User: 119120.73 129829.89 124475.31 5354.58 -3.8219%
./numa04.sh Real: 374.70 414.76 394.73 20.03 14.6708%
./numa04.sh Sys: 357.14 379.20 368.17 11.03 3.27294%
./numa04.sh User: 87830.73 88547.21 88188.97 358.24 5.7113%
./numa05.sh Real: 369.50 401.56 385.53 16.03 -5.64937%
./numa05.sh Sys: 718.99 741.02 730.00 11.01 -3.76438%
./numa05.sh User: 84989.07 85271.75 85130.41 141.34 -1.48142%
vmstat for numa01
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 2170524 2021927 -6.84613%
numa_hint_faults_local 376099 337768 -10.1917%
numa_hit 1177785 1149206 -2.4265%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 1176900 1149095 -2.36256%
numa_miss 0 0 NA
numa_other 885 111 -87.4576%
numa_pages_migrated 304670 292963 -3.84252%
numa_pte_updates 2171627 2022996 -6.84422%
pgfault 4469999 4266785 -4.54618%
pgmajfault 280 247 -11.7857%
pgmigrate_fail 1 0 -100%
pgmigrate_success 304670 292963 -3.84252%
vmstat for numa02
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 496508 508975 2.51094%
numa_hint_faults_local 295974 282634 -4.50715%
numa_hit 585706 642712 9.73287%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 585700 642677 9.72802%
numa_miss 0 0 NA
numa_other 6 35 483.333%
numa_pages_migrated 199884 224448 12.2891%
numa_pte_updates 513146 525354 2.37905%
pgfault 1111950 1238982 11.4243%
pgmajfault 121 141 16.5289%
pgmigrate_fail 0 0 NA
pgmigrate_success 199884 224448 12.2891%
vmstat for numa03
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 863404 951850 10.2439%
numa_hint_faults_local 108422 120466 11.1084%
numa_hit 612432 592068 -3.3251%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 612384 592059 -3.319%
numa_miss 0 0 NA
numa_other 48 9 -81.25%
numa_pages_migrated 118517 121945 2.89241%
numa_pte_updates 865936 952055 9.94519%
pgfault 2291712 2325598 1.47863%
pgmajfault 155 113 -27.0968%
pgmigrate_fail 0 2 NA
pgmigrate_success 118517 121945 2.89241%
vmstat for numa04
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 8122814 7678754 -5.46682%
numa_hint_faults_local 3965028 4202779 5.9962%
numa_hit 2453692 2412929 -1.66129%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 2453668 2412815 -1.66498%
numa_miss 0 0 NA
numa_other 24 114 375%
numa_pages_migrated 1302687 1249958 -4.04771%
numa_pte_updates 8139895 7683560 -5.60615%
pgfault 10420191 10002382 -4.00961%
pgmajfault 145 166 14.4828%
pgmigrate_fail 0 1 NA
pgmigrate_success 1302687 1249958 -4.04771%
vmstat for numa05
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 252995 NA
numa_hint_faults 16968844 16706026 -1.54883%
numa_hint_faults_local 10525364 10167507 -3.39995%
numa_hit 4354639 3947252 -9.35524%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 4354568 3947234 -9.35418%
numa_miss 0 252995 NA
numa_other 71 253013 356256%
numa_pages_migrated 2398713 2288409 -4.59847%
numa_pte_updates 16997456 16760448 -1.39437%
pgfault 20471213 19945264 -2.56921%
pgmajfault 166 261 57.2289%
pgmigrate_fail 4 2 -50%
pgmigrate_success 2398713 2288409 -4.59847%
numa06 is a set of 2 process each running 32 threads;
each thread doing 50 loops on 3GB process shared memory operations.
numa07 is a single process with 32 threads;
each thread doing 800 loops on 32MB thread local memory operations.
numa08 is a single process with 32 threads;
each thread doing 50 loops on 3GB process shared memory operations.
numa09 is a set of 8 process (as many nodes) each running 4 threads;
each thread doing 50 loops on 3GB process shared memory operations.
numa10 is a set of 16 process (twice as many nodes) each running 2 threads;
each thread doing 50 loops on 3GB process shared memory operations.
Testcase Time: Min Max Avg StdDev
./numa06.sh Real: 81.30 85.29 83.30 2.00
./numa06.sh Sys: 6.15 8.64 7.40 1.24
./numa06.sh User: 2493.87 2499.31 2496.59 2.72
./numa07.sh Real: 17.01 18.47 17.74 0.73
./numa07.sh Sys: 2.08 2.33 2.21 0.13
./numa07.sh User: 396.38 427.87 412.12 15.74
./numa08.sh Real: 77.89 79.05 78.47 0.58
./numa08.sh Sys: 3.76 4.66 4.21 0.45
./numa08.sh User: 2396.50 2443.64 2420.07 23.57
./numa09.sh Real: 60.64 65.37 63.01 2.37
./numa09.sh Sys: 31.28 33.10 32.19 0.91
./numa09.sh User: 1666.04 1685.55 1675.80 9.75
./numa10.sh Real: 56.48 56.64 56.56 0.08
./numa10.sh Sys: 56.59 63.25 59.92 3.33
./numa10.sh User: 1487.83 1492.53 1490.18 2.35
Testcase Time: Min Max Avg StdDev %Change
./numa06.sh Real: 74.43 79.30 76.87 2.43 8.36477%
./numa06.sh Sys: 8.64 9.16 8.90 0.26 -16.8539%
./numa06.sh User: 2278.98 2376.25 2327.61 48.64 7.25981%
./numa07.sh Real: 14.32 14.59 14.46 0.14 22.6833%
./numa07.sh Sys: 2.02 2.09 2.05 0.04 7.80488%
./numa07.sh User: 338.27 349.57 343.92 5.65 19.8302%
./numa08.sh Real: 75.19 81.25 78.22 3.03 0.319611%
./numa08.sh Sys: 3.92 3.98 3.95 0.03 6.58228%
./numa08.sh User: 2320.61 2509.58 2415.10 94.48 0.205789%
./numa09.sh Real: 64.44 64.65 64.55 0.10 -2.38575%
./numa09.sh Sys: 32.11 39.12 35.61 3.51 -9.60404%
./numa09.sh User: 1700.54 1771.65 1736.10 35.56 -3.4733%
./numa10.sh Real: 56.78 57.61 57.20 0.42 -1.11888%
./numa10.sh Sys: 67.30 67.82 67.56 0.26 -11.3085%
./numa10.sh User: 1502.38 1502.95 1502.66 0.29 -0.830527%
vmstat for numa06
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 1401846 1317738 -5.9998%
numa_hint_faults_local 291501 254441 -12.7135%
numa_hit 490509 495083 0.932501%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 490506 495068 0.93006%
numa_miss 0 0 NA
numa_other 3 15 400%
numa_pages_migrated 224869 237124 5.44984%
numa_pte_updates 1401947 1317899 -5.99509%
pgfault 1817481 1775118 -2.33086%
pgmajfault 175 178 1.71429%
pgmigrate_fail 0 0 NA
pgmigrate_success 224869 237124 5.44984%
vmstat for numa07
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 90935 87129 -4.18541%
numa_hint_faults_local 52864 49110 -7.10124%
numa_hit 94632 91902 -2.88486%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 94632 91902 -2.88486%
numa_miss 0 0 NA
numa_other 0 0 NA
numa_pages_migrated 37232 37744 1.37516%
numa_pte_updates 92987 89177 -4.09735%
pgfault 171811 177212 3.14357%
pgmajfault 65 72 10.7692%
pgmigrate_fail 0 0 NA
pgmigrate_success 37232 37744 1.37516%
vmstat for numa08
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 656205 578320 -11.869%
numa_hint_faults_local 77425 85553 10.4979%
numa_hit 262903 246913 -6.08209%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 262902 246902 -6.08592%
numa_miss 0 0 NA
numa_other 1 11 1000%
numa_pages_migrated 115615 94939 -17.8835%
numa_pte_updates 656300 578399 -11.8697%
pgfault 1000775 879013 -12.1668%
pgmajfault 80 173 116.25%
pgmigrate_fail 0 0 NA
pgmigrate_success 115615 94939 -17.8835%
vmstat for numa09
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 0 NA
numa_hint_faults 5292059 5086197 -3.89002%
numa_hint_faults_local 2771125 2463519 -11.1004%
numa_hit 1993632 2043106 2.4816%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 1993631 2043076 2.48015%
numa_miss 0 0 NA
numa_other 1 30 2900%
numa_pages_migrated 1154157 1223564 6.01365%
numa_pte_updates 5313698 5098234 -4.05488%
pgfault 6531964 6196370 -5.13772%
pgmajfault 83 121 45.7831%
pgmigrate_fail 0 0 NA
pgmigrate_success 1154157 1223564 6.01365%
vmstat for numa10
param last_patch with_patch %Change
----- ---------- ---------- -------
numa_foreign 0 195343 NA
numa_hint_faults 9745914 10968959 12.5493%
numa_hint_faults_local 6331681 7146416 12.8676%
numa_hit 3533392 3466916 -1.88136%
numa_huge_pte_updates 0 0 NA
numa_interleave 0 0 NA
numa_local 3533392 3466908 -1.88159%
numa_miss 0 195343 NA
numa_other 0 195351 NA
numa_pages_migrated 1930180 2050279 6.22217%
numa_pte_updates 9798861 11018095 12.4426%
pgfault 11544963 12744348 10.3888%
pgmajfault 83 154 85.5422%
pgmigrate_fail 0 0 NA
pgmigrate_success 1930180 2050279 6.22217%
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-17 17:56 ` Srikar Dronamraju
@ 2020-01-17 21:58 ` Mel Gorman
2020-01-20 8:09 ` Srikar Dronamraju
0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-17 21:58 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
Valentin Schneider, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Fri, Jan 17, 2020 at 11:26:31PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2020-01-14 10:13:20]:
>
> > Changelog since V3
> > o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> > be as good or better than allowing an imbalance based on the group weight
> > without worrying about potential spillover of the lower scheduler domains.
> >
>
> We certainly are seeing better results than v1.
> However numa02, numa03, numa05, numa09 and numa10 still seem to regressing, while
> the others are improving.
>
> While numa04 improves by 14%, numa02 regress by around 12%.
>
Ok, so it's both a win and a loss. This is a curiousity that this patch
may be the primary factor given that the logic only triggers when the
local group has spare capacity and the busiest group is nearly idle. The
test cases you describe should have fairly busy local groups.
> Setup:
> Architecture: ppc64le
> Byte Order: Little Endian
> CPU(s): 256
> On-line CPU(s) list: 0-255
> Thread(s) per core: 8
> Core(s) per socket: 1
> Socket(s): 32
> NUMA node(s): 8
> Model: 2.1 (pvr 004b 0201)
> Model name: POWER8 (architected), altivec supported
> Hypervisor vendor: pHyp
> Virtualization type: para
> L1d cache: 64K
> L1i cache: 32K
> L2 cache: 512K
> L3 cache: 8192K
> NUMA node0 CPU(s): 0-31
> NUMA node1 CPU(s): 32-63
> NUMA node2 CPU(s): 64-95
> NUMA node3 CPU(s): 96-127
> NUMA node4 CPU(s): 128-159
> NUMA node5 CPU(s): 160-191
> NUMA node6 CPU(s): 192-223
> NUMA node7 CPU(s): 224-255
>
> numa01 is a set of 2 process each running 128 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
Are the shared operations shared between the 2 processes? 256 threads
in total would more than exceed the capacity of a local group, even 128
threads per process would exceed the capacity of the local group. In such
a situation, much would depend on the locality of the accesses as well
as any shared accesses.
> numa02 is a single process with 256 threads;
> each thread doing 800 loops on 32MB thread local memory operations.
>
This one is more interesting. False sharing shouldn't be an issue so the
threads should be independent.
> numa03 is a single process with 256 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
>
Similar.
> numa04 is a set of 8 process (as many nodes) each running 32 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
>
Less clear as you don't say what is sharing the memory operations.
> numa05 is a set of 16 process (twice as many nodes) each running 16 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
>
Again, hard to tell because the shared memory operations are not described.
Of all of these, numa02 is the most interesting as it's the simplest
case showing a problem.
> Details below:
How many iterations for each test?
> Testcase Time: Min Max Avg StdDev
> ./numa01.sh Real: 513.12 547.37 530.25 17.12
> ./numa01.sh Sys: 107.73 146.26 127.00 19.26
> ./numa01.sh User: 122812.39 129136.61 125974.50 3162.11
> ./numa02.sh Real: 68.23 72.44 70.34 2.10
> ./numa02.sh Sys: 52.35 55.65 54.00 1.65
> ./numa02.sh User: 14334.37 14907.14 14620.76 286.38
> ./numa03.sh Real: 471.36 485.19 478.27 6.92
> ./numa03.sh Sys: 74.91 77.03 75.97 1.06
> ./numa03.sh User: 118197.30 121238.68 119717.99 1520.69
> ./numa04.sh Real: 450.35 454.93 452.64 2.29
> ./numa04.sh Sys: 362.49 397.95 380.22 17.73
> ./numa04.sh User: 93150.82 93300.60 93225.71 74.89
> ./numa05.sh Real: 361.18 366.32 363.75 2.57
> ./numa05.sh Sys: 678.72 726.32 702.52 23.80
> ./numa05.sh User: 82634.58 85103.97 83869.27 1234.70
> Testcase Time: Min Max Avg StdDev %Change
> ./numa01.sh Real: 485.45 530.20 507.83 22.37 4.41486%
> ./numa01.sh Sys: 123.45 130.62 127.03 3.59 -0.0236165%
> ./numa01.sh User: 119152.08 127121.14 123136.61 3984.53 2.30467%
The number of iterations is unknown in general but there is a lot of
overlap between the min and max ranges and the range is wide. It may or
may not be a gain overall.
Before range: 513 to 547
After range: 485 to 530
> ./numa02.sh Real: 78.87 82.31 80.59 1.72 -12.7187%
> ./numa02.sh Sys: 81.18 85.07 83.12 1.94 -35.0337%
> ./numa02.sh User: 16303.70 17122.14 16712.92 409.22 -12.5182%
Before range: 58 to 72
After range: 78 to 82
This one is more interesting in general. Can you add trace_printks to
the check for SD_NUMA the patch introduces and dump the sum_nr_running
for both local and busiest when the imbalance is ignored please? That
might give some hint as to the improper conditions where imbalance is
ignored.
However, knowing the number of iterations would be helpful. Can you also
tell me if this is consistent between boots or is it always roughly 12%
regression regardless of the number of iterations?
> ./numa03.sh Real: 477.20 528.12 502.66 25.46 -4.85219%
> ./numa03.sh Sys: 88.93 115.36 102.15 13.21 -25.629%
> ./numa03.sh User: 119120.73 129829.89 124475.31 5354.58 -3.8219%
Range before: 471 to 485
Range after: 477 to 528
> ./numa04.sh Real: 374.70 414.76 394.73 20.03 14.6708%
> ./numa04.sh Sys: 357.14 379.20 368.17 11.03 3.27294%
> ./numa04.sh User: 87830.73 88547.21 88188.97 358.24 5.7113%
Range before: 450 -> 454
Range after: 374 -> 414
Big gain there but the fact the range changed so much is a concern and
makes me wonder if this case is stable from boot to boot.
> ./numa05.sh Real: 369.50 401.56 385.53 16.03 -5.64937%
> ./numa05.sh Sys: 718.99 741.02 730.00 11.01 -3.76438%
> ./numa05.sh User: 84989.07 85271.75 85130.41 141.34 -1.48142%
>
Big range changes again but the shared memory operations complicate
matters. I think it's best to focus on numa02 for and identify if there
is an improper condition where the patch has an impact, the local group
has high utilisation but spare capacity while the busiest group is
almost completely idle.
> vmstat for numa01
I'm not going to comment in detail on these other than noting that NUMA
balancing is heavily active in all cases which may be masking any effect
of the patch and may have unstable results in general.
> <SNIP vmstat>
> <SNIP description of loads that showed gains>
>
> numa09 is a set of 8 process (as many nodes) each running 4 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
>
No description of shared operations but NUMA balancing is very active so
sharing is probably between processes.
> numa10 is a set of 16 process (twice as many nodes) each running 2 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
>
Again, shared accesses without description and heavy NUMA balancing
activity.
So bottom line, a lot of these cases have shared operations where NUMA
balancing decisions should dominate and make it hard to detect any impact
from the patch. The exception is numa02 so please add tracing and dump
out local and busiest sum_nr_running when the imbalance is ignored. I
want to see if it's as simple as the local group is very busy but has
capacity where the busiest group is almost idle. I also want to see how
many times over the course of the numa02 workload that the conditions
for the patch are even met.
Thanks.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-17 21:58 ` Mel Gorman
@ 2020-01-20 8:09 ` Srikar Dronamraju
2020-01-20 8:33 ` Mel Gorman
0 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-01-20 8:09 UTC (permalink / raw)
To: Mel Gorman
Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
Valentin Schneider, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
* Mel Gorman <mgorman@techsingularity.net> [2020-01-17 21:58:53]:
> On Fri, Jan 17, 2020 at 11:26:31PM +0530, Srikar Dronamraju wrote:
> > * Mel Gorman <mgorman@techsingularity.net> [2020-01-14 10:13:20]:
> >
> > We certainly are seeing better results than v1.
> > However numa02, numa03, numa05, numa09 and numa10 still seem to regressing, while
> > the others are improving.
> >
> > While numa04 improves by 14%, numa02 regress by around 12%.
> >
> Ok, so it's both a win and a loss. This is a curiousity that this patch
> may be the primary factor given that the logic only triggers when the
> local group has spare capacity and the busiest group is nearly idle. The
> test cases you describe should have fairly busy local groups.
>
Right, your code only seems to affect when the local group has spare
capacity and the busiest->sum_nr_running <=2
> >
> > numa01 is a set of 2 process each running 128 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
>
> Are the shared operations shared between the 2 processes? 256 threads
> in total would more than exceed the capacity of a local group, even 128
> threads per process would exceed the capacity of the local group. In such
> a situation, much would depend on the locality of the accesses as well
> as any shared accesses.
Except for numa02 and numa07, (both handle local memory operations) all
shared operations are within the process. i.e per process sharing.
>
> > numa02 is a single process with 256 threads;
> > each thread doing 800 loops on 32MB thread local memory operations.
> >
>
> This one is more interesting. False sharing shouldn't be an issue so the
> threads should be independent.
>
> > numa03 is a single process with 256 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> Similar.
This is similar to numa01. Except now all threads belong to just one
process.
>
> > numa04 is a set of 8 process (as many nodes) each running 32 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> Less clear as you don't say what is sharing the memory operations.
all sharing is within the process. In Numa04/numa09, I try to spawn as many
process as the number of nodes, other than that its same as Numa02.
>
> > numa05 is a set of 16 process (twice as many nodes) each running 16 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> > Details below:
>
> How many iterations for each test?
I run 5 iterations. Want me to run with more iterations?
>
>
> > ./numa02.sh Real: 78.87 82.31 80.59 1.72 -12.7187%
> > ./numa02.sh Sys: 81.18 85.07 83.12 1.94 -35.0337%
> > ./numa02.sh User: 16303.70 17122.14 16712.92 409.22 -12.5182%
>
> Before range: 58 to 72
> After range: 78 to 82
>
> This one is more interesting in general. Can you add trace_printks to
> the check for SD_NUMA the patch introduces and dump the sum_nr_running
> for both local and busiest when the imbalance is ignored please? That
> might give some hint as to the improper conditions where imbalance is
> ignored.
Can be done. Will get back with the results. But do let me know if you want
to run with more iterations or rerun the tests.
>
> However, knowing the number of iterations would be helpful. Can you also
> tell me if this is consistent between boots or is it always roughly 12%
> regression regardless of the number of iterations?
>
I have only measured for 5 iterations and I haven't repeated to see if the
numbers are consistent.
> > ./numa03.sh Real: 477.20 528.12 502.66 25.46 -4.85219%
> > ./numa03.sh Sys: 88.93 115.36 102.15 13.21 -25.629%
> > ./numa03.sh User: 119120.73 129829.89 124475.31 5354.58 -3.8219%
>
> Range before: 471 to 485
> Range after: 477 to 528
>
> > ./numa04.sh Real: 374.70 414.76 394.73 20.03 14.6708%
> > ./numa04.sh Sys: 357.14 379.20 368.17 11.03 3.27294%
> > ./numa04.sh User: 87830.73 88547.21 88188.97 358.24 5.7113%
>
> Range before: 450 -> 454
> Range after: 374 -> 414
>
> Big gain there but the fact the range changed so much is a concern and
> makes me wonder if this case is stable from boot to boot.
>
> > ./numa05.sh Real: 369.50 401.56 385.53 16.03 -5.64937%
> > ./numa05.sh Sys: 718.99 741.02 730.00 11.01 -3.76438%
> > ./numa05.sh User: 84989.07 85271.75 85130.41 141.34 -1.48142%
> >
>
> Big range changes again but the shared memory operations complicate
> matters. I think it's best to focus on numa02 for and identify if there
> is an improper condition where the patch has an impact, the local group
> has high utilisation but spare capacity while the busiest group is
> almost completely idle.
>
> > vmstat for numa01
>
> I'm not going to comment in detail on these other than noting that NUMA
> balancing is heavily active in all cases which may be masking any effect
> of the patch and may have unstable results in general.
>
> > <SNIP vmstat>
> > <SNIP description of loads that showed gains>
> >
> > numa09 is a set of 8 process (as many nodes) each running 4 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> No description of shared operations but NUMA balancing is very active so
> sharing is probably between processes.
>
> > numa10 is a set of 16 process (twice as many nodes) each running 2 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> >
>
> Again, shared accesses without description and heavy NUMA balancing
> activity.
>
> So bottom line, a lot of these cases have shared operations where NUMA
> balancing decisions should dominate and make it hard to detect any impact
> from the patch. The exception is numa02 so please add tracing and dump
> out local and busiest sum_nr_running when the imbalance is ignored. I
> want to see if it's as simple as the local group is very busy but has
> capacity where the busiest group is almost idle. I also want to see how
> many times over the course of the numa02 workload that the conditions
> for the patch are even met.
>
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-20 8:09 ` Srikar Dronamraju
@ 2020-01-20 8:33 ` Mel Gorman
2020-01-20 17:27 ` Srikar Dronamraju
0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-20 8:33 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
Valentin Schneider, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Mon, Jan 20, 2020 at 01:39:35PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2020-01-17 21:58:53]:
>
> > On Fri, Jan 17, 2020 at 11:26:31PM +0530, Srikar Dronamraju wrote:
> > > * Mel Gorman <mgorman@techsingularity.net> [2020-01-14 10:13:20]:
> > >
> > > We certainly are seeing better results than v1.
> > > However numa02, numa03, numa05, numa09 and numa10 still seem to regressing, while
> > > the others are improving.
> > >
> > > While numa04 improves by 14%, numa02 regress by around 12%.
> > >
>
> > Ok, so it's both a win and a loss. This is a curiousity that this patch
> > may be the primary factor given that the logic only triggers when the
> > local group has spare capacity and the busiest group is nearly idle. The
> > test cases you describe should have fairly busy local groups.
> >
>
> Right, your code only seems to affect when the local group has spare
> capacity and the busiest->sum_nr_running <=2
>
And this is why I'm curious as to why your workload is affected at all
because it uses many tasks. I stopped allowing an imbalance for higher
task counts partially on the basis of your previous report.
> > This one is more interesting. False sharing shouldn't be an issue so the
> > threads should be independent.
> >
> > > numa03 is a single process with 256 threads;
> > > each thread doing 50 loops on 3GB process shared memory operations.
> > >
> >
> > Similar.
>
> This is similar to numa01. Except now all threads belong to just one
> process.
>
My concern is that the shared memory options means that NUMA balancing
and false sharing can dominate and hide any impact of the patch itself.
Whether it has good or bad results may be partially down to luck.
> >
> > > numa05 is a set of 16 process (twice as many nodes) each running 16 threads;
> > > each thread doing 50 loops on 3GB process shared memory operations.
> > >
> >
> > > Details below:
> >
> > How many iterations for each test?
>
> I run 5 iterations. Want me to run with more iterations?
>
5 should be enough for now. I'm more interested in hearing if the
regression/gain is consistent when the patch is applied and a confirmation
that the patch really makes a difference to this set of workloads.
> >
> >
> > > ./numa02.sh Real: 78.87 82.31 80.59 1.72 -12.7187%
> > > ./numa02.sh Sys: 81.18 85.07 83.12 1.94 -35.0337%
> > > ./numa02.sh User: 16303.70 17122.14 16712.92 409.22 -12.5182%
> >
> > Before range: 58 to 72
> > After range: 78 to 82
> >
> > This one is more interesting in general. Can you add trace_printks to
> > the check for SD_NUMA the patch introduces and dump the sum_nr_running
> > for both local and busiest when the imbalance is ignored please? That
> > might give some hint as to the improper conditions where imbalance is
> > ignored.
>
> Can be done. Will get back with the results. But do let me know if you want
> to run with more iterations or rerun the tests.
>
The results of this will be interesting in itself. I'm particularly
interested in seeing what the traces look like for a good and bad result.
> >
> > However, knowing the number of iterations would be helpful. Can you also
> > tell me if this is consistent between boots or is it always roughly 12%
> > regression regardless of the number of iterations?
> >
>
> I have only measured for 5 iterations and I haven't repeated to see if the
> numbers are consistent.
>
Ok, that is quite a problem as the assertion at the moment is that the
patch causes a mix of regressions/gains. It's currently unclear to me
why the patch would have a major impact on this workload at all given the
number of active tasks and the nature of the patch. I'm concerned that
the workload may be naturally unstable but tracing may be able to help.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-20 8:33 ` Mel Gorman
@ 2020-01-20 17:27 ` Srikar Dronamraju
2020-01-20 18:21 ` Mel Gorman
0 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-01-20 17:27 UTC (permalink / raw)
To: Mel Gorman
Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
Valentin Schneider, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
> And this is why I'm curious as to why your workload is affected at all
> because it uses many tasks. I stopped allowing an imbalance for higher
> task counts partially on the basis of your previous report.
>
With this hunk on top of your patch and 5 runs of numa02, there were 0
traces.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ade7a8dca5e4..7506cf67bde8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8714,8 +8714,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* the risk that lower domains have to be rebalanced.
*/
imbalance_min = 2;
- if (busiest->sum_nr_running <= imbalance_min)
+ if (busiest->sum_nr_running <= imbalance_min) {
+ trace_printk("Reseting imbalance: busiest->sum_nr_running=%d, local->sum_nr_running=%d\n", busiest->sum_nr_irunning, local->sum_nr_running);
env->imbalance = 0;
+ }
}
return;
perf stat for the 5 iterations this time shows:
77.817 +- 0.995 seconds time elapsed ( +- 1.28% )
which I think is significantly less than last time around.
So I think it may be some other noise that could have contributed to the
jump last time. Also since the time consumption of numa02 is very small, a
small disturbance can show up as a big number from a percentage perspective.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-20 17:27 ` Srikar Dronamraju
@ 2020-01-20 18:21 ` Mel Gorman
2020-01-21 8:55 ` Srikar Dronamraju
0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-20 18:21 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
Valentin Schneider, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Mon, Jan 20, 2020 at 10:57:06PM +0530, Srikar Dronamraju wrote:
> > And this is why I'm curious as to why your workload is affected at all
> > because it uses many tasks. I stopped allowing an imbalance for higher
> > task counts partially on the basis of your previous report.
> >
>
> With this hunk on top of your patch and 5 runs of numa02, there were 0
> traces.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ade7a8dca5e4..7506cf67bde8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8714,8 +8714,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> * the risk that lower domains have to be rebalanced.
> */
> imbalance_min = 2;
> - if (busiest->sum_nr_running <= imbalance_min)
> + if (busiest->sum_nr_running <= imbalance_min) {
> + trace_printk("Reseting imbalance: busiest->sum_nr_running=%d, local->sum_nr_running=%d\n", busiest->sum_nr_irunning, local->sum_nr_running);
> env->imbalance = 0;
> + }
> }
>
> return;
>
Ok, thanks. No traces indicates that the patch should have no effect at
all and any difference in performance is a coincidence. What about the
other test programs?
>
> perf stat for the 5 iterations this time shows:
> 77.817 +- 0.995 seconds time elapsed ( +- 1.28% )
> which I think is significantly less than last time around.
>
> So I think it may be some other noise that could have contributed to the
> jump last time. Also since the time consumption of numa02 is very small, a
> small disturbance can show up as a big number from a percentage perspective.
Understood. At the moment, I'm going to assume that the patch has zero
impact on your workload but confirmation that the other test programs
trigger no traces would be appreciated.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-20 18:21 ` Mel Gorman
@ 2020-01-21 8:55 ` Srikar Dronamraju
2020-01-21 9:11 ` Mel Gorman
0 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-01-21 8:55 UTC (permalink / raw)
To: Mel Gorman
Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
Valentin Schneider, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
* Mel Gorman <mgorman@techsingularity.net> [2020-01-20 18:21:00]:
> Understood. At the moment, I'm going to assume that the patch has zero
> impact on your workload but confirmation that the other test programs
> trigger no traces would be appreciated.
>
Yes, I confirm there were no traces when run with other test programs too.
> --
> Mel Gorman
> SUSE Labs
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-21 8:55 ` Srikar Dronamraju
@ 2020-01-21 9:11 ` Mel Gorman
2020-01-21 10:42 ` Peter Zijlstra
0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-21 9:11 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
Valentin Schneider, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Tue, Jan 21, 2020 at 02:25:01PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2020-01-20 18:21:00]:
>
> > Understood. At the moment, I'm going to assume that the patch has zero
> > impact on your workload but confirmation that the other test programs
> > trigger no traces would be appreciated.
> >
>
> Yes, I confirm there were no traces when run with other test programs too.
>
Ok, great, thanks for confirming that!
Peter or Ingo, I think at this point all review comments have been
addressed. Is there anything else you'd like before picking the patch
up?
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
` (4 preceding siblings ...)
2020-01-17 17:56 ` Srikar Dronamraju
@ 2020-01-21 9:59 ` Srikar Dronamraju
2020-01-29 11:32 ` [tip: sched/core] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains tip-bot2 for Mel Gorman
6 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2020-01-21 9:59 UTC (permalink / raw)
To: Mel Gorman
Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
Valentin Schneider, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
* Mel Gorman <mgorman@techsingularity.net> [2020-01-14 10:13:20]:
> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> be as good or better than allowing an imbalance based on the group weight
> without worrying about potential spillover of the lower scheduler domains.
>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
2020-01-21 9:11 ` Mel Gorman
@ 2020-01-21 10:42 ` Peter Zijlstra
0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2020-01-21 10:42 UTC (permalink / raw)
To: Mel Gorman
Cc: Srikar Dronamraju, Vincent Guittot, Phil Auld, Ingo Molnar,
Valentin Schneider, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML
On Tue, Jan 21, 2020 at 09:11:48AM +0000, Mel Gorman wrote:
> On Tue, Jan 21, 2020 at 02:25:01PM +0530, Srikar Dronamraju wrote:
> > * Mel Gorman <mgorman@techsingularity.net> [2020-01-20 18:21:00]:
> >
> > > Understood. At the moment, I'm going to assume that the patch has zero
> > > impact on your workload but confirmation that the other test programs
> > > trigger no traces would be appreciated.
> > >
> >
> > Yes, I confirm there were no traces when run with other test programs too.
> >
>
> Ok, great, thanks for confirming that!
>
> Peter or Ingo, I think at this point all review comments have been
> addressed. Is there anything else you'd like before picking the patch
> up?
I've already queued it a few days ago, should show up in tip soonish :-)
^ permalink raw reply [flat|nested] 24+ messages in thread
* [tip: sched/core] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
` (5 preceding siblings ...)
2020-01-21 9:59 ` Srikar Dronamraju
@ 2020-01-29 11:32 ` tip-bot2 for Mel Gorman
6 siblings, 0 replies; 24+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-01-29 11:32 UTC (permalink / raw)
To: linux-tip-commits
Cc: Mel Gorman, Peter Zijlstra (Intel),
Ingo Molnar, Valentin Schneider, Vincent Guittot,
Srikar Dronamraju, Phil Auld, x86, LKML
The following commit has been merged into the sched/core branch of tip:
Commit-ID: b396f52326de20ec974471b7b19168867b365cbf
Gitweb: https://git.kernel.org/tip/b396f52326de20ec974471b7b19168867b365cbf
Author: Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 14 Jan 2020 10:13:20
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 28 Jan 2020 21:36:55 +01:00
sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.
This patch allows a fixed degree of imbalance of two tasks to exist
between NUMA domains regardless of utilisation levels. In many cases,
this prevents communicating tasks being pulled apart. It was evaluated
whether the imbalance should be scaled to the domain size. However, no
additional benefit was measured across a range of workloads and machines
and scaling adds the risk that lower domains have to be rebalanced. While
this could change again in the future, such a change should specify the
use case and benefit.
The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offload depending on the
transmission rate.
2-socket Haswell machine 48 core, HT enabled
netperf-tcp -- mmtests config config-network-netperf-unbound
baseline lbnuma-v3
Hmean 64 568.73 ( 0.00%) 577.56 * 1.55%*
Hmean 128 1089.98 ( 0.00%) 1128.06 * 3.49%*
Hmean 256 2061.72 ( 0.00%) 2104.39 * 2.07%*
Hmean 1024 7254.27 ( 0.00%) 7557.52 * 4.18%*
Hmean 2048 11729.20 ( 0.00%) 13350.67 * 13.82%*
Hmean 3312 15309.08 ( 0.00%) 18058.95 * 17.96%*
Hmean 4096 17338.75 ( 0.00%) 20483.66 * 18.14%*
Hmean 8192 25047.12 ( 0.00%) 27806.84 * 11.02%*
Hmean 16384 27359.55 ( 0.00%) 33071.88 * 20.88%*
Stddev 64 2.16 ( 0.00%) 2.02 ( 6.53%)
Stddev 128 2.31 ( 0.00%) 2.19 ( 5.05%)
Stddev 256 11.88 ( 0.00%) 3.22 ( 72.88%)
Stddev 1024 23.68 ( 0.00%) 7.24 ( 69.43%)
Stddev 2048 79.46 ( 0.00%) 71.49 ( 10.03%)
Stddev 3312 26.71 ( 0.00%) 57.80 (-116.41%)
Stddev 4096 185.57 ( 0.00%) 96.15 ( 48.19%)
Stddev 8192 245.80 ( 0.00%) 100.73 ( 59.02%)
Stddev 16384 207.31 ( 0.00%) 141.65 ( 31.67%)
In this case, there was a sizable improvement to performance and
a general reduction in variance. However, this is not univeral.
For most machines, the impact was roughly a 3% performance gain.
Ops NUMA base-page range updates 19796.00 292.00
Ops NUMA PTE updates 19796.00 292.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 16113.00 143.00
Ops NUMA hint local faults % 8407.00 142.00
Ops NUMA hint local percent 52.18 99.30
Ops NUMA pages migrated 4244.00 1.00
Without the patch, only 52.18% of sampled accesses are local. In an
earlier changelog, 100% of sampled accesses are local and indeed on
most machines, this was still the case. In this specific case, the
local sampled rates was 99.3% but note the "base-page range updates"
and "PTE updates". The activity with the patch is negligible as were
the number of faults. The small number of pages migrated were related to
shared libraries. A 2-socket Broadwell showed better results on average
but are not presented for brevity as the performance was similar except
it showed 100% of the sampled NUMA hints were local. The patch holds up
for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
For dbench, the impact depends on the filesystem used and the number of
clients. On XFS, there is little difference as the clients typically
communicate with workqueues which have a separate class of scheduler
problem at the moment. For ext4, performance is generally better,
particularly for small numbers of clients as NUMA balancing activity is
negligible with the patch applied.
A more interesting example is the Facebook schbench which uses a
number of messaging threads to communicate with worker threads. In this
configuration, one messaging thread is used per NUMA node and the number of
worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
for response latency is then reported.
Lat 50.00th-qrtle-1 44.00 ( 0.00%) 37.00 ( 15.91%)
Lat 75.00th-qrtle-1 53.00 ( 0.00%) 41.00 ( 22.64%)
Lat 90.00th-qrtle-1 57.00 ( 0.00%) 42.00 ( 26.32%)
Lat 95.00th-qrtle-1 63.00 ( 0.00%) 43.00 ( 31.75%)
Lat 99.00th-qrtle-1 76.00 ( 0.00%) 51.00 ( 32.89%)
Lat 99.50th-qrtle-1 89.00 ( 0.00%) 52.00 ( 41.57%)
Lat 99.90th-qrtle-1 98.00 ( 0.00%) 55.00 ( 43.88%)
Lat 50.00th-qrtle-2 42.00 ( 0.00%) 42.00 ( 0.00%)
Lat 75.00th-qrtle-2 48.00 ( 0.00%) 47.00 ( 2.08%)
Lat 90.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
Lat 95.00th-qrtle-2 55.00 ( 0.00%) 53.00 ( 3.64%)
Lat 99.00th-qrtle-2 62.00 ( 0.00%) 60.00 ( 3.23%)
Lat 99.50th-qrtle-2 63.00 ( 0.00%) 63.00 ( 0.00%)
Lat 99.90th-qrtle-2 68.00 ( 0.00%) 66.00 ( 2.94%
For higher worker threads, the differences become negligible but it's
interesting to note the difference in wakeup latency at low utilisation
and mpstat confirms that activity was almost all on one node until
the number of worker threads increase.
Hackbench generally showed neutral results across a range of machines.
This is different to earlier versions of the patch which allowed imbalances
for higher degrees of utilisation. perf bench pipe showed negligible
differences in overall performance as the differences are very close to
the noise.
An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is negligible with small gains/losses within the noise measured. This is
because the number of threads far exceeds the small imbalance the aptch
cares about. Similarly, there were report of regressions for the autonuma
benchmark against earlier versions but again, normal load balancing now
applies for that workload.
In general, the patch simply seeks to avoid unnecessary cross-node
migrations in the basic case where imbalances are very small. For low
utilisation communicating workloads, this patch generally behaves better
with less NUMA balancing activity. For high utilisation, there is no
change in behaviour.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Phil Auld <pauld@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200114101319.GO3466@techsingularity.net
---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
1 file changed, 29 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe4e0d7..25dffc0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8658,10 +8658,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
/*
* Try to use spare capacity of local group without overloading it or
* emptying busiest.
- * XXX Spreading tasks across NUMA nodes is not always the best policy
- * and special care should be taken for SD_NUMA domain level before
- * spreading the tasks. For now, load_balance() fully relies on
- * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
*/
if (local->group_type == group_has_spare) {
if (busiest->group_type > group_fully_busy) {
@@ -8701,16 +8697,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
env->migration_type = migrate_task;
lsub_positive(&nr_diff, local->sum_nr_running);
env->imbalance = nr_diff >> 1;
- return;
- }
+ } else {
- /*
- * If there is no overload, we just want to even the number of
- * idle cpus.
- */
- env->migration_type = migrate_task;
- env->imbalance = max_t(long, 0, (local->idle_cpus -
+ /*
+ * If there is no overload, we just want to even the number of
+ * idle cpus.
+ */
+ env->migration_type = migrate_task;
+ env->imbalance = max_t(long, 0, (local->idle_cpus -
busiest->idle_cpus) >> 1);
+ }
+
+ /* Consider allowing a small imbalance between NUMA groups */
+ if (env->sd->flags & SD_NUMA) {
+ unsigned int imbalance_min;
+
+ /*
+ * Compute an allowed imbalance based on a simple
+ * pair of communicating tasks that should remain
+ * local and ignore them.
+ *
+ * NOTE: Generally this would have been based on
+ * the domain size and this was evaluated. However,
+ * the benefit is similar across a range of workloads
+ * and machines but scaling by the domain size adds
+ * the risk that lower domains have to be rebalanced.
+ */
+ imbalance_min = 2;
+ if (busiest->sum_nr_running <= imbalance_min)
+ env->imbalance = 0;
+ }
+
return;
}
^ permalink raw reply related [flat|nested] 24+ messages in thread
end of thread, other threads:[~2020-01-29 11:33 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
2020-01-16 16:35 ` Mel Gorman
2020-01-17 13:08 ` Vincent Guittot
2020-01-17 14:15 ` Mel Gorman
2020-01-17 14:32 ` Phil Auld
2020-01-17 14:23 ` Phil Auld
2020-01-17 14:37 ` Valentin Schneider
2020-01-17 13:16 ` Vincent Guittot
2020-01-17 14:26 ` Mel Gorman
2020-01-17 14:29 ` Vincent Guittot
2020-01-17 15:09 ` Vincent Guittot
2020-01-17 15:11 ` Peter Zijlstra
2020-01-17 15:21 ` Phil Auld
2020-01-17 17:56 ` Srikar Dronamraju
2020-01-17 21:58 ` Mel Gorman
2020-01-20 8:09 ` Srikar Dronamraju
2020-01-20 8:33 ` Mel Gorman
2020-01-20 17:27 ` Srikar Dronamraju
2020-01-20 18:21 ` Mel Gorman
2020-01-21 8:55 ` Srikar Dronamraju
2020-01-21 9:11 ` Mel Gorman
2020-01-21 10:42 ` Peter Zijlstra
2020-01-21 9:59 ` Srikar Dronamraju
2020-01-29 11:32 ` [tip: sched/core] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains tip-bot2 for Mel Gorman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.