* [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs @ 2021-12-10 9:33 Mel Gorman 2021-12-10 9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman 2021-12-10 9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman 0 siblings, 2 replies; 22+ messages in thread From: Mel Gorman @ 2021-12-10 9:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML, Mel Gorman Changelog since V3 o Calculate imb_numa_nr for multiple SD_NUMA domains o Restore behaviour where communicating pairs remain on the same node Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. The series addresses two problems -- inconsistent use of scheduler domain weights and sub-optimal performance when there are many LLCs per NUMA node. include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 36 ++++++++++++++++--------------- kernel/sched/topology.c | 39 ++++++++++++++++++++++++++++++++++ 3 files changed, 59 insertions(+), 17 deletions(-) -- 2.31.1 Mel Gorman (2): sched/fair: Use weight of SD_NUMA domain in find_busiest_group sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 36 +++++++++++++++++---------------- kernel/sched/topology.c | 37 ++++++++++++++++++++++++++++++++++ 3 files changed, 57 insertions(+), 17 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group 2021-12-10 9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman @ 2021-12-10 9:33 ` Mel Gorman 2021-12-21 10:53 ` Vincent Guittot 2021-12-10 9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman 1 sibling, 1 reply; 22+ messages in thread From: Mel Gorman @ 2021-12-10 9:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML, Mel Gorman find_busiest_group uses the child domain's group weight instead of the sched_domain's weight that has SD_NUMA set when calculating the allowed imbalance between NUMA nodes. This is wrong and inconsistent with find_idlest_group. This patch uses the SD_NUMA weight in both. Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e476f6d9435..0a969affca76 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* Consider allowing a small imbalance between NUMA groups */ if (env->sd->flags & SD_NUMA) { env->imbalance = adjust_numa_imbalance(env->imbalance, - busiest->sum_nr_running, busiest->group_weight); + busiest->sum_nr_running, env->sd->span_weight); } return; -- 2.31.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group 2021-12-10 9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman @ 2021-12-21 10:53 ` Vincent Guittot 2021-12-21 11:32 ` Mel Gorman 0 siblings, 1 reply; 22+ messages in thread From: Vincent Guittot @ 2021-12-21 10:53 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote: > > find_busiest_group uses the child domain's group weight instead of > the sched_domain's weight that has SD_NUMA set when calculating the > allowed imbalance between NUMA nodes. This is wrong and inconsistent > with find_idlest_group. I agree that find_busiest_group and find_idlest_group should be consistent and use the same parameters but I wonder if sched_domain's weight is the right one to use instead of the target group's weight. IIRC, the goal of adjust_numa_imbalance is to keep some threads on the same node as long as we consider that there is no performance impact because of sharing resources as they can even take advantage of locality if they interact. So we consider that tasks will not be impacted by sharing resources if they use less than 25% of the CPUs of a node. If we use the sd->span_weight instead, we consider that we can pack threads in the same node as long as it uses less than 25% of the CPUs in all nodes. > > This patch uses the SD_NUMA weight in both. > > Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6e476f6d9435..0a969affca76 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* Consider allowing a small imbalance between NUMA groups */ > if (env->sd->flags & SD_NUMA) { > env->imbalance = adjust_numa_imbalance(env->imbalance, > - busiest->sum_nr_running, busiest->group_weight); > + busiest->sum_nr_running, env->sd->span_weight); > } > > return; > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group 2021-12-21 10:53 ` Vincent Guittot @ 2021-12-21 11:32 ` Mel Gorman 2021-12-21 13:05 ` Vincent Guittot 0 siblings, 1 reply; 22+ messages in thread From: Mel Gorman @ 2021-12-21 11:32 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML On Tue, Dec 21, 2021 at 11:53:50AM +0100, Vincent Guittot wrote: > On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > find_busiest_group uses the child domain's group weight instead of > > the sched_domain's weight that has SD_NUMA set when calculating the > > allowed imbalance between NUMA nodes. This is wrong and inconsistent > > with find_idlest_group. > > I agree that find_busiest_group and find_idlest_group should be > consistent and use the same parameters but I wonder if sched_domain's > weight is the right one to use instead of the target group's weight. > Ok > IIRC, the goal of adjust_numa_imbalance is to keep some threads on the > same node as long as we consider that there is no performance impact > because of sharing resources as they can even take advantage of > locality if they interact. Yes. > So we consider that tasks will not be > impacted by sharing resources if they use less than 25% of the CPUs of > a node. If we use the sd->span_weight instead, we consider that we can > pack threads in the same node as long as it uses less than 25% of the > CPUs in all nodes. > I assume you mean the target group weight instead of the node. The primary resource we are concerned with is memory bandwidth and it's a guess because we do not know for sure where memory channels are or how they are configured in this context and it may or may not be correlated with groups. I think using the group instead would deserve a series on its own after settling on an imbalance number when there are multiple LLCs per node. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group 2021-12-21 11:32 ` Mel Gorman @ 2021-12-21 13:05 ` Vincent Guittot 0 siblings, 0 replies; 22+ messages in thread From: Vincent Guittot @ 2021-12-21 13:05 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML On Tue, 21 Dec 2021 at 12:32, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Dec 21, 2021 at 11:53:50AM +0100, Vincent Guittot wrote: > > On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > find_busiest_group uses the child domain's group weight instead of > > > the sched_domain's weight that has SD_NUMA set when calculating the > > > allowed imbalance between NUMA nodes. This is wrong and inconsistent > > > with find_idlest_group. > > > > I agree that find_busiest_group and find_idlest_group should be > > consistent and use the same parameters but I wonder if sched_domain's > > weight is the right one to use instead of the target group's weight. > > > > Ok > > > IIRC, the goal of adjust_numa_imbalance is to keep some threads on the > > same node as long as we consider that there is no performance impact > > because of sharing resources as they can even take advantage of > > locality if they interact. > > Yes. > > > So we consider that tasks will not be > > impacted by sharing resources if they use less than 25% of the CPUs of > > a node. If we use the sd->span_weight instead, we consider that we can > > pack threads in the same node as long as it uses less than 25% of the > > CPUs in all nodes. > > > > I assume you mean the target group weight instead of the node. The I wanted to say that with this patch, we consider the imbalance acceptable if the number of threads in a node is less than 25% of all CPUs of all nodes (for this numa level) , but 25% of all CPUs of all nodes can be more that the number of CPUs in the group. So I would have changed find_idlest_group instead of changing find_busiest_group > primary resource we are concerned with is memory bandwidth and it's a > guess because we do not know for sure where memory channels are or how > they are configured in this context and it may or may not be correlated > with groups. I think using the group instead would deserve a series on > its own after settling on an imbalance number when there are multiple > LLCs per node. I haven't look yet at the patch2 for multiple LLC per node > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-10 9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman 2021-12-10 9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman @ 2021-12-10 9:33 ` Mel Gorman 2021-12-13 8:28 ` Gautham R. Shenoy 2021-12-17 19:54 ` Gautham R. Shenoy 1 sibling, 2 replies; 22+ messages in thread From: Mel Gorman @ 2021-12-10 9:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML, Mel Gorman Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. Zen* has multiple LLCs per node with local memory channels and due to the allowed imbalance, it's far harder to tune some workloads to run optimally than it is on hardware that has 1 LLC per node. This patch adjusts the imbalance on multi-LLC machines to allow an imbalance up to the point where LLCs should be balanced between nodes. On a Zen3 machine running STREAM parallelised with OMP to have on instance per LLC the results and without binding, the results are 5.16.0-rc1 5.16.0-rc1 vanilla sched-numaimb-v4 MB/sec copy-16 166712.18 ( 0.00%) 651540.22 ( 290.82%) MB/sec scale-16 140109.66 ( 0.00%) 382254.74 ( 172.83%) MB/sec add-16 160791.18 ( 0.00%) 623073.98 ( 287.51%) MB/sec triad-16 160043.84 ( 0.00%) 633964.52 ( 296.12%) STREAM can use directives to force the spread if the OpenMP is new enough but that doesn't help if an application uses threads and it's not known in advance how many threads will be created. Coremark is a CPU and cache intensive benchmark parallelised with threads. When running with 1 thread per instance, the vanilla kernel allows threads to contend on cache. With the patch; 5.16.0-rc1 5.16.0-rc1 vanilla sched-numaimb-v4r24 Min Score-16 367816.09 ( 0.00%) 384015.36 ( 4.40%) Hmean Score-16 389627.78 ( 0.00%) 431907.14 * 10.85%* Max Score-16 416178.96 ( 0.00%) 480120.03 ( 15.36%) Stddev Score-16 17361.82 ( 0.00%) 32505.34 ( -87.22%) CoeffVar Score-16 4.45 ( 0.00%) 7.49 ( -68.30%) It can also make a big difference for semi-realistic workloads like specjbb which can execute arbitrary numbers of threads without advance knowledge of how they should be placed 5.16.0-rc1 5.16.0-rc1 vanilla sched-numaimb-v4 Hmean tput-1 73743.05 ( 0.00%) 70258.27 * -4.73%* Hmean tput-8 563036.51 ( 0.00%) 591187.39 ( 5.00%) Hmean tput-16 1016590.61 ( 0.00%) 1032311.78 ( 1.55%) Hmean tput-24 1418558.41 ( 0.00%) 1424005.80 ( 0.38%) Hmean tput-32 1608794.22 ( 0.00%) 1907855.80 * 18.59%* Hmean tput-40 1761338.13 ( 0.00%) 2108162.23 * 19.69%* Hmean tput-48 2290646.54 ( 0.00%) 2214383.47 ( -3.33%) Hmean tput-56 2463345.12 ( 0.00%) 2780216.58 * 12.86%* Hmean tput-64 2650213.53 ( 0.00%) 2598196.66 ( -1.96%) Hmean tput-72 2497253.28 ( 0.00%) 2998882.47 * 20.09%* Hmean tput-80 2820786.72 ( 0.00%) 2951655.27 ( 4.64%) Hmean tput-88 2813541.68 ( 0.00%) 3045450.86 * 8.24%* Hmean tput-96 2604158.67 ( 0.00%) 3035311.91 * 16.56%* Hmean tput-104 2713810.62 ( 0.00%) 2984270.04 ( 9.97%) Hmean tput-112 2558425.37 ( 0.00%) 2894737.46 * 13.15%* Hmean tput-120 2611434.93 ( 0.00%) 2781661.01 ( 6.52%) Hmean tput-128 2706103.22 ( 0.00%) 2811447.85 ( 3.89%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 36 +++++++++++++++++---------------- kernel/sched/topology.c | 37 ++++++++++++++++++++++++++++++++++ 3 files changed, 57 insertions(+), 17 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index c07bfa2d80f2..54f5207154d3 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -93,6 +93,7 @@ struct sched_domain { unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0a969affca76..972ba586b113 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1489,6 +1489,7 @@ struct task_numa_env { int src_cpu, src_nid; int dst_cpu, dst_nid; + int imb_numa_nr; struct numa_stats src_stats, dst_stats; @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); static unsigned long cpu_runnable(struct rq *rq); static unsigned long cpu_util(int cpu); static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight); + int dst_running, int dst_weight, + int imb_numa_nr); static inline enum numa_type numa_classify(unsigned int imbalance_pct, @@ -1885,7 +1887,8 @@ static void task_numa_find_cpu(struct task_numa_env *env, dst_running = env->dst_stats.nr_running + 1; imbalance = max(0, dst_running - src_running); imbalance = adjust_numa_imbalance(imbalance, dst_running, - env->dst_stats.weight); + env->dst_stats.weight, + env->imb_numa_nr); /* Use idle CPU if there is no imbalance */ if (!imbalance) { @@ -1950,8 +1953,10 @@ static int task_numa_migrate(struct task_struct *p) */ rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); - if (sd) + if (sd) { env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; + env.imb_numa_nr = sd->imb_numa_nr; + } rcu_read_unlock(); /* @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) return idlest; #endif /* - * Otherwise, keep the task on this node to stay close - * its wakeup source and improve locality. If there is - * a real need of migration, periodic load balance will - * take care of it. + * Otherwise, keep the task on this node to stay local + * to its wakeup source if the number of running tasks + * are below the allowed imbalance. If there is a real + * need of migration, periodic load balance will take + * care of it. */ - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) + if (local_sgs.sum_nr_running <= sd->imb_numa_nr) return NULL; } @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } } -#define NUMA_IMBALANCE_MIN 2 - static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight) + int dst_running, int dst_weight, + int imb_numa_nr) { if (!allow_numa_imbalance(dst_running, dst_weight)) return imbalance; - /* - * Allow a small imbalance based on a simple pair of communicating - * tasks that remain local when the destination is lightly loaded. - */ - if (imbalance <= NUMA_IMBALANCE_MIN) + if (imbalance <= imb_numa_nr) return 0; return imbalance; @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* Consider allowing a small imbalance between NUMA groups */ if (env->sd->flags & SD_NUMA) { env->imbalance = adjust_numa_imbalance(env->imbalance, - busiest->sum_nr_running, env->sd->span_weight); + busiest->sum_nr_running, env->sd->span_weight, + env->sd->imb_numa_nr); } return; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..bacec575ade2 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } } + /* + * Calculate an allowed NUMA imbalance such that LLCs do not get + * imbalanced. + */ + for_each_cpu(i, cpu_map) { + unsigned int imb = 0; + unsigned int imb_span = 1; + + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + struct sched_domain *child = sd->child; + + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && + (child->flags & SD_SHARE_PKG_RESOURCES)) { + struct sched_domain *top = sd; + unsigned int llc_sq; + + /* + * nr_llcs = (top->span_weight / llc_weight); + * imb = (child_weight / nr_llcs) >> 2 + * + * is equivalent to + * + * imb = (llc_weight^2 / top->span_weight) >> 2 + * + */ + llc_sq = child->span_weight * child->span_weight; + + imb = max(2U, ((llc_sq / top->span_weight) >> 2)); + imb_span = sd->span_weight; + + sd->imb_numa_nr = imb; + } else { + sd->imb_numa_nr = imb * (sd->span_weight / imb_span); + } + } + } + /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) -- 2.31.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-10 9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman @ 2021-12-13 8:28 ` Gautham R. Shenoy 2021-12-13 13:01 ` Mel Gorman 2021-12-17 19:54 ` Gautham R. Shenoy 1 sibling, 1 reply; 22+ messages in thread From: Gautham R. Shenoy @ 2021-12-13 8:28 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML Hello Mel, On Fri, Dec 10, 2021 at 09:33:07AM +0000, Mel Gorman wrote: > Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA > nodes") allowed an imbalance between NUMA nodes such that communicating > tasks would not be pulled apart by the load balancer. This works fine when > there is a 1:1 relationship between LLC and node but can be suboptimal > for multiple LLCs if independent tasks prematurely use CPUs sharing cache. > > Zen* has multiple LLCs per node with local memory channels and due to > the allowed imbalance, it's far harder to tune some workloads to run > optimally than it is on hardware that has 1 LLC per node. This patch > adjusts the imbalance on multi-LLC machines to allow an imbalance up to > the point where LLCs should be balanced between nodes. > > On a Zen3 machine running STREAM parallelised with OMP to have on instance > per LLC the results and without binding, the results are > > 5.16.0-rc1 5.16.0-rc1 > vanilla sched-numaimb-v4 > MB/sec copy-16 166712.18 ( 0.00%) 651540.22 ( 290.82%) > MB/sec scale-16 140109.66 ( 0.00%) 382254.74 ( 172.83%) > MB/sec add-16 160791.18 ( 0.00%) 623073.98 ( 287.51%) > MB/sec triad-16 160043.84 ( 0.00%) 633964.52 ( 296.12%) Could you please share the size of the stream array ? These numbers are higher than what I am observing. > > STREAM can use directives to force the spread if the OpenMP is new > enough but that doesn't help if an application uses threads and > it's not known in advance how many threads will be created. > > Coremark is a CPU and cache intensive benchmark parallelised with > threads. When running with 1 thread per instance, the vanilla kernel > allows threads to contend on cache. With the patch; > > 5.16.0-rc1 5.16.0-rc1 > vanilla sched-numaimb-v4r24 > Min Score-16 367816.09 ( 0.00%) 384015.36 ( 4.40%) > Hmean Score-16 389627.78 ( 0.00%) 431907.14 * 10.85%* > Max Score-16 416178.96 ( 0.00%) 480120.03 ( 15.36%) > Stddev Score-16 17361.82 ( 0.00%) 32505.34 ( -87.22%) > CoeffVar Score-16 4.45 ( 0.00%) 7.49 ( -68.30%) > > It can also make a big difference for semi-realistic workloads > like specjbb which can execute arbitrary numbers of threads without > advance knowledge of how they should be placed > > 5.16.0-rc1 5.16.0-rc1 > vanilla sched-numaimb-v4 > Hmean tput-1 73743.05 ( 0.00%) 70258.27 * -4.73%* > Hmean tput-8 563036.51 ( 0.00%) 591187.39 ( 5.00%) > Hmean tput-16 1016590.61 ( 0.00%) 1032311.78 ( 1.55%) > Hmean tput-24 1418558.41 ( 0.00%) 1424005.80 ( 0.38%) > Hmean tput-32 1608794.22 ( 0.00%) 1907855.80 * 18.59%* > Hmean tput-40 1761338.13 ( 0.00%) 2108162.23 * 19.69%* > Hmean tput-48 2290646.54 ( 0.00%) 2214383.47 ( -3.33%) > Hmean tput-56 2463345.12 ( 0.00%) 2780216.58 * 12.86%* > Hmean tput-64 2650213.53 ( 0.00%) 2598196.66 ( -1.96%) > Hmean tput-72 2497253.28 ( 0.00%) 2998882.47 * 20.09%* > Hmean tput-80 2820786.72 ( 0.00%) 2951655.27 ( 4.64%) > Hmean tput-88 2813541.68 ( 0.00%) 3045450.86 * 8.24%* > Hmean tput-96 2604158.67 ( 0.00%) 3035311.91 * 16.56%* > Hmean tput-104 2713810.62 ( 0.00%) 2984270.04 ( 9.97%) > Hmean tput-112 2558425.37 ( 0.00%) 2894737.46 * 13.15%* > Hmean tput-120 2611434.93 ( 0.00%) 2781661.01 ( 6.52%) > Hmean tput-128 2706103.22 ( 0.00%) 2811447.85 ( 3.89%) > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > include/linux/sched/topology.h | 1 + > kernel/sched/fair.c | 36 +++++++++++++++++---------------- > kernel/sched/topology.c | 37 ++++++++++++++++++++++++++++++++++ > 3 files changed, 57 insertions(+), 17 deletions(-) > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index c07bfa2d80f2..54f5207154d3 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -93,6 +93,7 @@ struct sched_domain { > unsigned int busy_factor; /* less balancing by factor if busy */ > unsigned int imbalance_pct; /* No balance until over watermark */ > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ > > int nohz_idle; /* NOHZ IDLE status */ > int flags; /* See SD_* */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0a969affca76..972ba586b113 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > int src_cpu, src_nid; > int dst_cpu, dst_nid; > + int imb_numa_nr; > > struct numa_stats src_stats, dst_stats; > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); > static unsigned long cpu_runnable(struct rq *rq); > static unsigned long cpu_util(int cpu); > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight); > + int dst_running, int dst_weight, > + int imb_numa_nr); > > static inline enum > numa_type numa_classify(unsigned int imbalance_pct, > @@ -1885,7 +1887,8 @@ static void task_numa_find_cpu(struct task_numa_env *env, > dst_running = env->dst_stats.nr_running + 1; > imbalance = max(0, dst_running - src_running); > imbalance = adjust_numa_imbalance(imbalance, dst_running, > - env->dst_stats.weight); > + env->dst_stats.weight, > + env->imb_numa_nr); > > /* Use idle CPU if there is no imbalance */ > if (!imbalance) { > @@ -1950,8 +1953,10 @@ static int task_numa_migrate(struct task_struct *p) > */ > rcu_read_lock(); > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > - if (sd) > + if (sd) { > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > + env.imb_numa_nr = sd->imb_numa_nr; > + } > rcu_read_unlock(); > > /* > @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > return idlest; > #endif > /* > - * Otherwise, keep the task on this node to stay close > - * its wakeup source and improve locality. If there is > - * a real need of migration, periodic load balance will > - * take care of it. > + * Otherwise, keep the task on this node to stay local > + * to its wakeup source if the number of running tasks > + * are below the allowed imbalance. If there is a real > + * need of migration, periodic load balance will take > + * care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > + if (local_sgs.sum_nr_running <= sd->imb_numa_nr) > return NULL; > } > > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > } > } > > -#define NUMA_IMBALANCE_MIN 2 > - > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight) > + int dst_running, int dst_weight, > + int imb_numa_nr) > { > if (!allow_numa_imbalance(dst_running, dst_weight)) > return imbalance; > if (4 * dst_running >= dst_weight) we return imbalance here. The dst_weight here corresponds to the span of the domain, while dst_running is the nr_running in busiest. On Zen3, at the top most NUMA domain, the dst_weight = 256 across in all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are two groups, where each group is a socket. So, unless there are at least 64 tasks running in one of the sockets, we would not return imbalance here and go to the next step. > - /* > - * Allow a small imbalance based on a simple pair of communicating > - * tasks that remain local when the destination is lightly loaded. > - */ > - if (imbalance <= NUMA_IMBALANCE_MIN) > + if (imbalance <= imb_numa_nr) imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains don't have PREFER_SIBLING, we would be balancing the number of idle CPUs. We will end up doing the imbalance, as long as the difference between the idle CPUs is at least 8. In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we will not rebalance unless the difference between the idle CPUs is 16. In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the threshold is now bumped up to 32. > return 0; > > return imbalance; > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* Consider allowing a small imbalance between NUMA groups */ > if (env->sd->flags & SD_NUMA) { > env->imbalance = adjust_numa_imbalance(env->imbalance, > - busiest->sum_nr_running, env->sd->span_weight); > + busiest->sum_nr_running, env->sd->span_weight, > + env->sd->imb_numa_nr); > } > > return; > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d201a7052a29..bacec575ade2 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + /* > + * Calculate an allowed NUMA imbalance such that LLCs do not get > + * imbalanced. > + */ > + for_each_cpu(i, cpu_map) { > + unsigned int imb = 0; > + unsigned int imb_span = 1; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > + struct sched_domain *top = sd; We don't seem to be using top anywhere where sd may not be used since we already have variables imb and imb_span to record the top->imb_numa_nr and top->span_weight. > + unsigned int llc_sq; > + > + /* > + * nr_llcs = (top->span_weight / llc_weight); > + * imb = (child_weight / nr_llcs) >> 2 child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2. > + * > + * is equivalent to > + * > + * imb = (llc_weight^2 / top->span_weight) >> 2 > + * > + */ > + llc_sq = child->span_weight * child->span_weight; > + > + imb = max(2U, ((llc_sq / top->span_weight) >> 2)); > + imb_span = sd->span_weight; On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256. with NPS=1 top = DIE. top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128. with NPS=2 top = NODE. top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64. with NPS=4 top = NODE. top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32. On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64. with NPS=1 top = DIE. top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128. with NPS=2 top = NODE. top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64. with NPS=4 top = NODE. top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32. > + > + sd->imb_numa_nr = imb; > + } else { > + sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > + } On Zen3, with NPS=1 sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4. with NPS=2 sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4 sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8 with NPS=4 sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8 sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16 For Zen2, since the imb_span and imb values are the same as the corresponding NPS=x values on Zen3, the imb_numa_nr values are the same as well since the corresponding sd->span_weight is the same. If we look at the highest NUMA domain, there are two groups in all the NPS configurations. There are the same number of LLCs in each of these groups across the different NPS configurations (nr_llcs=8 on Zen3, 16 on Zen2) . However, the imb_numa_nr at this domain varies with the NPS value, since we compute the imb_numa_nr value relative to the number of "top" domains that can be fit within this NUMA domain. This is because the size of the "top" domain varies with the NPS value. This shows up in the benchmark results. The numbers with stream, tbench and YCSB + Mongodb are as follows: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stream with 16 threads. built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10 Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 Test: tip/sched/core mel-v3 mel-v4 Copy: 113716.62 (0.00 pct) 218961.59 (92.55 pct) 217130.07 (90.93 pct) Scale: 110996.89 (0.00 pct) 216674.73 (95.20 pct) 220765.94 (98.89 pct) Add: 124504.19 (0.00 pct) 253461.32 (103.57 pct 260273.88 (109.04 pct) Triad: 122890.43 (0.00 pct) 247552.00 (101.44 pct 252615.62 (105.56 pct) NPS=2 Test: tip/sched/core mel-v3 mel-v4 Copy: 58217.00 (0.00 pct) 204630.34 (251.49 pct) 191312.73 (228.62 pct) Scale: 55004.76 (0.00 pct) 212142.88 (285.68 pct) 175499.15 (219.06 pct) Add: 63269.04 (0.00 pct) 254752.56 (302.64 pct) 203571.50 (221.75 pct) Triad: 62178.25 (0.00 pct) 247290.80 (297.71 pct) 198988.70 (220.02 pct) NPS=4 Test: tip/sched/core mel-v3 mel-v4 Copy: 37986.66 (0.00 pct) 254183.87 (569.13 pct) 48748.87 (28.33 pct) Scale: 35471.22 (0.00 pct) 237804.76 (570.41 pct) 48317.82 (36.21 pct) Add: 39303.25 (0.00 pct) 292285.20 (643.66 pct) 54259.59 (38.05 pct) Triad: 39319.85 (0.00 pct) 285284.30 (625.54 pct) 54503.98 (38.61 pct) We can see that with the v4 patch, for NPS=2 and NPS=4, the gains start diminishing since the thresholds are higher than NPS=1. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stream with 16 threads. built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100 Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 Test: tip/sched/core mel-v3 mel-v4 Copy: 137362.66 (0.00 pct) 236661.65 (72.28 pct) 241148.65 (75.55 pct) Scale: 126742.24 (0.00 pct) 214568.17 (69.29 pct) 226416.41 (78.64 pct) Add: 148236.33 (0.00 pct) 257114.42 (73.44 pct) 272030.50 (83.51 pct) Triad: 146913.25 (0.00 pct) 241880.88 (64.64 pct) 259873.61 (76.88 pct) NPS=2 Test: tip/sched/core mel-v3 mel-v4 Copy: 107143.94 (0.00 pct) 244922.66 (128.59 pct) 198299.91 (85.07 pct) Scale: 102004.90 (0.00 pct) 218738.55 (114.43 pct) 177890.23 (74.39 pct) Add: 117760.23 (0.00 pct) 270516.24 (129.71 pct) 211458.30 (79.56 pct) Triad: 115927.92 (0.00 pct) 255985.20 (120.81 pct) 197812.60 (70.63 pct) NPS=4 Test: tip/sched/core mel-v3 mel-v4 Copy: 111653.17 (0.00 pct) 253912.17 (127.41 pct) 48898.34 (-56.20 pct) Scale: 105289.35 (0.00 pct) 223710.85 (112.47 pct) 48426.03 (-54.00 pct) Add: 120927.64 (0.00 pct) 277701.20 (129.64 pct) 54425.48 (-54.99 pct) Triad: 117659.97 (0.00 pct) 259473.84 (120.52 pct) 54622.82 (-53.57 pct) with -DNTIMES=100, each of the Copy,Scale,Add,Triad kernels runs for a longer duration. So the test takes longer time (6-10 seconds) giving the load-balancer sufficient time to place the tasks and balance them. In this configuration we see that the v4 shows some degradation on NPS=4. This is due to the imb_numa_nr being higher compared to v3. While Stream benefits from spreading, it is fair to understand the gains that we make with benchmarks that would prefer the tasks co-located instead of spread out. Chose tbench and YCSB+Mongodb as representatives of these. The numbers are as follows: ~~~~~~~~~~~~~~~~~~~~~~~~ tbench Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 Clients: tip/sched/core mel-v3 mel-v4 1 633.25 (0.00 pct) 619.18 (-2.22 pct) 632.96 (-0.04 pct) 2 1152.54 (0.00 pct) 1189.91 (3.24 pct) 1184.84 (2.80 pct) 4 1946.53 (0.00 pct) 2177.45 (11.86 pct) 1979.62 (1.69 pct) 8 3554.65 (0.00 pct) 3565.16 (0.29 pct) 3678.13 (3.47 pct) 16 6222.00 (0.00 pct) 6484.89 (4.22 pct) 6256.02 (0.54 pct) 32 11707.57 (0.00 pct) 12185.93 (4.08 pct) 12006.63 (2.55 pct) 64 18433.50 (0.00 pct) 19537.03 (5.98 pct) 19088.57 (3.55 pct) 128 27400.07 (0.00 pct) 31771.53 (15.95 pct) 27265.00 (-0.49 pct) 256 33195.27 (0.00 pct) 24478.67 (-26.25 pct) 34065.60 (2.62 pct) 512 41633.10 (0.00 pct) 54833.20 (31.70 pct) 46724.00 (12.22 pct) 1024 53877.23 (0.00 pct) 56363.37 (4.61 pct) 44813.10 (-16.82 pct) NPS=2 Clients: tip/sched/core mel-v3 mel-v4 1 629.76 (0.00 pct) 620.94 (-1.40 pct) 629.22 (-0.08 pct) 2 1177.01 (0.00 pct) 1203.27 (2.23 pct) 1169.12 (-0.66 pct) 4 1990.97 (0.00 pct) 2228.18 (11.91 pct) 1888.39 (-5.15 pct) 8 3535.45 (0.00 pct) 3620.76 (2.41 pct) 3662.72 (3.59 pct) 16 6309.02 (0.00 pct) 6548.66 (3.79 pct) 6508.67 (3.16 pct) 32 12038.73 (0.00 pct) 12145.97 (0.89 pct) 11411.50 (-5.21 pct) 64 18599.67 (0.00 pct) 19448.87 (4.56 pct) 17146.07 (-7.81 pct) 128 27861.57 (0.00 pct) 30630.53 (9.93 pct) 28217.30 (1.27 pct) 256 28215.80 (0.00 pct) 26864.67 (-4.78 pct) 29330.47 (3.95 pct) 512 44239.67 (0.00 pct) 52822.47 (19.40 pct) 42652.63 (-3.58 pct) 1024 54403.53 (0.00 pct) 53905.57 (-0.91 pct) 48490.30 (-10.86 pct) NPS=4 Clients: tip/sched/core mel-v3 mel-v4 1 622.68 (0.00 pct) 617.87 (-0.77 pct) 667.38 (7.17 pct) 2 1160.74 (0.00 pct) 1182.40 (1.86 pct) 1294.12 (11.49 pct) 4 1961.29 (0.00 pct) 2172.41 (10.76 pct) 2477.76 (26.33 pct) 8 3664.25 (0.00 pct) 3450.80 (-5.82 pct) 4067.42 (11.00 pct) 16 6495.53 (0.00 pct) 5873.41 (-9.57 pct) 6931.66 (6.71 pct) 32 11833.27 (0.00 pct) 12010.43 (1.49 pct) 12710.60 (7.41 pct) 64 17723.50 (0.00 pct) 18416.23 (3.90 pct) 18793.47 (6.03 pct) 128 27724.83 (0.00 pct) 27894.50 (0.61 pct) 27948.60 (0.80 pct) 256 31351.70 (0.00 pct) 23944.43 (-23.62 pct) 35430.17 (13.00 pct) 512 43383.43 (0.00 pct) 49830.63 (14.86 pct) 43877.83 (1.13 pct) 1024 46974.27 (0.00 pct) 53583.83 (14.07 pct) 50563.23 (7.64 pct) With NPS=4, with v4, we see no regressions with tbench compared to tip/sched/core and there is a considerable improvement in most cases. So, the higher imb_numa_nr helps pack the tasks which beneficial to tbench. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ YCSB + Mongodb. 4 client instances, 256 threads per client instance. These threads have a very low utilization. The overall system utilization was in the range of 16-20%. YCSB workload type : A ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 tip/sched/core mel-v3 mel-v4 Throughput 351611.0 314981.33 329026.33 (-10.42 pct) (-6.42 pct) NPS=4 tip/sched/core mel-v3 mel-v4 Throughput 315808.0 316600.67 331093.67 (0.25 pct) (4.84 pct) Since at NPS=4, the imb_numa_nr=8 and 16 respectively at the lower and higher NUMA domains, the task spreading happens more reluctantly compared to v3 where the imb_numa_nr was 1 in both the domains. -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-13 8:28 ` Gautham R. Shenoy @ 2021-12-13 13:01 ` Mel Gorman 2021-12-13 14:47 ` Gautham R. Shenoy 0 siblings, 1 reply; 22+ messages in thread From: Mel Gorman @ 2021-12-13 13:01 UTC (permalink / raw) To: Gautham R. Shenoy Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Mon, Dec 13, 2021 at 01:58:03PM +0530, Gautham R. Shenoy wrote: > > On a Zen3 machine running STREAM parallelised with OMP to have on instance > > per LLC the results and without binding, the results are > > > > 5.16.0-rc1 5.16.0-rc1 > > vanilla sched-numaimb-v4 > > MB/sec copy-16 166712.18 ( 0.00%) 651540.22 ( 290.82%) > > MB/sec scale-16 140109.66 ( 0.00%) 382254.74 ( 172.83%) > > MB/sec add-16 160791.18 ( 0.00%) 623073.98 ( 287.51%) > > MB/sec triad-16 160043.84 ( 0.00%) 633964.52 ( 296.12%) > > > Could you please share the size of the stream array ? These numbers > are higher than what I am observing. > 512MB > > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > } > > } > > > > -#define NUMA_IMBALANCE_MIN 2 > > - > > static inline long adjust_numa_imbalance(int imbalance, > > - int dst_running, int dst_weight) > > + int dst_running, int dst_weight, > > + int imb_numa_nr) > > { > > if (!allow_numa_imbalance(dst_running, dst_weight)) > > return imbalance; > > > > if (4 * dst_running >= dst_weight) we return imbalance here. The > dst_weight here corresponds to the span of the domain, while > dst_running is the nr_running in busiest. > Yes, once dst_running is high enough, no imbalance is allowed. In previous versions I changed this but that was a mistake and in this version, the threshold where imbalance is not allowed remains the same. > On Zen3, at the top most NUMA domain, the dst_weight = 256 across in > all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are > two groups, where each group is a socket. So, unless there are at > least 64 tasks running in one of the sockets, we would not return > imbalance here and go to the next step. > Yes > > > - /* > > - * Allow a small imbalance based on a simple pair of communicating > > - * tasks that remain local when the destination is lightly loaded. > > - */ > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > + if (imbalance <= imb_numa_nr) > > imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains > don't have PREFER_SIBLING, we would be balancing the number of idle > CPUs. We will end up doing the imbalance, as long as the difference > between the idle CPUs is at least 8. > > In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we > will not rebalance unless the difference between the idle CPUs is 16. > > In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the > threshold is now bumped up to 32. > > > return 0; > > > > > > > return imbalance; > > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > /* Consider allowing a small imbalance between NUMA groups */ > > if (env->sd->flags & SD_NUMA) { > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > - busiest->sum_nr_running, env->sd->span_weight); > > + busiest->sum_nr_running, env->sd->span_weight, > > + env->sd->imb_numa_nr); > > } > > > > return; > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index d201a7052a29..bacec575ade2 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > } > > } > > > > + /* > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > + * imbalanced. > > + */ > > + for_each_cpu(i, cpu_map) { > > + unsigned int imb = 0; > > + unsigned int imb_span = 1; > > + > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > + struct sched_domain *child = sd->child; > > + > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > + struct sched_domain *top = sd; > > > We don't seem to be using top anywhere where sd may not be used since > we already have variables imb and imb_span to record the > top->imb_numa_nr and top->span_weight. > Top could have been removed but we might still need it. > > > + unsigned int llc_sq; > > + > > + /* > > + * nr_llcs = (top->span_weight / llc_weight); > > + * imb = (child_weight / nr_llcs) >> 2 > > child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2. > That is be clearer. > > + * > > + * is equivalent to > > + * > > + * imb = (llc_weight^2 / top->span_weight) >> 2 > > + * > > + */ > > + llc_sq = child->span_weight * child->span_weight; > > + > > + imb = max(2U, ((llc_sq / top->span_weight) >> 2)); > > + imb_span = sd->span_weight; > > On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256. > with NPS=1 > top = DIE. > top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128. > > with NPS=2 > top = NODE. > top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64. > > with NPS=4 > top = NODE. > top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32. > > On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64. > with NPS=1 > top = DIE. > top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128. > > with NPS=2 > top = NODE. > top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64. > > with NPS=4 > top = NODE. > top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32. > > > > + > > + sd->imb_numa_nr = imb; > > + } else { > > + sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > > + } > > On Zen3, > with NPS=1 > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4. > > with NPS=2 > sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4 > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8 > > with NPS=4 > sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8 > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16 > > > For Zen2, since the imb_span and imb values are the same as the > corresponding NPS=x values on Zen3, the imb_numa_nr values are the > same as well since the corresponding sd->span_weight is the same. > > If we look at the highest NUMA domain, there are two groups in all the > NPS configurations. There are the same number of LLCs in each of these > groups across the different NPS configurations (nr_llcs=8 on Zen3, 16 > on Zen2) . However, the imb_numa_nr at this domain varies with the NPS > value, since we compute the imb_numa_nr value relative to the number > of "top" domains that can be fit within this NUMA domain. This is > because the size of the "top" domain varies with the NPS value. This > shows up in the benchmark results. > This was intentional to have some scaling but based on your results, the scaling might be at the wrong level. > > > The numbers with stream, tbench and YCSB + > Mongodb are as follows: > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Stream with 16 threads. > built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10 > Zen3, 64C128T per socket, 2 sockets, > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > NPS=1 > Test: tip/sched/core mel-v3 mel-v4 > Copy: 113716.62 (0.00 pct) 218961.59 (92.55 pct) 217130.07 (90.93 pct) > Scale: 110996.89 (0.00 pct) 216674.73 (95.20 pct) 220765.94 (98.89 pct) > Add: 124504.19 (0.00 pct) 253461.32 (103.57 pct 260273.88 (109.04 pct) > Triad: 122890.43 (0.00 pct) 247552.00 (101.44 pct 252615.62 (105.56 pct) > > > NPS=2 > Test: tip/sched/core mel-v3 mel-v4 > Copy: 58217.00 (0.00 pct) 204630.34 (251.49 pct) 191312.73 (228.62 pct) > Scale: 55004.76 (0.00 pct) 212142.88 (285.68 pct) 175499.15 (219.06 pct) > Add: 63269.04 (0.00 pct) 254752.56 (302.64 pct) 203571.50 (221.75 pct) > Triad: 62178.25 (0.00 pct) 247290.80 (297.71 pct) 198988.70 (220.02 pct) > > NPS=4 > Test: tip/sched/core mel-v3 mel-v4 > Copy: 37986.66 (0.00 pct) 254183.87 (569.13 pct) 48748.87 (28.33 pct) > Scale: 35471.22 (0.00 pct) 237804.76 (570.41 pct) 48317.82 (36.21 pct) > Add: 39303.25 (0.00 pct) 292285.20 (643.66 pct) 54259.59 (38.05 pct) > Triad: 39319.85 (0.00 pct) 285284.30 (625.54 pct) 54503.98 (38.61 pct) > At minimum, v3 is a failure because a single pair of communicating tasks were getting split across NUMA domains and the allowed numa imbalance gets cut off too early because of the change to allow_numa_imbalance. So while it's a valid comparison, it's definitely not the fix. Given how you describe NPS, maybe the scaling should only start at the point where tasks are no longer balanced between sibling domains. Can you try this? I've only boot tested it at this point. It should work for STREAM at least but probably not great for tbench. diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index bacec575ade2..1fa3e977521d 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && (child->flags & SD_SHARE_PKG_RESOURCES)) { - struct sched_domain *top = sd; + struct sched_domain *top, *top_p; unsigned int llc_sq; /* - * nr_llcs = (top->span_weight / llc_weight); - * imb = (child_weight / nr_llcs) >> 2 + * nr_llcs = (sd->span_weight / llc_weight); + * imb = (llc_weight / nr_llcs) >> 2 * * is equivalent to * - * imb = (llc_weight^2 / top->span_weight) >> 2 + * imb = (llc_weight^2 / sd->span_weight) >> 2 * */ llc_sq = child->span_weight * child->span_weight; - imb = max(2U, ((llc_sq / top->span_weight) >> 2)); - imb_span = sd->span_weight; - + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); sd->imb_numa_nr = imb; + + /* + * Set span based on top domain that places + * tasks in sibling domains. + */ + top = sd; + top_p = top->parent; + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { + top = top->parent; + top_p = top->parent; + } + imb_span = top_p ? top_p->span_weight : sd->span_weight; } else { - sd->imb_numa_nr = imb * (sd->span_weight / imb_span); + int factor = max(1U, (sd->span_weight / imb_span)); + + sd->imb_numa_nr = imb * factor; } } } ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-13 13:01 ` Mel Gorman @ 2021-12-13 14:47 ` Gautham R. Shenoy 2021-12-15 11:52 ` Gautham R. Shenoy 0 siblings, 1 reply; 22+ messages in thread From: Gautham R. Shenoy @ 2021-12-13 14:47 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML Hello Mel, On Mon, Dec 13, 2021 at 01:01:31PM +0000, Mel Gorman wrote: > On Mon, Dec 13, 2021 at 01:58:03PM +0530, Gautham R. Shenoy wrote: > > > On a Zen3 machine running STREAM parallelised with OMP to have on instance > > > per LLC the results and without binding, the results are > > > > > > 5.16.0-rc1 5.16.0-rc1 > > > vanilla sched-numaimb-v4 > > > MB/sec copy-16 166712.18 ( 0.00%) 651540.22 ( 290.82%) > > > MB/sec scale-16 140109.66 ( 0.00%) 382254.74 ( 172.83%) > > > MB/sec add-16 160791.18 ( 0.00%) 623073.98 ( 287.51%) > > > MB/sec triad-16 160043.84 ( 0.00%) 633964.52 ( 296.12%) > > > > > > Could you please share the size of the stream array ? These numbers > > are higher than what I am observing. > > > > 512MB Thanks, I will try with this one. > > > > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > } > > > } > > > > > > -#define NUMA_IMBALANCE_MIN 2 > > > - > > > static inline long adjust_numa_imbalance(int imbalance, > > > - int dst_running, int dst_weight) > > > + int dst_running, int dst_weight, > > > + int imb_numa_nr) > > > { > > > if (!allow_numa_imbalance(dst_running, dst_weight)) > > > return imbalance; > > > > > > > if (4 * dst_running >= dst_weight) we return imbalance here. The > > dst_weight here corresponds to the span of the domain, while > > dst_running is the nr_running in busiest. > > > > Yes, once dst_running is high enough, no imbalance is allowed. In > previous versions I changed this but that was a mistake and in this > version, the threshold where imbalance is not allowed remains the same. > > > On Zen3, at the top most NUMA domain, the dst_weight = 256 across in > > all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are > > two groups, where each group is a socket. So, unless there are at > > least 64 tasks running in one of the sockets, we would not return > > imbalance here and go to the next step. > > > > Yes > > > > > > - /* > > > - * Allow a small imbalance based on a simple pair of communicating > > > - * tasks that remain local when the destination is lightly loaded. > > > - */ > > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > > + if (imbalance <= imb_numa_nr) > > > > imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains > > don't have PREFER_SIBLING, we would be balancing the number of idle > > CPUs. We will end up doing the imbalance, as long as the difference > > between the idle CPUs is at least 8. > > > > In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we > > will not rebalance unless the difference between the idle CPUs is 16. > > > > In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the > > threshold is now bumped up to 32. > > > > > return 0; > > > > > > > > > > > > return imbalance; > > > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > > /* Consider allowing a small imbalance between NUMA groups */ > > > if (env->sd->flags & SD_NUMA) { > > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > > - busiest->sum_nr_running, env->sd->span_weight); > > > + busiest->sum_nr_running, env->sd->span_weight, > > > + env->sd->imb_numa_nr); > > > } > > > > > > return; > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > > index d201a7052a29..bacec575ade2 100644 > > > --- a/kernel/sched/topology.c > > > +++ b/kernel/sched/topology.c > > > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > > } > > > } > > > > > > + /* > > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > > + * imbalanced. > > > + */ > > > + for_each_cpu(i, cpu_map) { > > > + unsigned int imb = 0; > > > + unsigned int imb_span = 1; > > > + > > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > > + struct sched_domain *child = sd->child; > > > + > > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > > + struct sched_domain *top = sd; > > > > > > We don't seem to be using top anywhere where sd may not be used since > > we already have variables imb and imb_span to record the > > top->imb_numa_nr and top->span_weight. > > > > Top could have been removed but we might still need it. > > > > > > + unsigned int llc_sq; > > > + > > > + /* > > > + * nr_llcs = (top->span_weight / llc_weight); > > > + * imb = (child_weight / nr_llcs) >> 2 > > > > child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2. > > > > That is be clearer. > > > > + * > > > + * is equivalent to > > > + * > > > + * imb = (llc_weight^2 / top->span_weight) >> 2 > > > + * > > > + */ > > > + llc_sq = child->span_weight * child->span_weight; > > > + > > > + imb = max(2U, ((llc_sq / top->span_weight) >> 2)); > > > + imb_span = sd->span_weight; > > > > On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256. > > with NPS=1 > > top = DIE. > > top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128. > > > > with NPS=2 > > top = NODE. > > top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64. > > > > with NPS=4 > > top = NODE. > > top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32. > > > > On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64. > > with NPS=1 > > top = DIE. > > top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128. > > > > with NPS=2 > > top = NODE. > > top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64. > > > > with NPS=4 > > top = NODE. > > top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32. > > > > > > > + > > > + sd->imb_numa_nr = imb; > > > + } else { > > > + sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > > > + } > > > > On Zen3, > > with NPS=1 > > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4. > > > > with NPS=2 > > sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4 > > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8 > > > > with NPS=4 > > sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8 > > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16 > > > > > > For Zen2, since the imb_span and imb values are the same as the > > corresponding NPS=x values on Zen3, the imb_numa_nr values are the > > same as well since the corresponding sd->span_weight is the same. > > > > If we look at the highest NUMA domain, there are two groups in all the > > NPS configurations. There are the same number of LLCs in each of these > > groups across the different NPS configurations (nr_llcs=8 on Zen3, 16 > > on Zen2) . However, the imb_numa_nr at this domain varies with the NPS > > value, since we compute the imb_numa_nr value relative to the number > > of "top" domains that can be fit within this NUMA domain. This is > > because the size of the "top" domain varies with the NPS value. This > > shows up in the benchmark results. > > > > This was intentional to have some scaling but based on your results, the > scaling might be at the wrong level. Ok. > > > > > > > The numbers with stream, tbench and YCSB + > > Mongodb are as follows: > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Stream with 16 threads. > > built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10 > > Zen3, 64C128T per socket, 2 sockets, > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > NPS=1 > > Test: tip/sched/core mel-v3 mel-v4 > > Copy: 113716.62 (0.00 pct) 218961.59 (92.55 pct) 217130.07 (90.93 pct) > > Scale: 110996.89 (0.00 pct) 216674.73 (95.20 pct) 220765.94 (98.89 pct) > > Add: 124504.19 (0.00 pct) 253461.32 (103.57 pct 260273.88 (109.04 pct) > > Triad: 122890.43 (0.00 pct) 247552.00 (101.44 pct 252615.62 (105.56 pct) > > > > > > NPS=2 > > Test: tip/sched/core mel-v3 mel-v4 > > Copy: 58217.00 (0.00 pct) 204630.34 (251.49 pct) 191312.73 (228.62 pct) > > Scale: 55004.76 (0.00 pct) 212142.88 (285.68 pct) 175499.15 (219.06 pct) > > Add: 63269.04 (0.00 pct) 254752.56 (302.64 pct) 203571.50 (221.75 pct) > > Triad: 62178.25 (0.00 pct) 247290.80 (297.71 pct) 198988.70 (220.02 pct) > > > > NPS=4 > > Test: tip/sched/core mel-v3 mel-v4 > > Copy: 37986.66 (0.00 pct) 254183.87 (569.13 pct) 48748.87 (28.33 pct) > > Scale: 35471.22 (0.00 pct) 237804.76 (570.41 pct) 48317.82 (36.21 pct) > > Add: 39303.25 (0.00 pct) 292285.20 (643.66 pct) 54259.59 (38.05 pct) > > Triad: 39319.85 (0.00 pct) 285284.30 (625.54 pct) 54503.98 (38.61 pct) > > > > At minimum, v3 is a failure because a single pair of communicating tasks > were getting split across NUMA domains and the allowed numa imbalance > gets cut off too early because of the change to allow_numa_imbalance. > So while it's a valid comparison, it's definitely not the fix. v3 is definitely not a fix. I wasn't hinting at that. It was just to point out the opportunity that we have. > > Given how you describe NPS, maybe the scaling should only start at the > point where tasks are no longer balanced between sibling domains. Can > you try this? I've only boot tested it at this point. It should work for > STREAM at least but probably not great for tbench. Thanks for the patch. I will queue this one for tonight. > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index bacec575ade2..1fa3e977521d 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > (child->flags & SD_SHARE_PKG_RESOURCES)) { > - struct sched_domain *top = sd; > + struct sched_domain *top, *top_p; > unsigned int llc_sq; > > /* > - * nr_llcs = (top->span_weight / llc_weight); > - * imb = (child_weight / nr_llcs) >> 2 > + * nr_llcs = (sd->span_weight / llc_weight); > + * imb = (llc_weight / nr_llcs) >> 2 > * > * is equivalent to > * > - * imb = (llc_weight^2 / top->span_weight) >> 2 > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > * > */ > llc_sq = child->span_weight * child->span_weight; > > - imb = max(2U, ((llc_sq / top->span_weight) >> 2)); > - imb_span = sd->span_weight; > - > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > sd->imb_numa_nr = imb; > + > + /* > + * Set span based on top domain that places > + * tasks in sibling domains. > + */ > + top = sd; > + top_p = top->parent; > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > + top = top->parent; > + top_p = top->parent; > + } > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > } else { > - sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > + int factor = max(1U, (sd->span_weight / imb_span)); > + > + sd->imb_numa_nr = imb * factor; > } > } > } -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-13 14:47 ` Gautham R. Shenoy @ 2021-12-15 11:52 ` Gautham R. Shenoy 2021-12-15 12:25 ` Mel Gorman 0 siblings, 1 reply; 22+ messages in thread From: Gautham R. Shenoy @ 2021-12-15 11:52 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML Hello Mel, On Mon, Dec 13, 2021 at 08:17:37PM +0530, Gautham R. Shenoy wrote: > > Thanks for the patch. I will queue this one for tonight. > Getting the numbers took a bit longer than I expected. > > > > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index bacec575ade2..1fa3e977521d 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > > > if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > (child->flags & SD_SHARE_PKG_RESOURCES)) { > > - struct sched_domain *top = sd; > > + struct sched_domain *top, *top_p; > > unsigned int llc_sq; > > > > /* > > - * nr_llcs = (top->span_weight / llc_weight); > > - * imb = (child_weight / nr_llcs) >> 2 > > + * nr_llcs = (sd->span_weight / llc_weight); > > + * imb = (llc_weight / nr_llcs) >> 2 > > * > > * is equivalent to > > * > > - * imb = (llc_weight^2 / top->span_weight) >> 2 > > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > > * > > */ > > llc_sq = child->span_weight * child->span_weight; > > > > - imb = max(2U, ((llc_sq / top->span_weight) >> 2)); > > - imb_span = sd->span_weight; > > - > > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > > sd->imb_numa_nr = imb; > > + > > + /* > > + * Set span based on top domain that places > > + * tasks in sibling domains. > > + */ > > + top = sd; > > + top_p = top->parent; > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > + top = top->parent; > > + top_p = top->parent; > > + } > > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > > } else { > > - sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > > + int factor = max(1U, (sd->span_weight / imb_span)); > > + So for the first NUMA domain, the sd->imb_numa_nr will be imb, which turns out to be 2 for Zen2 and Zen3 processors across all Nodes Per Socket Settings. On a 2 Socket Zen3: NPS=1 child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2 top_p = NUMA, imb_span = 256. NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2 NPS=2 child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2 top_p = NUMA, imb_span = 128. NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 NPS=4: child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2 top_p = NUMA, imb_span = 128. NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 Again, we will be more aggressively load balancing across the two sockets in NPS=1 mode compared to NPS=2/4. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stream with 16 threads. built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10 Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 Test: tip-core mel-v3 mel-v4 mel-v4.1 Copy: 113666.77 214885.89 212162.63 226065.79 (0.00%) (89.04%) (86.65%) (98.88%) Scale: 110962.00 215475.05 220002.56 227610.64 (0.00%) (94.18%) (98.26%) (105.12%) Add: 124440.44 250338.11 258569.94 271368.51 (0.00%) (101.17%) (107.78%) (118.07%) Triad: 122836.42 244993.16 251075.19 265465.57 (0.00%) (99.44%) (104.39%) (116.11%) NPS=2 Test: tip-core mel-v3 mel-v4 mel-v4.1 Copy: 58193.29 203242.78 191171.22 212785.96 (0.00%) (249.25%) (228.51%) (265.65%) Scale: 54980.48 209056.28 175497.10 223452.99 (0.00%) (280.23%) (219.19%) (306.42%) Add: 63244.24 249610.19 203569.35 270487.22 (0.00%) (294.67%) (221.87%) (327.68%) Triad: 62166.77 242688.72 198971.26 262757.16 (0.00%) (290.38%) (220.06%) (322.66%) NPS=4 Test: tip-core mel-v3 mel-v4 mel-v4.1 Copy: 37762.14 253412.82 48748.60 164765.83 (0.00%) (571.07%) (29.09%) (336.32%) Scale: 33879.66 237736.93 48317.66 159641.67 (0.00%) (601.70%) (42.61%) (371.20%) Add: 38398.90 292186.56 54259.56 184583.70 (0.00%) (660.92%) (41.30%) (380.70%) Triad: 37942.38 285124.76 54503.74 181250.80 (0.00%) (651.46%) (43.64%) (377.70%) So while in NPS=1 and NPS=2 we are able to recover the performance that was obtained with v3, on v4, we are not able to recover as much. However it is not only due to the fact that in, the imb_numa_nr thresholds for NPS=4 were (1,1) for the two NUMA domains, while in v4.1 the imb_numa_nr was (2,4), but also due to the fact that in v3, we used the imb_numa_nr thresholds in allow_numa_imbalance() while in v4 and v4.1 we are using those in adjust_numa_imbalance(). The distinction is that in v3, we will trigger load balance as soon as the number of tasks in the busiest group in the NUMA domain is greater than or equal to imb_numa_nr at that domain. In v4 and v4.1, we will trigger the load-balance if - the number of tasks in the busiest group is greater than 1/4th the CPUs in the NUMA domain. OR - the difference between the idle CPUs in the busiest and this group is greater than imb_numa_nr. If we retain the (2,4) thresholds from v4.1 but use them in allow_numa_imbalance() as in v3 we get NPS=4 Test: mel-v4.2 Copy: 225860.12 (498.11%) Scale: 227869.07 (572.58%) Add: 278365.58 (624.93%) Triad: 264315.44 (596.62%) It shouldn't have made so much of a difference considering the fact that with 16 stream tasks, we should have hit "imbalance > imb_numa_nr" in adjust_numa_imbalance() eventually. But these are the numbers! The trend is similar with -DNTIMES=100. Even in this case with v4.2 we can recover the stream performance in NPS4 case. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stream with 16 threads. built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100 Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 Test: tip-core mel-v3 mel-v4 mel-v4.1 Copy: 137281.89 235886.13 240588.80 249005.99 (0.00 %) (71.82%) (75.25%) (81.38%) Scale: 126559.48 213667.75 226411.39 228523.78 (0.00 %) (68.82%) (78.89%) (80.56%) Add: 148144.23 255780.73 272027.93 272872.49 (0.00 %) (72.65%) (83.62%) (84.19%) Triad: 146829.78 240400.29 259854.84 265640.70 (0.00 %) (63.72%) (76.97%) (80.91%) NPS=2 Test: tip-core mel-v3 mel-v4 mel-v4.1 Copy: 105873.29 243513.40 198299.43 258266.99 (0.00%) (130.00%) (87.29%) (143.93%) Scale: 100510.54 217361.59 177890.00 231509.69 (0.00%) (116.25%) (76.98%) (130.33%) Add: 115932.61 268488.03 211436.57 288396.07 (0.00%) (131.58%) (82.37%) (148.76%) Triad: 113349.09 253865.68 197810.83 272219.89 (0.00%) (123.96%) (74.51%) (140.16%) NPS=4 Test: tip-core mel-v3 mel-v4 mel-v4.1 mel-v4.2 Copy: 106798.80 251503.78 48898.24 171769.82 266788.03 (0.00%) (135.49%) (-54.21%) (60.83%) (149.80%) Scale: 101377.71 221135.30 48425.90 160748.21 232480.83 (0.00%) (118.13%) (-52.23%) (58.56%) (129.32%) Add: 116131.56 275008.74 54425.29 186387.91 290787.31 (0.00%) (136.80%) (-53.13%) (60.49%) (150.39%) Triad: 113443.70 256027.20 54622.68 180936.47 277456.83 (0.00%) (125.68%) (-51.85%) (59.49%) (144.57%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ tbench Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 ====== Clients: tip-core mel-v3 mel-v4 mel-v4.1 1 633.19 619.16 632.94 619.27 (0.00%) (-2.21%) (-0.03%) (-2.19%) 2 1152.48 1189.88 1184.82 1189.19 (0.00%) (3.24%) (2.80%) (3.18%) 4 1946.46 2177.40 1979.56 2196.09 (0.00%) (11.86%) (1.70%) (12.82%) 8 3553.29 3564.50 3678.07 3668.77 (0.00%) (0.31%) (3.51%) (3.24%) 16 6217.03 6484.58 6249.29 6534.73 (0.00%) (4.30%) (0.51%) (5.11%) 32 11702.59 12185.77 12005.99 11917.57 (0.00%) (4.12%) (2.59%) (1.83%) 64 18394.56 19535.11 19080.19 19500.55 (0.00%) (6.20%) (3.72%) (6.01%) 128 27231.02 31759.92 27200.52 30358.99 (0.00%) (16.63%) (-0.11%) (11.48%) 256 33166.10 24474.30 31639.98 24788.12 (0.00%) (-26.20%) (-4.60%) (-25.26%) 512 41605.44 54823.57 46684.48 54559.02 (0.00%) (31.77%) (12.20%) (31.13%) 1024 53650.54 56329.39 44422.99 56320.66 (0.00%) (4.99%) (-17.19%) (4.97%) We see that the v4.1 performs better than v4 in most cases except when the number of clients=256 where the spread strategy seems to be hurting as we see degradation in both v3 and v4.1. This is true even for NPS=2 and NPS=4 cases (see below). NPS=2 ===== Clients: tip-core mel-v3 mel-v4 mel-v4.1 1 629.76 620.91 629.11 631.95 (0.00%) (-1.40%) (-0.10%) (0.34%) 2 1176.96 1203.12 1169.09 1186.74 (0.00%) (2.22%) (-0.66%) (0.83%) 4 1990.97 2228.04 1888.19 1995.21 (0.00%) (11.90%) (-5.16%) (0.21%) 8 3534.57 3617.16 3660.30 3548.09 (0.00%) (2.33%) (3.55%) (0.38%) 16 6294.71 6547.80 6504.13 6470.34 (0.00%) (4.02%) (3.32%) (2.79%) 32 12035.73 12143.03 11396.26 11860.91 (0.00%) (0.89%) (-5.31%) (-1.45%) 64 18583.39 19439.12 17126.47 18799.54 (0.00%) (4.60%) (-7.83%) (1.16%) 128 27811.89 30562.84 28090.29 27468.94 (0.00%) (9.89%) (1.00%) (-1.23%) 256 28148.95 26488.57 29117.13 23628.29 (0.00%) (-5.89%) (3.43%) (-16.05%) 512 43934.15 52796.38 42603.49 41725.75 (0.00%) (20.17%) (-3.02%) (-5.02%) 1024 54391.65 53891.83 48419.09 43913.40 (0.00%) (-0.91%) (-10.98%) (-19.26%) In this case, v4.1 performs as good as v4 upto 64 clients. But after that we see degradation. The degradation is significant in 1024 clients case. NPS=4 ===== Clients: tip-core mel-v3 mel-v4 mel-v4.1 mel-v4.2 1 622.65 617.83 667.34 644.76 617.58 (0.00%) (-0.77%) (7.17%) (3.55%) (-0.81%) 2 1160.62 1182.30 1294.08 1193.88 1182.55 (0.00%) (1.86%) (11.49%) (2.86%) (1.88%) 4 1961.14 2171.91 2477.71 1929.56 2116.01 (0.00%) (10.74%) (26.34%) (-1.61%) (7.89%) 8 3662.94 3447.98 4067.40 3627.43 3580.32 (0.00%) (-5.86%) (11.04%) (-0.96%) (-2.25%) 16 6490.92 5871.93 6924.32 6660.13 6413.34 (0.00%) (-9.53%) (6.67%) (2.60%) (-1.19%) 32 11831.81 12004.30 12709.06 12187.78 11767.46 (0.00%) (1.45%) (7.41%) (3.00%) (-0.54%) 64 17717.36 18406.79 18785.41 18820.33 18197.86 (0.00%) (3.89%) (6.02%) (6.22%) (2.71%) 128 27723.35 27777.34 27939.63 27399.64 24310.93 (0.00%) (0.19%) (0.78%) (-1.16%) (-12.30%) 256 30919.69 23937.03 35412.26 26780.37 24642.24 (0.00%) (-22.58%) (14.52%) (-13.38%) (-20.30%) 512 43366.03 49570.65 43830.84 43654.42 41031.90 (0.00%) (14.30%) (1.07%) (0.66%) (-5.38%) 1024 46960.83 53576.16 50557.19 43743.07 40884.98 (0.00%) (14.08%) (7.65%) (-6.85%) (-12.93%) In the NPS=4 case, clearly v4 provides the best results. v4.1 does better v4.2 since it is able to hold off spreading for a longer period compared to v4.2. -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-15 11:52 ` Gautham R. Shenoy @ 2021-12-15 12:25 ` Mel Gorman 2021-12-16 18:33 ` Gautham R. Shenoy 0 siblings, 1 reply; 22+ messages in thread From: Mel Gorman @ 2021-12-15 12:25 UTC (permalink / raw) To: Gautham R. Shenoy Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > Hello Mel, > > > On Mon, Dec 13, 2021 at 08:17:37PM +0530, Gautham R. Shenoy wrote: > > > > > Thanks for the patch. I will queue this one for tonight. > > > > Getting the numbers took a bit longer than I expected. > No worries. > > > <SNIP> > > > + /* > > > + * Set span based on top domain that places > > > + * tasks in sibling domains. > > > + */ > > > + top = sd; > > > + top_p = top->parent; > > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > > + top = top->parent; > > > + top_p = top->parent; > > > + } > > > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > > > } else { > > > - sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > > > + int factor = max(1U, (sd->span_weight / imb_span)); > > > + > > > So for the first NUMA domain, the sd->imb_numa_nr will be imb, which > turns out to be 2 for Zen2 and Zen3 processors across all Nodes Per Socket Settings. > > On a 2 Socket Zen3: > > NPS=1 > child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2 > top_p = NUMA, imb_span = 256. > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2 > > NPS=2 > child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2 > top_p = NUMA, imb_span = 128. > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > NPS=4: > child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2 > top_p = NUMA, imb_span = 128. > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > Again, we will be more aggressively load balancing across the two > sockets in NPS=1 mode compared to NPS=2/4. > Yes, but I felt it was reasonable behaviour because we have to strike some sort of balance between allowing a NUMA imbalance up to a point to prevent communicating tasks being pulled apart and v3 broke that completely. There will always be a tradeoff between tasks that want to remain local to each other and others that prefer to spread as wide as possible as quickly as possible. > <SNIP> > If we retain the (2,4) thresholds from v4.1 but use them in > allow_numa_imbalance() as in v3 we get > > NPS=4 > Test: mel-v4.2 > Copy: 225860.12 (498.11%) > Scale: 227869.07 (572.58%) > Add: 278365.58 (624.93%) > Triad: 264315.44 (596.62%) > The potential problem with this is that it probably will work for netperf when it's a single communicating pair but may not work as well when there are multiple communicating pairs or a number of communicating tasks that exceed numa_imb_nr. > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > NPS=1 > ====== > Clients: tip-core mel-v3 mel-v4 mel-v4.1 > 1 633.19 619.16 632.94 619.27 > (0.00%) (-2.21%) (-0.03%) (-2.19%) > > 2 1152.48 1189.88 1184.82 1189.19 > (0.00%) (3.24%) (2.80%) (3.18%) > > 4 1946.46 2177.40 1979.56 2196.09 > (0.00%) (11.86%) (1.70%) (12.82%) > > 8 3553.29 3564.50 3678.07 3668.77 > (0.00%) (0.31%) (3.51%) (3.24%) > > 16 6217.03 6484.58 6249.29 6534.73 > (0.00%) (4.30%) (0.51%) (5.11%) > > 32 11702.59 12185.77 12005.99 11917.57 > (0.00%) (4.12%) (2.59%) (1.83%) > > 64 18394.56 19535.11 19080.19 19500.55 > (0.00%) (6.20%) (3.72%) (6.01%) > > 128 27231.02 31759.92 27200.52 30358.99 > (0.00%) (16.63%) (-0.11%) (11.48%) > > 256 33166.10 24474.30 31639.98 24788.12 > (0.00%) (-26.20%) (-4.60%) (-25.26%) > > 512 41605.44 54823.57 46684.48 54559.02 > (0.00%) (31.77%) (12.20%) (31.13%) > > 1024 53650.54 56329.39 44422.99 56320.66 > (0.00%) (4.99%) (-17.19%) (4.97%) > > > We see that the v4.1 performs better than v4 in most cases except when > the number of clients=256 where the spread strategy seems to be > hurting as we see degradation in both v3 and v4.1. This is true even > for NPS=2 and NPS=4 cases (see below). > The 256 client case is a bit of a crapshoot. At that point, the NUMA imbalancing is disabled and the machine is overloaded. > NPS=2 > ===== > Clients: tip-core mel-v3 mel-v4 mel-v4.1 > 1 629.76 620.91 629.11 631.95 > (0.00%) (-1.40%) (-0.10%) (0.34%) > > 2 1176.96 1203.12 1169.09 1186.74 > (0.00%) (2.22%) (-0.66%) (0.83%) > > 4 1990.97 2228.04 1888.19 1995.21 > (0.00%) (11.90%) (-5.16%) (0.21%) > > 8 3534.57 3617.16 3660.30 3548.09 > (0.00%) (2.33%) (3.55%) (0.38%) > > 16 6294.71 6547.80 6504.13 6470.34 > (0.00%) (4.02%) (3.32%) (2.79%) > > 32 12035.73 12143.03 11396.26 11860.91 > (0.00%) (0.89%) (-5.31%) (-1.45%) > > 64 18583.39 19439.12 17126.47 18799.54 > (0.00%) (4.60%) (-7.83%) (1.16%) > > 128 27811.89 30562.84 28090.29 27468.94 > (0.00%) (9.89%) (1.00%) (-1.23%) > > 256 28148.95 26488.57 29117.13 23628.29 > (0.00%) (-5.89%) (3.43%) (-16.05%) > > 512 43934.15 52796.38 42603.49 41725.75 > (0.00%) (20.17%) (-3.02%) (-5.02%) > > 1024 54391.65 53891.83 48419.09 43913.40 > (0.00%) (-0.91%) (-10.98%) (-19.26%) > > In this case, v4.1 performs as good as v4 upto 64 clients. But after > that we see degradation. The degradation is significant in 1024 > clients case. > Kinda the same, it's more likely to be run-to-run variance because the machine is overloaded. > NPS=4 > ===== > Clients: tip-core mel-v3 mel-v4 mel-v4.1 mel-v4.2 > 1 622.65 617.83 667.34 644.76 617.58 > (0.00%) (-0.77%) (7.17%) (3.55%) (-0.81%) > > 2 1160.62 1182.30 1294.08 1193.88 1182.55 > (0.00%) (1.86%) (11.49%) (2.86%) (1.88%) > > 4 1961.14 2171.91 2477.71 1929.56 2116.01 > (0.00%) (10.74%) (26.34%) (-1.61%) (7.89%) > > 8 3662.94 3447.98 4067.40 3627.43 3580.32 > (0.00%) (-5.86%) (11.04%) (-0.96%) (-2.25%) > > 16 6490.92 5871.93 6924.32 6660.13 6413.34 > (0.00%) (-9.53%) (6.67%) (2.60%) (-1.19%) > > 32 11831.81 12004.30 12709.06 12187.78 11767.46 > (0.00%) (1.45%) (7.41%) (3.00%) (-0.54%) > > 64 17717.36 18406.79 18785.41 18820.33 18197.86 > (0.00%) (3.89%) (6.02%) (6.22%) (2.71%) > > 128 27723.35 27777.34 27939.63 27399.64 24310.93 > (0.00%) (0.19%) (0.78%) (-1.16%) (-12.30%) > > 256 30919.69 23937.03 35412.26 26780.37 24642.24 > (0.00%) (-22.58%) (14.52%) (-13.38%) (-20.30%) > > 512 43366.03 49570.65 43830.84 43654.42 41031.90 > (0.00%) (14.30%) (1.07%) (0.66%) (-5.38%) > > 1024 46960.83 53576.16 50557.19 43743.07 40884.98 > (0.00%) (14.08%) (7.65%) (-6.85%) (-12.93%) > > > In the NPS=4 case, clearly v4 provides the best results. > > v4.1 does better v4.2 since it is able to hold off spreading for a > longer period compared to v4.2. > Most likely because v4.2 is disabling the allowed NUMA imbalance too soon. This is the trade-off between favouring communicating tasks over embararassingly parallel problems. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-15 12:25 ` Mel Gorman @ 2021-12-16 18:33 ` Gautham R. Shenoy 2021-12-20 11:12 ` Mel Gorman 0 siblings, 1 reply; 22+ messages in thread From: Gautham R. Shenoy @ 2021-12-16 18:33 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML Hello Mel, On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: [..SNIP..] > > On a 2 Socket Zen3: > > > > NPS=1 > > child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2 > > top_p = NUMA, imb_span = 256. > > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2 > > > > NPS=2 > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2 > > top_p = NUMA, imb_span = 128. > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > NPS=4: > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2 > > top_p = NUMA, imb_span = 128. > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > Again, we will be more aggressively load balancing across the two > > sockets in NPS=1 mode compared to NPS=2/4. > > > > Yes, but I felt it was reasonable behaviour because we have to strike > some sort of balance between allowing a NUMA imbalance up to a point > to prevent communicating tasks being pulled apart and v3 broke that > completely. There will always be a tradeoff between tasks that want to > remain local to each other and others that prefer to spread as wide as > possible as quickly as possible. I agree with this argument that we want to be conservative while pulling tasks across NUMA domains. My point was that the threshold at the NUMA domain that spans the 2 sockets is lower for NPS=1 (imb_numa_nr = 2) when compared to the threshold for the same NUMA domain when NPS=2/4 (imb_numa_nr = 4). Irrespective of what NPS mode we are operating in, the NUMA distance between the two sockets is 32 on Zen3 systems. Hence shouldn't the thresholds be the same for that level of NUMA? Would something like the following work ? if (sd->flags & SD_NUMA) { /* We are using the child as a proxy for the group. */ group_span = sd->child->span_weight; sd_distance = /* NUMA distance at this sd level */ /* By default we set the threshold to 1/4th the sched-group span. */ imb_numa_shift = 2; /* * We can be a little aggressive if the cost of migrating tasks * across groups of this NUMA level is not high. * Assuming */ if (sd_distance < REMOTE_DISTANCE) imb_numa_shift++; /* * Compute the number of LLCs in each group. * More the LLCs, more aggressively we migrate across * the groups at this NUMA sd. */ nr_llcs = group_span/llc_size; sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift); } With this, on Intel platforms, we will get sd->imb_numa_nr = (span of socket)/4 On Zen3, NPS=1, Inter-socket NUMA : sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4 NPS=2, Intra-socket NUMA: sd->imb_numa_nr = max(2U, (64/4) >> (2+1)) = 2 Inter-socket NUMA: sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4 NPS=4, Intra-socket NUMA: sd->imb_numa_nr = max(2U, (32/2) >> (2+1)) = 2 Inter-socket NUMA: sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4 > > > <SNIP> > > If we retain the (2,4) thresholds from v4.1 but use them in > > allow_numa_imbalance() as in v3 we get > > > > NPS=4 > > Test: mel-v4.2 > > Copy: 225860.12 (498.11%) > > Scale: 227869.07 (572.58%) > > Add: 278365.58 (624.93%) > > Triad: 264315.44 (596.62%) > > > > The potential problem with this is that it probably will work for > netperf when it's a single communicating pair but may not work as well > when there are multiple communicating pairs or a number of communicating > tasks that exceed numa_imb_nr. Yes that's true. I think what you are doing in v4 is the right thing. In case of stream in NPS=4, it just manages to hit the corner case for this heuristic which results in a suboptimal behaviour. Description follows: On NPS=4, if we run 8 stream tasks bound to a socket with v4.1, we get the following initial placement based on data obtained via the sched:sched_wakeup_new tracepoint. This behaviour is consistently reproducible. ------------------------------------------------------- | NUMA | | ----------------------- ------------------------ | | | NODE0 | | NODE1 | | | | ------------- | | ------------- | | | | | 0 tasks | MC0 | | | 1 tasks | MC2 | | | | ------------- | | ------------- | | | | ------------- | | ------------- | | | | | 1 tasks | MC1 | | | 1 tasks | MC3 | | | | ------------- | | ------------- | | | | | | | | | ----------------------- ------------------------ | | ----------------------- ------------------------ | | | NODE2 | | NODE3 | | | | ------------- | | ------------- | | | | | 1 tasks | MC4 | | | 1 tasks | MC6 | | | | ------------- | | ------------- | | | | ------------- | | ------------- | | | | | 2 tasks | MC5 | | | 1 tasks | MC7 | | | | ------------- | | ------------- | | | | | | | | | ----------------------- ------------------------ | | | ------------------------------------------------------- From the trace data obtained for sched:sched_wakeup_new and sched:sched_migrate_task, we see PID 106089 : timestamp 35607.831040 : was running in MC5 PID 106090 : timestamp 35607.831040 : first placed in MC4 PID 106091 : timestamp 35607.831081 : first placed in MC5 PID 106092 : timestamp 35607.831155 : first placed in MC7 PID 106093 : timestamp 35607.831209 : first placed in MC3 PID 106094 : timestamp 35607.831254 : first placed in MC1 PID 106095 : timestamp 35607.831300 : first placed in MC6 PID 106096 : timestamp 35607.831344 : first placed in MC2 Subsequently we do not see any migrations for stream tasks (via the sched:sched_migrate_task tracepoint), even though they run for nearly 10 seconds. The reasons: - No load-balancing is possible at any of the NODE sched-domains since the groups are more or less balanced within each NODE. - At NUMA sched-domain, busiest group would be NODE2. When any CPU in NODE0 performs load-balancing at NUMA level, it can pull tasks only if the imbalance between NODE0 and NODE2 is greater than imb_numa_nr = 2, which isn't the case here. Hence, with v4.1, we get the following numbers which are better than the current upstream, but are still not the best. Copy: 78182.7 Scale: 76344.1 Add: 87638.7 Triad: 86388.9 However, if I run an "mpstat 1 10 > /tmp/mpstat.log&" just before kickstarting stream-8, the performance significantly improves (again, consistently reproducible). Copy: 122804.6 Scale: 115192.9 Add: 137191.6 Triad: 133338.5 In this case, from the trace data for stream, we see: PID 105174 : timestamp 35547.526816 : was running in MC4 PID 105174 : timestamp 35547.577635 : moved to MC5 PID 105175 : timestamp 35547.526816 : first placed in MC4 PID 105176 : timestamp 35547.526846 : first placed in MC3 PID 105177 : timestamp 35547.526893 : first placed in MC7 PID 105178 : timestamp 35547.526928 : first placed in MC1 PID 105179 : timestamp 35547.526961 : first placed in MC2 PID 105180 : timestamp 35547.527001 : first placed in MC6 PID 105181 : timestamp 35547.527032 : first placed in MC0 In this case, at the time of the initial placement (find_idlest_group() ?), we are able to spread out farther away. The subsequent load-balance at the NODE2 domain is able to balance the tasks between MC4 and MC5. > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > NPS=1 > > ====== > > Clients: tip-core mel-v3 mel-v4 mel-v4.1 > > 1 633.19 619.16 632.94 619.27 > > (0.00%) (-2.21%) (-0.03%) (-2.19%) > > > > 2 1152.48 1189.88 1184.82 1189.19 > > (0.00%) (3.24%) (2.80%) (3.18%) > > > > 4 1946.46 2177.40 1979.56 2196.09 > > (0.00%) (11.86%) (1.70%) (12.82%) > > > > 8 3553.29 3564.50 3678.07 3668.77 > > (0.00%) (0.31%) (3.51%) (3.24%) > > > > 16 6217.03 6484.58 6249.29 6534.73 > > (0.00%) (4.30%) (0.51%) (5.11%) > > > > 32 11702.59 12185.77 12005.99 11917.57 > > (0.00%) (4.12%) (2.59%) (1.83%) > > > > 64 18394.56 19535.11 19080.19 19500.55 > > (0.00%) (6.20%) (3.72%) (6.01%) > > > > 128 27231.02 31759.92 27200.52 30358.99 > > (0.00%) (16.63%) (-0.11%) (11.48%) > > > > 256 33166.10 24474.30 31639.98 24788.12 > > (0.00%) (-26.20%) (-4.60%) (-25.26%) > > > > 512 41605.44 54823.57 46684.48 54559.02 > > (0.00%) (31.77%) (12.20%) (31.13%) > > > > 1024 53650.54 56329.39 44422.99 56320.66 > > (0.00%) (4.99%) (-17.19%) (4.97%) > > > > > > We see that the v4.1 performs better than v4 in most cases except when > > the number of clients=256 where the spread strategy seems to be > > hurting as we see degradation in both v3 and v4.1. This is true even > > for NPS=2 and NPS=4 cases (see below). > > > > The 256 client case is a bit of a crapshoot. At that point, the NUMA > imbalancing is disabled and the machine is overloaded. Yup. [..snip..] > Most likely because v4.2 is disabling the allowed NUMA imbalance too > soon. This is the trade-off between favouring communicating tasks over > embararassingly parallel problems. v4.1 does allow the NUMA imbalance for a longer duration. But since the thresholds are small enough, I guess it should be a ok for most workloads. -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-16 18:33 ` Gautham R. Shenoy @ 2021-12-20 11:12 ` Mel Gorman 2021-12-21 15:03 ` Gautham R. Shenoy 2021-12-21 17:13 ` Vincent Guittot 0 siblings, 2 replies; 22+ messages in thread From: Mel Gorman @ 2021-12-20 11:12 UTC (permalink / raw) To: Gautham R. Shenoy Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML (sorry for the delay, was offline for a few days) On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > Hello Mel, > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > [..SNIP..] > > > > On a 2 Socket Zen3: > > > > > > NPS=1 > > > child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2 > > > top_p = NUMA, imb_span = 256. > > > > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2 > > > > > > NPS=2 > > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2 > > > top_p = NUMA, imb_span = 128. > > > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > > > NPS=4: > > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2 > > > top_p = NUMA, imb_span = 128. > > > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > > > Again, we will be more aggressively load balancing across the two > > > sockets in NPS=1 mode compared to NPS=2/4. > > > > > > > Yes, but I felt it was reasonable behaviour because we have to strike > > some sort of balance between allowing a NUMA imbalance up to a point > > to prevent communicating tasks being pulled apart and v3 broke that > > completely. There will always be a tradeoff between tasks that want to > > remain local to each other and others that prefer to spread as wide as > > possible as quickly as possible. > > I agree with this argument that we want to be conservative while > pulling tasks across NUMA domains. My point was that the threshold at > the NUMA domain that spans the 2 sockets is lower for NPS=1 > (imb_numa_nr = 2) when compared to the threshold for the same NUMA > domain when NPS=2/4 (imb_numa_nr = 4). > Is that a problem though? On an Intel machine with sub-numa clustering, the distances are 11 and 21 for a "node" that is the split cache and the remote node respectively. > Irrespective of what NPS mode we are operating in, the NUMA distance > between the two sockets is 32 on Zen3 systems. Hence shouldn't the > thresholds be the same for that level of NUMA? > Maybe, but then it is not a property of the sched_domain and instead needs to account for distance when balancing between two nodes that may be varying distances from each other. > Would something like the following work ? > > if (sd->flags & SD_NUMA) { > > /* We are using the child as a proxy for the group. */ > group_span = sd->child->span_weight; > sd_distance = /* NUMA distance at this sd level */ > NUMA distance relative to what? On Zen, the distance to a remote node may be fixed but on topologies with multiple nodes that are not fully linked to every other node by one hop, the same is not true. > /* By default we set the threshold to 1/4th the sched-group span. */ > imb_numa_shift = 2; > > /* > * We can be a little aggressive if the cost of migrating tasks > * across groups of this NUMA level is not high. > * Assuming > */ > > if (sd_distance < REMOTE_DISTANCE) > imb_numa_shift++; > The distance would have to be accounted for elsewhere because here we are considering one node in isolation, not relative to other nodes. > /* > * Compute the number of LLCs in each group. > * More the LLCs, more aggressively we migrate across > * the groups at this NUMA sd. > */ > nr_llcs = group_span/llc_size; > > sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift); > } > Same, any adjustment would have to happen during load balancing taking into account the relatively NUMA distances. I'm not necessarily opposed but it would be a separate patch. > > > <SNIP> > > > If we retain the (2,4) thresholds from v4.1 but use them in > > > allow_numa_imbalance() as in v3 we get > > > > > > NPS=4 > > > Test: mel-v4.2 > > > Copy: 225860.12 (498.11%) > > > Scale: 227869.07 (572.58%) > > > Add: 278365.58 (624.93%) > > > Triad: 264315.44 (596.62%) > > > > > > > The potential problem with this is that it probably will work for > > netperf when it's a single communicating pair but may not work as well > > when there are multiple communicating pairs or a number of communicating > > tasks that exceed numa_imb_nr. > > > Yes that's true. I think what you are doing in v4 is the right thing. > > In case of stream in NPS=4, it just manages to hit the corner case for > this heuristic which results in a suboptimal behaviour. Description > follows: > To avoid the corner case, we'd need to explicitly favour spreading early and assume wakeup will pull communicating tasks together and NUMA balancing migrate the data after some time which looks like diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index c07bfa2d80f2..54f5207154d3 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -93,6 +93,7 @@ struct sched_domain { unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0a969affca76..df0e84462e62 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1489,6 +1489,7 @@ struct task_numa_env { int src_cpu, src_nid; int dst_cpu, dst_nid; + int imb_numa_nr; struct numa_stats src_stats, dst_stats; @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); static unsigned long cpu_runnable(struct rq *rq); static unsigned long cpu_util(int cpu); static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight); + int dst_running, + int imb_numa_nr); static inline enum numa_type numa_classify(unsigned int imbalance_pct, @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, dst_running = env->dst_stats.nr_running + 1; imbalance = max(0, dst_running - src_running); imbalance = adjust_numa_imbalance(imbalance, dst_running, - env->dst_stats.weight); + env->imb_numa_nr); /* Use idle CPU if there is no imbalance */ if (!imbalance) { @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p) */ rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); - if (sd) + if (sd) { env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; + env.imb_numa_nr = sd->imb_numa_nr; + } rcu_read_unlock(); /* @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, * This is an approximation as the number of running tasks may not be * related to the number of busy CPUs due to sched_setaffinity. */ -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) { - return (dst_running < (dst_weight >> 2)); + return dst_running < imb_numa_nr; } /* @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) return idlest; #endif /* - * Otherwise, keep the task on this node to stay close - * its wakeup source and improve locality. If there is - * a real need of migration, periodic load balance will - * take care of it. + * Otherwise, keep the task on this node to stay local + * to its wakeup source if the number of running tasks + * are below the allowed imbalance. If there is a real + * need of migration, periodic load balance will take + * care of it. */ - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) return NULL; } @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } } -#define NUMA_IMBALANCE_MIN 2 - static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight) + int dst_running, int imb_numa_nr) { - if (!allow_numa_imbalance(dst_running, dst_weight)) + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) return imbalance; - /* - * Allow a small imbalance based on a simple pair of communicating - * tasks that remain local when the destination is lightly loaded. - */ - if (imbalance <= NUMA_IMBALANCE_MIN) + if (imbalance <= imb_numa_nr) return 0; return imbalance; @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* Consider allowing a small imbalance between NUMA groups */ if (env->sd->flags & SD_NUMA) { env->imbalance = adjust_numa_imbalance(env->imbalance, - busiest->sum_nr_running, env->sd->span_weight); + busiest->sum_nr_running, env->sd->imb_numa_nr); } return; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..1fa3e977521d 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } } + /* + * Calculate an allowed NUMA imbalance such that LLCs do not get + * imbalanced. + */ + for_each_cpu(i, cpu_map) { + unsigned int imb = 0; + unsigned int imb_span = 1; + + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + struct sched_domain *child = sd->child; + + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && + (child->flags & SD_SHARE_PKG_RESOURCES)) { + struct sched_domain *top, *top_p; + unsigned int llc_sq; + + /* + * nr_llcs = (sd->span_weight / llc_weight); + * imb = (llc_weight / nr_llcs) >> 2 + * + * is equivalent to + * + * imb = (llc_weight^2 / sd->span_weight) >> 2 + * + */ + llc_sq = child->span_weight * child->span_weight; + + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); + sd->imb_numa_nr = imb; + + /* + * Set span based on top domain that places + * tasks in sibling domains. + */ + top = sd; + top_p = top->parent; + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { + top = top->parent; + top_p = top->parent; + } + imb_span = top_p ? top_p->span_weight : sd->span_weight; + } else { + int factor = max(1U, (sd->span_weight / imb_span)); + + sd->imb_numa_nr = imb * factor; + } + } + } + /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-20 11:12 ` Mel Gorman @ 2021-12-21 15:03 ` Gautham R. Shenoy 2021-12-21 17:13 ` Vincent Guittot 1 sibling, 0 replies; 22+ messages in thread From: Gautham R. Shenoy @ 2021-12-21 15:03 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Mon, Dec 20, 2021 at 11:12:43AM +0000, Mel Gorman wrote: > (sorry for the delay, was offline for a few days) > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > > Hello Mel, > > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > > > [..SNIP..] > > > > > > On a 2 Socket Zen3: > > > > > > > > NPS=1 > > > > child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2 > > > > top_p = NUMA, imb_span = 256. > > > > > > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2 > > > > > > > > NPS=2 > > > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2 > > > > top_p = NUMA, imb_span = 128. > > > > > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > > > > > NPS=4: > > > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2 > > > > top_p = NUMA, imb_span = 128. > > > > > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > > > > > Again, we will be more aggressively load balancing across the two > > > > sockets in NPS=1 mode compared to NPS=2/4. > > > > > > > > > > Yes, but I felt it was reasonable behaviour because we have to strike > > > some sort of balance between allowing a NUMA imbalance up to a point > > > to prevent communicating tasks being pulled apart and v3 broke that > > > completely. There will always be a tradeoff between tasks that want to > > > remain local to each other and others that prefer to spread as wide as > > > possible as quickly as possible. > > > > I agree with this argument that we want to be conservative while > > pulling tasks across NUMA domains. My point was that the threshold at > > the NUMA domain that spans the 2 sockets is lower for NPS=1 > > (imb_numa_nr = 2) when compared to the threshold for the same NUMA > > domain when NPS=2/4 (imb_numa_nr = 4). > > > > Is that a problem though? On an Intel machine with sub-numa clustering, > the distances are 11 and 21 for a "node" that is the split cache and the > remote node respectively. So, my question was, on an Intel machine, with sub-numa clustering enabled vs disabled, is the value of imb_numa_nr for the NUMA domain which spans the remote nodes (distance=21) the same or different. And if it is different, what is the rationale behind that. I am totally on-board with the idea that for the different NUMA levels, the corresponding imb_numa_nr should be different. Just in case, I was not making myself clear earlier, on Zen3, the NUMA-A sched-domain, in the figures below, has groups where each group spans a socket in all the NPS configurations. However, only on NPS=1 we have sd->imb_numa_nr=2 for NUMA-A, while on NPS=2/4, the value of sd->imb_numa_nr=4 for NUMA-A domain. Thus if we had 4 tasks sharing data, on NPS=2/4, they would reside on the same socket, while on NPS=1, we will have 2 tasks on one socket, and the other 2 will migrated to the other socket. That said, I have not been able to observe any significiant difference with a real-world workload like Mongodb run on NPS=1 with imb_numa_nr set to 2 vs 4. Zen3, NPS=1 ------------------------------------------------------------------ | | | NUMA-A : sd->imb_numa_nr = 2 | | ------------------------ ------------------------ | | |DIE | |DIE | | | | | | | | | | ------ ------ | | ------ ------ | | | | |MC | |MC | | | |MC | |MC | | | | | ------ ------ | | ------ ------ | | | | ------ ------ | | ------ ------ | | | | |MC | |MC | | | |MC | |MC | | | | | ------ ------ | | ------ ------ | | | | | | | | | | ------ ------ | | ------ ------ | | | | |MC | |MC | | | |MC | |MC | | | | | ------ ------ | | ------ ------ | | | | ------ ------ | | ------ ------ | | | | |MC | |MC | | | |MC | |MC | | | | | ------ ------ | | ------ ------ | | | | | | | | | ------------------------ ------------------------ | | | ------------------------------------------------------------------ Zen3, NPS=2 ------------------------------------------------------------------ | | | NUMA-A: sd->imb_numa_nr = 4 | | --------------------------- --------------------------- | | |NUMA-B :sd->imb_numa_nr=2| |NUMA-B :sd->imb_numa_nr=2| | | | ----------- ----------- | | ----------- ----------- | | | | |NODE | |NODE | | | |NODE | |NODE | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | | | | | | | | | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | ----------- ----------- | | ----------- ----------- | | | | | | | | | --------------------------- --------------------------- | | | ------------------------------------------------------------------ Zen3, NPS=4 ------------------------------------------------------------------ | | | NUMA-A: sd->imb_numa_nr = 4 | | --------------------------- --------------------------- | | |NUMA-B :sd->imb_numa_nr=2| |NUMA-B :sd->imb_numa_nr=2| | | | ----------- ----------- | | ----------- ----------- | | | | |NODE | |NODE | | | |NODE | |NODE | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | ----------- ----------- | | ----------- ----------- | | | | ----------- ----------- | | ----------- ----------- | | | | |NODE | |NODE | | | |NODE | |NODE | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | ----------- ----------- | | ----------- ----------- | | | | | | | | | --------------------------- --------------------------- | | | ------------------------------------------------------------------ > > Irrespective of what NPS mode we are operating in, the NUMA distance > > between the two sockets is 32 on Zen3 systems. Hence shouldn't the > > thresholds be the same for that level of NUMA? > > > > Maybe, but then it is not a property of the sched_domain and instead > needs to account for distance when balancing between two nodes that may > be varying distances from each other. > > > Would something like the following work ? > > > > if (sd->flags & SD_NUMA) { > > > > /* We are using the child as a proxy for the group. */ > > group_span = sd->child->span_weight; > > sd_distance = /* NUMA distance at this sd level */ > > > > NUMA distance relative to what? On Zen, the distance to a remote node may > be fixed but on topologies with multiple nodes that are not fully linked > to every other node by one hop, the same is not true. Fair enough. The "sd_distance" I was referring to the node_distance() between the CPUs of any two groups in this NUMA domain. However, this was assuming that the node_distance() between the CPUs of any two groups would be the same, which is not the case for certain platforms. So this wouldn't work. > > > /* By default we set the threshold to 1/4th the sched-group span. */ > > imb_numa_shift = 2; > > > > /* > > * We can be a little aggressive if the cost of migrating tasks > > * across groups of this NUMA level is not high. > > * Assuming > > */ > > > > if (sd_distance < REMOTE_DISTANCE) > > imb_numa_shift++; > > > > The distance would have to be accounted for elsewhere because here we > are considering one node in isolation, not relative to other nodes. > > > /* > > * Compute the number of LLCs in each group. > > * More the LLCs, more aggressively we migrate across > > * the groups at this NUMA sd. > > */ > > nr_llcs = group_span/llc_size; > > > > sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift); > > } > > > > Same, any adjustment would have to happen during load balancing taking > into account the relatively NUMA distances. I'm not necessarily opposed > but it would be a separate patch. Sure, we can look into this separately. > > > > > <SNIP> > > > > If we retain the (2,4) thresholds from v4.1 but use them in > > > > allow_numa_imbalance() as in v3 we get > > > > > > > > NPS=4 > > > > Test: mel-v4.2 > > > > Copy: 225860.12 (498.11%) > > > > Scale: 227869.07 (572.58%) > > > > Add: 278365.58 (624.93%) > > > > Triad: 264315.44 (596.62%) > > > > > > > > > > The potential problem with this is that it probably will work for > > > netperf when it's a single communicating pair but may not work as well > > > when there are multiple communicating pairs or a number of communicating > > > tasks that exceed numa_imb_nr. > > > > > > Yes that's true. I think what you are doing in v4 is the right thing. > > > > In case of stream in NPS=4, it just manages to hit the corner case for > > this heuristic which results in a suboptimal behaviour. Description > > follows: > > > > To avoid the corner case, we'd need to explicitly favour spreading early > and assume wakeup will pull communicating tasks together and NUMA > balancing migrate the data after some time which looks like Actually I was able to root-cause the reason behind the drop in the performance of stream on NPS-4. I have already responded earlier in another thread : https://lore.kernel.org/lkml/Ybzq%2FA+HS%2FGxGYha@BLR-5CG11610CF.amd.com/ Appending the patch here: --- kernel/sched/fair.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ec354bf88b0d..c1b2a422a877 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9191,13 +9191,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) return idlest; #endif /* - * Otherwise, keep the task on this node to stay local - * to its wakeup source if the number of running tasks - * are below the allowed imbalance. If there is a real - * need of migration, periodic load balance will take - * care of it. + * Otherwise, keep the task on this node to + * stay local to its wakeup source if the + * number of running tasks (including the new + * one) are below the allowed imbalance. If + * there is a real need of migration, periodic + * load balance will take care of it. */ - if (local_sgs.sum_nr_running <= sd->imb_numa_nr) + if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr) return NULL; } -- With this fix on top of your fix to compute the imb_numa_nr at the relevant level (v4.1: https://lore.kernel.org/lkml/20211213130131.GQ3366@techsingularity.net/), the stream regression for NPS4 is no longer there. Test: tip-core v4 v4.1 v4.1-find_idlest_group_fix Copy: 37762.14 (0.00%) 48748.60 (29.09%) 164765.83 (336.32%) 205963.99 (445.42%) Scale: 33879.66 (0.00%) 48317.66 (42.61%) 159641.67 (371.20%) 218136.57 (543.85%) Add: 38398.90 (0.00%) 54259.56 (41.30%) 184583.70 (380.70%) 257857.90 (571.52%) Triad: 37942.38 (0.00%) 54503.74 (43.64%) 181250.80 (377.70%) 251410.28 (562.61%) -- Thanks and Regards gautham. ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-20 11:12 ` Mel Gorman 2021-12-21 15:03 ` Gautham R. Shenoy @ 2021-12-21 17:13 ` Vincent Guittot 2021-12-22 8:52 ` Jirka Hladky 2022-01-05 10:42 ` Mel Gorman 1 sibling, 2 replies; 22+ messages in thread From: Vincent Guittot @ 2021-12-21 17:13 UTC (permalink / raw) To: Mel Gorman Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote: > > (sorry for the delay, was offline for a few days) > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > > Hello Mel, > > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > > > [..SNIP..] > > [snip] > > To avoid the corner case, we'd need to explicitly favour spreading early > and assume wakeup will pull communicating tasks together and NUMA > balancing migrate the data after some time which looks like > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index c07bfa2d80f2..54f5207154d3 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -93,6 +93,7 @@ struct sched_domain { > unsigned int busy_factor; /* less balancing by factor if busy */ > unsigned int imbalance_pct; /* No balance until over watermark */ > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ So now you compute an allowed imbalance level instead of using 25% of sd->span_weight or 25% of busiest->group_weight And you adjust this new imb_numa_nr according to the topology. That makes sense. > > int nohz_idle; /* NOHZ IDLE status */ > int flags; /* See SD_* */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0a969affca76..df0e84462e62 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > int src_cpu, src_nid; > int dst_cpu, dst_nid; > + int imb_numa_nr; > > struct numa_stats src_stats, dst_stats; > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); > static unsigned long cpu_runnable(struct rq *rq); > static unsigned long cpu_util(int cpu); > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight); > + int dst_running, > + int imb_numa_nr); > > static inline enum > numa_type numa_classify(unsigned int imbalance_pct, > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, > dst_running = env->dst_stats.nr_running + 1; > imbalance = max(0, dst_running - src_running); > imbalance = adjust_numa_imbalance(imbalance, dst_running, > - env->dst_stats.weight); > + env->imb_numa_nr); > > /* Use idle CPU if there is no imbalance */ > if (!imbalance) { > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p) > */ > rcu_read_lock(); > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > - if (sd) > + if (sd) { > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > + env.imb_numa_nr = sd->imb_numa_nr; > + } > rcu_read_unlock(); > > /* > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > * This is an approximation as the number of running tasks may not be > * related to the number of busy CPUs due to sched_setaffinity. > */ > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > { > - return (dst_running < (dst_weight >> 2)); > + return dst_running < imb_numa_nr; > } > > /* > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > return idlest; > #endif > /* > - * Otherwise, keep the task on this node to stay close > - * its wakeup source and improve locality. If there is > - * a real need of migration, periodic load balance will > - * take care of it. > + * Otherwise, keep the task on this node to stay local > + * to its wakeup source if the number of running tasks > + * are below the allowed imbalance. If there is a real > + * need of migration, periodic load balance will take > + * care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) > return NULL; > } > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > } > } > > -#define NUMA_IMBALANCE_MIN 2 > - > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight) > + int dst_running, int imb_numa_nr) > { > - if (!allow_numa_imbalance(dst_running, dst_weight)) > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > return imbalance; > > - /* > - * Allow a small imbalance based on a simple pair of communicating > - * tasks that remain local when the destination is lightly loaded. > - */ > - if (imbalance <= NUMA_IMBALANCE_MIN) > + if (imbalance <= imb_numa_nr) Isn't this always true ? imbalance is "always" < dst_running as imbalance is usually the number of these tasks that we would like to migrate > return 0; > > return imbalance; > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* Consider allowing a small imbalance between NUMA groups */ > if (env->sd->flags & SD_NUMA) { > env->imbalance = adjust_numa_imbalance(env->imbalance, > - busiest->sum_nr_running, env->sd->span_weight); > + busiest->sum_nr_running, env->sd->imb_numa_nr); > } > > return; > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d201a7052a29..1fa3e977521d 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + /* > + * Calculate an allowed NUMA imbalance such that LLCs do not get > + * imbalanced. > + */ > + for_each_cpu(i, cpu_map) { > + unsigned int imb = 0; > + unsigned int imb_span = 1; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { sched_domains have not been degenerated yet so you found here the DIE domain > + struct sched_domain *top, *top_p; > + unsigned int llc_sq; > + > + /* > + * nr_llcs = (sd->span_weight / llc_weight); > + * imb = (llc_weight / nr_llcs) >> 2 it would be good to add a comment to explain why 25% of LLC weight / number of LLC in a node is the right value. For example, why is it better than just 25% of the LLC weight ? Do you want to allow the same imbalance at node level whatever the number of LLC in the node ? > + * > + * is equivalent to > + * > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > + * > + */ > + llc_sq = child->span_weight * child->span_weight; > + > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > + sd->imb_numa_nr = imb; > + > + /* > + * Set span based on top domain that places > + * tasks in sibling domains. > + */ > + top = sd; > + top_p = top->parent; > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure that you want to take this case into account, only numa node don't have SD_PREFER_SIBLING > + top = top->parent; > + top_p = top->parent; > + } > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > + } else { > + int factor = max(1U, (sd->span_weight / imb_span)); > + > + sd->imb_numa_nr = imb * factor; > + } > + } > + } > + > /* Calculate CPU capacity for physical packages and nodes */ > for (i = nr_cpumask_bits-1; i >= 0; i--) { > if (!cpumask_test_cpu(i, cpu_map)) ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-21 17:13 ` Vincent Guittot @ 2021-12-22 8:52 ` Jirka Hladky 2022-01-04 19:52 ` Jirka Hladky 2022-01-05 10:42 ` Mel Gorman 1 sibling, 1 reply; 22+ messages in thread From: Jirka Hladky @ 2021-12-22 8:52 UTC (permalink / raw) To: Vincent Guittot Cc: Mel Gorman, Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Philip Auld Hi Mel, we have tested the performance impact of this patch and we see performance drops up to 60% with the OpenMP implementation of the NAS parallel benchmark. Example of results: 2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs) 5.16.0-rc5 5.16.0-rc5 Difference (vanilla) + v4 of the patch in % Mop/s total Mop/s total sp_C_x, 8 threads 36316.1 12939.6 -64% sp_C_x, 16 threads 64790.2 23968.0 -63% sp_C_x, 32 threads 67205.5 48891.4 -27% Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results. It seems like the allowed imbalance is too large and the negative impact on workloads that prefer to span wide across all NUMA nodes is too large. Thanks Jirka On Tue, Dec 21, 2021 at 6:13 PM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > (sorry for the delay, was offline for a few days) > > > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > > > Hello Mel, > > > > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > > > > > [..SNIP..] > > > > > [snip] > > > > > To avoid the corner case, we'd need to explicitly favour spreading early > > and assume wakeup will pull communicating tasks together and NUMA > > balancing migrate the data after some time which looks like > > > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > > index c07bfa2d80f2..54f5207154d3 100644 > > --- a/include/linux/sched/topology.h > > +++ b/include/linux/sched/topology.h > > @@ -93,6 +93,7 @@ struct sched_domain { > > unsigned int busy_factor; /* less balancing by factor if busy */ > > unsigned int imbalance_pct; /* No balance until over watermark */ > > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > > + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ > > So now you compute an allowed imbalance level instead of using > 25% of sd->span_weight > or > 25% of busiest->group_weight > > And you adjust this new imb_numa_nr according to the topology. > > That makes sense. > > > > > int nohz_idle; /* NOHZ IDLE status */ > > int flags; /* See SD_* */ > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 0a969affca76..df0e84462e62 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > > > int src_cpu, src_nid; > > int dst_cpu, dst_nid; > > + int imb_numa_nr; > > > > struct numa_stats src_stats, dst_stats; > > > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); > > static unsigned long cpu_runnable(struct rq *rq); > > static unsigned long cpu_util(int cpu); > > static inline long adjust_numa_imbalance(int imbalance, > > - int dst_running, int dst_weight); > > + int dst_running, > > + int imb_numa_nr); > > > > static inline enum > > numa_type numa_classify(unsigned int imbalance_pct, > > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, > > dst_running = env->dst_stats.nr_running + 1; > > imbalance = max(0, dst_running - src_running); > > imbalance = adjust_numa_imbalance(imbalance, dst_running, > > - env->dst_stats.weight); > > + env->imb_numa_nr); > > > > /* Use idle CPU if there is no imbalance */ > > if (!imbalance) { > > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p) > > */ > > rcu_read_lock(); > > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > > - if (sd) > > + if (sd) { > > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > > + env.imb_numa_nr = sd->imb_numa_nr; > > + } > > rcu_read_unlock(); > > > > /* > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > * This is an approximation as the number of running tasks may not be > > * related to the number of busy CPUs due to sched_setaffinity. > > */ > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > { > > - return (dst_running < (dst_weight >> 2)); > > + return dst_running < imb_numa_nr; > > } > > > > /* > > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > > return idlest; > > #endif > > /* > > - * Otherwise, keep the task on this node to stay close > > - * its wakeup source and improve locality. If there is > > - * a real need of migration, periodic load balance will > > - * take care of it. > > + * Otherwise, keep the task on this node to stay local > > + * to its wakeup source if the number of running tasks > > + * are below the allowed imbalance. If there is a real > > + * need of migration, periodic load balance will take > > + * care of it. > > */ > > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > > + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) > > return NULL; > > } > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > } > > } > > > > -#define NUMA_IMBALANCE_MIN 2 > > - > > static inline long adjust_numa_imbalance(int imbalance, > > - int dst_running, int dst_weight) > > + int dst_running, int imb_numa_nr) > > { > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > return imbalance; > > > > - /* > > - * Allow a small imbalance based on a simple pair of communicating > > - * tasks that remain local when the destination is lightly loaded. > > - */ > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > + if (imbalance <= imb_numa_nr) > > Isn't this always true ? > > imbalance is "always" < dst_running as imbalance is usually the number > of these tasks that we would like to migrate > > > > return 0; > > > > return imbalance; > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > /* Consider allowing a small imbalance between NUMA groups */ > > if (env->sd->flags & SD_NUMA) { > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > - busiest->sum_nr_running, env->sd->span_weight); > > + busiest->sum_nr_running, env->sd->imb_numa_nr); > > } > > > > return; > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index d201a7052a29..1fa3e977521d 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > } > > } > > > > + /* > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > + * imbalanced. > > + */ > > + for_each_cpu(i, cpu_map) { > > + unsigned int imb = 0; > > + unsigned int imb_span = 1; > > + > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > + struct sched_domain *child = sd->child; > > + > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > sched_domains have not been degenerated yet so you found here the DIE domain > > > + struct sched_domain *top, *top_p; > > + unsigned int llc_sq; > > + > > + /* > > + * nr_llcs = (sd->span_weight / llc_weight); > > + * imb = (llc_weight / nr_llcs) >> 2 > > it would be good to add a comment to explain why 25% of LLC weight / > number of LLC in a node is the right value. > For example, why is it better than just 25% of the LLC weight ? > Do you want to allow the same imbalance at node level whatever the > number of LLC in the node ? > > > + * > > + * is equivalent to > > + * > > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > > + * > > + */ > > + llc_sq = child->span_weight * child->span_weight; > > + > > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > > + sd->imb_numa_nr = imb; > > + > > + /* > > + * Set span based on top domain that places > > + * tasks in sibling domains. > > + */ > > + top = sd; > > + top_p = top->parent; > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? > Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure > that you want to take this case into account, only numa node don't > have SD_PREFER_SIBLING > > > + top = top->parent; > > + top_p = top->parent; > > + } > > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > > + } else { > > + int factor = max(1U, (sd->span_weight / imb_span)); > > + > > + sd->imb_numa_nr = imb * factor; > > + } > > + } > > + } > > + > > /* Calculate CPU capacity for physical packages and nodes */ > > for (i = nr_cpumask_bits-1; i >= 0; i--) { > > if (!cpumask_test_cpu(i, cpu_map)) > -- -Jirka ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-22 8:52 ` Jirka Hladky @ 2022-01-04 19:52 ` Jirka Hladky 0 siblings, 0 replies; 22+ messages in thread From: Jirka Hladky @ 2022-01-04 19:52 UTC (permalink / raw) To: Vincent Guittot Cc: Mel Gorman, Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Philip Auld Hi Mel, the table with results is badly formatted. Let me send it again: We have tested the performance impact of this patch and we see performance drops up to 60% with the OpenMP implementation of the NAS parallel benchmark. Example of results: 2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs) Kernel 5.16.0-rc5 5.16.0-rc5+patch_v4 Mop/s tot Mop/s tot Diff sp_C_x, 8 threads 36316.1 12939.6 -64% sp_C_x, 16 threads 64790.2 23968.0 -63% sp_C_x, 32 threads 67205.5 48891.4 -27% Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results. It seems like the allowed imbalance is too large and the negative impact on workloads that prefer to span wide across all NUMA nodes is too large. Thanks Jirka On Wed, Dec 22, 2021 at 9:52 AM Jirka Hladky <jhladky@redhat.com> wrote: > > Hi Mel, > > we have tested the performance impact of this patch and we see > performance drops up to 60% with the OpenMP implementation of the NAS > parallel benchmark. > > Example of results: > 2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs) > > 5.16.0-rc5 > 5.16.0-rc5 Difference > (vanilla) + v4 of > the patch in % > Mop/s total Mop/s total > sp_C_x, 8 threads 36316.1 12939.6 > -64% > sp_C_x, 16 threads 64790.2 23968.0 > -63% > sp_C_x, 32 threads 67205.5 48891.4 > -27% > > Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results. > > It seems like the allowed imbalance is too large and the negative > impact on workloads that prefer to span wide across all NUMA nodes is > too large. > > Thanks > Jirka > > > On Tue, Dec 21, 2021 at 6:13 PM Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > (sorry for the delay, was offline for a few days) > > > > > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > > > > Hello Mel, > > > > > > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > > > > > > > [..SNIP..] > > > > > > > > [snip] > > > > > > > > To avoid the corner case, we'd need to explicitly favour spreading early > > > and assume wakeup will pull communicating tasks together and NUMA > > > balancing migrate the data after some time which looks like > > > > > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > > > index c07bfa2d80f2..54f5207154d3 100644 > > > --- a/include/linux/sched/topology.h > > > +++ b/include/linux/sched/topology.h > > > @@ -93,6 +93,7 @@ struct sched_domain { > > > unsigned int busy_factor; /* less balancing by factor if busy */ > > > unsigned int imbalance_pct; /* No balance until over watermark */ > > > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > > > + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ > > > > So now you compute an allowed imbalance level instead of using > > 25% of sd->span_weight > > or > > 25% of busiest->group_weight > > > > And you adjust this new imb_numa_nr according to the topology. > > > > That makes sense. > > > > > > > > int nohz_idle; /* NOHZ IDLE status */ > > > int flags; /* See SD_* */ > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 0a969affca76..df0e84462e62 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > > > > > int src_cpu, src_nid; > > > int dst_cpu, dst_nid; > > > + int imb_numa_nr; > > > > > > struct numa_stats src_stats, dst_stats; > > > > > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); > > > static unsigned long cpu_runnable(struct rq *rq); > > > static unsigned long cpu_util(int cpu); > > > static inline long adjust_numa_imbalance(int imbalance, > > > - int dst_running, int dst_weight); > > > + int dst_running, > > > + int imb_numa_nr); > > > > > > static inline enum > > > numa_type numa_classify(unsigned int imbalance_pct, > > > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, > > > dst_running = env->dst_stats.nr_running + 1; > > > imbalance = max(0, dst_running - src_running); > > > imbalance = adjust_numa_imbalance(imbalance, dst_running, > > > - env->dst_stats.weight); > > > + env->imb_numa_nr); > > > > > > /* Use idle CPU if there is no imbalance */ > > > if (!imbalance) { > > > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p) > > > */ > > > rcu_read_lock(); > > > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > > > - if (sd) > > > + if (sd) { > > > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > > > + env.imb_numa_nr = sd->imb_numa_nr; > > > + } > > > rcu_read_unlock(); > > > > > > /* > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > > * This is an approximation as the number of running tasks may not be > > > * related to the number of busy CPUs due to sched_setaffinity. > > > */ > > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > > { > > > - return (dst_running < (dst_weight >> 2)); > > > + return dst_running < imb_numa_nr; > > > } > > > > > > /* > > > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > > > return idlest; > > > #endif > > > /* > > > - * Otherwise, keep the task on this node to stay close > > > - * its wakeup source and improve locality. If there is > > > - * a real need of migration, periodic load balance will > > > - * take care of it. > > > + * Otherwise, keep the task on this node to stay local > > > + * to its wakeup source if the number of running tasks > > > + * are below the allowed imbalance. If there is a real > > > + * need of migration, periodic load balance will take > > > + * care of it. > > > */ > > > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > > > + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) > > > return NULL; > > > } > > > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > } > > > } > > > > > > -#define NUMA_IMBALANCE_MIN 2 > > > - > > > static inline long adjust_numa_imbalance(int imbalance, > > > - int dst_running, int dst_weight) > > > + int dst_running, int imb_numa_nr) > > > { > > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > > return imbalance; > > > > > > - /* > > > - * Allow a small imbalance based on a simple pair of communicating > > > - * tasks that remain local when the destination is lightly loaded. > > > - */ > > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > > + if (imbalance <= imb_numa_nr) > > > > Isn't this always true ? > > > > imbalance is "always" < dst_running as imbalance is usually the number > > of these tasks that we would like to migrate > > > > > > > return 0; > > > > > > return imbalance; > > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > > /* Consider allowing a small imbalance between NUMA groups */ > > > if (env->sd->flags & SD_NUMA) { > > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > > - busiest->sum_nr_running, env->sd->span_weight); > > > + busiest->sum_nr_running, env->sd->imb_numa_nr); > > > } > > > > > > return; > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > > index d201a7052a29..1fa3e977521d 100644 > > > --- a/kernel/sched/topology.c > > > +++ b/kernel/sched/topology.c > > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > > } > > > } > > > > > > + /* > > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > > + * imbalanced. > > > + */ > > > + for_each_cpu(i, cpu_map) { > > > + unsigned int imb = 0; > > > + unsigned int imb_span = 1; > > > + > > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > > + struct sched_domain *child = sd->child; > > > + > > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > > > sched_domains have not been degenerated yet so you found here the DIE domain > > > > > + struct sched_domain *top, *top_p; > > > + unsigned int llc_sq; > > > + > > > + /* > > > + * nr_llcs = (sd->span_weight / llc_weight); > > > + * imb = (llc_weight / nr_llcs) >> 2 > > > > it would be good to add a comment to explain why 25% of LLC weight / > > number of LLC in a node is the right value. > > For example, why is it better than just 25% of the LLC weight ? > > Do you want to allow the same imbalance at node level whatever the > > number of LLC in the node ? > > > > > + * > > > + * is equivalent to > > > + * > > > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > > > + * > > > + */ > > > + llc_sq = child->span_weight * child->span_weight; > > > + > > > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > > > + sd->imb_numa_nr = imb; > > > + > > > + /* > > > + * Set span based on top domain that places > > > + * tasks in sibling domains. > > > + */ > > > + top = sd; > > > + top_p = top->parent; > > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > > > Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? > > Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure > > that you want to take this case into account, only numa node don't > > have SD_PREFER_SIBLING > > > > > + top = top->parent; > > > + top_p = top->parent; > > > + } > > > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > > > + } else { > > > + int factor = max(1U, (sd->span_weight / imb_span)); > > > + > > > + sd->imb_numa_nr = imb * factor; > > > + } > > > + } > > > + } > > > + > > > /* Calculate CPU capacity for physical packages and nodes */ > > > for (i = nr_cpumask_bits-1; i >= 0; i--) { > > > if (!cpumask_test_cpu(i, cpu_map)) > > > > > -- > -Jirka -- -Jirka ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-21 17:13 ` Vincent Guittot 2021-12-22 8:52 ` Jirka Hladky @ 2022-01-05 10:42 ` Mel Gorman 2022-01-05 10:49 ` Mel Gorman 2022-01-10 15:53 ` Vincent Guittot 1 sibling, 2 replies; 22+ messages in thread From: Mel Gorman @ 2022-01-05 10:42 UTC (permalink / raw) To: Vincent Guittot Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote: > > <SNIP> > > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > * This is an approximation as the number of running tasks may not be > > * related to the number of busy CPUs due to sched_setaffinity. > > */ > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > { > > - return (dst_running < (dst_weight >> 2)); > > + return dst_running < imb_numa_nr; > > } > > > > /* > > > > <SNIP> > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > } > > } > > > > -#define NUMA_IMBALANCE_MIN 2 > > - > > static inline long adjust_numa_imbalance(int imbalance, > > - int dst_running, int dst_weight) > > + int dst_running, int imb_numa_nr) > > { > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > return imbalance; > > > > - /* > > - * Allow a small imbalance based on a simple pair of communicating > > - * tasks that remain local when the destination is lightly loaded. > > - */ > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > + if (imbalance <= imb_numa_nr) > > Isn't this always true ? > > imbalance is "always" < dst_running as imbalance is usually the number > of these tasks that we would like to migrate > It's not necessarily true. allow_numa_imbalanced is checking if dst_running < imb_numa_nr and adjust_numa_imbalance is checking the imbalance. imb_numa_nr = 4 dst_running = 2 imbalance = 1 In that case, imbalance of 1 is ok, but 2 is not. > > > return 0; > > > > return imbalance; > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > /* Consider allowing a small imbalance between NUMA groups */ > > if (env->sd->flags & SD_NUMA) { > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > - busiest->sum_nr_running, env->sd->span_weight); > > + busiest->sum_nr_running, env->sd->imb_numa_nr); > > } > > > > return; > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index d201a7052a29..1fa3e977521d 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > } > > } > > > > + /* > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > + * imbalanced. > > + */ > > + for_each_cpu(i, cpu_map) { > > + unsigned int imb = 0; > > + unsigned int imb_span = 1; > > + > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > + struct sched_domain *child = sd->child; > > + > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > sched_domains have not been degenerated yet so you found here the DIE domain > Yes > > + struct sched_domain *top, *top_p; > > + unsigned int llc_sq; > > + > > + /* > > + * nr_llcs = (sd->span_weight / llc_weight); > > + * imb = (llc_weight / nr_llcs) >> 2 > > it would be good to add a comment to explain why 25% of LLC weight / > number of LLC in a node is the right value. This? * The 25% imbalance is an arbitrary cutoff * based on SMT-2 to balance between memory * bandwidth and avoiding premature sharing * of HT resources and SMT-4 or SMT-8 *may* * benefit from a different cutoff. nr_llcs * are accounted for to mitigate premature * cache eviction due to multiple tasks * using one cache while a sibling cache * remains relatively idle. > For example, why is it better than just 25% of the LLC weight ? Because lets say there are 2 LLCs then an imbalance based on just the LLC weight might allow 2 tasks to share one cache while another is idle. This is the original problem whereby the vanilla imbalance allowed multiple LLCs on the same node to be overloaded which hurt workloads that prefer to spread wide. > Do you want to allow the same imbalance at node level whatever the > number of LLC in the node ? > At this point, it's less clear how the larger domains should be balanced and the initial scaling is as good an option as any. > > + * > > + * is equivalent to > > + * > > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > > + * > > + */ > > + llc_sq = child->span_weight * child->span_weight; > > + > > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > > + sd->imb_numa_nr = imb; > > + > > + /* > > + * Set span based on top domain that places > > + * tasks in sibling domains. > > + */ > > + top = sd; > > + top_p = top->parent; > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? Because on AMD Zen3, I saw inconsistent treatment of SD_NUMA prior to degeneration depending on whether it was NPS-1, NPS-2 or NPS-4 and only SD_PREFER_SIBLING gave the current results. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-01-05 10:42 ` Mel Gorman @ 2022-01-05 10:49 ` Mel Gorman 2022-01-10 15:53 ` Vincent Guittot 1 sibling, 0 replies; 22+ messages in thread From: Mel Gorman @ 2022-01-05 10:49 UTC (permalink / raw) To: Vincent Guittot Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Wed, Jan 05, 2022 at 10:42:07AM +0000, Mel Gorman wrote: > On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote: > > > <SNIP> > > > > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > > * This is an approximation as the number of running tasks may not be > > > * related to the number of busy CPUs due to sched_setaffinity. > > > */ > > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > > { > > > - return (dst_running < (dst_weight >> 2)); > > > + return dst_running < imb_numa_nr; > > > } > > > > > > /* > > > > > > <SNIP> > > > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > } > > > } > > > > > > -#define NUMA_IMBALANCE_MIN 2 > > > - > > > static inline long adjust_numa_imbalance(int imbalance, > > > - int dst_running, int dst_weight) > > > + int dst_running, int imb_numa_nr) > > > { > > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > > return imbalance; > > > > > > - /* > > > - * Allow a small imbalance based on a simple pair of communicating > > > - * tasks that remain local when the destination is lightly loaded. > > > - */ > > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > > + if (imbalance <= imb_numa_nr) > > > > Isn't this always true ? > > > > imbalance is "always" < dst_running as imbalance is usually the number > > of these tasks that we would like to migrate > > > > It's not necessarily true. allow_numa_imbalanced is checking if > dst_running < imb_numa_nr and adjust_numa_imbalance is checking the > imbalance. > > imb_numa_nr = 4 > dst_running = 2 > imbalance = 1 > > In that case, imbalance of 1 is ok, but 2 is not. > My bad, this is based on v5 which I just queued for testing. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-01-05 10:42 ` Mel Gorman 2022-01-05 10:49 ` Mel Gorman @ 2022-01-10 15:53 ` Vincent Guittot 2022-01-12 10:24 ` Mel Gorman 1 sibling, 1 reply; 22+ messages in thread From: Vincent Guittot @ 2022-01-10 15:53 UTC (permalink / raw) To: Mel Gorman Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Wed, 5 Jan 2022 at 11:42, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote: > > > <SNIP> > > > > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > > * This is an approximation as the number of running tasks may not be > > > * related to the number of busy CPUs due to sched_setaffinity. > > > */ > > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > > { > > > - return (dst_running < (dst_weight >> 2)); > > > + return dst_running < imb_numa_nr; > > > } > > > > > > /* > > > > > > <SNIP> > > > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > } > > > } > > > > > > -#define NUMA_IMBALANCE_MIN 2 > > > - > > > static inline long adjust_numa_imbalance(int imbalance, > > > - int dst_running, int dst_weight) > > > + int dst_running, int imb_numa_nr) > > > { > > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > > return imbalance; > > > > > > - /* > > > - * Allow a small imbalance based on a simple pair of communicating > > > - * tasks that remain local when the destination is lightly loaded. > > > - */ > > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > > + if (imbalance <= imb_numa_nr) > > > > Isn't this always true ? > > > > imbalance is "always" < dst_running as imbalance is usually the number > > of these tasks that we would like to migrate > > > > It's not necessarily true. allow_numa_imbalanced is checking if > dst_running < imb_numa_nr and adjust_numa_imbalance is checking the > imbalance. > > imb_numa_nr = 4 > dst_running = 2 > imbalance = 1 > > In that case, imbalance of 1 is ok, but 2 is not. I don't catch your example. Why is imbalance = 2 not ok in your example above ? allow_numa_imbalance still returns true (dst-running < imb_numa_nr) and we still have imbalance <= imb_numa_nr Also the name dst_running is quite confusing; In the case of calculate_imbalance, busiest->nr_running is passed as dst_running argument. But the busiest group is the src not the dst of the balance Then, imbalance < busiest->nr_running in load_balance because we try to even the number of task running in each groups without emptying it and allow_numa_imbalance checks that dst_running < imb_numa_nr. So we have imbalance < dst_running < imb_numa_nr > > > > > > return 0; > > > > > > return imbalance; > > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > > /* Consider allowing a small imbalance between NUMA groups */ > > > if (env->sd->flags & SD_NUMA) { > > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > > - busiest->sum_nr_running, env->sd->span_weight); > > > + busiest->sum_nr_running, env->sd->imb_numa_nr); > > > } > > > > > > return; > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > > index d201a7052a29..1fa3e977521d 100644 > > > --- a/kernel/sched/topology.c > > > +++ b/kernel/sched/topology.c > > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > > } > > > } > > > > > > + /* > > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > > + * imbalanced. > > > + */ > > > + for_each_cpu(i, cpu_map) { > > > + unsigned int imb = 0; > > > + unsigned int imb_span = 1; > > > + > > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > > + struct sched_domain *child = sd->child; > > > + > > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > > > sched_domains have not been degenerated yet so you found here the DIE domain > > > > Yes > > > > + struct sched_domain *top, *top_p; > > > + unsigned int llc_sq; > > > + > > > + /* > > > + * nr_llcs = (sd->span_weight / llc_weight); > > > + * imb = (llc_weight / nr_llcs) >> 2 > > > > it would be good to add a comment to explain why 25% of LLC weight / > > number of LLC in a node is the right value. > > This? > > * The 25% imbalance is an arbitrary cutoff > * based on SMT-2 to balance between memory > * bandwidth and avoiding premature sharing > * of HT resources and SMT-4 or SMT-8 *may* > * benefit from a different cutoff. nr_llcs > * are accounted for to mitigate premature > * cache eviction due to multiple tasks > * using one cache while a sibling cache > * remains relatively idle. > > > For example, why is it better than just 25% of the LLC weight ? > > Because lets say there are 2 LLCs then an imbalance based on just the LLC > weight might allow 2 tasks to share one cache while another is idle. This > is the original problem whereby the vanilla imbalance allowed multiple > LLCs on the same node to be overloaded which hurt workloads that prefer > to spread wide. In this case, shouldn't it be (llc_weight >> 2) * nr_llcs to fill each llc up to 25% ? instead of dividing by nr_llcs As an example, you have 1 node with 1 LLC with 128 CPUs will get an imb_numa_nr = 32 1 node with 2 LLC with 64 CPUs each will get an imb_numa_nr = 8 1 node with 4 LLC with 32 CPUs each will get an imb_numa_nr = 2 sd->imb_numa_nr is used at NUMA level so the more LLC you have the lower imbalance is allowed > > > Do you want to allow the same imbalance at node level whatever the > > number of LLC in the node ? > > > > At this point, it's less clear how the larger domains should be > balanced and the initial scaling is as good an option as any. > > > > + * > > > + * is equivalent to > > > + * > > > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > > > + * > > > + */ > > > + llc_sq = child->span_weight * child->span_weight; > > > + > > > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > > > + sd->imb_numa_nr = imb; > > > + > > > + /* > > > + * Set span based on top domain that places > > > + * tasks in sibling domains. > > > + */ > > > + top = sd; > > > + top_p = top->parent; > > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > > > Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? > > Because on AMD Zen3, I saw inconsistent treatment of SD_NUMA prior to > degeneration depending on whether it was NPS-1, NPS-2 or NPS-4 and only > SD_PREFER_SIBLING gave the current results. SD_PREFER_SIBLING is not mandatory in childs of NUMA node (look at heterogenous system as an example) so relying of this flag seems quite fragile The fact that you see inconsistency with SD_NUMA depending of NPS-1, NPS-2 or NPS-4 topology probably means that you don't looks for the right domain level or you try to compensate side effect of your formula above > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-01-10 15:53 ` Vincent Guittot @ 2022-01-12 10:24 ` Mel Gorman 0 siblings, 0 replies; 22+ messages in thread From: Mel Gorman @ 2022-01-12 10:24 UTC (permalink / raw) To: Vincent Guittot Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Mon, Jan 10, 2022 at 04:53:26PM +0100, Vincent Guittot wrote: > On Wed, 5 Jan 2022 at 11:42, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote: > > > > <SNIP> > > > > > > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > > > * This is an approximation as the number of running tasks may not be > > > > * related to the number of busy CPUs due to sched_setaffinity. > > > > */ > > > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > > > { > > > > - return (dst_running < (dst_weight >> 2)); > > > > + return dst_running < imb_numa_nr; > > > > } > > > > > > > > /* > > > > > > > > <SNIP> > > > > > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > > } > > > > } > > > > > > > > -#define NUMA_IMBALANCE_MIN 2 > > > > - > > > > static inline long adjust_numa_imbalance(int imbalance, > > > > - int dst_running, int dst_weight) > > > > + int dst_running, int imb_numa_nr) > > > > { > > > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > > > return imbalance; > > > > > > > > - /* > > > > - * Allow a small imbalance based on a simple pair of communicating > > > > - * tasks that remain local when the destination is lightly loaded. > > > > - */ > > > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > > > + if (imbalance <= imb_numa_nr) > > > > > > Isn't this always true ? > > > > > > imbalance is "always" < dst_running as imbalance is usually the number > > > of these tasks that we would like to migrate > > > > > > > It's not necessarily true. allow_numa_imbalanced is checking if > > dst_running < imb_numa_nr and adjust_numa_imbalance is checking the > > imbalance. > > > > imb_numa_nr = 4 > > dst_running = 2 > > imbalance = 1 > > > > In that case, imbalance of 1 is ok, but 2 is not. > > I don't catch your example. Why is imbalance = 2 not ok in your > example above ? allow_numa_imbalance still returns true (dst-running < > imb_numa_nr) and we still have imbalance <= imb_numa_nr > At the time I wrote it, the comparison looked like < instead of <=. > Also the name dst_running is quite confusing; In the case of > calculate_imbalance, busiest->nr_running is passed as dst_running > argument. But the busiest group is the src not the dst of the balance > > Then, imbalance < busiest->nr_running in load_balance because we try > to even the number of task running in each groups without emptying it > and allow_numa_imbalance checks that dst_running < imb_numa_nr. So we > have imbalance < dst_running < imb_numa_nr > But either way, you have a valid point. The patch as-is is too complex and doing too much and is failing to make progress as a result. I'm going to go back to the drawing board and come up with a simpler version that adjusts the cut-off depending on topology but only allows an imbalance of NUMA_IMBALANCE_MIN and tidy up the inconsistencies. > > This? > > > > * The 25% imbalance is an arbitrary cutoff > > * based on SMT-2 to balance between memory > > * bandwidth and avoiding premature sharing > > * of HT resources and SMT-4 or SMT-8 *may* > > * benefit from a different cutoff. nr_llcs > > * are accounted for to mitigate premature > > * cache eviction due to multiple tasks > > * using one cache while a sibling cache > > * remains relatively idle. > > > > > For example, why is it better than just 25% of the LLC weight ? > > > > Because lets say there are 2 LLCs then an imbalance based on just the LLC > > weight might allow 2 tasks to share one cache while another is idle. This > > is the original problem whereby the vanilla imbalance allowed multiple > > LLCs on the same node to be overloaded which hurt workloads that prefer > > to spread wide. > > In this case, shouldn't it be (llc_weight >> 2) * nr_llcs to fill each > llc up to 25% ? instead of dividing by nr_llcs > > As an example, you have > 1 node with 1 LLC with 128 CPUs will get an imb_numa_nr = 32 > 1 node with 2 LLC with 64 CPUs each will get an imb_numa_nr = 8 > 1 node with 4 LLC with 32 CPUs each will get an imb_numa_nr = 2 > > sd->imb_numa_nr is used at NUMA level so the more LLC you have the > lower imbalance is allowed > The more LLCs, the lower the threshold where imbalances is allowed is deliberate given that the motivating problem was that embarassingly parallel problems on AMD suffer due to overloading some LLCs while others remain idle. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-10 9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman 2021-12-13 8:28 ` Gautham R. Shenoy @ 2021-12-17 19:54 ` Gautham R. Shenoy 1 sibling, 0 replies; 22+ messages in thread From: Gautham R. Shenoy @ 2021-12-17 19:54 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Fri, Dec 10, 2021 at 09:33:07AM +0000, Mel Gorman wrote: [..snip..] > @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > return idlest; > #endif > /* > - * Otherwise, keep the task on this node to stay close > - * its wakeup source and improve locality. If there is > - * a real need of migration, periodic load balance will > - * take care of it. > + * Otherwise, keep the task on this node to stay local > + * to its wakeup source if the number of running tasks > + * are below the allowed imbalance. If there is a real > + * need of migration, periodic load balance will take > + * care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > + if (local_sgs.sum_nr_running <= sd->imb_numa_nr) > return NULL; Considering the fact that we want to check whether or not the imb_numa_nr threshold is going to be crossed if we let the new task stay local, this should be if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr) return NULL; Without this change, on a Zen3 configured with Nodes Per Socket (NPS)=4, the lower NUMA domain with sd->imb_numa_nr = 2, has 4 groups (each group corresponds to a NODE sched-domain), when we run stream with 8 threads, we see 3 of them being initially placed in the local group and the remaining 5 distributed across the other 4 groups. None of these 3 tasks will never get migrated to any of the other 3 groups, because those others have at least one task. Eg: PID 157811 : timestamp 108921.267293 : first placed in NODE 1 PID 157812 : timestamp 108921.269877 : first placed in NODE 1 PID 157813 : timestamp 108921.269921 : first placed in NODE 1 PID 157814 : timestamp 108921.270007 : first placed in NODE 2 PID 157815 : timestamp 108921.270065 : first placed in NODE 3 PID 157816 : timestamp 108921.270118 : first placed in NODE 0 PID 157817 : timestamp 108921.270168 : first placed in NODE 2 PID 157818 : timestamp 108921.270216 : first placed in NODE 3 With the fix mentioned above, we see the 8 threads uniformly distributed across the 4 groups. PID 7500 : timestamp 436.156429 : first placed in NODE 1 PID 7501 : timestamp 436.159058 : first placed in NODE 1 PID 7502 : timestamp 436.159106 : first placed in NODE 2 PID 7503 : timestamp 436.159173 : first placed in NODE 3 PID 7504 : timestamp 436.159219 : first placed in NODE 0 PID 7505 : timestamp 436.159263 : first placed in NODE 2 PID 7506 : timestamp 436.159305 : first placed in NODE 3 PID 7507 : timestamp 436.159348 : first placed in NODE 0 -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2022-01-12 10:24 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-10 9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman 2021-12-10 9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman 2021-12-21 10:53 ` Vincent Guittot 2021-12-21 11:32 ` Mel Gorman 2021-12-21 13:05 ` Vincent Guittot 2021-12-10 9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman 2021-12-13 8:28 ` Gautham R. Shenoy 2021-12-13 13:01 ` Mel Gorman 2021-12-13 14:47 ` Gautham R. Shenoy 2021-12-15 11:52 ` Gautham R. Shenoy 2021-12-15 12:25 ` Mel Gorman 2021-12-16 18:33 ` Gautham R. Shenoy 2021-12-20 11:12 ` Mel Gorman 2021-12-21 15:03 ` Gautham R. Shenoy 2021-12-21 17:13 ` Vincent Guittot 2021-12-22 8:52 ` Jirka Hladky 2022-01-04 19:52 ` Jirka Hladky 2022-01-05 10:42 ` Mel Gorman 2022-01-05 10:49 ` Mel Gorman 2022-01-10 15:53 ` Vincent Guittot 2022-01-12 10:24 ` Mel Gorman 2021-12-17 19:54 ` Gautham R. Shenoy
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.