* [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs @ 2021-12-10 9:33 Mel Gorman 2021-12-10 9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman 2021-12-10 9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman 0 siblings, 2 replies; 48+ messages in thread From: Mel Gorman @ 2021-12-10 9:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML, Mel Gorman Changelog since V3 o Calculate imb_numa_nr for multiple SD_NUMA domains o Restore behaviour where communicating pairs remain on the same node Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. The series addresses two problems -- inconsistent use of scheduler domain weights and sub-optimal performance when there are many LLCs per NUMA node. include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 36 ++++++++++++++++--------------- kernel/sched/topology.c | 39 ++++++++++++++++++++++++++++++++++ 3 files changed, 59 insertions(+), 17 deletions(-) -- 2.31.1 Mel Gorman (2): sched/fair: Use weight of SD_NUMA domain in find_busiest_group sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 36 +++++++++++++++++---------------- kernel/sched/topology.c | 37 ++++++++++++++++++++++++++++++++++ 3 files changed, 57 insertions(+), 17 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group 2021-12-10 9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman @ 2021-12-10 9:33 ` Mel Gorman 2021-12-21 10:53 ` Vincent Guittot 2021-12-10 9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman 1 sibling, 1 reply; 48+ messages in thread From: Mel Gorman @ 2021-12-10 9:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML, Mel Gorman find_busiest_group uses the child domain's group weight instead of the sched_domain's weight that has SD_NUMA set when calculating the allowed imbalance between NUMA nodes. This is wrong and inconsistent with find_idlest_group. This patch uses the SD_NUMA weight in both. Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e476f6d9435..0a969affca76 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* Consider allowing a small imbalance between NUMA groups */ if (env->sd->flags & SD_NUMA) { env->imbalance = adjust_numa_imbalance(env->imbalance, - busiest->sum_nr_running, busiest->group_weight); + busiest->sum_nr_running, env->sd->span_weight); } return; -- 2.31.1 ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group 2021-12-10 9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman @ 2021-12-21 10:53 ` Vincent Guittot 2021-12-21 11:32 ` Mel Gorman 0 siblings, 1 reply; 48+ messages in thread From: Vincent Guittot @ 2021-12-21 10:53 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote: > > find_busiest_group uses the child domain's group weight instead of > the sched_domain's weight that has SD_NUMA set when calculating the > allowed imbalance between NUMA nodes. This is wrong and inconsistent > with find_idlest_group. I agree that find_busiest_group and find_idlest_group should be consistent and use the same parameters but I wonder if sched_domain's weight is the right one to use instead of the target group's weight. IIRC, the goal of adjust_numa_imbalance is to keep some threads on the same node as long as we consider that there is no performance impact because of sharing resources as they can even take advantage of locality if they interact. So we consider that tasks will not be impacted by sharing resources if they use less than 25% of the CPUs of a node. If we use the sd->span_weight instead, we consider that we can pack threads in the same node as long as it uses less than 25% of the CPUs in all nodes. > > This patch uses the SD_NUMA weight in both. > > Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 6e476f6d9435..0a969affca76 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* Consider allowing a small imbalance between NUMA groups */ > if (env->sd->flags & SD_NUMA) { > env->imbalance = adjust_numa_imbalance(env->imbalance, > - busiest->sum_nr_running, busiest->group_weight); > + busiest->sum_nr_running, env->sd->span_weight); > } > > return; > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group 2021-12-21 10:53 ` Vincent Guittot @ 2021-12-21 11:32 ` Mel Gorman 2021-12-21 13:05 ` Vincent Guittot 0 siblings, 1 reply; 48+ messages in thread From: Mel Gorman @ 2021-12-21 11:32 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML On Tue, Dec 21, 2021 at 11:53:50AM +0100, Vincent Guittot wrote: > On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > find_busiest_group uses the child domain's group weight instead of > > the sched_domain's weight that has SD_NUMA set when calculating the > > allowed imbalance between NUMA nodes. This is wrong and inconsistent > > with find_idlest_group. > > I agree that find_busiest_group and find_idlest_group should be > consistent and use the same parameters but I wonder if sched_domain's > weight is the right one to use instead of the target group's weight. > Ok > IIRC, the goal of adjust_numa_imbalance is to keep some threads on the > same node as long as we consider that there is no performance impact > because of sharing resources as they can even take advantage of > locality if they interact. Yes. > So we consider that tasks will not be > impacted by sharing resources if they use less than 25% of the CPUs of > a node. If we use the sd->span_weight instead, we consider that we can > pack threads in the same node as long as it uses less than 25% of the > CPUs in all nodes. > I assume you mean the target group weight instead of the node. The primary resource we are concerned with is memory bandwidth and it's a guess because we do not know for sure where memory channels are or how they are configured in this context and it may or may not be correlated with groups. I think using the group instead would deserve a series on its own after settling on an imbalance number when there are multiple LLCs per node. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group 2021-12-21 11:32 ` Mel Gorman @ 2021-12-21 13:05 ` Vincent Guittot 0 siblings, 0 replies; 48+ messages in thread From: Vincent Guittot @ 2021-12-21 13:05 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML On Tue, 21 Dec 2021 at 12:32, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Dec 21, 2021 at 11:53:50AM +0100, Vincent Guittot wrote: > > On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > find_busiest_group uses the child domain's group weight instead of > > > the sched_domain's weight that has SD_NUMA set when calculating the > > > allowed imbalance between NUMA nodes. This is wrong and inconsistent > > > with find_idlest_group. > > > > I agree that find_busiest_group and find_idlest_group should be > > consistent and use the same parameters but I wonder if sched_domain's > > weight is the right one to use instead of the target group's weight. > > > > Ok > > > IIRC, the goal of adjust_numa_imbalance is to keep some threads on the > > same node as long as we consider that there is no performance impact > > because of sharing resources as they can even take advantage of > > locality if they interact. > > Yes. > > > So we consider that tasks will not be > > impacted by sharing resources if they use less than 25% of the CPUs of > > a node. If we use the sd->span_weight instead, we consider that we can > > pack threads in the same node as long as it uses less than 25% of the > > CPUs in all nodes. > > > > I assume you mean the target group weight instead of the node. The I wanted to say that with this patch, we consider the imbalance acceptable if the number of threads in a node is less than 25% of all CPUs of all nodes (for this numa level) , but 25% of all CPUs of all nodes can be more that the number of CPUs in the group. So I would have changed find_idlest_group instead of changing find_busiest_group > primary resource we are concerned with is memory bandwidth and it's a > guess because we do not know for sure where memory channels are or how > they are configured in this context and it may or may not be correlated > with groups. I think using the group instead would deserve a series on > its own after settling on an imbalance number when there are multiple > LLCs per node. I haven't look yet at the patch2 for multiple LLC per node > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-10 9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman 2021-12-10 9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman @ 2021-12-10 9:33 ` Mel Gorman 2021-12-13 8:28 ` Gautham R. Shenoy 2021-12-17 19:54 ` Gautham R. Shenoy 1 sibling, 2 replies; 48+ messages in thread From: Mel Gorman @ 2021-12-10 9:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML, Mel Gorman Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. Zen* has multiple LLCs per node with local memory channels and due to the allowed imbalance, it's far harder to tune some workloads to run optimally than it is on hardware that has 1 LLC per node. This patch adjusts the imbalance on multi-LLC machines to allow an imbalance up to the point where LLCs should be balanced between nodes. On a Zen3 machine running STREAM parallelised with OMP to have on instance per LLC the results and without binding, the results are 5.16.0-rc1 5.16.0-rc1 vanilla sched-numaimb-v4 MB/sec copy-16 166712.18 ( 0.00%) 651540.22 ( 290.82%) MB/sec scale-16 140109.66 ( 0.00%) 382254.74 ( 172.83%) MB/sec add-16 160791.18 ( 0.00%) 623073.98 ( 287.51%) MB/sec triad-16 160043.84 ( 0.00%) 633964.52 ( 296.12%) STREAM can use directives to force the spread if the OpenMP is new enough but that doesn't help if an application uses threads and it's not known in advance how many threads will be created. Coremark is a CPU and cache intensive benchmark parallelised with threads. When running with 1 thread per instance, the vanilla kernel allows threads to contend on cache. With the patch; 5.16.0-rc1 5.16.0-rc1 vanilla sched-numaimb-v4r24 Min Score-16 367816.09 ( 0.00%) 384015.36 ( 4.40%) Hmean Score-16 389627.78 ( 0.00%) 431907.14 * 10.85%* Max Score-16 416178.96 ( 0.00%) 480120.03 ( 15.36%) Stddev Score-16 17361.82 ( 0.00%) 32505.34 ( -87.22%) CoeffVar Score-16 4.45 ( 0.00%) 7.49 ( -68.30%) It can also make a big difference for semi-realistic workloads like specjbb which can execute arbitrary numbers of threads without advance knowledge of how they should be placed 5.16.0-rc1 5.16.0-rc1 vanilla sched-numaimb-v4 Hmean tput-1 73743.05 ( 0.00%) 70258.27 * -4.73%* Hmean tput-8 563036.51 ( 0.00%) 591187.39 ( 5.00%) Hmean tput-16 1016590.61 ( 0.00%) 1032311.78 ( 1.55%) Hmean tput-24 1418558.41 ( 0.00%) 1424005.80 ( 0.38%) Hmean tput-32 1608794.22 ( 0.00%) 1907855.80 * 18.59%* Hmean tput-40 1761338.13 ( 0.00%) 2108162.23 * 19.69%* Hmean tput-48 2290646.54 ( 0.00%) 2214383.47 ( -3.33%) Hmean tput-56 2463345.12 ( 0.00%) 2780216.58 * 12.86%* Hmean tput-64 2650213.53 ( 0.00%) 2598196.66 ( -1.96%) Hmean tput-72 2497253.28 ( 0.00%) 2998882.47 * 20.09%* Hmean tput-80 2820786.72 ( 0.00%) 2951655.27 ( 4.64%) Hmean tput-88 2813541.68 ( 0.00%) 3045450.86 * 8.24%* Hmean tput-96 2604158.67 ( 0.00%) 3035311.91 * 16.56%* Hmean tput-104 2713810.62 ( 0.00%) 2984270.04 ( 9.97%) Hmean tput-112 2558425.37 ( 0.00%) 2894737.46 * 13.15%* Hmean tput-120 2611434.93 ( 0.00%) 2781661.01 ( 6.52%) Hmean tput-128 2706103.22 ( 0.00%) 2811447.85 ( 3.89%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 36 +++++++++++++++++---------------- kernel/sched/topology.c | 37 ++++++++++++++++++++++++++++++++++ 3 files changed, 57 insertions(+), 17 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index c07bfa2d80f2..54f5207154d3 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -93,6 +93,7 @@ struct sched_domain { unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0a969affca76..972ba586b113 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1489,6 +1489,7 @@ struct task_numa_env { int src_cpu, src_nid; int dst_cpu, dst_nid; + int imb_numa_nr; struct numa_stats src_stats, dst_stats; @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); static unsigned long cpu_runnable(struct rq *rq); static unsigned long cpu_util(int cpu); static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight); + int dst_running, int dst_weight, + int imb_numa_nr); static inline enum numa_type numa_classify(unsigned int imbalance_pct, @@ -1885,7 +1887,8 @@ static void task_numa_find_cpu(struct task_numa_env *env, dst_running = env->dst_stats.nr_running + 1; imbalance = max(0, dst_running - src_running); imbalance = adjust_numa_imbalance(imbalance, dst_running, - env->dst_stats.weight); + env->dst_stats.weight, + env->imb_numa_nr); /* Use idle CPU if there is no imbalance */ if (!imbalance) { @@ -1950,8 +1953,10 @@ static int task_numa_migrate(struct task_struct *p) */ rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); - if (sd) + if (sd) { env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; + env.imb_numa_nr = sd->imb_numa_nr; + } rcu_read_unlock(); /* @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) return idlest; #endif /* - * Otherwise, keep the task on this node to stay close - * its wakeup source and improve locality. If there is - * a real need of migration, periodic load balance will - * take care of it. + * Otherwise, keep the task on this node to stay local + * to its wakeup source if the number of running tasks + * are below the allowed imbalance. If there is a real + * need of migration, periodic load balance will take + * care of it. */ - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) + if (local_sgs.sum_nr_running <= sd->imb_numa_nr) return NULL; } @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } } -#define NUMA_IMBALANCE_MIN 2 - static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight) + int dst_running, int dst_weight, + int imb_numa_nr) { if (!allow_numa_imbalance(dst_running, dst_weight)) return imbalance; - /* - * Allow a small imbalance based on a simple pair of communicating - * tasks that remain local when the destination is lightly loaded. - */ - if (imbalance <= NUMA_IMBALANCE_MIN) + if (imbalance <= imb_numa_nr) return 0; return imbalance; @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* Consider allowing a small imbalance between NUMA groups */ if (env->sd->flags & SD_NUMA) { env->imbalance = adjust_numa_imbalance(env->imbalance, - busiest->sum_nr_running, env->sd->span_weight); + busiest->sum_nr_running, env->sd->span_weight, + env->sd->imb_numa_nr); } return; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..bacec575ade2 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } } + /* + * Calculate an allowed NUMA imbalance such that LLCs do not get + * imbalanced. + */ + for_each_cpu(i, cpu_map) { + unsigned int imb = 0; + unsigned int imb_span = 1; + + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + struct sched_domain *child = sd->child; + + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && + (child->flags & SD_SHARE_PKG_RESOURCES)) { + struct sched_domain *top = sd; + unsigned int llc_sq; + + /* + * nr_llcs = (top->span_weight / llc_weight); + * imb = (child_weight / nr_llcs) >> 2 + * + * is equivalent to + * + * imb = (llc_weight^2 / top->span_weight) >> 2 + * + */ + llc_sq = child->span_weight * child->span_weight; + + imb = max(2U, ((llc_sq / top->span_weight) >> 2)); + imb_span = sd->span_weight; + + sd->imb_numa_nr = imb; + } else { + sd->imb_numa_nr = imb * (sd->span_weight / imb_span); + } + } + } + /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) -- 2.31.1 ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-10 9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman @ 2021-12-13 8:28 ` Gautham R. Shenoy 2021-12-13 13:01 ` Mel Gorman 2021-12-17 19:54 ` Gautham R. Shenoy 1 sibling, 1 reply; 48+ messages in thread From: Gautham R. Shenoy @ 2021-12-13 8:28 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML Hello Mel, On Fri, Dec 10, 2021 at 09:33:07AM +0000, Mel Gorman wrote: > Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA > nodes") allowed an imbalance between NUMA nodes such that communicating > tasks would not be pulled apart by the load balancer. This works fine when > there is a 1:1 relationship between LLC and node but can be suboptimal > for multiple LLCs if independent tasks prematurely use CPUs sharing cache. > > Zen* has multiple LLCs per node with local memory channels and due to > the allowed imbalance, it's far harder to tune some workloads to run > optimally than it is on hardware that has 1 LLC per node. This patch > adjusts the imbalance on multi-LLC machines to allow an imbalance up to > the point where LLCs should be balanced between nodes. > > On a Zen3 machine running STREAM parallelised with OMP to have on instance > per LLC the results and without binding, the results are > > 5.16.0-rc1 5.16.0-rc1 > vanilla sched-numaimb-v4 > MB/sec copy-16 166712.18 ( 0.00%) 651540.22 ( 290.82%) > MB/sec scale-16 140109.66 ( 0.00%) 382254.74 ( 172.83%) > MB/sec add-16 160791.18 ( 0.00%) 623073.98 ( 287.51%) > MB/sec triad-16 160043.84 ( 0.00%) 633964.52 ( 296.12%) Could you please share the size of the stream array ? These numbers are higher than what I am observing. > > STREAM can use directives to force the spread if the OpenMP is new > enough but that doesn't help if an application uses threads and > it's not known in advance how many threads will be created. > > Coremark is a CPU and cache intensive benchmark parallelised with > threads. When running with 1 thread per instance, the vanilla kernel > allows threads to contend on cache. With the patch; > > 5.16.0-rc1 5.16.0-rc1 > vanilla sched-numaimb-v4r24 > Min Score-16 367816.09 ( 0.00%) 384015.36 ( 4.40%) > Hmean Score-16 389627.78 ( 0.00%) 431907.14 * 10.85%* > Max Score-16 416178.96 ( 0.00%) 480120.03 ( 15.36%) > Stddev Score-16 17361.82 ( 0.00%) 32505.34 ( -87.22%) > CoeffVar Score-16 4.45 ( 0.00%) 7.49 ( -68.30%) > > It can also make a big difference for semi-realistic workloads > like specjbb which can execute arbitrary numbers of threads without > advance knowledge of how they should be placed > > 5.16.0-rc1 5.16.0-rc1 > vanilla sched-numaimb-v4 > Hmean tput-1 73743.05 ( 0.00%) 70258.27 * -4.73%* > Hmean tput-8 563036.51 ( 0.00%) 591187.39 ( 5.00%) > Hmean tput-16 1016590.61 ( 0.00%) 1032311.78 ( 1.55%) > Hmean tput-24 1418558.41 ( 0.00%) 1424005.80 ( 0.38%) > Hmean tput-32 1608794.22 ( 0.00%) 1907855.80 * 18.59%* > Hmean tput-40 1761338.13 ( 0.00%) 2108162.23 * 19.69%* > Hmean tput-48 2290646.54 ( 0.00%) 2214383.47 ( -3.33%) > Hmean tput-56 2463345.12 ( 0.00%) 2780216.58 * 12.86%* > Hmean tput-64 2650213.53 ( 0.00%) 2598196.66 ( -1.96%) > Hmean tput-72 2497253.28 ( 0.00%) 2998882.47 * 20.09%* > Hmean tput-80 2820786.72 ( 0.00%) 2951655.27 ( 4.64%) > Hmean tput-88 2813541.68 ( 0.00%) 3045450.86 * 8.24%* > Hmean tput-96 2604158.67 ( 0.00%) 3035311.91 * 16.56%* > Hmean tput-104 2713810.62 ( 0.00%) 2984270.04 ( 9.97%) > Hmean tput-112 2558425.37 ( 0.00%) 2894737.46 * 13.15%* > Hmean tput-120 2611434.93 ( 0.00%) 2781661.01 ( 6.52%) > Hmean tput-128 2706103.22 ( 0.00%) 2811447.85 ( 3.89%) > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > include/linux/sched/topology.h | 1 + > kernel/sched/fair.c | 36 +++++++++++++++++---------------- > kernel/sched/topology.c | 37 ++++++++++++++++++++++++++++++++++ > 3 files changed, 57 insertions(+), 17 deletions(-) > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index c07bfa2d80f2..54f5207154d3 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -93,6 +93,7 @@ struct sched_domain { > unsigned int busy_factor; /* less balancing by factor if busy */ > unsigned int imbalance_pct; /* No balance until over watermark */ > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ > > int nohz_idle; /* NOHZ IDLE status */ > int flags; /* See SD_* */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0a969affca76..972ba586b113 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > int src_cpu, src_nid; > int dst_cpu, dst_nid; > + int imb_numa_nr; > > struct numa_stats src_stats, dst_stats; > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); > static unsigned long cpu_runnable(struct rq *rq); > static unsigned long cpu_util(int cpu); > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight); > + int dst_running, int dst_weight, > + int imb_numa_nr); > > static inline enum > numa_type numa_classify(unsigned int imbalance_pct, > @@ -1885,7 +1887,8 @@ static void task_numa_find_cpu(struct task_numa_env *env, > dst_running = env->dst_stats.nr_running + 1; > imbalance = max(0, dst_running - src_running); > imbalance = adjust_numa_imbalance(imbalance, dst_running, > - env->dst_stats.weight); > + env->dst_stats.weight, > + env->imb_numa_nr); > > /* Use idle CPU if there is no imbalance */ > if (!imbalance) { > @@ -1950,8 +1953,10 @@ static int task_numa_migrate(struct task_struct *p) > */ > rcu_read_lock(); > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > - if (sd) > + if (sd) { > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > + env.imb_numa_nr = sd->imb_numa_nr; > + } > rcu_read_unlock(); > > /* > @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > return idlest; > #endif > /* > - * Otherwise, keep the task on this node to stay close > - * its wakeup source and improve locality. If there is > - * a real need of migration, periodic load balance will > - * take care of it. > + * Otherwise, keep the task on this node to stay local > + * to its wakeup source if the number of running tasks > + * are below the allowed imbalance. If there is a real > + * need of migration, periodic load balance will take > + * care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > + if (local_sgs.sum_nr_running <= sd->imb_numa_nr) > return NULL; > } > > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > } > } > > -#define NUMA_IMBALANCE_MIN 2 > - > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight) > + int dst_running, int dst_weight, > + int imb_numa_nr) > { > if (!allow_numa_imbalance(dst_running, dst_weight)) > return imbalance; > if (4 * dst_running >= dst_weight) we return imbalance here. The dst_weight here corresponds to the span of the domain, while dst_running is the nr_running in busiest. On Zen3, at the top most NUMA domain, the dst_weight = 256 across in all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are two groups, where each group is a socket. So, unless there are at least 64 tasks running in one of the sockets, we would not return imbalance here and go to the next step. > - /* > - * Allow a small imbalance based on a simple pair of communicating > - * tasks that remain local when the destination is lightly loaded. > - */ > - if (imbalance <= NUMA_IMBALANCE_MIN) > + if (imbalance <= imb_numa_nr) imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains don't have PREFER_SIBLING, we would be balancing the number of idle CPUs. We will end up doing the imbalance, as long as the difference between the idle CPUs is at least 8. In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we will not rebalance unless the difference between the idle CPUs is 16. In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the threshold is now bumped up to 32. > return 0; > > return imbalance; > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* Consider allowing a small imbalance between NUMA groups */ > if (env->sd->flags & SD_NUMA) { > env->imbalance = adjust_numa_imbalance(env->imbalance, > - busiest->sum_nr_running, env->sd->span_weight); > + busiest->sum_nr_running, env->sd->span_weight, > + env->sd->imb_numa_nr); > } > > return; > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d201a7052a29..bacec575ade2 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + /* > + * Calculate an allowed NUMA imbalance such that LLCs do not get > + * imbalanced. > + */ > + for_each_cpu(i, cpu_map) { > + unsigned int imb = 0; > + unsigned int imb_span = 1; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > + struct sched_domain *top = sd; We don't seem to be using top anywhere where sd may not be used since we already have variables imb and imb_span to record the top->imb_numa_nr and top->span_weight. > + unsigned int llc_sq; > + > + /* > + * nr_llcs = (top->span_weight / llc_weight); > + * imb = (child_weight / nr_llcs) >> 2 child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2. > + * > + * is equivalent to > + * > + * imb = (llc_weight^2 / top->span_weight) >> 2 > + * > + */ > + llc_sq = child->span_weight * child->span_weight; > + > + imb = max(2U, ((llc_sq / top->span_weight) >> 2)); > + imb_span = sd->span_weight; On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256. with NPS=1 top = DIE. top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128. with NPS=2 top = NODE. top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64. with NPS=4 top = NODE. top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32. On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64. with NPS=1 top = DIE. top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128. with NPS=2 top = NODE. top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64. with NPS=4 top = NODE. top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32. > + > + sd->imb_numa_nr = imb; > + } else { > + sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > + } On Zen3, with NPS=1 sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4. with NPS=2 sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4 sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8 with NPS=4 sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8 sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16 For Zen2, since the imb_span and imb values are the same as the corresponding NPS=x values on Zen3, the imb_numa_nr values are the same as well since the corresponding sd->span_weight is the same. If we look at the highest NUMA domain, there are two groups in all the NPS configurations. There are the same number of LLCs in each of these groups across the different NPS configurations (nr_llcs=8 on Zen3, 16 on Zen2) . However, the imb_numa_nr at this domain varies with the NPS value, since we compute the imb_numa_nr value relative to the number of "top" domains that can be fit within this NUMA domain. This is because the size of the "top" domain varies with the NPS value. This shows up in the benchmark results. The numbers with stream, tbench and YCSB + Mongodb are as follows: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stream with 16 threads. built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10 Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 Test: tip/sched/core mel-v3 mel-v4 Copy: 113716.62 (0.00 pct) 218961.59 (92.55 pct) 217130.07 (90.93 pct) Scale: 110996.89 (0.00 pct) 216674.73 (95.20 pct) 220765.94 (98.89 pct) Add: 124504.19 (0.00 pct) 253461.32 (103.57 pct 260273.88 (109.04 pct) Triad: 122890.43 (0.00 pct) 247552.00 (101.44 pct 252615.62 (105.56 pct) NPS=2 Test: tip/sched/core mel-v3 mel-v4 Copy: 58217.00 (0.00 pct) 204630.34 (251.49 pct) 191312.73 (228.62 pct) Scale: 55004.76 (0.00 pct) 212142.88 (285.68 pct) 175499.15 (219.06 pct) Add: 63269.04 (0.00 pct) 254752.56 (302.64 pct) 203571.50 (221.75 pct) Triad: 62178.25 (0.00 pct) 247290.80 (297.71 pct) 198988.70 (220.02 pct) NPS=4 Test: tip/sched/core mel-v3 mel-v4 Copy: 37986.66 (0.00 pct) 254183.87 (569.13 pct) 48748.87 (28.33 pct) Scale: 35471.22 (0.00 pct) 237804.76 (570.41 pct) 48317.82 (36.21 pct) Add: 39303.25 (0.00 pct) 292285.20 (643.66 pct) 54259.59 (38.05 pct) Triad: 39319.85 (0.00 pct) 285284.30 (625.54 pct) 54503.98 (38.61 pct) We can see that with the v4 patch, for NPS=2 and NPS=4, the gains start diminishing since the thresholds are higher than NPS=1. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stream with 16 threads. built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100 Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 Test: tip/sched/core mel-v3 mel-v4 Copy: 137362.66 (0.00 pct) 236661.65 (72.28 pct) 241148.65 (75.55 pct) Scale: 126742.24 (0.00 pct) 214568.17 (69.29 pct) 226416.41 (78.64 pct) Add: 148236.33 (0.00 pct) 257114.42 (73.44 pct) 272030.50 (83.51 pct) Triad: 146913.25 (0.00 pct) 241880.88 (64.64 pct) 259873.61 (76.88 pct) NPS=2 Test: tip/sched/core mel-v3 mel-v4 Copy: 107143.94 (0.00 pct) 244922.66 (128.59 pct) 198299.91 (85.07 pct) Scale: 102004.90 (0.00 pct) 218738.55 (114.43 pct) 177890.23 (74.39 pct) Add: 117760.23 (0.00 pct) 270516.24 (129.71 pct) 211458.30 (79.56 pct) Triad: 115927.92 (0.00 pct) 255985.20 (120.81 pct) 197812.60 (70.63 pct) NPS=4 Test: tip/sched/core mel-v3 mel-v4 Copy: 111653.17 (0.00 pct) 253912.17 (127.41 pct) 48898.34 (-56.20 pct) Scale: 105289.35 (0.00 pct) 223710.85 (112.47 pct) 48426.03 (-54.00 pct) Add: 120927.64 (0.00 pct) 277701.20 (129.64 pct) 54425.48 (-54.99 pct) Triad: 117659.97 (0.00 pct) 259473.84 (120.52 pct) 54622.82 (-53.57 pct) with -DNTIMES=100, each of the Copy,Scale,Add,Triad kernels runs for a longer duration. So the test takes longer time (6-10 seconds) giving the load-balancer sufficient time to place the tasks and balance them. In this configuration we see that the v4 shows some degradation on NPS=4. This is due to the imb_numa_nr being higher compared to v3. While Stream benefits from spreading, it is fair to understand the gains that we make with benchmarks that would prefer the tasks co-located instead of spread out. Chose tbench and YCSB+Mongodb as representatives of these. The numbers are as follows: ~~~~~~~~~~~~~~~~~~~~~~~~ tbench Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 Clients: tip/sched/core mel-v3 mel-v4 1 633.25 (0.00 pct) 619.18 (-2.22 pct) 632.96 (-0.04 pct) 2 1152.54 (0.00 pct) 1189.91 (3.24 pct) 1184.84 (2.80 pct) 4 1946.53 (0.00 pct) 2177.45 (11.86 pct) 1979.62 (1.69 pct) 8 3554.65 (0.00 pct) 3565.16 (0.29 pct) 3678.13 (3.47 pct) 16 6222.00 (0.00 pct) 6484.89 (4.22 pct) 6256.02 (0.54 pct) 32 11707.57 (0.00 pct) 12185.93 (4.08 pct) 12006.63 (2.55 pct) 64 18433.50 (0.00 pct) 19537.03 (5.98 pct) 19088.57 (3.55 pct) 128 27400.07 (0.00 pct) 31771.53 (15.95 pct) 27265.00 (-0.49 pct) 256 33195.27 (0.00 pct) 24478.67 (-26.25 pct) 34065.60 (2.62 pct) 512 41633.10 (0.00 pct) 54833.20 (31.70 pct) 46724.00 (12.22 pct) 1024 53877.23 (0.00 pct) 56363.37 (4.61 pct) 44813.10 (-16.82 pct) NPS=2 Clients: tip/sched/core mel-v3 mel-v4 1 629.76 (0.00 pct) 620.94 (-1.40 pct) 629.22 (-0.08 pct) 2 1177.01 (0.00 pct) 1203.27 (2.23 pct) 1169.12 (-0.66 pct) 4 1990.97 (0.00 pct) 2228.18 (11.91 pct) 1888.39 (-5.15 pct) 8 3535.45 (0.00 pct) 3620.76 (2.41 pct) 3662.72 (3.59 pct) 16 6309.02 (0.00 pct) 6548.66 (3.79 pct) 6508.67 (3.16 pct) 32 12038.73 (0.00 pct) 12145.97 (0.89 pct) 11411.50 (-5.21 pct) 64 18599.67 (0.00 pct) 19448.87 (4.56 pct) 17146.07 (-7.81 pct) 128 27861.57 (0.00 pct) 30630.53 (9.93 pct) 28217.30 (1.27 pct) 256 28215.80 (0.00 pct) 26864.67 (-4.78 pct) 29330.47 (3.95 pct) 512 44239.67 (0.00 pct) 52822.47 (19.40 pct) 42652.63 (-3.58 pct) 1024 54403.53 (0.00 pct) 53905.57 (-0.91 pct) 48490.30 (-10.86 pct) NPS=4 Clients: tip/sched/core mel-v3 mel-v4 1 622.68 (0.00 pct) 617.87 (-0.77 pct) 667.38 (7.17 pct) 2 1160.74 (0.00 pct) 1182.40 (1.86 pct) 1294.12 (11.49 pct) 4 1961.29 (0.00 pct) 2172.41 (10.76 pct) 2477.76 (26.33 pct) 8 3664.25 (0.00 pct) 3450.80 (-5.82 pct) 4067.42 (11.00 pct) 16 6495.53 (0.00 pct) 5873.41 (-9.57 pct) 6931.66 (6.71 pct) 32 11833.27 (0.00 pct) 12010.43 (1.49 pct) 12710.60 (7.41 pct) 64 17723.50 (0.00 pct) 18416.23 (3.90 pct) 18793.47 (6.03 pct) 128 27724.83 (0.00 pct) 27894.50 (0.61 pct) 27948.60 (0.80 pct) 256 31351.70 (0.00 pct) 23944.43 (-23.62 pct) 35430.17 (13.00 pct) 512 43383.43 (0.00 pct) 49830.63 (14.86 pct) 43877.83 (1.13 pct) 1024 46974.27 (0.00 pct) 53583.83 (14.07 pct) 50563.23 (7.64 pct) With NPS=4, with v4, we see no regressions with tbench compared to tip/sched/core and there is a considerable improvement in most cases. So, the higher imb_numa_nr helps pack the tasks which beneficial to tbench. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ YCSB + Mongodb. 4 client instances, 256 threads per client instance. These threads have a very low utilization. The overall system utilization was in the range of 16-20%. YCSB workload type : A ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 tip/sched/core mel-v3 mel-v4 Throughput 351611.0 314981.33 329026.33 (-10.42 pct) (-6.42 pct) NPS=4 tip/sched/core mel-v3 mel-v4 Throughput 315808.0 316600.67 331093.67 (0.25 pct) (4.84 pct) Since at NPS=4, the imb_numa_nr=8 and 16 respectively at the lower and higher NUMA domains, the task spreading happens more reluctantly compared to v3 where the imb_numa_nr was 1 in both the domains. -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-13 8:28 ` Gautham R. Shenoy @ 2021-12-13 13:01 ` Mel Gorman 2021-12-13 14:47 ` Gautham R. Shenoy 0 siblings, 1 reply; 48+ messages in thread From: Mel Gorman @ 2021-12-13 13:01 UTC (permalink / raw) To: Gautham R. Shenoy Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Mon, Dec 13, 2021 at 01:58:03PM +0530, Gautham R. Shenoy wrote: > > On a Zen3 machine running STREAM parallelised with OMP to have on instance > > per LLC the results and without binding, the results are > > > > 5.16.0-rc1 5.16.0-rc1 > > vanilla sched-numaimb-v4 > > MB/sec copy-16 166712.18 ( 0.00%) 651540.22 ( 290.82%) > > MB/sec scale-16 140109.66 ( 0.00%) 382254.74 ( 172.83%) > > MB/sec add-16 160791.18 ( 0.00%) 623073.98 ( 287.51%) > > MB/sec triad-16 160043.84 ( 0.00%) 633964.52 ( 296.12%) > > > Could you please share the size of the stream array ? These numbers > are higher than what I am observing. > 512MB > > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > } > > } > > > > -#define NUMA_IMBALANCE_MIN 2 > > - > > static inline long adjust_numa_imbalance(int imbalance, > > - int dst_running, int dst_weight) > > + int dst_running, int dst_weight, > > + int imb_numa_nr) > > { > > if (!allow_numa_imbalance(dst_running, dst_weight)) > > return imbalance; > > > > if (4 * dst_running >= dst_weight) we return imbalance here. The > dst_weight here corresponds to the span of the domain, while > dst_running is the nr_running in busiest. > Yes, once dst_running is high enough, no imbalance is allowed. In previous versions I changed this but that was a mistake and in this version, the threshold where imbalance is not allowed remains the same. > On Zen3, at the top most NUMA domain, the dst_weight = 256 across in > all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are > two groups, where each group is a socket. So, unless there are at > least 64 tasks running in one of the sockets, we would not return > imbalance here and go to the next step. > Yes > > > - /* > > - * Allow a small imbalance based on a simple pair of communicating > > - * tasks that remain local when the destination is lightly loaded. > > - */ > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > + if (imbalance <= imb_numa_nr) > > imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains > don't have PREFER_SIBLING, we would be balancing the number of idle > CPUs. We will end up doing the imbalance, as long as the difference > between the idle CPUs is at least 8. > > In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we > will not rebalance unless the difference between the idle CPUs is 16. > > In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the > threshold is now bumped up to 32. > > > return 0; > > > > > > > return imbalance; > > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > /* Consider allowing a small imbalance between NUMA groups */ > > if (env->sd->flags & SD_NUMA) { > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > - busiest->sum_nr_running, env->sd->span_weight); > > + busiest->sum_nr_running, env->sd->span_weight, > > + env->sd->imb_numa_nr); > > } > > > > return; > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index d201a7052a29..bacec575ade2 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > } > > } > > > > + /* > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > + * imbalanced. > > + */ > > + for_each_cpu(i, cpu_map) { > > + unsigned int imb = 0; > > + unsigned int imb_span = 1; > > + > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > + struct sched_domain *child = sd->child; > > + > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > + struct sched_domain *top = sd; > > > We don't seem to be using top anywhere where sd may not be used since > we already have variables imb and imb_span to record the > top->imb_numa_nr and top->span_weight. > Top could have been removed but we might still need it. > > > + unsigned int llc_sq; > > + > > + /* > > + * nr_llcs = (top->span_weight / llc_weight); > > + * imb = (child_weight / nr_llcs) >> 2 > > child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2. > That is be clearer. > > + * > > + * is equivalent to > > + * > > + * imb = (llc_weight^2 / top->span_weight) >> 2 > > + * > > + */ > > + llc_sq = child->span_weight * child->span_weight; > > + > > + imb = max(2U, ((llc_sq / top->span_weight) >> 2)); > > + imb_span = sd->span_weight; > > On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256. > with NPS=1 > top = DIE. > top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128. > > with NPS=2 > top = NODE. > top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64. > > with NPS=4 > top = NODE. > top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32. > > On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64. > with NPS=1 > top = DIE. > top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128. > > with NPS=2 > top = NODE. > top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64. > > with NPS=4 > top = NODE. > top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32. > > > > + > > + sd->imb_numa_nr = imb; > > + } else { > > + sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > > + } > > On Zen3, > with NPS=1 > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4. > > with NPS=2 > sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4 > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8 > > with NPS=4 > sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8 > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16 > > > For Zen2, since the imb_span and imb values are the same as the > corresponding NPS=x values on Zen3, the imb_numa_nr values are the > same as well since the corresponding sd->span_weight is the same. > > If we look at the highest NUMA domain, there are two groups in all the > NPS configurations. There are the same number of LLCs in each of these > groups across the different NPS configurations (nr_llcs=8 on Zen3, 16 > on Zen2) . However, the imb_numa_nr at this domain varies with the NPS > value, since we compute the imb_numa_nr value relative to the number > of "top" domains that can be fit within this NUMA domain. This is > because the size of the "top" domain varies with the NPS value. This > shows up in the benchmark results. > This was intentional to have some scaling but based on your results, the scaling might be at the wrong level. > > > The numbers with stream, tbench and YCSB + > Mongodb are as follows: > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Stream with 16 threads. > built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10 > Zen3, 64C128T per socket, 2 sockets, > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > NPS=1 > Test: tip/sched/core mel-v3 mel-v4 > Copy: 113716.62 (0.00 pct) 218961.59 (92.55 pct) 217130.07 (90.93 pct) > Scale: 110996.89 (0.00 pct) 216674.73 (95.20 pct) 220765.94 (98.89 pct) > Add: 124504.19 (0.00 pct) 253461.32 (103.57 pct 260273.88 (109.04 pct) > Triad: 122890.43 (0.00 pct) 247552.00 (101.44 pct 252615.62 (105.56 pct) > > > NPS=2 > Test: tip/sched/core mel-v3 mel-v4 > Copy: 58217.00 (0.00 pct) 204630.34 (251.49 pct) 191312.73 (228.62 pct) > Scale: 55004.76 (0.00 pct) 212142.88 (285.68 pct) 175499.15 (219.06 pct) > Add: 63269.04 (0.00 pct) 254752.56 (302.64 pct) 203571.50 (221.75 pct) > Triad: 62178.25 (0.00 pct) 247290.80 (297.71 pct) 198988.70 (220.02 pct) > > NPS=4 > Test: tip/sched/core mel-v3 mel-v4 > Copy: 37986.66 (0.00 pct) 254183.87 (569.13 pct) 48748.87 (28.33 pct) > Scale: 35471.22 (0.00 pct) 237804.76 (570.41 pct) 48317.82 (36.21 pct) > Add: 39303.25 (0.00 pct) 292285.20 (643.66 pct) 54259.59 (38.05 pct) > Triad: 39319.85 (0.00 pct) 285284.30 (625.54 pct) 54503.98 (38.61 pct) > At minimum, v3 is a failure because a single pair of communicating tasks were getting split across NUMA domains and the allowed numa imbalance gets cut off too early because of the change to allow_numa_imbalance. So while it's a valid comparison, it's definitely not the fix. Given how you describe NPS, maybe the scaling should only start at the point where tasks are no longer balanced between sibling domains. Can you try this? I've only boot tested it at this point. It should work for STREAM at least but probably not great for tbench. diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index bacec575ade2..1fa3e977521d 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && (child->flags & SD_SHARE_PKG_RESOURCES)) { - struct sched_domain *top = sd; + struct sched_domain *top, *top_p; unsigned int llc_sq; /* - * nr_llcs = (top->span_weight / llc_weight); - * imb = (child_weight / nr_llcs) >> 2 + * nr_llcs = (sd->span_weight / llc_weight); + * imb = (llc_weight / nr_llcs) >> 2 * * is equivalent to * - * imb = (llc_weight^2 / top->span_weight) >> 2 + * imb = (llc_weight^2 / sd->span_weight) >> 2 * */ llc_sq = child->span_weight * child->span_weight; - imb = max(2U, ((llc_sq / top->span_weight) >> 2)); - imb_span = sd->span_weight; - + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); sd->imb_numa_nr = imb; + + /* + * Set span based on top domain that places + * tasks in sibling domains. + */ + top = sd; + top_p = top->parent; + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { + top = top->parent; + top_p = top->parent; + } + imb_span = top_p ? top_p->span_weight : sd->span_weight; } else { - sd->imb_numa_nr = imb * (sd->span_weight / imb_span); + int factor = max(1U, (sd->span_weight / imb_span)); + + sd->imb_numa_nr = imb * factor; } } } ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-13 13:01 ` Mel Gorman @ 2021-12-13 14:47 ` Gautham R. Shenoy 2021-12-15 11:52 ` Gautham R. Shenoy 0 siblings, 1 reply; 48+ messages in thread From: Gautham R. Shenoy @ 2021-12-13 14:47 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML Hello Mel, On Mon, Dec 13, 2021 at 01:01:31PM +0000, Mel Gorman wrote: > On Mon, Dec 13, 2021 at 01:58:03PM +0530, Gautham R. Shenoy wrote: > > > On a Zen3 machine running STREAM parallelised with OMP to have on instance > > > per LLC the results and without binding, the results are > > > > > > 5.16.0-rc1 5.16.0-rc1 > > > vanilla sched-numaimb-v4 > > > MB/sec copy-16 166712.18 ( 0.00%) 651540.22 ( 290.82%) > > > MB/sec scale-16 140109.66 ( 0.00%) 382254.74 ( 172.83%) > > > MB/sec add-16 160791.18 ( 0.00%) 623073.98 ( 287.51%) > > > MB/sec triad-16 160043.84 ( 0.00%) 633964.52 ( 296.12%) > > > > > > Could you please share the size of the stream array ? These numbers > > are higher than what I am observing. > > > > 512MB Thanks, I will try with this one. > > > > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > } > > > } > > > > > > -#define NUMA_IMBALANCE_MIN 2 > > > - > > > static inline long adjust_numa_imbalance(int imbalance, > > > - int dst_running, int dst_weight) > > > + int dst_running, int dst_weight, > > > + int imb_numa_nr) > > > { > > > if (!allow_numa_imbalance(dst_running, dst_weight)) > > > return imbalance; > > > > > > > if (4 * dst_running >= dst_weight) we return imbalance here. The > > dst_weight here corresponds to the span of the domain, while > > dst_running is the nr_running in busiest. > > > > Yes, once dst_running is high enough, no imbalance is allowed. In > previous versions I changed this but that was a mistake and in this > version, the threshold where imbalance is not allowed remains the same. > > > On Zen3, at the top most NUMA domain, the dst_weight = 256 across in > > all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are > > two groups, where each group is a socket. So, unless there are at > > least 64 tasks running in one of the sockets, we would not return > > imbalance here and go to the next step. > > > > Yes > > > > > > - /* > > > - * Allow a small imbalance based on a simple pair of communicating > > > - * tasks that remain local when the destination is lightly loaded. > > > - */ > > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > > + if (imbalance <= imb_numa_nr) > > > > imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains > > don't have PREFER_SIBLING, we would be balancing the number of idle > > CPUs. We will end up doing the imbalance, as long as the difference > > between the idle CPUs is at least 8. > > > > In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we > > will not rebalance unless the difference between the idle CPUs is 16. > > > > In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the > > threshold is now bumped up to 32. > > > > > return 0; > > > > > > > > > > > > return imbalance; > > > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > > /* Consider allowing a small imbalance between NUMA groups */ > > > if (env->sd->flags & SD_NUMA) { > > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > > - busiest->sum_nr_running, env->sd->span_weight); > > > + busiest->sum_nr_running, env->sd->span_weight, > > > + env->sd->imb_numa_nr); > > > } > > > > > > return; > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > > index d201a7052a29..bacec575ade2 100644 > > > --- a/kernel/sched/topology.c > > > +++ b/kernel/sched/topology.c > > > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > > } > > > } > > > > > > + /* > > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > > + * imbalanced. > > > + */ > > > + for_each_cpu(i, cpu_map) { > > > + unsigned int imb = 0; > > > + unsigned int imb_span = 1; > > > + > > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > > + struct sched_domain *child = sd->child; > > > + > > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > > + struct sched_domain *top = sd; > > > > > > We don't seem to be using top anywhere where sd may not be used since > > we already have variables imb and imb_span to record the > > top->imb_numa_nr and top->span_weight. > > > > Top could have been removed but we might still need it. > > > > > > + unsigned int llc_sq; > > > + > > > + /* > > > + * nr_llcs = (top->span_weight / llc_weight); > > > + * imb = (child_weight / nr_llcs) >> 2 > > > > child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2. > > > > That is be clearer. > > > > + * > > > + * is equivalent to > > > + * > > > + * imb = (llc_weight^2 / top->span_weight) >> 2 > > > + * > > > + */ > > > + llc_sq = child->span_weight * child->span_weight; > > > + > > > + imb = max(2U, ((llc_sq / top->span_weight) >> 2)); > > > + imb_span = sd->span_weight; > > > > On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256. > > with NPS=1 > > top = DIE. > > top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128. > > > > with NPS=2 > > top = NODE. > > top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64. > > > > with NPS=4 > > top = NODE. > > top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32. > > > > On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64. > > with NPS=1 > > top = DIE. > > top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128. > > > > with NPS=2 > > top = NODE. > > top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64. > > > > with NPS=4 > > top = NODE. > > top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32. > > > > > > > + > > > + sd->imb_numa_nr = imb; > > > + } else { > > > + sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > > > + } > > > > On Zen3, > > with NPS=1 > > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4. > > > > with NPS=2 > > sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4 > > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8 > > > > with NPS=4 > > sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8 > > sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16 > > > > > > For Zen2, since the imb_span and imb values are the same as the > > corresponding NPS=x values on Zen3, the imb_numa_nr values are the > > same as well since the corresponding sd->span_weight is the same. > > > > If we look at the highest NUMA domain, there are two groups in all the > > NPS configurations. There are the same number of LLCs in each of these > > groups across the different NPS configurations (nr_llcs=8 on Zen3, 16 > > on Zen2) . However, the imb_numa_nr at this domain varies with the NPS > > value, since we compute the imb_numa_nr value relative to the number > > of "top" domains that can be fit within this NUMA domain. This is > > because the size of the "top" domain varies with the NPS value. This > > shows up in the benchmark results. > > > > This was intentional to have some scaling but based on your results, the > scaling might be at the wrong level. Ok. > > > > > > > The numbers with stream, tbench and YCSB + > > Mongodb are as follows: > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Stream with 16 threads. > > built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10 > > Zen3, 64C128T per socket, 2 sockets, > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > NPS=1 > > Test: tip/sched/core mel-v3 mel-v4 > > Copy: 113716.62 (0.00 pct) 218961.59 (92.55 pct) 217130.07 (90.93 pct) > > Scale: 110996.89 (0.00 pct) 216674.73 (95.20 pct) 220765.94 (98.89 pct) > > Add: 124504.19 (0.00 pct) 253461.32 (103.57 pct 260273.88 (109.04 pct) > > Triad: 122890.43 (0.00 pct) 247552.00 (101.44 pct 252615.62 (105.56 pct) > > > > > > NPS=2 > > Test: tip/sched/core mel-v3 mel-v4 > > Copy: 58217.00 (0.00 pct) 204630.34 (251.49 pct) 191312.73 (228.62 pct) > > Scale: 55004.76 (0.00 pct) 212142.88 (285.68 pct) 175499.15 (219.06 pct) > > Add: 63269.04 (0.00 pct) 254752.56 (302.64 pct) 203571.50 (221.75 pct) > > Triad: 62178.25 (0.00 pct) 247290.80 (297.71 pct) 198988.70 (220.02 pct) > > > > NPS=4 > > Test: tip/sched/core mel-v3 mel-v4 > > Copy: 37986.66 (0.00 pct) 254183.87 (569.13 pct) 48748.87 (28.33 pct) > > Scale: 35471.22 (0.00 pct) 237804.76 (570.41 pct) 48317.82 (36.21 pct) > > Add: 39303.25 (0.00 pct) 292285.20 (643.66 pct) 54259.59 (38.05 pct) > > Triad: 39319.85 (0.00 pct) 285284.30 (625.54 pct) 54503.98 (38.61 pct) > > > > At minimum, v3 is a failure because a single pair of communicating tasks > were getting split across NUMA domains and the allowed numa imbalance > gets cut off too early because of the change to allow_numa_imbalance. > So while it's a valid comparison, it's definitely not the fix. v3 is definitely not a fix. I wasn't hinting at that. It was just to point out the opportunity that we have. > > Given how you describe NPS, maybe the scaling should only start at the > point where tasks are no longer balanced between sibling domains. Can > you try this? I've only boot tested it at this point. It should work for > STREAM at least but probably not great for tbench. Thanks for the patch. I will queue this one for tonight. > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index bacec575ade2..1fa3e977521d 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > (child->flags & SD_SHARE_PKG_RESOURCES)) { > - struct sched_domain *top = sd; > + struct sched_domain *top, *top_p; > unsigned int llc_sq; > > /* > - * nr_llcs = (top->span_weight / llc_weight); > - * imb = (child_weight / nr_llcs) >> 2 > + * nr_llcs = (sd->span_weight / llc_weight); > + * imb = (llc_weight / nr_llcs) >> 2 > * > * is equivalent to > * > - * imb = (llc_weight^2 / top->span_weight) >> 2 > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > * > */ > llc_sq = child->span_weight * child->span_weight; > > - imb = max(2U, ((llc_sq / top->span_weight) >> 2)); > - imb_span = sd->span_weight; > - > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > sd->imb_numa_nr = imb; > + > + /* > + * Set span based on top domain that places > + * tasks in sibling domains. > + */ > + top = sd; > + top_p = top->parent; > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > + top = top->parent; > + top_p = top->parent; > + } > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > } else { > - sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > + int factor = max(1U, (sd->span_weight / imb_span)); > + > + sd->imb_numa_nr = imb * factor; > } > } > } -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-13 14:47 ` Gautham R. Shenoy @ 2021-12-15 11:52 ` Gautham R. Shenoy 2021-12-15 12:25 ` Mel Gorman 0 siblings, 1 reply; 48+ messages in thread From: Gautham R. Shenoy @ 2021-12-15 11:52 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML Hello Mel, On Mon, Dec 13, 2021 at 08:17:37PM +0530, Gautham R. Shenoy wrote: > > Thanks for the patch. I will queue this one for tonight. > Getting the numbers took a bit longer than I expected. > > > > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index bacec575ade2..1fa3e977521d 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > > > if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > (child->flags & SD_SHARE_PKG_RESOURCES)) { > > - struct sched_domain *top = sd; > > + struct sched_domain *top, *top_p; > > unsigned int llc_sq; > > > > /* > > - * nr_llcs = (top->span_weight / llc_weight); > > - * imb = (child_weight / nr_llcs) >> 2 > > + * nr_llcs = (sd->span_weight / llc_weight); > > + * imb = (llc_weight / nr_llcs) >> 2 > > * > > * is equivalent to > > * > > - * imb = (llc_weight^2 / top->span_weight) >> 2 > > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > > * > > */ > > llc_sq = child->span_weight * child->span_weight; > > > > - imb = max(2U, ((llc_sq / top->span_weight) >> 2)); > > - imb_span = sd->span_weight; > > - > > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > > sd->imb_numa_nr = imb; > > + > > + /* > > + * Set span based on top domain that places > > + * tasks in sibling domains. > > + */ > > + top = sd; > > + top_p = top->parent; > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > + top = top->parent; > > + top_p = top->parent; > > + } > > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > > } else { > > - sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > > + int factor = max(1U, (sd->span_weight / imb_span)); > > + So for the first NUMA domain, the sd->imb_numa_nr will be imb, which turns out to be 2 for Zen2 and Zen3 processors across all Nodes Per Socket Settings. On a 2 Socket Zen3: NPS=1 child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2 top_p = NUMA, imb_span = 256. NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2 NPS=2 child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2 top_p = NUMA, imb_span = 128. NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 NPS=4: child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2 top_p = NUMA, imb_span = 128. NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 Again, we will be more aggressively load balancing across the two sockets in NPS=1 mode compared to NPS=2/4. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stream with 16 threads. built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10 Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 Test: tip-core mel-v3 mel-v4 mel-v4.1 Copy: 113666.77 214885.89 212162.63 226065.79 (0.00%) (89.04%) (86.65%) (98.88%) Scale: 110962.00 215475.05 220002.56 227610.64 (0.00%) (94.18%) (98.26%) (105.12%) Add: 124440.44 250338.11 258569.94 271368.51 (0.00%) (101.17%) (107.78%) (118.07%) Triad: 122836.42 244993.16 251075.19 265465.57 (0.00%) (99.44%) (104.39%) (116.11%) NPS=2 Test: tip-core mel-v3 mel-v4 mel-v4.1 Copy: 58193.29 203242.78 191171.22 212785.96 (0.00%) (249.25%) (228.51%) (265.65%) Scale: 54980.48 209056.28 175497.10 223452.99 (0.00%) (280.23%) (219.19%) (306.42%) Add: 63244.24 249610.19 203569.35 270487.22 (0.00%) (294.67%) (221.87%) (327.68%) Triad: 62166.77 242688.72 198971.26 262757.16 (0.00%) (290.38%) (220.06%) (322.66%) NPS=4 Test: tip-core mel-v3 mel-v4 mel-v4.1 Copy: 37762.14 253412.82 48748.60 164765.83 (0.00%) (571.07%) (29.09%) (336.32%) Scale: 33879.66 237736.93 48317.66 159641.67 (0.00%) (601.70%) (42.61%) (371.20%) Add: 38398.90 292186.56 54259.56 184583.70 (0.00%) (660.92%) (41.30%) (380.70%) Triad: 37942.38 285124.76 54503.74 181250.80 (0.00%) (651.46%) (43.64%) (377.70%) So while in NPS=1 and NPS=2 we are able to recover the performance that was obtained with v3, on v4, we are not able to recover as much. However it is not only due to the fact that in, the imb_numa_nr thresholds for NPS=4 were (1,1) for the two NUMA domains, while in v4.1 the imb_numa_nr was (2,4), but also due to the fact that in v3, we used the imb_numa_nr thresholds in allow_numa_imbalance() while in v4 and v4.1 we are using those in adjust_numa_imbalance(). The distinction is that in v3, we will trigger load balance as soon as the number of tasks in the busiest group in the NUMA domain is greater than or equal to imb_numa_nr at that domain. In v4 and v4.1, we will trigger the load-balance if - the number of tasks in the busiest group is greater than 1/4th the CPUs in the NUMA domain. OR - the difference between the idle CPUs in the busiest and this group is greater than imb_numa_nr. If we retain the (2,4) thresholds from v4.1 but use them in allow_numa_imbalance() as in v3 we get NPS=4 Test: mel-v4.2 Copy: 225860.12 (498.11%) Scale: 227869.07 (572.58%) Add: 278365.58 (624.93%) Triad: 264315.44 (596.62%) It shouldn't have made so much of a difference considering the fact that with 16 stream tasks, we should have hit "imbalance > imb_numa_nr" in adjust_numa_imbalance() eventually. But these are the numbers! The trend is similar with -DNTIMES=100. Even in this case with v4.2 we can recover the stream performance in NPS4 case. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stream with 16 threads. built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100 Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 Test: tip-core mel-v3 mel-v4 mel-v4.1 Copy: 137281.89 235886.13 240588.80 249005.99 (0.00 %) (71.82%) (75.25%) (81.38%) Scale: 126559.48 213667.75 226411.39 228523.78 (0.00 %) (68.82%) (78.89%) (80.56%) Add: 148144.23 255780.73 272027.93 272872.49 (0.00 %) (72.65%) (83.62%) (84.19%) Triad: 146829.78 240400.29 259854.84 265640.70 (0.00 %) (63.72%) (76.97%) (80.91%) NPS=2 Test: tip-core mel-v3 mel-v4 mel-v4.1 Copy: 105873.29 243513.40 198299.43 258266.99 (0.00%) (130.00%) (87.29%) (143.93%) Scale: 100510.54 217361.59 177890.00 231509.69 (0.00%) (116.25%) (76.98%) (130.33%) Add: 115932.61 268488.03 211436.57 288396.07 (0.00%) (131.58%) (82.37%) (148.76%) Triad: 113349.09 253865.68 197810.83 272219.89 (0.00%) (123.96%) (74.51%) (140.16%) NPS=4 Test: tip-core mel-v3 mel-v4 mel-v4.1 mel-v4.2 Copy: 106798.80 251503.78 48898.24 171769.82 266788.03 (0.00%) (135.49%) (-54.21%) (60.83%) (149.80%) Scale: 101377.71 221135.30 48425.90 160748.21 232480.83 (0.00%) (118.13%) (-52.23%) (58.56%) (129.32%) Add: 116131.56 275008.74 54425.29 186387.91 290787.31 (0.00%) (136.80%) (-53.13%) (60.49%) (150.39%) Triad: 113443.70 256027.20 54622.68 180936.47 277456.83 (0.00%) (125.68%) (-51.85%) (59.49%) (144.57%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ tbench Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS=1 ====== Clients: tip-core mel-v3 mel-v4 mel-v4.1 1 633.19 619.16 632.94 619.27 (0.00%) (-2.21%) (-0.03%) (-2.19%) 2 1152.48 1189.88 1184.82 1189.19 (0.00%) (3.24%) (2.80%) (3.18%) 4 1946.46 2177.40 1979.56 2196.09 (0.00%) (11.86%) (1.70%) (12.82%) 8 3553.29 3564.50 3678.07 3668.77 (0.00%) (0.31%) (3.51%) (3.24%) 16 6217.03 6484.58 6249.29 6534.73 (0.00%) (4.30%) (0.51%) (5.11%) 32 11702.59 12185.77 12005.99 11917.57 (0.00%) (4.12%) (2.59%) (1.83%) 64 18394.56 19535.11 19080.19 19500.55 (0.00%) (6.20%) (3.72%) (6.01%) 128 27231.02 31759.92 27200.52 30358.99 (0.00%) (16.63%) (-0.11%) (11.48%) 256 33166.10 24474.30 31639.98 24788.12 (0.00%) (-26.20%) (-4.60%) (-25.26%) 512 41605.44 54823.57 46684.48 54559.02 (0.00%) (31.77%) (12.20%) (31.13%) 1024 53650.54 56329.39 44422.99 56320.66 (0.00%) (4.99%) (-17.19%) (4.97%) We see that the v4.1 performs better than v4 in most cases except when the number of clients=256 where the spread strategy seems to be hurting as we see degradation in both v3 and v4.1. This is true even for NPS=2 and NPS=4 cases (see below). NPS=2 ===== Clients: tip-core mel-v3 mel-v4 mel-v4.1 1 629.76 620.91 629.11 631.95 (0.00%) (-1.40%) (-0.10%) (0.34%) 2 1176.96 1203.12 1169.09 1186.74 (0.00%) (2.22%) (-0.66%) (0.83%) 4 1990.97 2228.04 1888.19 1995.21 (0.00%) (11.90%) (-5.16%) (0.21%) 8 3534.57 3617.16 3660.30 3548.09 (0.00%) (2.33%) (3.55%) (0.38%) 16 6294.71 6547.80 6504.13 6470.34 (0.00%) (4.02%) (3.32%) (2.79%) 32 12035.73 12143.03 11396.26 11860.91 (0.00%) (0.89%) (-5.31%) (-1.45%) 64 18583.39 19439.12 17126.47 18799.54 (0.00%) (4.60%) (-7.83%) (1.16%) 128 27811.89 30562.84 28090.29 27468.94 (0.00%) (9.89%) (1.00%) (-1.23%) 256 28148.95 26488.57 29117.13 23628.29 (0.00%) (-5.89%) (3.43%) (-16.05%) 512 43934.15 52796.38 42603.49 41725.75 (0.00%) (20.17%) (-3.02%) (-5.02%) 1024 54391.65 53891.83 48419.09 43913.40 (0.00%) (-0.91%) (-10.98%) (-19.26%) In this case, v4.1 performs as good as v4 upto 64 clients. But after that we see degradation. The degradation is significant in 1024 clients case. NPS=4 ===== Clients: tip-core mel-v3 mel-v4 mel-v4.1 mel-v4.2 1 622.65 617.83 667.34 644.76 617.58 (0.00%) (-0.77%) (7.17%) (3.55%) (-0.81%) 2 1160.62 1182.30 1294.08 1193.88 1182.55 (0.00%) (1.86%) (11.49%) (2.86%) (1.88%) 4 1961.14 2171.91 2477.71 1929.56 2116.01 (0.00%) (10.74%) (26.34%) (-1.61%) (7.89%) 8 3662.94 3447.98 4067.40 3627.43 3580.32 (0.00%) (-5.86%) (11.04%) (-0.96%) (-2.25%) 16 6490.92 5871.93 6924.32 6660.13 6413.34 (0.00%) (-9.53%) (6.67%) (2.60%) (-1.19%) 32 11831.81 12004.30 12709.06 12187.78 11767.46 (0.00%) (1.45%) (7.41%) (3.00%) (-0.54%) 64 17717.36 18406.79 18785.41 18820.33 18197.86 (0.00%) (3.89%) (6.02%) (6.22%) (2.71%) 128 27723.35 27777.34 27939.63 27399.64 24310.93 (0.00%) (0.19%) (0.78%) (-1.16%) (-12.30%) 256 30919.69 23937.03 35412.26 26780.37 24642.24 (0.00%) (-22.58%) (14.52%) (-13.38%) (-20.30%) 512 43366.03 49570.65 43830.84 43654.42 41031.90 (0.00%) (14.30%) (1.07%) (0.66%) (-5.38%) 1024 46960.83 53576.16 50557.19 43743.07 40884.98 (0.00%) (14.08%) (7.65%) (-6.85%) (-12.93%) In the NPS=4 case, clearly v4 provides the best results. v4.1 does better v4.2 since it is able to hold off spreading for a longer period compared to v4.2. -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-15 11:52 ` Gautham R. Shenoy @ 2021-12-15 12:25 ` Mel Gorman 2021-12-16 18:33 ` Gautham R. Shenoy 0 siblings, 1 reply; 48+ messages in thread From: Mel Gorman @ 2021-12-15 12:25 UTC (permalink / raw) To: Gautham R. Shenoy Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > Hello Mel, > > > On Mon, Dec 13, 2021 at 08:17:37PM +0530, Gautham R. Shenoy wrote: > > > > > Thanks for the patch. I will queue this one for tonight. > > > > Getting the numbers took a bit longer than I expected. > No worries. > > > <SNIP> > > > + /* > > > + * Set span based on top domain that places > > > + * tasks in sibling domains. > > > + */ > > > + top = sd; > > > + top_p = top->parent; > > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > > + top = top->parent; > > > + top_p = top->parent; > > > + } > > > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > > > } else { > > > - sd->imb_numa_nr = imb * (sd->span_weight / imb_span); > > > + int factor = max(1U, (sd->span_weight / imb_span)); > > > + > > > So for the first NUMA domain, the sd->imb_numa_nr will be imb, which > turns out to be 2 for Zen2 and Zen3 processors across all Nodes Per Socket Settings. > > On a 2 Socket Zen3: > > NPS=1 > child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2 > top_p = NUMA, imb_span = 256. > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2 > > NPS=2 > child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2 > top_p = NUMA, imb_span = 128. > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > NPS=4: > child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2 > top_p = NUMA, imb_span = 128. > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > Again, we will be more aggressively load balancing across the two > sockets in NPS=1 mode compared to NPS=2/4. > Yes, but I felt it was reasonable behaviour because we have to strike some sort of balance between allowing a NUMA imbalance up to a point to prevent communicating tasks being pulled apart and v3 broke that completely. There will always be a tradeoff between tasks that want to remain local to each other and others that prefer to spread as wide as possible as quickly as possible. > <SNIP> > If we retain the (2,4) thresholds from v4.1 but use them in > allow_numa_imbalance() as in v3 we get > > NPS=4 > Test: mel-v4.2 > Copy: 225860.12 (498.11%) > Scale: 227869.07 (572.58%) > Add: 278365.58 (624.93%) > Triad: 264315.44 (596.62%) > The potential problem with this is that it probably will work for netperf when it's a single communicating pair but may not work as well when there are multiple communicating pairs or a number of communicating tasks that exceed numa_imb_nr. > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > NPS=1 > ====== > Clients: tip-core mel-v3 mel-v4 mel-v4.1 > 1 633.19 619.16 632.94 619.27 > (0.00%) (-2.21%) (-0.03%) (-2.19%) > > 2 1152.48 1189.88 1184.82 1189.19 > (0.00%) (3.24%) (2.80%) (3.18%) > > 4 1946.46 2177.40 1979.56 2196.09 > (0.00%) (11.86%) (1.70%) (12.82%) > > 8 3553.29 3564.50 3678.07 3668.77 > (0.00%) (0.31%) (3.51%) (3.24%) > > 16 6217.03 6484.58 6249.29 6534.73 > (0.00%) (4.30%) (0.51%) (5.11%) > > 32 11702.59 12185.77 12005.99 11917.57 > (0.00%) (4.12%) (2.59%) (1.83%) > > 64 18394.56 19535.11 19080.19 19500.55 > (0.00%) (6.20%) (3.72%) (6.01%) > > 128 27231.02 31759.92 27200.52 30358.99 > (0.00%) (16.63%) (-0.11%) (11.48%) > > 256 33166.10 24474.30 31639.98 24788.12 > (0.00%) (-26.20%) (-4.60%) (-25.26%) > > 512 41605.44 54823.57 46684.48 54559.02 > (0.00%) (31.77%) (12.20%) (31.13%) > > 1024 53650.54 56329.39 44422.99 56320.66 > (0.00%) (4.99%) (-17.19%) (4.97%) > > > We see that the v4.1 performs better than v4 in most cases except when > the number of clients=256 where the spread strategy seems to be > hurting as we see degradation in both v3 and v4.1. This is true even > for NPS=2 and NPS=4 cases (see below). > The 256 client case is a bit of a crapshoot. At that point, the NUMA imbalancing is disabled and the machine is overloaded. > NPS=2 > ===== > Clients: tip-core mel-v3 mel-v4 mel-v4.1 > 1 629.76 620.91 629.11 631.95 > (0.00%) (-1.40%) (-0.10%) (0.34%) > > 2 1176.96 1203.12 1169.09 1186.74 > (0.00%) (2.22%) (-0.66%) (0.83%) > > 4 1990.97 2228.04 1888.19 1995.21 > (0.00%) (11.90%) (-5.16%) (0.21%) > > 8 3534.57 3617.16 3660.30 3548.09 > (0.00%) (2.33%) (3.55%) (0.38%) > > 16 6294.71 6547.80 6504.13 6470.34 > (0.00%) (4.02%) (3.32%) (2.79%) > > 32 12035.73 12143.03 11396.26 11860.91 > (0.00%) (0.89%) (-5.31%) (-1.45%) > > 64 18583.39 19439.12 17126.47 18799.54 > (0.00%) (4.60%) (-7.83%) (1.16%) > > 128 27811.89 30562.84 28090.29 27468.94 > (0.00%) (9.89%) (1.00%) (-1.23%) > > 256 28148.95 26488.57 29117.13 23628.29 > (0.00%) (-5.89%) (3.43%) (-16.05%) > > 512 43934.15 52796.38 42603.49 41725.75 > (0.00%) (20.17%) (-3.02%) (-5.02%) > > 1024 54391.65 53891.83 48419.09 43913.40 > (0.00%) (-0.91%) (-10.98%) (-19.26%) > > In this case, v4.1 performs as good as v4 upto 64 clients. But after > that we see degradation. The degradation is significant in 1024 > clients case. > Kinda the same, it's more likely to be run-to-run variance because the machine is overloaded. > NPS=4 > ===== > Clients: tip-core mel-v3 mel-v4 mel-v4.1 mel-v4.2 > 1 622.65 617.83 667.34 644.76 617.58 > (0.00%) (-0.77%) (7.17%) (3.55%) (-0.81%) > > 2 1160.62 1182.30 1294.08 1193.88 1182.55 > (0.00%) (1.86%) (11.49%) (2.86%) (1.88%) > > 4 1961.14 2171.91 2477.71 1929.56 2116.01 > (0.00%) (10.74%) (26.34%) (-1.61%) (7.89%) > > 8 3662.94 3447.98 4067.40 3627.43 3580.32 > (0.00%) (-5.86%) (11.04%) (-0.96%) (-2.25%) > > 16 6490.92 5871.93 6924.32 6660.13 6413.34 > (0.00%) (-9.53%) (6.67%) (2.60%) (-1.19%) > > 32 11831.81 12004.30 12709.06 12187.78 11767.46 > (0.00%) (1.45%) (7.41%) (3.00%) (-0.54%) > > 64 17717.36 18406.79 18785.41 18820.33 18197.86 > (0.00%) (3.89%) (6.02%) (6.22%) (2.71%) > > 128 27723.35 27777.34 27939.63 27399.64 24310.93 > (0.00%) (0.19%) (0.78%) (-1.16%) (-12.30%) > > 256 30919.69 23937.03 35412.26 26780.37 24642.24 > (0.00%) (-22.58%) (14.52%) (-13.38%) (-20.30%) > > 512 43366.03 49570.65 43830.84 43654.42 41031.90 > (0.00%) (14.30%) (1.07%) (0.66%) (-5.38%) > > 1024 46960.83 53576.16 50557.19 43743.07 40884.98 > (0.00%) (14.08%) (7.65%) (-6.85%) (-12.93%) > > > In the NPS=4 case, clearly v4 provides the best results. > > v4.1 does better v4.2 since it is able to hold off spreading for a > longer period compared to v4.2. > Most likely because v4.2 is disabling the allowed NUMA imbalance too soon. This is the trade-off between favouring communicating tasks over embararassingly parallel problems. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-15 12:25 ` Mel Gorman @ 2021-12-16 18:33 ` Gautham R. Shenoy 2021-12-20 11:12 ` Mel Gorman 0 siblings, 1 reply; 48+ messages in thread From: Gautham R. Shenoy @ 2021-12-16 18:33 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML Hello Mel, On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: [..SNIP..] > > On a 2 Socket Zen3: > > > > NPS=1 > > child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2 > > top_p = NUMA, imb_span = 256. > > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2 > > > > NPS=2 > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2 > > top_p = NUMA, imb_span = 128. > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > NPS=4: > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2 > > top_p = NUMA, imb_span = 128. > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > Again, we will be more aggressively load balancing across the two > > sockets in NPS=1 mode compared to NPS=2/4. > > > > Yes, but I felt it was reasonable behaviour because we have to strike > some sort of balance between allowing a NUMA imbalance up to a point > to prevent communicating tasks being pulled apart and v3 broke that > completely. There will always be a tradeoff between tasks that want to > remain local to each other and others that prefer to spread as wide as > possible as quickly as possible. I agree with this argument that we want to be conservative while pulling tasks across NUMA domains. My point was that the threshold at the NUMA domain that spans the 2 sockets is lower for NPS=1 (imb_numa_nr = 2) when compared to the threshold for the same NUMA domain when NPS=2/4 (imb_numa_nr = 4). Irrespective of what NPS mode we are operating in, the NUMA distance between the two sockets is 32 on Zen3 systems. Hence shouldn't the thresholds be the same for that level of NUMA? Would something like the following work ? if (sd->flags & SD_NUMA) { /* We are using the child as a proxy for the group. */ group_span = sd->child->span_weight; sd_distance = /* NUMA distance at this sd level */ /* By default we set the threshold to 1/4th the sched-group span. */ imb_numa_shift = 2; /* * We can be a little aggressive if the cost of migrating tasks * across groups of this NUMA level is not high. * Assuming */ if (sd_distance < REMOTE_DISTANCE) imb_numa_shift++; /* * Compute the number of LLCs in each group. * More the LLCs, more aggressively we migrate across * the groups at this NUMA sd. */ nr_llcs = group_span/llc_size; sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift); } With this, on Intel platforms, we will get sd->imb_numa_nr = (span of socket)/4 On Zen3, NPS=1, Inter-socket NUMA : sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4 NPS=2, Intra-socket NUMA: sd->imb_numa_nr = max(2U, (64/4) >> (2+1)) = 2 Inter-socket NUMA: sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4 NPS=4, Intra-socket NUMA: sd->imb_numa_nr = max(2U, (32/2) >> (2+1)) = 2 Inter-socket NUMA: sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4 > > > <SNIP> > > If we retain the (2,4) thresholds from v4.1 but use them in > > allow_numa_imbalance() as in v3 we get > > > > NPS=4 > > Test: mel-v4.2 > > Copy: 225860.12 (498.11%) > > Scale: 227869.07 (572.58%) > > Add: 278365.58 (624.93%) > > Triad: 264315.44 (596.62%) > > > > The potential problem with this is that it probably will work for > netperf when it's a single communicating pair but may not work as well > when there are multiple communicating pairs or a number of communicating > tasks that exceed numa_imb_nr. Yes that's true. I think what you are doing in v4 is the right thing. In case of stream in NPS=4, it just manages to hit the corner case for this heuristic which results in a suboptimal behaviour. Description follows: On NPS=4, if we run 8 stream tasks bound to a socket with v4.1, we get the following initial placement based on data obtained via the sched:sched_wakeup_new tracepoint. This behaviour is consistently reproducible. ------------------------------------------------------- | NUMA | | ----------------------- ------------------------ | | | NODE0 | | NODE1 | | | | ------------- | | ------------- | | | | | 0 tasks | MC0 | | | 1 tasks | MC2 | | | | ------------- | | ------------- | | | | ------------- | | ------------- | | | | | 1 tasks | MC1 | | | 1 tasks | MC3 | | | | ------------- | | ------------- | | | | | | | | | ----------------------- ------------------------ | | ----------------------- ------------------------ | | | NODE2 | | NODE3 | | | | ------------- | | ------------- | | | | | 1 tasks | MC4 | | | 1 tasks | MC6 | | | | ------------- | | ------------- | | | | ------------- | | ------------- | | | | | 2 tasks | MC5 | | | 1 tasks | MC7 | | | | ------------- | | ------------- | | | | | | | | | ----------------------- ------------------------ | | | ------------------------------------------------------- From the trace data obtained for sched:sched_wakeup_new and sched:sched_migrate_task, we see PID 106089 : timestamp 35607.831040 : was running in MC5 PID 106090 : timestamp 35607.831040 : first placed in MC4 PID 106091 : timestamp 35607.831081 : first placed in MC5 PID 106092 : timestamp 35607.831155 : first placed in MC7 PID 106093 : timestamp 35607.831209 : first placed in MC3 PID 106094 : timestamp 35607.831254 : first placed in MC1 PID 106095 : timestamp 35607.831300 : first placed in MC6 PID 106096 : timestamp 35607.831344 : first placed in MC2 Subsequently we do not see any migrations for stream tasks (via the sched:sched_migrate_task tracepoint), even though they run for nearly 10 seconds. The reasons: - No load-balancing is possible at any of the NODE sched-domains since the groups are more or less balanced within each NODE. - At NUMA sched-domain, busiest group would be NODE2. When any CPU in NODE0 performs load-balancing at NUMA level, it can pull tasks only if the imbalance between NODE0 and NODE2 is greater than imb_numa_nr = 2, which isn't the case here. Hence, with v4.1, we get the following numbers which are better than the current upstream, but are still not the best. Copy: 78182.7 Scale: 76344.1 Add: 87638.7 Triad: 86388.9 However, if I run an "mpstat 1 10 > /tmp/mpstat.log&" just before kickstarting stream-8, the performance significantly improves (again, consistently reproducible). Copy: 122804.6 Scale: 115192.9 Add: 137191.6 Triad: 133338.5 In this case, from the trace data for stream, we see: PID 105174 : timestamp 35547.526816 : was running in MC4 PID 105174 : timestamp 35547.577635 : moved to MC5 PID 105175 : timestamp 35547.526816 : first placed in MC4 PID 105176 : timestamp 35547.526846 : first placed in MC3 PID 105177 : timestamp 35547.526893 : first placed in MC7 PID 105178 : timestamp 35547.526928 : first placed in MC1 PID 105179 : timestamp 35547.526961 : first placed in MC2 PID 105180 : timestamp 35547.527001 : first placed in MC6 PID 105181 : timestamp 35547.527032 : first placed in MC0 In this case, at the time of the initial placement (find_idlest_group() ?), we are able to spread out farther away. The subsequent load-balance at the NODE2 domain is able to balance the tasks between MC4 and MC5. > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > NPS=1 > > ====== > > Clients: tip-core mel-v3 mel-v4 mel-v4.1 > > 1 633.19 619.16 632.94 619.27 > > (0.00%) (-2.21%) (-0.03%) (-2.19%) > > > > 2 1152.48 1189.88 1184.82 1189.19 > > (0.00%) (3.24%) (2.80%) (3.18%) > > > > 4 1946.46 2177.40 1979.56 2196.09 > > (0.00%) (11.86%) (1.70%) (12.82%) > > > > 8 3553.29 3564.50 3678.07 3668.77 > > (0.00%) (0.31%) (3.51%) (3.24%) > > > > 16 6217.03 6484.58 6249.29 6534.73 > > (0.00%) (4.30%) (0.51%) (5.11%) > > > > 32 11702.59 12185.77 12005.99 11917.57 > > (0.00%) (4.12%) (2.59%) (1.83%) > > > > 64 18394.56 19535.11 19080.19 19500.55 > > (0.00%) (6.20%) (3.72%) (6.01%) > > > > 128 27231.02 31759.92 27200.52 30358.99 > > (0.00%) (16.63%) (-0.11%) (11.48%) > > > > 256 33166.10 24474.30 31639.98 24788.12 > > (0.00%) (-26.20%) (-4.60%) (-25.26%) > > > > 512 41605.44 54823.57 46684.48 54559.02 > > (0.00%) (31.77%) (12.20%) (31.13%) > > > > 1024 53650.54 56329.39 44422.99 56320.66 > > (0.00%) (4.99%) (-17.19%) (4.97%) > > > > > > We see that the v4.1 performs better than v4 in most cases except when > > the number of clients=256 where the spread strategy seems to be > > hurting as we see degradation in both v3 and v4.1. This is true even > > for NPS=2 and NPS=4 cases (see below). > > > > The 256 client case is a bit of a crapshoot. At that point, the NUMA > imbalancing is disabled and the machine is overloaded. Yup. [..snip..] > Most likely because v4.2 is disabling the allowed NUMA imbalance too > soon. This is the trade-off between favouring communicating tasks over > embararassingly parallel problems. v4.1 does allow the NUMA imbalance for a longer duration. But since the thresholds are small enough, I guess it should be a ok for most workloads. -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-16 18:33 ` Gautham R. Shenoy @ 2021-12-20 11:12 ` Mel Gorman 2021-12-21 15:03 ` Gautham R. Shenoy 2021-12-21 17:13 ` Vincent Guittot 0 siblings, 2 replies; 48+ messages in thread From: Mel Gorman @ 2021-12-20 11:12 UTC (permalink / raw) To: Gautham R. Shenoy Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML (sorry for the delay, was offline for a few days) On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > Hello Mel, > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > [..SNIP..] > > > > On a 2 Socket Zen3: > > > > > > NPS=1 > > > child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2 > > > top_p = NUMA, imb_span = 256. > > > > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2 > > > > > > NPS=2 > > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2 > > > top_p = NUMA, imb_span = 128. > > > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > > > NPS=4: > > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2 > > > top_p = NUMA, imb_span = 128. > > > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > > > Again, we will be more aggressively load balancing across the two > > > sockets in NPS=1 mode compared to NPS=2/4. > > > > > > > Yes, but I felt it was reasonable behaviour because we have to strike > > some sort of balance between allowing a NUMA imbalance up to a point > > to prevent communicating tasks being pulled apart and v3 broke that > > completely. There will always be a tradeoff between tasks that want to > > remain local to each other and others that prefer to spread as wide as > > possible as quickly as possible. > > I agree with this argument that we want to be conservative while > pulling tasks across NUMA domains. My point was that the threshold at > the NUMA domain that spans the 2 sockets is lower for NPS=1 > (imb_numa_nr = 2) when compared to the threshold for the same NUMA > domain when NPS=2/4 (imb_numa_nr = 4). > Is that a problem though? On an Intel machine with sub-numa clustering, the distances are 11 and 21 for a "node" that is the split cache and the remote node respectively. > Irrespective of what NPS mode we are operating in, the NUMA distance > between the two sockets is 32 on Zen3 systems. Hence shouldn't the > thresholds be the same for that level of NUMA? > Maybe, but then it is not a property of the sched_domain and instead needs to account for distance when balancing between two nodes that may be varying distances from each other. > Would something like the following work ? > > if (sd->flags & SD_NUMA) { > > /* We are using the child as a proxy for the group. */ > group_span = sd->child->span_weight; > sd_distance = /* NUMA distance at this sd level */ > NUMA distance relative to what? On Zen, the distance to a remote node may be fixed but on topologies with multiple nodes that are not fully linked to every other node by one hop, the same is not true. > /* By default we set the threshold to 1/4th the sched-group span. */ > imb_numa_shift = 2; > > /* > * We can be a little aggressive if the cost of migrating tasks > * across groups of this NUMA level is not high. > * Assuming > */ > > if (sd_distance < REMOTE_DISTANCE) > imb_numa_shift++; > The distance would have to be accounted for elsewhere because here we are considering one node in isolation, not relative to other nodes. > /* > * Compute the number of LLCs in each group. > * More the LLCs, more aggressively we migrate across > * the groups at this NUMA sd. > */ > nr_llcs = group_span/llc_size; > > sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift); > } > Same, any adjustment would have to happen during load balancing taking into account the relatively NUMA distances. I'm not necessarily opposed but it would be a separate patch. > > > <SNIP> > > > If we retain the (2,4) thresholds from v4.1 but use them in > > > allow_numa_imbalance() as in v3 we get > > > > > > NPS=4 > > > Test: mel-v4.2 > > > Copy: 225860.12 (498.11%) > > > Scale: 227869.07 (572.58%) > > > Add: 278365.58 (624.93%) > > > Triad: 264315.44 (596.62%) > > > > > > > The potential problem with this is that it probably will work for > > netperf when it's a single communicating pair but may not work as well > > when there are multiple communicating pairs or a number of communicating > > tasks that exceed numa_imb_nr. > > > Yes that's true. I think what you are doing in v4 is the right thing. > > In case of stream in NPS=4, it just manages to hit the corner case for > this heuristic which results in a suboptimal behaviour. Description > follows: > To avoid the corner case, we'd need to explicitly favour spreading early and assume wakeup will pull communicating tasks together and NUMA balancing migrate the data after some time which looks like diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index c07bfa2d80f2..54f5207154d3 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -93,6 +93,7 @@ struct sched_domain { unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0a969affca76..df0e84462e62 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1489,6 +1489,7 @@ struct task_numa_env { int src_cpu, src_nid; int dst_cpu, dst_nid; + int imb_numa_nr; struct numa_stats src_stats, dst_stats; @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); static unsigned long cpu_runnable(struct rq *rq); static unsigned long cpu_util(int cpu); static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight); + int dst_running, + int imb_numa_nr); static inline enum numa_type numa_classify(unsigned int imbalance_pct, @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, dst_running = env->dst_stats.nr_running + 1; imbalance = max(0, dst_running - src_running); imbalance = adjust_numa_imbalance(imbalance, dst_running, - env->dst_stats.weight); + env->imb_numa_nr); /* Use idle CPU if there is no imbalance */ if (!imbalance) { @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p) */ rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); - if (sd) + if (sd) { env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; + env.imb_numa_nr = sd->imb_numa_nr; + } rcu_read_unlock(); /* @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, * This is an approximation as the number of running tasks may not be * related to the number of busy CPUs due to sched_setaffinity. */ -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) { - return (dst_running < (dst_weight >> 2)); + return dst_running < imb_numa_nr; } /* @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) return idlest; #endif /* - * Otherwise, keep the task on this node to stay close - * its wakeup source and improve locality. If there is - * a real need of migration, periodic load balance will - * take care of it. + * Otherwise, keep the task on this node to stay local + * to its wakeup source if the number of running tasks + * are below the allowed imbalance. If there is a real + * need of migration, periodic load balance will take + * care of it. */ - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) return NULL; } @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } } -#define NUMA_IMBALANCE_MIN 2 - static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight) + int dst_running, int imb_numa_nr) { - if (!allow_numa_imbalance(dst_running, dst_weight)) + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) return imbalance; - /* - * Allow a small imbalance based on a simple pair of communicating - * tasks that remain local when the destination is lightly loaded. - */ - if (imbalance <= NUMA_IMBALANCE_MIN) + if (imbalance <= imb_numa_nr) return 0; return imbalance; @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* Consider allowing a small imbalance between NUMA groups */ if (env->sd->flags & SD_NUMA) { env->imbalance = adjust_numa_imbalance(env->imbalance, - busiest->sum_nr_running, env->sd->span_weight); + busiest->sum_nr_running, env->sd->imb_numa_nr); } return; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..1fa3e977521d 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } } + /* + * Calculate an allowed NUMA imbalance such that LLCs do not get + * imbalanced. + */ + for_each_cpu(i, cpu_map) { + unsigned int imb = 0; + unsigned int imb_span = 1; + + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + struct sched_domain *child = sd->child; + + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && + (child->flags & SD_SHARE_PKG_RESOURCES)) { + struct sched_domain *top, *top_p; + unsigned int llc_sq; + + /* + * nr_llcs = (sd->span_weight / llc_weight); + * imb = (llc_weight / nr_llcs) >> 2 + * + * is equivalent to + * + * imb = (llc_weight^2 / sd->span_weight) >> 2 + * + */ + llc_sq = child->span_weight * child->span_weight; + + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); + sd->imb_numa_nr = imb; + + /* + * Set span based on top domain that places + * tasks in sibling domains. + */ + top = sd; + top_p = top->parent; + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { + top = top->parent; + top_p = top->parent; + } + imb_span = top_p ? top_p->span_weight : sd->span_weight; + } else { + int factor = max(1U, (sd->span_weight / imb_span)); + + sd->imb_numa_nr = imb * factor; + } + } + } + /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-20 11:12 ` Mel Gorman @ 2021-12-21 15:03 ` Gautham R. Shenoy 2021-12-21 17:13 ` Vincent Guittot 1 sibling, 0 replies; 48+ messages in thread From: Gautham R. Shenoy @ 2021-12-21 15:03 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Mon, Dec 20, 2021 at 11:12:43AM +0000, Mel Gorman wrote: > (sorry for the delay, was offline for a few days) > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > > Hello Mel, > > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > > > [..SNIP..] > > > > > > On a 2 Socket Zen3: > > > > > > > > NPS=1 > > > > child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2 > > > > top_p = NUMA, imb_span = 256. > > > > > > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2 > > > > > > > > NPS=2 > > > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2 > > > > top_p = NUMA, imb_span = 128. > > > > > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > > > > > NPS=4: > > > > child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2 > > > > top_p = NUMA, imb_span = 128. > > > > > > > > NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2 > > > > NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4 > > > > > > > > Again, we will be more aggressively load balancing across the two > > > > sockets in NPS=1 mode compared to NPS=2/4. > > > > > > > > > > Yes, but I felt it was reasonable behaviour because we have to strike > > > some sort of balance between allowing a NUMA imbalance up to a point > > > to prevent communicating tasks being pulled apart and v3 broke that > > > completely. There will always be a tradeoff between tasks that want to > > > remain local to each other and others that prefer to spread as wide as > > > possible as quickly as possible. > > > > I agree with this argument that we want to be conservative while > > pulling tasks across NUMA domains. My point was that the threshold at > > the NUMA domain that spans the 2 sockets is lower for NPS=1 > > (imb_numa_nr = 2) when compared to the threshold for the same NUMA > > domain when NPS=2/4 (imb_numa_nr = 4). > > > > Is that a problem though? On an Intel machine with sub-numa clustering, > the distances are 11 and 21 for a "node" that is the split cache and the > remote node respectively. So, my question was, on an Intel machine, with sub-numa clustering enabled vs disabled, is the value of imb_numa_nr for the NUMA domain which spans the remote nodes (distance=21) the same or different. And if it is different, what is the rationale behind that. I am totally on-board with the idea that for the different NUMA levels, the corresponding imb_numa_nr should be different. Just in case, I was not making myself clear earlier, on Zen3, the NUMA-A sched-domain, in the figures below, has groups where each group spans a socket in all the NPS configurations. However, only on NPS=1 we have sd->imb_numa_nr=2 for NUMA-A, while on NPS=2/4, the value of sd->imb_numa_nr=4 for NUMA-A domain. Thus if we had 4 tasks sharing data, on NPS=2/4, they would reside on the same socket, while on NPS=1, we will have 2 tasks on one socket, and the other 2 will migrated to the other socket. That said, I have not been able to observe any significiant difference with a real-world workload like Mongodb run on NPS=1 with imb_numa_nr set to 2 vs 4. Zen3, NPS=1 ------------------------------------------------------------------ | | | NUMA-A : sd->imb_numa_nr = 2 | | ------------------------ ------------------------ | | |DIE | |DIE | | | | | | | | | | ------ ------ | | ------ ------ | | | | |MC | |MC | | | |MC | |MC | | | | | ------ ------ | | ------ ------ | | | | ------ ------ | | ------ ------ | | | | |MC | |MC | | | |MC | |MC | | | | | ------ ------ | | ------ ------ | | | | | | | | | | ------ ------ | | ------ ------ | | | | |MC | |MC | | | |MC | |MC | | | | | ------ ------ | | ------ ------ | | | | ------ ------ | | ------ ------ | | | | |MC | |MC | | | |MC | |MC | | | | | ------ ------ | | ------ ------ | | | | | | | | | ------------------------ ------------------------ | | | ------------------------------------------------------------------ Zen3, NPS=2 ------------------------------------------------------------------ | | | NUMA-A: sd->imb_numa_nr = 4 | | --------------------------- --------------------------- | | |NUMA-B :sd->imb_numa_nr=2| |NUMA-B :sd->imb_numa_nr=2| | | | ----------- ----------- | | ----------- ----------- | | | | |NODE | |NODE | | | |NODE | |NODE | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | | | | | | | | | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | ----------- ----------- | | ----------- ----------- | | | | | | | | | --------------------------- --------------------------- | | | ------------------------------------------------------------------ Zen3, NPS=4 ------------------------------------------------------------------ | | | NUMA-A: sd->imb_numa_nr = 4 | | --------------------------- --------------------------- | | |NUMA-B :sd->imb_numa_nr=2| |NUMA-B :sd->imb_numa_nr=2| | | | ----------- ----------- | | ----------- ----------- | | | | |NODE | |NODE | | | |NODE | |NODE | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | ----------- ----------- | | ----------- ----------- | | | | ----------- ----------- | | ----------- ----------- | | | | |NODE | |NODE | | | |NODE | |NODE | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | | |MC | | | |MC | | | | | |MC | | | |MC | | | | | | | ------ | | ------ | | | | ------ | | ------ | | | | | ----------- ----------- | | ----------- ----------- | | | | | | | | | --------------------------- --------------------------- | | | ------------------------------------------------------------------ > > Irrespective of what NPS mode we are operating in, the NUMA distance > > between the two sockets is 32 on Zen3 systems. Hence shouldn't the > > thresholds be the same for that level of NUMA? > > > > Maybe, but then it is not a property of the sched_domain and instead > needs to account for distance when balancing between two nodes that may > be varying distances from each other. > > > Would something like the following work ? > > > > if (sd->flags & SD_NUMA) { > > > > /* We are using the child as a proxy for the group. */ > > group_span = sd->child->span_weight; > > sd_distance = /* NUMA distance at this sd level */ > > > > NUMA distance relative to what? On Zen, the distance to a remote node may > be fixed but on topologies with multiple nodes that are not fully linked > to every other node by one hop, the same is not true. Fair enough. The "sd_distance" I was referring to the node_distance() between the CPUs of any two groups in this NUMA domain. However, this was assuming that the node_distance() between the CPUs of any two groups would be the same, which is not the case for certain platforms. So this wouldn't work. > > > /* By default we set the threshold to 1/4th the sched-group span. */ > > imb_numa_shift = 2; > > > > /* > > * We can be a little aggressive if the cost of migrating tasks > > * across groups of this NUMA level is not high. > > * Assuming > > */ > > > > if (sd_distance < REMOTE_DISTANCE) > > imb_numa_shift++; > > > > The distance would have to be accounted for elsewhere because here we > are considering one node in isolation, not relative to other nodes. > > > /* > > * Compute the number of LLCs in each group. > > * More the LLCs, more aggressively we migrate across > > * the groups at this NUMA sd. > > */ > > nr_llcs = group_span/llc_size; > > > > sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift); > > } > > > > Same, any adjustment would have to happen during load balancing taking > into account the relatively NUMA distances. I'm not necessarily opposed > but it would be a separate patch. Sure, we can look into this separately. > > > > > <SNIP> > > > > If we retain the (2,4) thresholds from v4.1 but use them in > > > > allow_numa_imbalance() as in v3 we get > > > > > > > > NPS=4 > > > > Test: mel-v4.2 > > > > Copy: 225860.12 (498.11%) > > > > Scale: 227869.07 (572.58%) > > > > Add: 278365.58 (624.93%) > > > > Triad: 264315.44 (596.62%) > > > > > > > > > > The potential problem with this is that it probably will work for > > > netperf when it's a single communicating pair but may not work as well > > > when there are multiple communicating pairs or a number of communicating > > > tasks that exceed numa_imb_nr. > > > > > > Yes that's true. I think what you are doing in v4 is the right thing. > > > > In case of stream in NPS=4, it just manages to hit the corner case for > > this heuristic which results in a suboptimal behaviour. Description > > follows: > > > > To avoid the corner case, we'd need to explicitly favour spreading early > and assume wakeup will pull communicating tasks together and NUMA > balancing migrate the data after some time which looks like Actually I was able to root-cause the reason behind the drop in the performance of stream on NPS-4. I have already responded earlier in another thread : https://lore.kernel.org/lkml/Ybzq%2FA+HS%2FGxGYha@BLR-5CG11610CF.amd.com/ Appending the patch here: --- kernel/sched/fair.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ec354bf88b0d..c1b2a422a877 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9191,13 +9191,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) return idlest; #endif /* - * Otherwise, keep the task on this node to stay local - * to its wakeup source if the number of running tasks - * are below the allowed imbalance. If there is a real - * need of migration, periodic load balance will take - * care of it. + * Otherwise, keep the task on this node to + * stay local to its wakeup source if the + * number of running tasks (including the new + * one) are below the allowed imbalance. If + * there is a real need of migration, periodic + * load balance will take care of it. */ - if (local_sgs.sum_nr_running <= sd->imb_numa_nr) + if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr) return NULL; } -- With this fix on top of your fix to compute the imb_numa_nr at the relevant level (v4.1: https://lore.kernel.org/lkml/20211213130131.GQ3366@techsingularity.net/), the stream regression for NPS4 is no longer there. Test: tip-core v4 v4.1 v4.1-find_idlest_group_fix Copy: 37762.14 (0.00%) 48748.60 (29.09%) 164765.83 (336.32%) 205963.99 (445.42%) Scale: 33879.66 (0.00%) 48317.66 (42.61%) 159641.67 (371.20%) 218136.57 (543.85%) Add: 38398.90 (0.00%) 54259.56 (41.30%) 184583.70 (380.70%) 257857.90 (571.52%) Triad: 37942.38 (0.00%) 54503.74 (43.64%) 181250.80 (377.70%) 251410.28 (562.61%) -- Thanks and Regards gautham. ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-20 11:12 ` Mel Gorman 2021-12-21 15:03 ` Gautham R. Shenoy @ 2021-12-21 17:13 ` Vincent Guittot 2021-12-22 8:52 ` Jirka Hladky 2022-01-05 10:42 ` Mel Gorman 1 sibling, 2 replies; 48+ messages in thread From: Vincent Guittot @ 2021-12-21 17:13 UTC (permalink / raw) To: Mel Gorman Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote: > > (sorry for the delay, was offline for a few days) > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > > Hello Mel, > > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > > > [..SNIP..] > > [snip] > > To avoid the corner case, we'd need to explicitly favour spreading early > and assume wakeup will pull communicating tasks together and NUMA > balancing migrate the data after some time which looks like > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index c07bfa2d80f2..54f5207154d3 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -93,6 +93,7 @@ struct sched_domain { > unsigned int busy_factor; /* less balancing by factor if busy */ > unsigned int imbalance_pct; /* No balance until over watermark */ > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ So now you compute an allowed imbalance level instead of using 25% of sd->span_weight or 25% of busiest->group_weight And you adjust this new imb_numa_nr according to the topology. That makes sense. > > int nohz_idle; /* NOHZ IDLE status */ > int flags; /* See SD_* */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0a969affca76..df0e84462e62 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > int src_cpu, src_nid; > int dst_cpu, dst_nid; > + int imb_numa_nr; > > struct numa_stats src_stats, dst_stats; > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); > static unsigned long cpu_runnable(struct rq *rq); > static unsigned long cpu_util(int cpu); > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight); > + int dst_running, > + int imb_numa_nr); > > static inline enum > numa_type numa_classify(unsigned int imbalance_pct, > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, > dst_running = env->dst_stats.nr_running + 1; > imbalance = max(0, dst_running - src_running); > imbalance = adjust_numa_imbalance(imbalance, dst_running, > - env->dst_stats.weight); > + env->imb_numa_nr); > > /* Use idle CPU if there is no imbalance */ > if (!imbalance) { > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p) > */ > rcu_read_lock(); > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > - if (sd) > + if (sd) { > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > + env.imb_numa_nr = sd->imb_numa_nr; > + } > rcu_read_unlock(); > > /* > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > * This is an approximation as the number of running tasks may not be > * related to the number of busy CPUs due to sched_setaffinity. > */ > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > { > - return (dst_running < (dst_weight >> 2)); > + return dst_running < imb_numa_nr; > } > > /* > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > return idlest; > #endif > /* > - * Otherwise, keep the task on this node to stay close > - * its wakeup source and improve locality. If there is > - * a real need of migration, periodic load balance will > - * take care of it. > + * Otherwise, keep the task on this node to stay local > + * to its wakeup source if the number of running tasks > + * are below the allowed imbalance. If there is a real > + * need of migration, periodic load balance will take > + * care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) > return NULL; > } > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > } > } > > -#define NUMA_IMBALANCE_MIN 2 > - > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight) > + int dst_running, int imb_numa_nr) > { > - if (!allow_numa_imbalance(dst_running, dst_weight)) > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > return imbalance; > > - /* > - * Allow a small imbalance based on a simple pair of communicating > - * tasks that remain local when the destination is lightly loaded. > - */ > - if (imbalance <= NUMA_IMBALANCE_MIN) > + if (imbalance <= imb_numa_nr) Isn't this always true ? imbalance is "always" < dst_running as imbalance is usually the number of these tasks that we would like to migrate > return 0; > > return imbalance; > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* Consider allowing a small imbalance between NUMA groups */ > if (env->sd->flags & SD_NUMA) { > env->imbalance = adjust_numa_imbalance(env->imbalance, > - busiest->sum_nr_running, env->sd->span_weight); > + busiest->sum_nr_running, env->sd->imb_numa_nr); > } > > return; > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d201a7052a29..1fa3e977521d 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + /* > + * Calculate an allowed NUMA imbalance such that LLCs do not get > + * imbalanced. > + */ > + for_each_cpu(i, cpu_map) { > + unsigned int imb = 0; > + unsigned int imb_span = 1; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { sched_domains have not been degenerated yet so you found here the DIE domain > + struct sched_domain *top, *top_p; > + unsigned int llc_sq; > + > + /* > + * nr_llcs = (sd->span_weight / llc_weight); > + * imb = (llc_weight / nr_llcs) >> 2 it would be good to add a comment to explain why 25% of LLC weight / number of LLC in a node is the right value. For example, why is it better than just 25% of the LLC weight ? Do you want to allow the same imbalance at node level whatever the number of LLC in the node ? > + * > + * is equivalent to > + * > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > + * > + */ > + llc_sq = child->span_weight * child->span_weight; > + > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > + sd->imb_numa_nr = imb; > + > + /* > + * Set span based on top domain that places > + * tasks in sibling domains. > + */ > + top = sd; > + top_p = top->parent; > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure that you want to take this case into account, only numa node don't have SD_PREFER_SIBLING > + top = top->parent; > + top_p = top->parent; > + } > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > + } else { > + int factor = max(1U, (sd->span_weight / imb_span)); > + > + sd->imb_numa_nr = imb * factor; > + } > + } > + } > + > /* Calculate CPU capacity for physical packages and nodes */ > for (i = nr_cpumask_bits-1; i >= 0; i--) { > if (!cpumask_test_cpu(i, cpu_map)) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-21 17:13 ` Vincent Guittot @ 2021-12-22 8:52 ` Jirka Hladky 2022-01-04 19:52 ` Jirka Hladky 2022-01-05 10:42 ` Mel Gorman 1 sibling, 1 reply; 48+ messages in thread From: Jirka Hladky @ 2021-12-22 8:52 UTC (permalink / raw) To: Vincent Guittot Cc: Mel Gorman, Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Philip Auld Hi Mel, we have tested the performance impact of this patch and we see performance drops up to 60% with the OpenMP implementation of the NAS parallel benchmark. Example of results: 2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs) 5.16.0-rc5 5.16.0-rc5 Difference (vanilla) + v4 of the patch in % Mop/s total Mop/s total sp_C_x, 8 threads 36316.1 12939.6 -64% sp_C_x, 16 threads 64790.2 23968.0 -63% sp_C_x, 32 threads 67205.5 48891.4 -27% Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results. It seems like the allowed imbalance is too large and the negative impact on workloads that prefer to span wide across all NUMA nodes is too large. Thanks Jirka On Tue, Dec 21, 2021 at 6:13 PM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > (sorry for the delay, was offline for a few days) > > > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > > > Hello Mel, > > > > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > > > > > [..SNIP..] > > > > > [snip] > > > > > To avoid the corner case, we'd need to explicitly favour spreading early > > and assume wakeup will pull communicating tasks together and NUMA > > balancing migrate the data after some time which looks like > > > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > > index c07bfa2d80f2..54f5207154d3 100644 > > --- a/include/linux/sched/topology.h > > +++ b/include/linux/sched/topology.h > > @@ -93,6 +93,7 @@ struct sched_domain { > > unsigned int busy_factor; /* less balancing by factor if busy */ > > unsigned int imbalance_pct; /* No balance until over watermark */ > > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > > + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ > > So now you compute an allowed imbalance level instead of using > 25% of sd->span_weight > or > 25% of busiest->group_weight > > And you adjust this new imb_numa_nr according to the topology. > > That makes sense. > > > > > int nohz_idle; /* NOHZ IDLE status */ > > int flags; /* See SD_* */ > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 0a969affca76..df0e84462e62 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > > > int src_cpu, src_nid; > > int dst_cpu, dst_nid; > > + int imb_numa_nr; > > > > struct numa_stats src_stats, dst_stats; > > > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); > > static unsigned long cpu_runnable(struct rq *rq); > > static unsigned long cpu_util(int cpu); > > static inline long adjust_numa_imbalance(int imbalance, > > - int dst_running, int dst_weight); > > + int dst_running, > > + int imb_numa_nr); > > > > static inline enum > > numa_type numa_classify(unsigned int imbalance_pct, > > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, > > dst_running = env->dst_stats.nr_running + 1; > > imbalance = max(0, dst_running - src_running); > > imbalance = adjust_numa_imbalance(imbalance, dst_running, > > - env->dst_stats.weight); > > + env->imb_numa_nr); > > > > /* Use idle CPU if there is no imbalance */ > > if (!imbalance) { > > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p) > > */ > > rcu_read_lock(); > > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > > - if (sd) > > + if (sd) { > > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > > + env.imb_numa_nr = sd->imb_numa_nr; > > + } > > rcu_read_unlock(); > > > > /* > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > * This is an approximation as the number of running tasks may not be > > * related to the number of busy CPUs due to sched_setaffinity. > > */ > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > { > > - return (dst_running < (dst_weight >> 2)); > > + return dst_running < imb_numa_nr; > > } > > > > /* > > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > > return idlest; > > #endif > > /* > > - * Otherwise, keep the task on this node to stay close > > - * its wakeup source and improve locality. If there is > > - * a real need of migration, periodic load balance will > > - * take care of it. > > + * Otherwise, keep the task on this node to stay local > > + * to its wakeup source if the number of running tasks > > + * are below the allowed imbalance. If there is a real > > + * need of migration, periodic load balance will take > > + * care of it. > > */ > > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > > + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) > > return NULL; > > } > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > } > > } > > > > -#define NUMA_IMBALANCE_MIN 2 > > - > > static inline long adjust_numa_imbalance(int imbalance, > > - int dst_running, int dst_weight) > > + int dst_running, int imb_numa_nr) > > { > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > return imbalance; > > > > - /* > > - * Allow a small imbalance based on a simple pair of communicating > > - * tasks that remain local when the destination is lightly loaded. > > - */ > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > + if (imbalance <= imb_numa_nr) > > Isn't this always true ? > > imbalance is "always" < dst_running as imbalance is usually the number > of these tasks that we would like to migrate > > > > return 0; > > > > return imbalance; > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > /* Consider allowing a small imbalance between NUMA groups */ > > if (env->sd->flags & SD_NUMA) { > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > - busiest->sum_nr_running, env->sd->span_weight); > > + busiest->sum_nr_running, env->sd->imb_numa_nr); > > } > > > > return; > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index d201a7052a29..1fa3e977521d 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > } > > } > > > > + /* > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > + * imbalanced. > > + */ > > + for_each_cpu(i, cpu_map) { > > + unsigned int imb = 0; > > + unsigned int imb_span = 1; > > + > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > + struct sched_domain *child = sd->child; > > + > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > sched_domains have not been degenerated yet so you found here the DIE domain > > > + struct sched_domain *top, *top_p; > > + unsigned int llc_sq; > > + > > + /* > > + * nr_llcs = (sd->span_weight / llc_weight); > > + * imb = (llc_weight / nr_llcs) >> 2 > > it would be good to add a comment to explain why 25% of LLC weight / > number of LLC in a node is the right value. > For example, why is it better than just 25% of the LLC weight ? > Do you want to allow the same imbalance at node level whatever the > number of LLC in the node ? > > > + * > > + * is equivalent to > > + * > > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > > + * > > + */ > > + llc_sq = child->span_weight * child->span_weight; > > + > > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > > + sd->imb_numa_nr = imb; > > + > > + /* > > + * Set span based on top domain that places > > + * tasks in sibling domains. > > + */ > > + top = sd; > > + top_p = top->parent; > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? > Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure > that you want to take this case into account, only numa node don't > have SD_PREFER_SIBLING > > > + top = top->parent; > > + top_p = top->parent; > > + } > > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > > + } else { > > + int factor = max(1U, (sd->span_weight / imb_span)); > > + > > + sd->imb_numa_nr = imb * factor; > > + } > > + } > > + } > > + > > /* Calculate CPU capacity for physical packages and nodes */ > > for (i = nr_cpumask_bits-1; i >= 0; i--) { > > if (!cpumask_test_cpu(i, cpu_map)) > -- -Jirka ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-22 8:52 ` Jirka Hladky @ 2022-01-04 19:52 ` Jirka Hladky 0 siblings, 0 replies; 48+ messages in thread From: Jirka Hladky @ 2022-01-04 19:52 UTC (permalink / raw) To: Vincent Guittot Cc: Mel Gorman, Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Philip Auld Hi Mel, the table with results is badly formatted. Let me send it again: We have tested the performance impact of this patch and we see performance drops up to 60% with the OpenMP implementation of the NAS parallel benchmark. Example of results: 2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs) Kernel 5.16.0-rc5 5.16.0-rc5+patch_v4 Mop/s tot Mop/s tot Diff sp_C_x, 8 threads 36316.1 12939.6 -64% sp_C_x, 16 threads 64790.2 23968.0 -63% sp_C_x, 32 threads 67205.5 48891.4 -27% Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results. It seems like the allowed imbalance is too large and the negative impact on workloads that prefer to span wide across all NUMA nodes is too large. Thanks Jirka On Wed, Dec 22, 2021 at 9:52 AM Jirka Hladky <jhladky@redhat.com> wrote: > > Hi Mel, > > we have tested the performance impact of this patch and we see > performance drops up to 60% with the OpenMP implementation of the NAS > parallel benchmark. > > Example of results: > 2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs) > > 5.16.0-rc5 > 5.16.0-rc5 Difference > (vanilla) + v4 of > the patch in % > Mop/s total Mop/s total > sp_C_x, 8 threads 36316.1 12939.6 > -64% > sp_C_x, 16 threads 64790.2 23968.0 > -63% > sp_C_x, 32 threads 67205.5 48891.4 > -27% > > Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results. > > It seems like the allowed imbalance is too large and the negative > impact on workloads that prefer to span wide across all NUMA nodes is > too large. > > Thanks > Jirka > > > On Tue, Dec 21, 2021 at 6:13 PM Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > (sorry for the delay, was offline for a few days) > > > > > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > > > > Hello Mel, > > > > > > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > > > > > > > [..SNIP..] > > > > > > > > [snip] > > > > > > > > To avoid the corner case, we'd need to explicitly favour spreading early > > > and assume wakeup will pull communicating tasks together and NUMA > > > balancing migrate the data after some time which looks like > > > > > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > > > index c07bfa2d80f2..54f5207154d3 100644 > > > --- a/include/linux/sched/topology.h > > > +++ b/include/linux/sched/topology.h > > > @@ -93,6 +93,7 @@ struct sched_domain { > > > unsigned int busy_factor; /* less balancing by factor if busy */ > > > unsigned int imbalance_pct; /* No balance until over watermark */ > > > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > > > + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ > > > > So now you compute an allowed imbalance level instead of using > > 25% of sd->span_weight > > or > > 25% of busiest->group_weight > > > > And you adjust this new imb_numa_nr according to the topology. > > > > That makes sense. > > > > > > > > int nohz_idle; /* NOHZ IDLE status */ > > > int flags; /* See SD_* */ > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 0a969affca76..df0e84462e62 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > > > > > int src_cpu, src_nid; > > > int dst_cpu, dst_nid; > > > + int imb_numa_nr; > > > > > > struct numa_stats src_stats, dst_stats; > > > > > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); > > > static unsigned long cpu_runnable(struct rq *rq); > > > static unsigned long cpu_util(int cpu); > > > static inline long adjust_numa_imbalance(int imbalance, > > > - int dst_running, int dst_weight); > > > + int dst_running, > > > + int imb_numa_nr); > > > > > > static inline enum > > > numa_type numa_classify(unsigned int imbalance_pct, > > > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, > > > dst_running = env->dst_stats.nr_running + 1; > > > imbalance = max(0, dst_running - src_running); > > > imbalance = adjust_numa_imbalance(imbalance, dst_running, > > > - env->dst_stats.weight); > > > + env->imb_numa_nr); > > > > > > /* Use idle CPU if there is no imbalance */ > > > if (!imbalance) { > > > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p) > > > */ > > > rcu_read_lock(); > > > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > > > - if (sd) > > > + if (sd) { > > > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > > > + env.imb_numa_nr = sd->imb_numa_nr; > > > + } > > > rcu_read_unlock(); > > > > > > /* > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > > * This is an approximation as the number of running tasks may not be > > > * related to the number of busy CPUs due to sched_setaffinity. > > > */ > > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > > { > > > - return (dst_running < (dst_weight >> 2)); > > > + return dst_running < imb_numa_nr; > > > } > > > > > > /* > > > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > > > return idlest; > > > #endif > > > /* > > > - * Otherwise, keep the task on this node to stay close > > > - * its wakeup source and improve locality. If there is > > > - * a real need of migration, periodic load balance will > > > - * take care of it. > > > + * Otherwise, keep the task on this node to stay local > > > + * to its wakeup source if the number of running tasks > > > + * are below the allowed imbalance. If there is a real > > > + * need of migration, periodic load balance will take > > > + * care of it. > > > */ > > > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > > > + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) > > > return NULL; > > > } > > > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > } > > > } > > > > > > -#define NUMA_IMBALANCE_MIN 2 > > > - > > > static inline long adjust_numa_imbalance(int imbalance, > > > - int dst_running, int dst_weight) > > > + int dst_running, int imb_numa_nr) > > > { > > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > > return imbalance; > > > > > > - /* > > > - * Allow a small imbalance based on a simple pair of communicating > > > - * tasks that remain local when the destination is lightly loaded. > > > - */ > > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > > + if (imbalance <= imb_numa_nr) > > > > Isn't this always true ? > > > > imbalance is "always" < dst_running as imbalance is usually the number > > of these tasks that we would like to migrate > > > > > > > return 0; > > > > > > return imbalance; > > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > > /* Consider allowing a small imbalance between NUMA groups */ > > > if (env->sd->flags & SD_NUMA) { > > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > > - busiest->sum_nr_running, env->sd->span_weight); > > > + busiest->sum_nr_running, env->sd->imb_numa_nr); > > > } > > > > > > return; > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > > index d201a7052a29..1fa3e977521d 100644 > > > --- a/kernel/sched/topology.c > > > +++ b/kernel/sched/topology.c > > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > > } > > > } > > > > > > + /* > > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > > + * imbalanced. > > > + */ > > > + for_each_cpu(i, cpu_map) { > > > + unsigned int imb = 0; > > > + unsigned int imb_span = 1; > > > + > > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > > + struct sched_domain *child = sd->child; > > > + > > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > > > sched_domains have not been degenerated yet so you found here the DIE domain > > > > > + struct sched_domain *top, *top_p; > > > + unsigned int llc_sq; > > > + > > > + /* > > > + * nr_llcs = (sd->span_weight / llc_weight); > > > + * imb = (llc_weight / nr_llcs) >> 2 > > > > it would be good to add a comment to explain why 25% of LLC weight / > > number of LLC in a node is the right value. > > For example, why is it better than just 25% of the LLC weight ? > > Do you want to allow the same imbalance at node level whatever the > > number of LLC in the node ? > > > > > + * > > > + * is equivalent to > > > + * > > > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > > > + * > > > + */ > > > + llc_sq = child->span_weight * child->span_weight; > > > + > > > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > > > + sd->imb_numa_nr = imb; > > > + > > > + /* > > > + * Set span based on top domain that places > > > + * tasks in sibling domains. > > > + */ > > > + top = sd; > > > + top_p = top->parent; > > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > > > Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? > > Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure > > that you want to take this case into account, only numa node don't > > have SD_PREFER_SIBLING > > > > > + top = top->parent; > > > + top_p = top->parent; > > > + } > > > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > > > + } else { > > > + int factor = max(1U, (sd->span_weight / imb_span)); > > > + > > > + sd->imb_numa_nr = imb * factor; > > > + } > > > + } > > > + } > > > + > > > /* Calculate CPU capacity for physical packages and nodes */ > > > for (i = nr_cpumask_bits-1; i >= 0; i--) { > > > if (!cpumask_test_cpu(i, cpu_map)) > > > > > -- > -Jirka -- -Jirka ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-21 17:13 ` Vincent Guittot 2021-12-22 8:52 ` Jirka Hladky @ 2022-01-05 10:42 ` Mel Gorman 2022-01-05 10:49 ` Mel Gorman 2022-01-10 15:53 ` Vincent Guittot 1 sibling, 2 replies; 48+ messages in thread From: Mel Gorman @ 2022-01-05 10:42 UTC (permalink / raw) To: Vincent Guittot Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote: > > <SNIP> > > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > * This is an approximation as the number of running tasks may not be > > * related to the number of busy CPUs due to sched_setaffinity. > > */ > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > { > > - return (dst_running < (dst_weight >> 2)); > > + return dst_running < imb_numa_nr; > > } > > > > /* > > > > <SNIP> > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > } > > } > > > > -#define NUMA_IMBALANCE_MIN 2 > > - > > static inline long adjust_numa_imbalance(int imbalance, > > - int dst_running, int dst_weight) > > + int dst_running, int imb_numa_nr) > > { > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > return imbalance; > > > > - /* > > - * Allow a small imbalance based on a simple pair of communicating > > - * tasks that remain local when the destination is lightly loaded. > > - */ > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > + if (imbalance <= imb_numa_nr) > > Isn't this always true ? > > imbalance is "always" < dst_running as imbalance is usually the number > of these tasks that we would like to migrate > It's not necessarily true. allow_numa_imbalanced is checking if dst_running < imb_numa_nr and adjust_numa_imbalance is checking the imbalance. imb_numa_nr = 4 dst_running = 2 imbalance = 1 In that case, imbalance of 1 is ok, but 2 is not. > > > return 0; > > > > return imbalance; > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > /* Consider allowing a small imbalance between NUMA groups */ > > if (env->sd->flags & SD_NUMA) { > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > - busiest->sum_nr_running, env->sd->span_weight); > > + busiest->sum_nr_running, env->sd->imb_numa_nr); > > } > > > > return; > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index d201a7052a29..1fa3e977521d 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > } > > } > > > > + /* > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > + * imbalanced. > > + */ > > + for_each_cpu(i, cpu_map) { > > + unsigned int imb = 0; > > + unsigned int imb_span = 1; > > + > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > + struct sched_domain *child = sd->child; > > + > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > sched_domains have not been degenerated yet so you found here the DIE domain > Yes > > + struct sched_domain *top, *top_p; > > + unsigned int llc_sq; > > + > > + /* > > + * nr_llcs = (sd->span_weight / llc_weight); > > + * imb = (llc_weight / nr_llcs) >> 2 > > it would be good to add a comment to explain why 25% of LLC weight / > number of LLC in a node is the right value. This? * The 25% imbalance is an arbitrary cutoff * based on SMT-2 to balance between memory * bandwidth and avoiding premature sharing * of HT resources and SMT-4 or SMT-8 *may* * benefit from a different cutoff. nr_llcs * are accounted for to mitigate premature * cache eviction due to multiple tasks * using one cache while a sibling cache * remains relatively idle. > For example, why is it better than just 25% of the LLC weight ? Because lets say there are 2 LLCs then an imbalance based on just the LLC weight might allow 2 tasks to share one cache while another is idle. This is the original problem whereby the vanilla imbalance allowed multiple LLCs on the same node to be overloaded which hurt workloads that prefer to spread wide. > Do you want to allow the same imbalance at node level whatever the > number of LLC in the node ? > At this point, it's less clear how the larger domains should be balanced and the initial scaling is as good an option as any. > > + * > > + * is equivalent to > > + * > > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > > + * > > + */ > > + llc_sq = child->span_weight * child->span_weight; > > + > > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > > + sd->imb_numa_nr = imb; > > + > > + /* > > + * Set span based on top domain that places > > + * tasks in sibling domains. > > + */ > > + top = sd; > > + top_p = top->parent; > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? Because on AMD Zen3, I saw inconsistent treatment of SD_NUMA prior to degeneration depending on whether it was NPS-1, NPS-2 or NPS-4 and only SD_PREFER_SIBLING gave the current results. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-01-05 10:42 ` Mel Gorman @ 2022-01-05 10:49 ` Mel Gorman 2022-01-10 15:53 ` Vincent Guittot 1 sibling, 0 replies; 48+ messages in thread From: Mel Gorman @ 2022-01-05 10:49 UTC (permalink / raw) To: Vincent Guittot Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Wed, Jan 05, 2022 at 10:42:07AM +0000, Mel Gorman wrote: > On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote: > > > <SNIP> > > > > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > > * This is an approximation as the number of running tasks may not be > > > * related to the number of busy CPUs due to sched_setaffinity. > > > */ > > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > > { > > > - return (dst_running < (dst_weight >> 2)); > > > + return dst_running < imb_numa_nr; > > > } > > > > > > /* > > > > > > <SNIP> > > > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > } > > > } > > > > > > -#define NUMA_IMBALANCE_MIN 2 > > > - > > > static inline long adjust_numa_imbalance(int imbalance, > > > - int dst_running, int dst_weight) > > > + int dst_running, int imb_numa_nr) > > > { > > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > > return imbalance; > > > > > > - /* > > > - * Allow a small imbalance based on a simple pair of communicating > > > - * tasks that remain local when the destination is lightly loaded. > > > - */ > > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > > + if (imbalance <= imb_numa_nr) > > > > Isn't this always true ? > > > > imbalance is "always" < dst_running as imbalance is usually the number > > of these tasks that we would like to migrate > > > > It's not necessarily true. allow_numa_imbalanced is checking if > dst_running < imb_numa_nr and adjust_numa_imbalance is checking the > imbalance. > > imb_numa_nr = 4 > dst_running = 2 > imbalance = 1 > > In that case, imbalance of 1 is ok, but 2 is not. > My bad, this is based on v5 which I just queued for testing. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-01-05 10:42 ` Mel Gorman 2022-01-05 10:49 ` Mel Gorman @ 2022-01-10 15:53 ` Vincent Guittot 2022-01-12 10:24 ` Mel Gorman 1 sibling, 1 reply; 48+ messages in thread From: Vincent Guittot @ 2022-01-10 15:53 UTC (permalink / raw) To: Mel Gorman Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Wed, 5 Jan 2022 at 11:42, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote: > > > <SNIP> > > > > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > > * This is an approximation as the number of running tasks may not be > > > * related to the number of busy CPUs due to sched_setaffinity. > > > */ > > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > > { > > > - return (dst_running < (dst_weight >> 2)); > > > + return dst_running < imb_numa_nr; > > > } > > > > > > /* > > > > > > <SNIP> > > > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > } > > > } > > > > > > -#define NUMA_IMBALANCE_MIN 2 > > > - > > > static inline long adjust_numa_imbalance(int imbalance, > > > - int dst_running, int dst_weight) > > > + int dst_running, int imb_numa_nr) > > > { > > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > > return imbalance; > > > > > > - /* > > > - * Allow a small imbalance based on a simple pair of communicating > > > - * tasks that remain local when the destination is lightly loaded. > > > - */ > > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > > + if (imbalance <= imb_numa_nr) > > > > Isn't this always true ? > > > > imbalance is "always" < dst_running as imbalance is usually the number > > of these tasks that we would like to migrate > > > > It's not necessarily true. allow_numa_imbalanced is checking if > dst_running < imb_numa_nr and adjust_numa_imbalance is checking the > imbalance. > > imb_numa_nr = 4 > dst_running = 2 > imbalance = 1 > > In that case, imbalance of 1 is ok, but 2 is not. I don't catch your example. Why is imbalance = 2 not ok in your example above ? allow_numa_imbalance still returns true (dst-running < imb_numa_nr) and we still have imbalance <= imb_numa_nr Also the name dst_running is quite confusing; In the case of calculate_imbalance, busiest->nr_running is passed as dst_running argument. But the busiest group is the src not the dst of the balance Then, imbalance < busiest->nr_running in load_balance because we try to even the number of task running in each groups without emptying it and allow_numa_imbalance checks that dst_running < imb_numa_nr. So we have imbalance < dst_running < imb_numa_nr > > > > > > return 0; > > > > > > return imbalance; > > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > > /* Consider allowing a small imbalance between NUMA groups */ > > > if (env->sd->flags & SD_NUMA) { > > > env->imbalance = adjust_numa_imbalance(env->imbalance, > > > - busiest->sum_nr_running, env->sd->span_weight); > > > + busiest->sum_nr_running, env->sd->imb_numa_nr); > > > } > > > > > > return; > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > > index d201a7052a29..1fa3e977521d 100644 > > > --- a/kernel/sched/topology.c > > > +++ b/kernel/sched/topology.c > > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > > } > > > } > > > > > > + /* > > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > > + * imbalanced. > > > + */ > > > + for_each_cpu(i, cpu_map) { > > > + unsigned int imb = 0; > > > + unsigned int imb_span = 1; > > > + > > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > > + struct sched_domain *child = sd->child; > > > + > > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > > > sched_domains have not been degenerated yet so you found here the DIE domain > > > > Yes > > > > + struct sched_domain *top, *top_p; > > > + unsigned int llc_sq; > > > + > > > + /* > > > + * nr_llcs = (sd->span_weight / llc_weight); > > > + * imb = (llc_weight / nr_llcs) >> 2 > > > > it would be good to add a comment to explain why 25% of LLC weight / > > number of LLC in a node is the right value. > > This? > > * The 25% imbalance is an arbitrary cutoff > * based on SMT-2 to balance between memory > * bandwidth and avoiding premature sharing > * of HT resources and SMT-4 or SMT-8 *may* > * benefit from a different cutoff. nr_llcs > * are accounted for to mitigate premature > * cache eviction due to multiple tasks > * using one cache while a sibling cache > * remains relatively idle. > > > For example, why is it better than just 25% of the LLC weight ? > > Because lets say there are 2 LLCs then an imbalance based on just the LLC > weight might allow 2 tasks to share one cache while another is idle. This > is the original problem whereby the vanilla imbalance allowed multiple > LLCs on the same node to be overloaded which hurt workloads that prefer > to spread wide. In this case, shouldn't it be (llc_weight >> 2) * nr_llcs to fill each llc up to 25% ? instead of dividing by nr_llcs As an example, you have 1 node with 1 LLC with 128 CPUs will get an imb_numa_nr = 32 1 node with 2 LLC with 64 CPUs each will get an imb_numa_nr = 8 1 node with 4 LLC with 32 CPUs each will get an imb_numa_nr = 2 sd->imb_numa_nr is used at NUMA level so the more LLC you have the lower imbalance is allowed > > > Do you want to allow the same imbalance at node level whatever the > > number of LLC in the node ? > > > > At this point, it's less clear how the larger domains should be > balanced and the initial scaling is as good an option as any. > > > > + * > > > + * is equivalent to > > > + * > > > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > > > + * > > > + */ > > > + llc_sq = child->span_weight * child->span_weight; > > > + > > > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > > > + sd->imb_numa_nr = imb; > > > + > > > + /* > > > + * Set span based on top domain that places > > > + * tasks in sibling domains. > > > + */ > > > + top = sd; > > > + top_p = top->parent; > > > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { > > > > Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? > > Because on AMD Zen3, I saw inconsistent treatment of SD_NUMA prior to > degeneration depending on whether it was NPS-1, NPS-2 or NPS-4 and only > SD_PREFER_SIBLING gave the current results. SD_PREFER_SIBLING is not mandatory in childs of NUMA node (look at heterogenous system as an example) so relying of this flag seems quite fragile The fact that you see inconsistency with SD_NUMA depending of NPS-1, NPS-2 or NPS-4 topology probably means that you don't looks for the right domain level or you try to compensate side effect of your formula above > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-01-10 15:53 ` Vincent Guittot @ 2022-01-12 10:24 ` Mel Gorman 0 siblings, 0 replies; 48+ messages in thread From: Mel Gorman @ 2022-01-12 10:24 UTC (permalink / raw) To: Vincent Guittot Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Mon, Jan 10, 2022 at 04:53:26PM +0100, Vincent Guittot wrote: > On Wed, 5 Jan 2022 at 11:42, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote: > > > > <SNIP> > > > > > > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > > > * This is an approximation as the number of running tasks may not be > > > > * related to the number of busy CPUs due to sched_setaffinity. > > > > */ > > > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > > > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > > > > { > > > > - return (dst_running < (dst_weight >> 2)); > > > > + return dst_running < imb_numa_nr; > > > > } > > > > > > > > /* > > > > > > > > <SNIP> > > > > > > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > > } > > > > } > > > > > > > > -#define NUMA_IMBALANCE_MIN 2 > > > > - > > > > static inline long adjust_numa_imbalance(int imbalance, > > > > - int dst_running, int dst_weight) > > > > + int dst_running, int imb_numa_nr) > > > > { > > > > - if (!allow_numa_imbalance(dst_running, dst_weight)) > > > > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > > > > return imbalance; > > > > > > > > - /* > > > > - * Allow a small imbalance based on a simple pair of communicating > > > > - * tasks that remain local when the destination is lightly loaded. > > > > - */ > > > > - if (imbalance <= NUMA_IMBALANCE_MIN) > > > > + if (imbalance <= imb_numa_nr) > > > > > > Isn't this always true ? > > > > > > imbalance is "always" < dst_running as imbalance is usually the number > > > of these tasks that we would like to migrate > > > > > > > It's not necessarily true. allow_numa_imbalanced is checking if > > dst_running < imb_numa_nr and adjust_numa_imbalance is checking the > > imbalance. > > > > imb_numa_nr = 4 > > dst_running = 2 > > imbalance = 1 > > > > In that case, imbalance of 1 is ok, but 2 is not. > > I don't catch your example. Why is imbalance = 2 not ok in your > example above ? allow_numa_imbalance still returns true (dst-running < > imb_numa_nr) and we still have imbalance <= imb_numa_nr > At the time I wrote it, the comparison looked like < instead of <=. > Also the name dst_running is quite confusing; In the case of > calculate_imbalance, busiest->nr_running is passed as dst_running > argument. But the busiest group is the src not the dst of the balance > > Then, imbalance < busiest->nr_running in load_balance because we try > to even the number of task running in each groups without emptying it > and allow_numa_imbalance checks that dst_running < imb_numa_nr. So we > have imbalance < dst_running < imb_numa_nr > But either way, you have a valid point. The patch as-is is too complex and doing too much and is failing to make progress as a result. I'm going to go back to the drawing board and come up with a simpler version that adjusts the cut-off depending on topology but only allows an imbalance of NUMA_IMBALANCE_MIN and tidy up the inconsistencies. > > This? > > > > * The 25% imbalance is an arbitrary cutoff > > * based on SMT-2 to balance between memory > > * bandwidth and avoiding premature sharing > > * of HT resources and SMT-4 or SMT-8 *may* > > * benefit from a different cutoff. nr_llcs > > * are accounted for to mitigate premature > > * cache eviction due to multiple tasks > > * using one cache while a sibling cache > > * remains relatively idle. > > > > > For example, why is it better than just 25% of the LLC weight ? > > > > Because lets say there are 2 LLCs then an imbalance based on just the LLC > > weight might allow 2 tasks to share one cache while another is idle. This > > is the original problem whereby the vanilla imbalance allowed multiple > > LLCs on the same node to be overloaded which hurt workloads that prefer > > to spread wide. > > In this case, shouldn't it be (llc_weight >> 2) * nr_llcs to fill each > llc up to 25% ? instead of dividing by nr_llcs > > As an example, you have > 1 node with 1 LLC with 128 CPUs will get an imb_numa_nr = 32 > 1 node with 2 LLC with 64 CPUs each will get an imb_numa_nr = 8 > 1 node with 4 LLC with 32 CPUs each will get an imb_numa_nr = 2 > > sd->imb_numa_nr is used at NUMA level so the more LLC you have the > lower imbalance is allowed > The more LLCs, the lower the threshold where imbalances is allowed is deliberate given that the motivating problem was that embarassingly parallel problems on AMD suffer due to overloading some LLCs while others remain idle. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-10 9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman 2021-12-13 8:28 ` Gautham R. Shenoy @ 2021-12-17 19:54 ` Gautham R. Shenoy 1 sibling, 0 replies; 48+ messages in thread From: Gautham R. Shenoy @ 2021-12-17 19:54 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Fri, Dec 10, 2021 at 09:33:07AM +0000, Mel Gorman wrote: [..snip..] > @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > return idlest; > #endif > /* > - * Otherwise, keep the task on this node to stay close > - * its wakeup source and improve locality. If there is > - * a real need of migration, periodic load balance will > - * take care of it. > + * Otherwise, keep the task on this node to stay local > + * to its wakeup source if the number of running tasks > + * are below the allowed imbalance. If there is a real > + * need of migration, periodic load balance will take > + * care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > + if (local_sgs.sum_nr_running <= sd->imb_numa_nr) > return NULL; Considering the fact that we want to check whether or not the imb_numa_nr threshold is going to be crossed if we let the new task stay local, this should be if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr) return NULL; Without this change, on a Zen3 configured with Nodes Per Socket (NPS)=4, the lower NUMA domain with sd->imb_numa_nr = 2, has 4 groups (each group corresponds to a NODE sched-domain), when we run stream with 8 threads, we see 3 of them being initially placed in the local group and the remaining 5 distributed across the other 4 groups. None of these 3 tasks will never get migrated to any of the other 3 groups, because those others have at least one task. Eg: PID 157811 : timestamp 108921.267293 : first placed in NODE 1 PID 157812 : timestamp 108921.269877 : first placed in NODE 1 PID 157813 : timestamp 108921.269921 : first placed in NODE 1 PID 157814 : timestamp 108921.270007 : first placed in NODE 2 PID 157815 : timestamp 108921.270065 : first placed in NODE 3 PID 157816 : timestamp 108921.270118 : first placed in NODE 0 PID 157817 : timestamp 108921.270168 : first placed in NODE 2 PID 157818 : timestamp 108921.270216 : first placed in NODE 3 With the fix mentioned above, we see the 8 threads uniformly distributed across the 4 groups. PID 7500 : timestamp 436.156429 : first placed in NODE 1 PID 7501 : timestamp 436.159058 : first placed in NODE 1 PID 7502 : timestamp 436.159106 : first placed in NODE 2 PID 7503 : timestamp 436.159173 : first placed in NODE 3 PID 7504 : timestamp 436.159219 : first placed in NODE 0 PID 7505 : timestamp 436.159263 : first placed in NODE 2 PID 7506 : timestamp 436.159305 : first placed in NODE 3 PID 7507 : timestamp 436.159348 : first placed in NODE 0 -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v6 0/2] Adjust NUMA imbalance for multiple LLCs @ 2022-02-08 9:43 Mel Gorman 2022-02-08 9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 0 siblings, 1 reply; 48+ messages in thread From: Mel Gorman @ 2022-02-08 9:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, K Prateek Nayak, LKML, Mel Gorman Changelog sinve v5 o Fix off-by-one error Changelog since V4 o Scale imbalance based on the top domain that prefers siblings o Keep allowed imbalance as 2 up to the point where LLCs can be overloaded Changelog since V3 o Calculate imb_numa_nr for multiple SD_NUMA domains o Restore behaviour where communicating pairs remain on the same node Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. The series addresses two problems -- inconsistent logic when allowing a NUMA imbalance and sub-optimal performance when there are many LLCs per NUMA node. include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 30 ++++++++++--------- kernel/sched/topology.c | 53 ++++++++++++++++++++++++++++++++++ 3 files changed, 71 insertions(+), 13 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-08 9:43 [PATCH v6 0/2] Adjust NUMA imbalance for " Mel Gorman @ 2022-02-08 9:43 ` Mel Gorman 2022-02-08 16:19 ` Gautham R. Shenoy ` (3 more replies) 0 siblings, 4 replies; 48+ messages in thread From: Mel Gorman @ 2022-02-08 9:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, K Prateek Nayak, LKML, Mel Gorman Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. Zen* has multiple LLCs per node with local memory channels and due to the allowed imbalance, it's far harder to tune some workloads to run optimally than it is on hardware that has 1 LLC per node. This patch allows an imbalance to exist up to the point where LLCs should be balanced between nodes. On a Zen3 machine running STREAM parallelised with OMP to have on instance per LLC the results and without binding, the results are 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v6 MB/sec copy-16 162596.94 ( 0.00%) 580559.74 ( 257.05%) MB/sec scale-16 136901.28 ( 0.00%) 374450.52 ( 173.52%) MB/sec add-16 157300.70 ( 0.00%) 564113.76 ( 258.62%) MB/sec triad-16 151446.88 ( 0.00%) 564304.24 ( 272.61%) STREAM can use directives to force the spread if the OpenMP is new enough but that doesn't help if an application uses threads and it's not known in advance how many threads will be created. Coremark is a CPU and cache intensive benchmark parallelised with threads. When running with 1 thread per core, the vanilla kernel allows threads to contend on cache. With the patch; 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v5 Min Score-16 368239.36 ( 0.00%) 389816.06 ( 5.86%) Hmean Score-16 388607.33 ( 0.00%) 427877.08 * 10.11%* Max Score-16 408945.69 ( 0.00%) 481022.17 ( 17.62%) Stddev Score-16 15247.04 ( 0.00%) 24966.82 ( -63.75%) CoeffVar Score-16 3.92 ( 0.00%) 5.82 ( -48.48%) It can also make a big difference for semi-realistic workloads like specjbb which can execute arbitrary numbers of threads without advance knowledge of how they should be placed. Even in cases where the average performance is neutral, the results are more stable. 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v6 Hmean tput-1 71631.55 ( 0.00%) 73065.57 ( 2.00%) Hmean tput-8 582758.78 ( 0.00%) 556777.23 ( -4.46%) Hmean tput-16 1020372.75 ( 0.00%) 1009995.26 ( -1.02%) Hmean tput-24 1416430.67 ( 0.00%) 1398700.11 ( -1.25%) Hmean tput-32 1687702.72 ( 0.00%) 1671357.04 ( -0.97%) Hmean tput-40 1798094.90 ( 0.00%) 2015616.46 * 12.10%* Hmean tput-48 1972731.77 ( 0.00%) 2333233.72 ( 18.27%) Hmean tput-56 2386872.38 ( 0.00%) 2759483.38 ( 15.61%) Hmean tput-64 2909475.33 ( 0.00%) 2925074.69 ( 0.54%) Hmean tput-72 2585071.36 ( 0.00%) 2962443.97 ( 14.60%) Hmean tput-80 2994387.24 ( 0.00%) 3015980.59 ( 0.72%) Hmean tput-88 3061408.57 ( 0.00%) 3010296.16 ( -1.67%) Hmean tput-96 3052394.82 ( 0.00%) 2784743.41 ( -8.77%) Hmean tput-104 2997814.76 ( 0.00%) 2758184.50 ( -7.99%) Hmean tput-112 2955353.29 ( 0.00%) 2859705.09 ( -3.24%) Hmean tput-120 2889770.71 ( 0.00%) 2764478.46 ( -4.34%) Hmean tput-128 2871713.84 ( 0.00%) 2750136.73 ( -4.23%) Stddev tput-1 5325.93 ( 0.00%) 2002.53 ( 62.40%) Stddev tput-8 6630.54 ( 0.00%) 10905.00 ( -64.47%) Stddev tput-16 25608.58 ( 0.00%) 6851.16 ( 73.25%) Stddev tput-24 12117.69 ( 0.00%) 4227.79 ( 65.11%) Stddev tput-32 27577.16 ( 0.00%) 8761.05 ( 68.23%) Stddev tput-40 59505.86 ( 0.00%) 2048.49 ( 96.56%) Stddev tput-48 168330.30 ( 0.00%) 93058.08 ( 44.72%) Stddev tput-56 219540.39 ( 0.00%) 30687.02 ( 86.02%) Stddev tput-64 121750.35 ( 0.00%) 9617.36 ( 92.10%) Stddev tput-72 223387.05 ( 0.00%) 34081.13 ( 84.74%) Stddev tput-80 128198.46 ( 0.00%) 22565.19 ( 82.40%) Stddev tput-88 136665.36 ( 0.00%) 27905.97 ( 79.58%) Stddev tput-96 111925.81 ( 0.00%) 99615.79 ( 11.00%) Stddev tput-104 146455.96 ( 0.00%) 28861.98 ( 80.29%) Stddev tput-112 88740.49 ( 0.00%) 58288.23 ( 34.32%) Stddev tput-120 186384.86 ( 0.00%) 45812.03 ( 75.42%) Stddev tput-128 78761.09 ( 0.00%) 57418.48 ( 27.10%) Similarly, for embarassingly parallel problems like NPB-ep, there are improvements due to better spreading across LLC when the machine is not fully utilised. vanilla sched-numaimb-v6 Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%) Amean ep.D 31.86 ( 0.00%) 26.17 * 17.86%* Stddev ep.D 0.07 ( 0.00%) 0.05 ( 24.41%) CoeffVar ep.D 0.22 ( 0.00%) 0.20 ( 7.97%) Max ep.D 31.93 ( 0.00%) 26.21 ( 17.91%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 22 +++++++------- kernel/sched/topology.c | 53 ++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+), 10 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 8054641c0a7b..56cffe42abbc 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -93,6 +93,7 @@ struct sched_domain { unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ + unsigned int imb_numa_nr; /* Nr running tasks that allows a NUMA imbalance */ int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4592ccf82c34..538756bd8e7f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1489,6 +1489,7 @@ struct task_numa_env { int src_cpu, src_nid; int dst_cpu, dst_nid; + int imb_numa_nr; struct numa_stats src_stats, dst_stats; @@ -1503,7 +1504,7 @@ struct task_numa_env { static unsigned long cpu_load(struct rq *rq); static unsigned long cpu_runnable(struct rq *rq); static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight); + int dst_running, int imb_numa_nr); static inline enum numa_type numa_classify(unsigned int imbalance_pct, @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, dst_running = env->dst_stats.nr_running + 1; imbalance = max(0, dst_running - src_running); imbalance = adjust_numa_imbalance(imbalance, dst_running, - env->dst_stats.weight); + env->imb_numa_nr); /* Use idle CPU if there is no imbalance */ if (!imbalance) { @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p) */ rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); - if (sd) + if (sd) { env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; + env.imb_numa_nr = sd->imb_numa_nr; + } rcu_read_unlock(); /* @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest, * This is an approximation as the number of running tasks may not be * related to the number of busy CPUs due to sched_setaffinity. */ -static inline bool -allow_numa_imbalance(unsigned int running, unsigned int weight) +static inline bool allow_numa_imbalance(int running, int imb_numa_nr) { - return (running < (weight >> 2)); + return running <= imb_numa_nr; } /* @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) * allowed. If there is a real need of migration, * periodic load balance will take care of it. */ - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight)) + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr)) return NULL; } @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd #define NUMA_IMBALANCE_MIN 2 static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight) + int dst_running, int imb_numa_nr) { - if (!allow_numa_imbalance(dst_running, dst_weight)) + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) return imbalance; /* @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* Consider allowing a small imbalance between NUMA groups */ if (env->sd->flags & SD_NUMA) { env->imbalance = adjust_numa_imbalance(env->imbalance, - local->sum_nr_running + 1, local->group_weight); + local->sum_nr_running + 1, env->sd->imb_numa_nr); } return; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..e6cd55951304 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } } + /* + * Calculate an allowed NUMA imbalance such that LLCs do not get + * imbalanced. + */ + for_each_cpu(i, cpu_map) { + unsigned int imb = 0; + unsigned int imb_span = 1; + + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + struct sched_domain *child = sd->child; + + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && + (child->flags & SD_SHARE_PKG_RESOURCES)) { + struct sched_domain *top, *top_p; + unsigned int nr_llcs; + + /* + * For a single LLC per node, allow an + * imbalance up to 25% of the node. This is an + * arbitrary cutoff based on SMT-2 to balance + * between memory bandwidth and avoiding + * premature sharing of HT resources and SMT-4 + * or SMT-8 *may* benefit from a different + * cutoff. + * + * For multiple LLCs, allow an imbalance + * until multiple tasks would share an LLC + * on one node while LLCs on another node + * remain idle. + */ + nr_llcs = sd->span_weight / child->span_weight; + if (nr_llcs == 1) + imb = sd->span_weight >> 2; + else + imb = nr_llcs; + sd->imb_numa_nr = imb; + + /* Set span based on the first NUMA domain. */ + top = sd; + top_p = top->parent; + while (top_p && !(top_p->flags & SD_NUMA)) { + top = top->parent; + top_p = top->parent; + } + imb_span = top_p ? top_p->span_weight : sd->span_weight; + } else { + int factor = max(1U, (sd->span_weight / imb_span)); + + sd->imb_numa_nr = imb * factor; + } + } + } + /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) -- 2.31.1 ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-08 9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman @ 2022-02-08 16:19 ` Gautham R. Shenoy 2022-02-09 5:10 ` K Prateek Nayak ` (2 subsequent siblings) 3 siblings, 0 replies; 48+ messages in thread From: Gautham R. Shenoy @ 2022-02-08 16:19 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, K Prateek Nayak, LKML On Tue, Feb 08, 2022 at 09:43:34AM +0000, Mel Gorman wrote: [..snip..] > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d201a7052a29..e6cd55951304 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + /* > + * Calculate an allowed NUMA imbalance such that LLCs do not get > + * imbalanced. > + */ > + for_each_cpu(i, cpu_map) { > + unsigned int imb = 0; > + unsigned int imb_span = 1; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > + struct sched_domain *top, *top_p; > + unsigned int nr_llcs; > + > + /* > + * For a single LLC per node, allow an > + * imbalance up to 25% of the node. This is an > + * arbitrary cutoff based on SMT-2 to balance > + * between memory bandwidth and avoiding > + * premature sharing of HT resources and SMT-4 > + * or SMT-8 *may* benefit from a different > + * cutoff. > + * > + * For multiple LLCs, allow an imbalance > + * until multiple tasks would share an LLC > + * on one node while LLCs on another node > + * remain idle. > + */ > + nr_llcs = sd->span_weight / child->span_weight; > + if (nr_llcs == 1) > + imb = sd->span_weight >> 2; > + else > + imb = nr_llcs; > + sd->imb_numa_nr = imb; > + > + /* Set span based on the first NUMA domain. */ > + top = sd; > + top_p = top->parent; > + while (top_p && !(top_p->flags & SD_NUMA)) { > + top = top->parent; > + top_p = top->parent; > + } > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > + } else { > + int factor = max(1U, (sd->span_weight / imb_span)); > + > + sd->imb_numa_nr = imb * factor; > + } > + } > + } On a 2 Socket Zen3 servers with 64 cores per socket, the imb_numa_nr works out to be as follows for different Node Per Socket (NPS) modes NPS = 1: ====== SMT(span = 2) -- > MC (span = 16) --> DIE (span = 128) --> NUMA (span = 256) Parent of LLC is DIE. nr_llcs = 128/16 = 8. imb = 8. top_p = NUMA. imb_span = 256. for NUMA doman, factor = max(1U, 256/256) = 1. Thus sd->imb_numa_nr = 8. NPS = 2 ======== SMT(span=2)--> MC(span=16)--> NODE(span=64)--> NUMA1(span=128)--> NUMA2(span=256) Parent of LLC = NODE. nr_llcs = 64/16 = 4. imb = 4. top_p = NUMA1. imb_span = 128. For NUMA1 domain, factor = 1. sd->imb_numa_nr = 4. For NUMA2 domain, factor = 2. sd->imb_numa_nr = 8 NPS = 4 ======== SMT(span=2)--> MC(span=16)--> NODE(span=32)--> NUMA1(span=128)--> NUMA2(span=256) Parent of LLC = NODE. nr_llcs = 32/16 = 2. imb = 2. top_p = NUMA1. imb_span = 128. For NUMA1 domain, factor = 1. sd->imb_numa_nr = 2. For NUMA2 domain, factor = 2. sd->imb_numa_nr = 4 The imb_numa_nr looks good for all the NPS modes. Furthermore, running stream with 16 threads (equal to the number of LLCs in the system) yields good results on all the NPS modes with this imb_numa_nr. Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-08 9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 2022-02-08 16:19 ` Gautham R. Shenoy @ 2022-02-09 5:10 ` K Prateek Nayak 2022-02-09 10:33 ` Mel Gorman 2022-02-14 10:27 ` Srikar Dronamraju 2022-02-14 11:03 ` Vincent Guittot 3 siblings, 1 reply; 48+ messages in thread From: K Prateek Nayak @ 2022-02-09 5:10 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML Hello Mel, On 2/8/2022 3:13 PM, Mel Gorman wrote: [..snip..] > On a Zen3 machine running STREAM parallelised with OMP to have on instance > per LLC the results and without binding, the results are > > 5.17.0-rc0 5.17.0-rc0 > vanilla sched-numaimb-v6 > MB/sec copy-16 162596.94 ( 0.00%) 580559.74 ( 257.05%) > MB/sec scale-16 136901.28 ( 0.00%) 374450.52 ( 173.52%) > MB/sec add-16 157300.70 ( 0.00%) 564113.76 ( 258.62%) > MB/sec triad-16 151446.88 ( 0.00%) 564304.24 ( 272.61%) I was able to test STREAM without binding on different NPS configurations of two socket Zen3 machine. The results look good: sched-tip - 5.17.0-rc1 tip sched/core mel-v6 - 5.17.0-rc1 tip sched/core + this patch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stream with 16 threads. built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10 Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS1 Test: sched-tip mel-v6 Copy: 114470.18 (0.00 pct) 152806.94 (33.49 pct) Scale: 111575.12 (0.00 pct) 189784.57 (70.09 pct) Add: 125436.15 (0.00 pct) 213371.05 (70.10 pct) Triad: 123068.86 (0.00 pct) 209809.11 (70.48 pct) NPS2 Test: sched-tip mel-v6 Copy: 57936.28 (0.00 pct) 155038.70 (167.60 pct) Scale: 55599.30 (0.00 pct) 192601.59 (246.41 pct) Add: 63096.96 (0.00 pct) 211462.58 (235.13 pct) Triad: 61983.39 (0.00 pct) 208909.34 (237.04 pct) NPS4 Test: sched-tip mel-v6 Copy: 43946.42 (0.00 pct) 119583.69 (172.11 pct) Scale: 33750.96 (0.00 pct) 180130.83 (433.70 pct) Add: 39109.72 (0.00 pct) 170296.68 (335.43 pct) Triad: 36598.88 (0.00 pct) 169953.47 (364.36 pct) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stream with 16 threads. built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100 Zen3, 64C128T per socket, 2 sockets, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NPS1 Test: sched-tip mel-v6 Copy: 132402.79 (0.00 pct) 225587.85 (70.37 pct) Scale: 126923.02 (0.00 pct) 214363.58 (68.89 pct) Add: 145596.55 (0.00 pct) 260901.92 (79.19 pct) Triad: 143092.91 (0.00 pct) 249081.79 (74.06 pct) NPS 2 Test: sched-tip mel-v6 Copy: 107386.27 (0.00 pct) 227623.31 (111.96 pct) Scale: 100941.44 (0.00 pct) 218116.63 (116.08 pct) Add: 115854.52 (0.00 pct) 272756.95 (135.43 pct) Triad: 113369.96 (0.00 pct) 260235.32 (129.54 pct) NPS4 Test: sched-tip mel-v6 Copy: 91083.07 (0.00 pct) 247163.90 (171.36 pct) Scale: 90352.54 (0.00 pct) 223914.31 (147.82 pct) Add: 101973.98 (0.00 pct) 272842.42 (167.56 pct) Triad: 99773.65 (0.00 pct) 258904.54 (159.49 pct) There is a significant improvement throughout the board with v6 outperforming tip/sched/core in every case! Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> -- Thanks and Regards Prateek ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-09 5:10 ` K Prateek Nayak @ 2022-02-09 10:33 ` Mel Gorman 2022-02-11 19:02 ` Jirka Hladky 0 siblings, 1 reply; 48+ messages in thread From: Mel Gorman @ 2022-02-09 10:33 UTC (permalink / raw) To: K Prateek Nayak Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML On Wed, Feb 09, 2022 at 10:40:15AM +0530, K Prateek Nayak wrote: > There is a significant improvement throughout the board > with v6 outperforming tip/sched/core in every case! > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > Thanks very much Prateek and Gautham for reviewing and testing! -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-09 10:33 ` Mel Gorman @ 2022-02-11 19:02 ` Jirka Hladky 0 siblings, 0 replies; 48+ messages in thread From: Jirka Hladky @ 2022-02-11 19:02 UTC (permalink / raw) To: Mel Gorman Cc: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML Hi Mel, we have tested 5.17.0_rc2 + v6 of your patch series and the results are very good! We see performance gains up to 25% for the hackbench (32 processes). Even more importantly, we don't see any performance losses that we have experienced with the previous versions of the patch series for NAS and SPECjbb2005 workloads. Tested-by: Jirka Hladky <jhladky@redhat.com> Thanks a lot Jirka On Wed, Feb 9, 2022 at 12:08 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > On Wed, Feb 09, 2022 at 10:40:15AM +0530, K Prateek Nayak wrote: > > There is a significant improvement throughout the board > > with v6 outperforming tip/sched/core in every case! > > > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > > > > Thanks very much Prateek and Gautham for reviewing and testing! > > -- > Mel Gorman > SUSE Labs > -- -Jirka ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-08 9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 2022-02-08 16:19 ` Gautham R. Shenoy 2022-02-09 5:10 ` K Prateek Nayak @ 2022-02-14 10:27 ` Srikar Dronamraju 2022-02-14 11:03 ` Vincent Guittot 3 siblings, 0 replies; 48+ messages in thread From: Srikar Dronamraju @ 2022-02-14 10:27 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy, K Prateek Nayak, LKML * Mel Gorman <mgorman@techsingularity.net> [2022-02-08 09:43:34]: > Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA > nodes") allowed an imbalance between NUMA nodes such that communicating > tasks would not be pulled apart by the load balancer. This works fine when > there is a 1:1 relationship between LLC and node but can be suboptimal > for multiple LLCs if independent tasks prematurely use CPUs sharing cache. > > Zen* has multiple LLCs per node with local memory channels and due to > the allowed imbalance, it's far harder to tune some workloads to run > optimally than it is on hardware that has 1 LLC per node. This patch > allows an imbalance to exist up to the point where LLCs should be balanced > between nodes. > > On a Zen3 machine running STREAM parallelised with OMP to have on instance > per LLC the results and without binding, the results are > > 5.17.0-rc0 5.17.0-rc0 > Looks good to me. Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-08 9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman ` (2 preceding siblings ...) 2022-02-14 10:27 ` Srikar Dronamraju @ 2022-02-14 11:03 ` Vincent Guittot 3 siblings, 0 replies; 48+ messages in thread From: Vincent Guittot @ 2022-02-14 11:03 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, K Prateek Nayak, LKML On Tue, 8 Feb 2022 at 10:44, Mel Gorman <mgorman@techsingularity.net> wrote: > > Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA > nodes") allowed an imbalance between NUMA nodes such that communicating > tasks would not be pulled apart by the load balancer. This works fine when > there is a 1:1 relationship between LLC and node but can be suboptimal > for multiple LLCs if independent tasks prematurely use CPUs sharing cache. > > Zen* has multiple LLCs per node with local memory channels and due to > the allowed imbalance, it's far harder to tune some workloads to run > optimally than it is on hardware that has 1 LLC per node. This patch > allows an imbalance to exist up to the point where LLCs should be balanced > between nodes. > > On a Zen3 machine running STREAM parallelised with OMP to have on instance > per LLC the results and without binding, the results are > > 5.17.0-rc0 5.17.0-rc0 > vanilla sched-numaimb-v6 > MB/sec copy-16 162596.94 ( 0.00%) 580559.74 ( 257.05%) > MB/sec scale-16 136901.28 ( 0.00%) 374450.52 ( 173.52%) > MB/sec add-16 157300.70 ( 0.00%) 564113.76 ( 258.62%) > MB/sec triad-16 151446.88 ( 0.00%) 564304.24 ( 272.61%) > > STREAM can use directives to force the spread if the OpenMP is new > enough but that doesn't help if an application uses threads and > it's not known in advance how many threads will be created. > > Coremark is a CPU and cache intensive benchmark parallelised with > threads. When running with 1 thread per core, the vanilla kernel > allows threads to contend on cache. With the patch; > > 5.17.0-rc0 5.17.0-rc0 > vanilla sched-numaimb-v5 > Min Score-16 368239.36 ( 0.00%) 389816.06 ( 5.86%) > Hmean Score-16 388607.33 ( 0.00%) 427877.08 * 10.11%* > Max Score-16 408945.69 ( 0.00%) 481022.17 ( 17.62%) > Stddev Score-16 15247.04 ( 0.00%) 24966.82 ( -63.75%) > CoeffVar Score-16 3.92 ( 0.00%) 5.82 ( -48.48%) > > It can also make a big difference for semi-realistic workloads > like specjbb which can execute arbitrary numbers of threads without > advance knowledge of how they should be placed. Even in cases where > the average performance is neutral, the results are more stable. > > 5.17.0-rc0 5.17.0-rc0 > vanilla sched-numaimb-v6 > Hmean tput-1 71631.55 ( 0.00%) 73065.57 ( 2.00%) > Hmean tput-8 582758.78 ( 0.00%) 556777.23 ( -4.46%) > Hmean tput-16 1020372.75 ( 0.00%) 1009995.26 ( -1.02%) > Hmean tput-24 1416430.67 ( 0.00%) 1398700.11 ( -1.25%) > Hmean tput-32 1687702.72 ( 0.00%) 1671357.04 ( -0.97%) > Hmean tput-40 1798094.90 ( 0.00%) 2015616.46 * 12.10%* > Hmean tput-48 1972731.77 ( 0.00%) 2333233.72 ( 18.27%) > Hmean tput-56 2386872.38 ( 0.00%) 2759483.38 ( 15.61%) > Hmean tput-64 2909475.33 ( 0.00%) 2925074.69 ( 0.54%) > Hmean tput-72 2585071.36 ( 0.00%) 2962443.97 ( 14.60%) > Hmean tput-80 2994387.24 ( 0.00%) 3015980.59 ( 0.72%) > Hmean tput-88 3061408.57 ( 0.00%) 3010296.16 ( -1.67%) > Hmean tput-96 3052394.82 ( 0.00%) 2784743.41 ( -8.77%) > Hmean tput-104 2997814.76 ( 0.00%) 2758184.50 ( -7.99%) > Hmean tput-112 2955353.29 ( 0.00%) 2859705.09 ( -3.24%) > Hmean tput-120 2889770.71 ( 0.00%) 2764478.46 ( -4.34%) > Hmean tput-128 2871713.84 ( 0.00%) 2750136.73 ( -4.23%) > Stddev tput-1 5325.93 ( 0.00%) 2002.53 ( 62.40%) > Stddev tput-8 6630.54 ( 0.00%) 10905.00 ( -64.47%) > Stddev tput-16 25608.58 ( 0.00%) 6851.16 ( 73.25%) > Stddev tput-24 12117.69 ( 0.00%) 4227.79 ( 65.11%) > Stddev tput-32 27577.16 ( 0.00%) 8761.05 ( 68.23%) > Stddev tput-40 59505.86 ( 0.00%) 2048.49 ( 96.56%) > Stddev tput-48 168330.30 ( 0.00%) 93058.08 ( 44.72%) > Stddev tput-56 219540.39 ( 0.00%) 30687.02 ( 86.02%) > Stddev tput-64 121750.35 ( 0.00%) 9617.36 ( 92.10%) > Stddev tput-72 223387.05 ( 0.00%) 34081.13 ( 84.74%) > Stddev tput-80 128198.46 ( 0.00%) 22565.19 ( 82.40%) > Stddev tput-88 136665.36 ( 0.00%) 27905.97 ( 79.58%) > Stddev tput-96 111925.81 ( 0.00%) 99615.79 ( 11.00%) > Stddev tput-104 146455.96 ( 0.00%) 28861.98 ( 80.29%) > Stddev tput-112 88740.49 ( 0.00%) 58288.23 ( 34.32%) > Stddev tput-120 186384.86 ( 0.00%) 45812.03 ( 75.42%) > Stddev tput-128 78761.09 ( 0.00%) 57418.48 ( 27.10%) > > Similarly, for embarassingly parallel problems like NPB-ep, there are > improvements due to better spreading across LLC when the machine is not > fully utilised. > > vanilla sched-numaimb-v6 > Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%) > Amean ep.D 31.86 ( 0.00%) 26.17 * 17.86%* > Stddev ep.D 0.07 ( 0.00%) 0.05 ( 24.41%) > CoeffVar ep.D 0.22 ( 0.00%) 0.20 ( 7.97%) > Max ep.D 31.93 ( 0.00%) 26.21 ( 17.91%) > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Good to see that you have been able to move on SD_NUMA instead of SD_PREFER_SIBLING. The allowed imbalance looks also more consistent whatever the number of LLC Reviewed-by: Vincent Guittot <vincent.guitto@linaro.org> > --- > include/linux/sched/topology.h | 1 + > kernel/sched/fair.c | 22 +++++++------- > kernel/sched/topology.c | 53 ++++++++++++++++++++++++++++++++++ > 3 files changed, 66 insertions(+), 10 deletions(-) > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index 8054641c0a7b..56cffe42abbc 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -93,6 +93,7 @@ struct sched_domain { > unsigned int busy_factor; /* less balancing by factor if busy */ > unsigned int imbalance_pct; /* No balance until over watermark */ > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > + unsigned int imb_numa_nr; /* Nr running tasks that allows a NUMA imbalance */ > > int nohz_idle; /* NOHZ IDLE status */ > int flags; /* See SD_* */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4592ccf82c34..538756bd8e7f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > int src_cpu, src_nid; > int dst_cpu, dst_nid; > + int imb_numa_nr; > > struct numa_stats src_stats, dst_stats; > > @@ -1503,7 +1504,7 @@ struct task_numa_env { > static unsigned long cpu_load(struct rq *rq); > static unsigned long cpu_runnable(struct rq *rq); > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight); > + int dst_running, int imb_numa_nr); > > static inline enum > numa_type numa_classify(unsigned int imbalance_pct, > @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, > dst_running = env->dst_stats.nr_running + 1; > imbalance = max(0, dst_running - src_running); > imbalance = adjust_numa_imbalance(imbalance, dst_running, > - env->dst_stats.weight); > + env->imb_numa_nr); > > /* Use idle CPU if there is no imbalance */ > if (!imbalance) { > @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p) > */ > rcu_read_lock(); > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > - if (sd) > + if (sd) { > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > + env.imb_numa_nr = sd->imb_numa_nr; > + } > rcu_read_unlock(); > > /* > @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > * This is an approximation as the number of running tasks may not be > * related to the number of busy CPUs due to sched_setaffinity. > */ > -static inline bool > -allow_numa_imbalance(unsigned int running, unsigned int weight) > +static inline bool allow_numa_imbalance(int running, int imb_numa_nr) > { > - return (running < (weight >> 2)); > + return running <= imb_numa_nr; > } > > /* > @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > * allowed. If there is a real need of migration, > * periodic load balance will take care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight)) > + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr)) > return NULL; > } > > @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > #define NUMA_IMBALANCE_MIN 2 > > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight) > + int dst_running, int imb_numa_nr) > { > - if (!allow_numa_imbalance(dst_running, dst_weight)) > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > return imbalance; > > /* > @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* Consider allowing a small imbalance between NUMA groups */ > if (env->sd->flags & SD_NUMA) { > env->imbalance = adjust_numa_imbalance(env->imbalance, > - local->sum_nr_running + 1, local->group_weight); > + local->sum_nr_running + 1, env->sd->imb_numa_nr); > } > > return; > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d201a7052a29..e6cd55951304 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + /* > + * Calculate an allowed NUMA imbalance such that LLCs do not get > + * imbalanced. > + */ > + for_each_cpu(i, cpu_map) { > + unsigned int imb = 0; > + unsigned int imb_span = 1; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > + struct sched_domain *top, *top_p; > + unsigned int nr_llcs; > + > + /* > + * For a single LLC per node, allow an > + * imbalance up to 25% of the node. This is an > + * arbitrary cutoff based on SMT-2 to balance > + * between memory bandwidth and avoiding > + * premature sharing of HT resources and SMT-4 > + * or SMT-8 *may* benefit from a different > + * cutoff. > + * > + * For multiple LLCs, allow an imbalance > + * until multiple tasks would share an LLC > + * on one node while LLCs on another node > + * remain idle. > + */ > + nr_llcs = sd->span_weight / child->span_weight; > + if (nr_llcs == 1) > + imb = sd->span_weight >> 2; > + else > + imb = nr_llcs; > + sd->imb_numa_nr = imb; > + > + /* Set span based on the first NUMA domain. */ > + top = sd; > + top_p = top->parent; > + while (top_p && !(top_p->flags & SD_NUMA)) { > + top = top->parent; > + top_p = top->parent; > + } > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > + } else { > + int factor = max(1U, (sd->span_weight / imb_span)); > + > + sd->imb_numa_nr = imb * factor; > + } > + } > + } > + > /* Calculate CPU capacity for physical packages and nodes */ > for (i = nr_cpumask_bits-1; i >= 0; i--) { > if (!cpumask_test_cpu(i, cpu_map)) > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v5 0/2] Adjust NUMA imbalance for multiple LLCs @ 2022-02-03 14:46 Mel Gorman 2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 0 siblings, 1 reply; 48+ messages in thread From: Mel Gorman @ 2022-02-03 14:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML, Mel Gorman Changelog since V4 o Scale imbalance based on the top domain that prefers siblings o Keep allowed imbalance as 2 up to the point where LLCs can be overloaded Changelog since V3 o Calculate imb_numa_nr for multiple SD_NUMA domains o Restore behaviour where communicating pairs remain on the same node Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. The series addresses two problems -- inconsistent logic when allowing a NUMA imbalance and sub-optimal performance when there are many LLCs per NUMA node. include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 30 ++++++++++--------- kernel/sched/topology.c | 53 ++++++++++++++++++++++++++++++++++ 3 files changed, 71 insertions(+), 13 deletions(-) -- 2.31.1 Mel Gorman (2): sched/fair: Improve consistency of allowed NUMA balance calculations sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 30 ++++++++++--------- kernel/sched/topology.c | 53 ++++++++++++++++++++++++++++++++++ 3 files changed, 71 insertions(+), 13 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-03 14:46 [PATCH v5 0/2] Adjust NUMA imbalance for " Mel Gorman @ 2022-02-03 14:46 ` Mel Gorman 2022-02-04 1:30 ` kernel test robot ` (2 more replies) 0 siblings, 3 replies; 48+ messages in thread From: Mel Gorman @ 2022-02-03 14:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML, Mel Gorman Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. Zen* has multiple LLCs per node with local memory channels and due to the allowed imbalance, it's far harder to tune some workloads to run optimally than it is on hardware that has 1 LLC per node. This patch allows an imbalance to exist up to the point where LLCs should be balanced between nodes. On a Zen3 machine running STREAM parallelised with OMP to have on instance per LLC the results and without binding, the results are 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v5 MB/sec copy-16 162596.94 ( 0.00%) 501967.12 ( 208.72%) MB/sec scale-16 136901.28 ( 0.00%) 376531.50 ( 175.04%) MB/sec add-16 157300.70 ( 0.00%) 569997.42 ( 262.36%) MB/sec triad-16 151446.88 ( 0.00%) 553204.54 ( 265.28%) STREAM can use directives to force the spread if the OpenMP is new enough but that doesn't help if an application uses threads and it's not known in advance how many threads will be created. Coremark is a CPU and cache intensive benchmark parallelised with threads. When running with 1 thread per core, the vanilla kernel allows threads to contend on cache. With the patch; 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v5 Min Score-16 368239.36 ( 0.00%) 400876.92 ( 8.86%) Hmean Score-16 388607.33 ( 0.00%) 441447.30 * 13.60%* Max Score-16 408945.69 ( 0.00%) 478826.87 ( 17.09%) Stddev Score-16 15247.04 ( 0.00%) 34061.76 (-123.40%) CoeffVar Score-16 3.92 ( 0.00%) 7.67 ( -95.82%) It can also make a big difference for semi-realistic workloads like specjbb which can execute arbitrary numbers of threads without advance knowledge of how they should be placed 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v5 Hmean tput-1 71631.55 ( 0.00%) 70383.46 ( -1.74%) Hmean tput-8 582758.78 ( 0.00%) 607290.89 * 4.21%* Hmean tput-16 1020372.75 ( 0.00%) 1031257.25 ( 1.07%) Hmean tput-24 1416430.67 ( 0.00%) 1587576.33 * 12.08%* Hmean tput-32 1687702.72 ( 0.00%) 1724207.51 ( 2.16%) Hmean tput-40 1798094.90 ( 0.00%) 1983053.56 * 10.29%* Hmean tput-48 1972731.77 ( 0.00%) 2157461.70 ( 9.36%) Hmean tput-56 2386872.38 ( 0.00%) 2193237.42 ( -8.11%) Hmean tput-64 2536954.17 ( 0.00%) 2588741.08 ( 2.04%) Hmean tput-72 2585071.36 ( 0.00%) 2654776.36 ( 2.70%) Hmean tput-80 2960523.94 ( 0.00%) 2894657.12 ( -2.22%) Hmean tput-88 3061408.57 ( 0.00%) 2903167.72 ( -5.17%) Hmean tput-96 3052394.82 ( 0.00%) 2872605.46 ( -5.89%) Hmean tput-104 2997814.76 ( 0.00%) 3013660.26 ( 0.53%) Hmean tput-112 2955353.29 ( 0.00%) 3029122.16 ( 2.50%) Hmean tput-120 2889770.71 ( 0.00%) 2957739.88 ( 2.35%) Hmean tput-128 2871713.84 ( 0.00%) 2912410.18 ( 1.42%) In general, the standard deviation figures also are a lot more stable. Similarly, for embarassingly parallel problems like NPB-ep, there are improvements due to better spreading across LLC when the machine is not fully utilised. vanilla sched-numaimb-v5r12 Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%) Amean ep.D 31.86 ( 0.00%) 26.26 * 17.58%* Stddev ep.D 0.07 ( 0.00%) 0.18 (-157.54%) CoeffVar ep.D 0.22 ( 0.00%) 0.69 (-212.46%) Max ep.D 31.93 ( 0.00%) 26.46 ( 17.13%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 22 +++++++------- kernel/sched/topology.c | 53 ++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+), 10 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 8054641c0a7b..56cffe42abbc 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -93,6 +93,7 @@ struct sched_domain { unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ + unsigned int imb_numa_nr; /* Nr running tasks that allows a NUMA imbalance */ int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4592ccf82c34..86abf97a8df6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1489,6 +1489,7 @@ struct task_numa_env { int src_cpu, src_nid; int dst_cpu, dst_nid; + int imb_numa_nr; struct numa_stats src_stats, dst_stats; @@ -1503,7 +1504,7 @@ struct task_numa_env { static unsigned long cpu_load(struct rq *rq); static unsigned long cpu_runnable(struct rq *rq); static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight); + int dst_running, int imb_numa_nr); static inline enum numa_type numa_classify(unsigned int imbalance_pct, @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, dst_running = env->dst_stats.nr_running + 1; imbalance = max(0, dst_running - src_running); imbalance = adjust_numa_imbalance(imbalance, dst_running, - env->dst_stats.weight); + env->imb_numa_nr); /* Use idle CPU if there is no imbalance */ if (!imbalance) { @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p) */ rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); - if (sd) + if (sd) { env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; + env.imb_numa_nr = sd->imb_numa_nr; + } rcu_read_unlock(); /* @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest, * This is an approximation as the number of running tasks may not be * related to the number of busy CPUs due to sched_setaffinity. */ -static inline bool -allow_numa_imbalance(unsigned int running, unsigned int weight) +static inline bool allow_numa_imbalance(int running, int imb_numa_nr) { - return (running < (weight >> 2)); + return running < imb_numa_nr; } /* @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) * allowed. If there is a real need of migration, * periodic load balance will take care of it. */ - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight)) + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr)) return NULL; } @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd #define NUMA_IMBALANCE_MIN 2 static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight) + int dst_running, int imb_numa_nr) { - if (!allow_numa_imbalance(dst_running, dst_weight)) + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) return imbalance; /* @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* Consider allowing a small imbalance between NUMA groups */ if (env->sd->flags & SD_NUMA) { env->imbalance = adjust_numa_imbalance(env->imbalance, - local->sum_nr_running + 1, local->group_weight); + local->sum_nr_running + 1, env->sd->imb_numa_nr); } return; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..e6cd55951304 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } } + /* + * Calculate an allowed NUMA imbalance such that LLCs do not get + * imbalanced. + */ + for_each_cpu(i, cpu_map) { + unsigned int imb = 0; + unsigned int imb_span = 1; + + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + struct sched_domain *child = sd->child; + + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && + (child->flags & SD_SHARE_PKG_RESOURCES)) { + struct sched_domain *top, *top_p; + unsigned int nr_llcs; + + /* + * For a single LLC per node, allow an + * imbalance up to 25% of the node. This is an + * arbitrary cutoff based on SMT-2 to balance + * between memory bandwidth and avoiding + * premature sharing of HT resources and SMT-4 + * or SMT-8 *may* benefit from a different + * cutoff. + * + * For multiple LLCs, allow an imbalance + * until multiple tasks would share an LLC + * on one node while LLCs on another node + * remain idle. + */ + nr_llcs = sd->span_weight / child->span_weight; + if (nr_llcs == 1) + imb = sd->span_weight >> 2; + else + imb = nr_llcs; + sd->imb_numa_nr = imb; + + /* Set span based on the first NUMA domain. */ + top = sd; + top_p = top->parent; + while (top_p && !(top_p->flags & SD_NUMA)) { + top = top->parent; + top_p = top->parent; + } + imb_span = top_p ? top_p->span_weight : sd->span_weight; + } else { + int factor = max(1U, (sd->span_weight / imb_span)); + + sd->imb_numa_nr = imb * factor; + } + } + } + /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) -- 2.31.1 ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman @ 2022-02-04 1:30 ` kernel test robot 2022-02-04 7:06 ` Srikar Dronamraju 2022-02-04 15:07 ` Nayak, KPrateek (K Prateek) 2 siblings, 0 replies; 48+ messages in thread From: kernel test robot @ 2022-02-04 1:30 UTC (permalink / raw) To: kbuild-all [-- Attachment #1: Type: text/plain, Size: 18268 bytes --] Hi Mel, I love your patch! Perhaps something to improve: [auto build test WARNING on tip/sched/core] [also build test WARNING on tip/master linux/master linus/master v5.17-rc2 next-20220203] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20220203-225129 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git ec2444530612a886b406e2830d7f314d1a07d4bb config: parisc-randconfig-s031-20220131 (https://download.01.org/0day-ci/archive/20220204/202202040638.UP73q8Dl-lkp(a)intel.com/config) compiler: hppa-linux-gcc (GCC) 11.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # apt-get install sparse # sparse version: v0.6.4-dirty # https://github.com/0day-ci/linux/commit/81a6d8d3e9199b22ab27ea3ade91a0b0c18d0811 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20220203-225129 git checkout 81a6d8d3e9199b22ab27ea3ade91a0b0c18d0811 # save the config file to linux build tree mkdir build_dir COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=parisc SHELL=/bin/bash kernel/sched/ If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> sparse warnings: (new ones prefixed by >>) kernel/sched/topology.c:461:19: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct perf_domain *pd @@ got struct perf_domain [noderef] __rcu *pd @@ kernel/sched/topology.c:461:19: sparse: expected struct perf_domain *pd kernel/sched/topology.c:461:19: sparse: got struct perf_domain [noderef] __rcu *pd kernel/sched/topology.c:623:49: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:623:49: sparse: expected struct sched_domain *parent kernel/sched/topology.c:623:49: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:694:50: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:694:50: sparse: expected struct sched_domain *parent kernel/sched/topology.c:694:50: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:701:55: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *[noderef] __rcu child @@ got struct sched_domain *[assigned] tmp @@ kernel/sched/topology.c:701:55: sparse: expected struct sched_domain [noderef] __rcu *[noderef] __rcu child kernel/sched/topology.c:701:55: sparse: got struct sched_domain *[assigned] tmp kernel/sched/topology.c:711:29: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:711:29: sparse: expected struct sched_domain *[assigned] tmp kernel/sched/topology.c:711:29: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:716:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:716:20: sparse: expected struct sched_domain *sd kernel/sched/topology.c:716:20: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:737:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *sd @@ kernel/sched/topology.c:737:13: sparse: expected struct sched_domain *[assigned] tmp kernel/sched/topology.c:737:13: sparse: got struct sched_domain [noderef] __rcu *sd kernel/sched/topology.c:899:70: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:899:70: sparse: expected struct sched_domain *sd kernel/sched/topology.c:899:70: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:928:59: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:928:59: sparse: expected struct sched_domain *sd kernel/sched/topology.c:928:59: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:974:57: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:974:57: sparse: expected struct sched_domain *sd kernel/sched/topology.c:974:57: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:976:25: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sibling @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:976:25: sparse: expected struct sched_domain *sibling kernel/sched/topology.c:976:25: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:984:55: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:984:55: sparse: expected struct sched_domain *sd kernel/sched/topology.c:984:55: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:986:25: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sibling @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:986:25: sparse: expected struct sched_domain *sibling kernel/sched/topology.c:986:25: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1056:62: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:1056:62: sparse: expected struct sched_domain *sd kernel/sched/topology.c:1056:62: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1160:40: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:1160:40: sparse: expected struct sched_domain *child kernel/sched/topology.c:1160:40: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1571:43: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain [noderef] __rcu *child @@ got struct sched_domain *child @@ kernel/sched/topology.c:1571:43: sparse: expected struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1571:43: sparse: got struct sched_domain *child kernel/sched/topology.c:2130:31: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *parent @@ got struct sched_domain *sd @@ kernel/sched/topology.c:2130:31: sparse: expected struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:2130:31: sparse: got struct sched_domain *sd kernel/sched/topology.c:2233:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2233:57: sparse: expected struct sched_domain *[assigned] sd kernel/sched/topology.c:2233:57: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:2254:56: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:2254:56: sparse: expected struct sched_domain *child kernel/sched/topology.c:2254:56: sparse: got struct sched_domain [noderef] __rcu *child >> kernel/sched/topology.c:2284:39: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *top_p @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2284:39: sparse: expected struct sched_domain *top_p kernel/sched/topology.c:2284:39: sparse: got struct sched_domain [noderef] __rcu *parent >> kernel/sched/topology.c:2286:45: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] top @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2286:45: sparse: expected struct sched_domain *[assigned] top kernel/sched/topology.c:2286:45: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:2287:47: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *top_p @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2287:47: sparse: expected struct sched_domain *top_p kernel/sched/topology.c:2287:47: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:2253:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2253:57: sparse: expected struct sched_domain *[assigned] sd kernel/sched/topology.c:2253:57: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:2303:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2303:57: sparse: expected struct sched_domain *[assigned] sd kernel/sched/topology.c:2303:57: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c: note: in included file: kernel/sched/sched.h:1744:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1744:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1744:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/sched.h:1757:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1757:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1757:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/sched.h:1744:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1744:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1744:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/sched.h:1757:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1757:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1757:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:929:31: sparse: sparse: dereference of noderef expression kernel/sched/topology.c:1592:19: sparse: sparse: dereference of noderef expression vim +2284 kernel/sched/topology.c 2186 2187 /* 2188 * Build sched domains for a given set of CPUs and attach the sched domains 2189 * to the individual CPUs 2190 */ 2191 static int 2192 build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr) 2193 { 2194 enum s_alloc alloc_state = sa_none; 2195 struct sched_domain *sd; 2196 struct s_data d; 2197 struct rq *rq = NULL; 2198 int i, ret = -ENOMEM; 2199 bool has_asym = false; 2200 2201 if (WARN_ON(cpumask_empty(cpu_map))) 2202 goto error; 2203 2204 alloc_state = __visit_domain_allocation_hell(&d, cpu_map); 2205 if (alloc_state != sa_rootdomain) 2206 goto error; 2207 2208 /* Set up domains for CPUs specified by the cpu_map: */ 2209 for_each_cpu(i, cpu_map) { 2210 struct sched_domain_topology_level *tl; 2211 2212 sd = NULL; 2213 for_each_sd_topology(tl) { 2214 2215 if (WARN_ON(!topology_span_sane(tl, cpu_map, i))) 2216 goto error; 2217 2218 sd = build_sched_domain(tl, cpu_map, attr, sd, i); 2219 2220 has_asym |= sd->flags & SD_ASYM_CPUCAPACITY; 2221 2222 if (tl == sched_domain_topology) 2223 *per_cpu_ptr(d.sd, i) = sd; 2224 if (tl->flags & SDTL_OVERLAP) 2225 sd->flags |= SD_OVERLAP; 2226 if (cpumask_equal(cpu_map, sched_domain_span(sd))) 2227 break; 2228 } 2229 } 2230 2231 /* Build the groups for the domains */ 2232 for_each_cpu(i, cpu_map) { 2233 for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { 2234 sd->span_weight = cpumask_weight(sched_domain_span(sd)); 2235 if (sd->flags & SD_OVERLAP) { 2236 if (build_overlap_sched_groups(sd, i)) 2237 goto error; 2238 } else { 2239 if (build_sched_groups(sd, i)) 2240 goto error; 2241 } 2242 } 2243 } 2244 2245 /* 2246 * Calculate an allowed NUMA imbalance such that LLCs do not get 2247 * imbalanced. 2248 */ 2249 for_each_cpu(i, cpu_map) { 2250 unsigned int imb = 0; 2251 unsigned int imb_span = 1; 2252 2253 for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { 2254 struct sched_domain *child = sd->child; 2255 2256 if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && 2257 (child->flags & SD_SHARE_PKG_RESOURCES)) { 2258 struct sched_domain *top, *top_p; 2259 unsigned int nr_llcs; 2260 2261 /* 2262 * For a single LLC per node, allow an 2263 * imbalance up to 25% of the node. This is an 2264 * arbitrary cutoff based on SMT-2 to balance 2265 * between memory bandwidth and avoiding 2266 * premature sharing of HT resources and SMT-4 2267 * or SMT-8 *may* benefit from a different 2268 * cutoff. 2269 * 2270 * For multiple LLCs, allow an imbalance 2271 * until multiple tasks would share an LLC 2272 * on one node while LLCs on another node 2273 * remain idle. 2274 */ 2275 nr_llcs = sd->span_weight / child->span_weight; 2276 if (nr_llcs == 1) 2277 imb = sd->span_weight >> 2; 2278 else 2279 imb = nr_llcs; 2280 sd->imb_numa_nr = imb; 2281 2282 /* Set span based on the first NUMA domain. */ 2283 top = sd; > 2284 top_p = top->parent; 2285 while (top_p && !(top_p->flags & SD_NUMA)) { > 2286 top = top->parent; 2287 top_p = top->parent; 2288 } 2289 imb_span = top_p ? top_p->span_weight : sd->span_weight; 2290 } else { 2291 int factor = max(1U, (sd->span_weight / imb_span)); 2292 2293 sd->imb_numa_nr = imb * factor; 2294 } 2295 } 2296 } 2297 2298 /* Calculate CPU capacity for physical packages and nodes */ 2299 for (i = nr_cpumask_bits-1; i >= 0; i--) { 2300 if (!cpumask_test_cpu(i, cpu_map)) 2301 continue; 2302 2303 for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { 2304 claim_allocations(i, sd); 2305 init_sched_groups_capacity(i, sd); 2306 } 2307 } 2308 2309 /* Attach the domains */ 2310 rcu_read_lock(); 2311 for_each_cpu(i, cpu_map) { 2312 rq = cpu_rq(i); 2313 sd = *per_cpu_ptr(d.sd, i); 2314 2315 /* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */ 2316 if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity)) 2317 WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig); 2318 2319 cpu_attach_domain(sd, d.rd, i); 2320 } 2321 rcu_read_unlock(); 2322 2323 if (has_asym) 2324 static_branch_inc_cpuslocked(&sched_asym_cpucapacity); 2325 2326 if (rq && sched_debug_verbose) { 2327 pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n", 2328 cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity); 2329 } 2330 2331 ret = 0; 2332 error: 2333 __free_domain_allocs(&d, alloc_state, cpu_map); 2334 2335 return ret; 2336 } 2337 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 2022-02-04 1:30 ` kernel test robot @ 2022-02-04 7:06 ` Srikar Dronamraju 2022-02-04 9:04 ` Mel Gorman 2022-02-04 15:07 ` Nayak, KPrateek (K Prateek) 2 siblings, 1 reply; 48+ messages in thread From: Srikar Dronamraju @ 2022-02-04 7:06 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy, LKML * Mel Gorman <mgorman@techsingularity.net> [2022-02-03 14:46:52]: > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d201a7052a29..e6cd55951304 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + /* > + * Calculate an allowed NUMA imbalance such that LLCs do not get > + * imbalanced. > + */ We seem to adding this hunk before the sched_domains may be degenerated. Wondering if we really want to do it before degeneration. Let say we have 3 sched domains and we calculated the sd->imb_numa_nr for all the 3 domains, then lets say the middle sched_domain gets degenerated. Would the sd->imb_numa_nr's still be relevant? > + for_each_cpu(i, cpu_map) { > + unsigned int imb = 0; > + unsigned int imb_span = 1; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > + struct sched_domain *top, *top_p; > + unsigned int nr_llcs; > + > + /* > + * For a single LLC per node, allow an > + * imbalance up to 25% of the node. This is an > + * arbitrary cutoff based on SMT-2 to balance > + * between memory bandwidth and avoiding > + * premature sharing of HT resources and SMT-4 > + * or SMT-8 *may* benefit from a different > + * cutoff. > + * > + * For multiple LLCs, allow an imbalance > + * until multiple tasks would share an LLC > + * on one node while LLCs on another node > + * remain idle. > + */ > + nr_llcs = sd->span_weight / child->span_weight; > + if (nr_llcs == 1) > + imb = sd->span_weight >> 2; > + else > + imb = nr_llcs; > + sd->imb_numa_nr = imb; > + > + /* Set span based on the first NUMA domain. */ > + top = sd; > + top_p = top->parent; > + while (top_p && !(top_p->flags & SD_NUMA)) { > + top = top->parent; > + top_p = top->parent; > + } > + imb_span = top_p ? top_p->span_weight : sd->span_weight; I am getting confused by imb_span. Let say we have a topology of SMT -> MC -> DIE -> NUMA -> NUMA, with SMT and MC domains having SD_SHARE_PKG_RESOURCES flag set. We come here only for DIE domain. imb_span set here is being used for both the subsequent sched domains most likely they will be NUMA domains. Right? > + } else { > + int factor = max(1U, (sd->span_weight / imb_span)); > + > + sd->imb_numa_nr = imb * factor; For SMT, (or any sched domains below the llcs) factor would be sd->span_weight but imb_numa_nr and imb would be 0. For NUMA (or any sched domain just above DIE), factor would be sd->imb_numa_nr would be nr_llcs. For subsequent sched_domains, the sd->imb_numa_nr would be some multiple of nr_llcs. Right? > + } > + } > + } > + > /* Calculate CPU capacity for physical packages and nodes */ > for (i = nr_cpumask_bits-1; i >= 0; i--) { > if (!cpumask_test_cpu(i, cpu_map)) > -- > 2.31.1 > -- Thanks and Regards Srikar Dronamraju ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-04 7:06 ` Srikar Dronamraju @ 2022-02-04 9:04 ` Mel Gorman 0 siblings, 0 replies; 48+ messages in thread From: Mel Gorman @ 2022-02-04 9:04 UTC (permalink / raw) To: Srikar Dronamraju Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy, LKML On Fri, Feb 04, 2022 at 12:36:54PM +0530, Srikar Dronamraju wrote: > * Mel Gorman <mgorman@techsingularity.net> [2022-02-03 14:46:52]: > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index d201a7052a29..e6cd55951304 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > } > > } > > > > + /* > > + * Calculate an allowed NUMA imbalance such that LLCs do not get > > + * imbalanced. > > + */ > > We seem to adding this hunk before the sched_domains may be degenerated. > Wondering if we really want to do it before degeneration. > There was no obvious advantage versus doing it at the same time characteristics like groups were being determined. > Let say we have 3 sched domains and we calculated the sd->imb_numa_nr for > all the 3 domains, then lets say the middle sched_domain gets degenerated. > Would the sd->imb_numa_nr's still be relevant? > It's expected that it is still relevant as the ratios with respect to SD_SHARE_PKG_RESOURCES should still be consistent. > > > + for_each_cpu(i, cpu_map) { > > + unsigned int imb = 0; > > + unsigned int imb_span = 1; > > + > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > + struct sched_domain *child = sd->child; > > + > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > + struct sched_domain *top, *top_p; > > + unsigned int nr_llcs; > > + > > + /* > > + * For a single LLC per node, allow an > > + * imbalance up to 25% of the node. This is an > > + * arbitrary cutoff based on SMT-2 to balance > > + * between memory bandwidth and avoiding > > + * premature sharing of HT resources and SMT-4 > > + * or SMT-8 *may* benefit from a different > > + * cutoff. > > + * > > + * For multiple LLCs, allow an imbalance > > + * until multiple tasks would share an LLC > > + * on one node while LLCs on another node > > + * remain idle. > > + */ > > + nr_llcs = sd->span_weight / child->span_weight; > > + if (nr_llcs == 1) > > + imb = sd->span_weight >> 2; > > + else > > + imb = nr_llcs; > > + sd->imb_numa_nr = imb; > > + > > + /* Set span based on the first NUMA domain. */ > > + top = sd; > > + top_p = top->parent; > > + while (top_p && !(top_p->flags & SD_NUMA)) { > > + top = top->parent; > > + top_p = top->parent; > > + } > > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > > I am getting confused by imb_span. > Let say we have a topology of SMT -> MC -> DIE -> NUMA -> NUMA, with SMT and > MC domains having SD_SHARE_PKG_RESOURCES flag set. > We come here only for DIE domain. > > imb_span set here is being used for both the subsequent sched domains > most likely they will be NUMA domains. Right? > Right. > > + } else { > > + int factor = max(1U, (sd->span_weight / imb_span)); > > + > > + sd->imb_numa_nr = imb * factor; > > For SMT, (or any sched domains below the llcs) factor would be > sd->span_weight but imb_numa_nr and imb would be 0. Yes. > For NUMA (or any sched domain just above DIE), factor would be > sd->imb_numa_nr would be nr_llcs. > For subsequent sched_domains, the sd->imb_numa_nr would be some multiple of > nr_llcs. Right? > Right. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 2022-02-04 1:30 ` kernel test robot 2022-02-04 7:06 ` Srikar Dronamraju @ 2022-02-04 15:07 ` Nayak, KPrateek (K Prateek) 2022-02-04 16:45 ` Mel Gorman 2 siblings, 1 reply; 48+ messages in thread From: Nayak, KPrateek (K Prateek) @ 2022-02-04 15:07 UTC (permalink / raw) To: Mel Gorman, Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML Hello Mel, On 2/3/2022 8:16 PM, Mel Gorman wrote: > Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA > nodes") allowed an imbalance between NUMA nodes such that communicating > tasks would not be pulled apart by the load balancer. This works fine when > there is a 1:1 relationship between LLC and node but can be suboptimal > for multiple LLCs if independent tasks prematurely use CPUs sharing cache. > > Zen* has multiple LLCs per node with local memory channels and due to > the allowed imbalance, it's far harder to tune some workloads to run > optimally than it is on hardware that has 1 LLC per node. This patch > allows an imbalance to exist up to the point where LLCs should be balanced > between nodes. > > On a Zen3 machine running STREAM parallelised with OMP to have on instance > per LLC the results and without binding, the results are > > 5.17.0-rc0 5.17.0-rc0 > vanilla sched-numaimb-v5 > MB/sec copy-16 162596.94 ( 0.00%) 501967.12 ( 208.72%) > MB/sec scale-16 136901.28 ( 0.00%) 376531.50 ( 175.04%) > MB/sec add-16 157300.70 ( 0.00%) 569997.42 ( 262.36%) > MB/sec triad-16 151446.88 ( 0.00%) 553204.54 ( 265.28%) > > STREAM can use directives to force the spread if the OpenMP is new > enough but that doesn't help if an application uses threads and > it's not known in advance how many threads will be created. > > Coremark is a CPU and cache intensive benchmark parallelised with > threads. When running with 1 thread per core, the vanilla kernel > allows threads to contend on cache. With the patch; > > 5.17.0-rc0 5.17.0-rc0 > vanilla sched-numaimb-v5 > Min Score-16 368239.36 ( 0.00%) 400876.92 ( 8.86%) > Hmean Score-16 388607.33 ( 0.00%) 441447.30 * 13.60%* > Max Score-16 408945.69 ( 0.00%) 478826.87 ( 17.09%) > Stddev Score-16 15247.04 ( 0.00%) 34061.76 (-123.40%) > CoeffVar Score-16 3.92 ( 0.00%) 7.67 ( -95.82%) > > It can also make a big difference for semi-realistic workloads > like specjbb which can execute arbitrary numbers of threads without > advance knowledge of how they should be placed > > 5.17.0-rc0 5.17.0-rc0 > vanilla sched-numaimb-v5 > Hmean tput-1 71631.55 ( 0.00%) 70383.46 ( -1.74%) > Hmean tput-8 582758.78 ( 0.00%) 607290.89 * 4.21%* > Hmean tput-16 1020372.75 ( 0.00%) 1031257.25 ( 1.07%) > Hmean tput-24 1416430.67 ( 0.00%) 1587576.33 * 12.08%* > Hmean tput-32 1687702.72 ( 0.00%) 1724207.51 ( 2.16%) > Hmean tput-40 1798094.90 ( 0.00%) 1983053.56 * 10.29%* > Hmean tput-48 1972731.77 ( 0.00%) 2157461.70 ( 9.36%) > Hmean tput-56 2386872.38 ( 0.00%) 2193237.42 ( -8.11%) > Hmean tput-64 2536954.17 ( 0.00%) 2588741.08 ( 2.04%) > Hmean tput-72 2585071.36 ( 0.00%) 2654776.36 ( 2.70%) > Hmean tput-80 2960523.94 ( 0.00%) 2894657.12 ( -2.22%) > Hmean tput-88 3061408.57 ( 0.00%) 2903167.72 ( -5.17%) > Hmean tput-96 3052394.82 ( 0.00%) 2872605.46 ( -5.89%) > Hmean tput-104 2997814.76 ( 0.00%) 3013660.26 ( 0.53%) > Hmean tput-112 2955353.29 ( 0.00%) 3029122.16 ( 2.50%) > Hmean tput-120 2889770.71 ( 0.00%) 2957739.88 ( 2.35%) > Hmean tput-128 2871713.84 ( 0.00%) 2912410.18 ( 1.42%) > > In general, the standard deviation figures also are a lot more > stable. > > Similarly, for embarassingly parallel problems like NPB-ep, there are > improvements due to better spreading across LLC when the machine is not > fully utilised. > > vanilla sched-numaimb-v5r12 > Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%) > Amean ep.D 31.86 ( 0.00%) 26.26 * 17.58%* > Stddev ep.D 0.07 ( 0.00%) 0.18 (-157.54%) > CoeffVar ep.D 0.22 ( 0.00%) 0.69 (-212.46%) > Max ep.D 31.93 ( 0.00%) 26.46 ( 17.13%) > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > include/linux/sched/topology.h | 1 + > kernel/sched/fair.c | 22 +++++++------- > kernel/sched/topology.c | 53 ++++++++++++++++++++++++++++++++++ > 3 files changed, 66 insertions(+), 10 deletions(-) > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index 8054641c0a7b..56cffe42abbc 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -93,6 +93,7 @@ struct sched_domain { > unsigned int busy_factor; /* less balancing by factor if busy */ > unsigned int imbalance_pct; /* No balance until over watermark */ > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > + unsigned int imb_numa_nr; /* Nr running tasks that allows a NUMA imbalance */ > > int nohz_idle; /* NOHZ IDLE status */ > int flags; /* See SD_* */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4592ccf82c34..86abf97a8df6 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > int src_cpu, src_nid; > int dst_cpu, dst_nid; > + int imb_numa_nr; > > struct numa_stats src_stats, dst_stats; > > @@ -1503,7 +1504,7 @@ struct task_numa_env { > static unsigned long cpu_load(struct rq *rq); > static unsigned long cpu_runnable(struct rq *rq); > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight); > + int dst_running, int imb_numa_nr); > > static inline enum > numa_type numa_classify(unsigned int imbalance_pct, > @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, > dst_running = env->dst_stats.nr_running + 1; > imbalance = max(0, dst_running - src_running); > imbalance = adjust_numa_imbalance(imbalance, dst_running, > - env->dst_stats.weight); > + env->imb_numa_nr); > > /* Use idle CPU if there is no imbalance */ > if (!imbalance) { > @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p) > */ > rcu_read_lock(); > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > - if (sd) > + if (sd) { > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > + env.imb_numa_nr = sd->imb_numa_nr; > + } > rcu_read_unlock(); > > /* > @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > * This is an approximation as the number of running tasks may not be > * related to the number of busy CPUs due to sched_setaffinity. > */ > -static inline bool > -allow_numa_imbalance(unsigned int running, unsigned int weight) > +static inline bool allow_numa_imbalance(int running, int imb_numa_nr) > { > - return (running < (weight >> 2)); > + return running < imb_numa_nr; > } > > /* > @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > * allowed. If there is a real need of migration, > * periodic load balance will take care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight)) > + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr)) Could you please clarify why are we adding 1 to local_sgs.sum_nr_running while allowing imbalance? allow_numa_imbalance allows the imbalance based on the following inequality: running < imb_numa_nr Consider on a Zen3 CPU with 8 LLCs in the sched group of the NUMA domain. Assume the group is running 7 task and we are finding the idlest group for the 8th task: sd->imb_numa_nr = 8 local_sgs.sum_nr_running = 7 In this case, local_sgs.sum_nr_running + 1 is equal to sd->imb_numa_nr and if we allow NUMA imbalance and place the task in the same group, each task can be given one LLC. However, allow_numa_imbalance returns 0 for the above case and can lead to task being placed on a different NUMA group. In case of Gautham's suggested fix (https://lore.kernel.org/lkml/YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com/), the v4 patch in question (https://lore.kernel.org/lkml/20211210093307.31701-3-mgorman@techsingularity.net/) used the inequality "<=" to allow NUMA imbalance where we needed to consider the additional load CPU had to bear. However that doesn't seem to be the case here. > return NULL; > } > > @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > #define NUMA_IMBALANCE_MIN 2 > > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight) > + int dst_running, int imb_numa_nr) > { > - if (!allow_numa_imbalance(dst_running, dst_weight)) > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > return imbalance; > > /* > @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* Consider allowing a small imbalance between NUMA groups */ > if (env->sd->flags & SD_NUMA) { > env->imbalance = adjust_numa_imbalance(env->imbalance, > - local->sum_nr_running + 1, local->group_weight); > + local->sum_nr_running + 1, env->sd->imb_numa_nr); > } > > return; > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d201a7052a29..e6cd55951304 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + /* > + * Calculate an allowed NUMA imbalance such that LLCs do not get > + * imbalanced. > + */ > + for_each_cpu(i, cpu_map) { > + unsigned int imb = 0; > + unsigned int imb_span = 1; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > + struct sched_domain *top, *top_p; > + unsigned int nr_llcs; > + > + /* > + * For a single LLC per node, allow an > + * imbalance up to 25% of the node. This is an > + * arbitrary cutoff based on SMT-2 to balance > + * between memory bandwidth and avoiding > + * premature sharing of HT resources and SMT-4 > + * or SMT-8 *may* benefit from a different > + * cutoff. > + * > + * For multiple LLCs, allow an imbalance > + * until multiple tasks would share an LLC > + * on one node while LLCs on another node > + * remain idle. > + */ To add to my point above, the comment here says - "allow an imbalance until multiple tasks would share an LLC on one node" Whereas, in the case I highlighted above, we see balancing kick in with possibly one LLC being unaccounted for. > + nr_llcs = sd->span_weight / child->span_weight; > + if (nr_llcs == 1) > + imb = sd->span_weight >> 2; > + else > + imb = nr_llcs; > + sd->imb_numa_nr = imb; > + > + /* Set span based on the first NUMA domain. */ > + top = sd; > + top_p = top->parent; > + while (top_p && !(top_p->flags & SD_NUMA)) { > + top = top->parent; > + top_p = top->parent; > + } > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > + } else { > + int factor = max(1U, (sd->span_weight / imb_span)); > + > + sd->imb_numa_nr = imb * factor; > + } > + } > + } > + > /* Calculate CPU capacity for physical packages and nodes */ > for (i = nr_cpumask_bits-1; i >= 0; i--) { > if (!cpumask_test_cpu(i, cpu_map)) Please correct me if I'm wrong. Thanks and Regards Prateek ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2022-02-04 15:07 ` Nayak, KPrateek (K Prateek) @ 2022-02-04 16:45 ` Mel Gorman 0 siblings, 0 replies; 48+ messages in thread From: Mel Gorman @ 2022-02-04 16:45 UTC (permalink / raw) To: Nayak, KPrateek (K Prateek) Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy, LKML On Fri, Feb 04, 2022 at 08:37:53PM +0530, Nayak, KPrateek (K Prateek) wrote: > On 2/3/2022 8:16 PM, Mel Gorman wrote: > > @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > > * This is an approximation as the number of running tasks may not be > > * related to the number of busy CPUs due to sched_setaffinity. > > */ > > -static inline bool > > -allow_numa_imbalance(unsigned int running, unsigned int weight) > > +static inline bool allow_numa_imbalance(int running, int imb_numa_nr) > > { > > - return (running < (weight >> 2)); > > + return running < imb_numa_nr; > > } > > > > /* > > @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > > * allowed. If there is a real need of migration, > > * periodic load balance will take care of it. > > */ > > - if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight)) > > + if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr)) > > Could you please clarify why are we adding 1 to local_sgs.sum_nr_running while allowing imbalance? To account for the new task similar to what task_numa_find_cpu before calling adjust_numa_imbalance. > allow_numa_imbalance allows the imbalance based on the following inequality: > > running < imb_numa_nr > > Consider on a Zen3 CPU with 8 LLCs in the sched group of the NUMA domain. > Assume the group is running 7 task and we are finding the idlest group for the 8th task: > > sd->imb_numa_nr = 8 > local_sgs.sum_nr_running = 7 > > In this case, local_sgs.sum_nr_running + 1 is equal to sd->imb_numa_nr and if we allow NUMA imbalance > and place the task in the same group, each task can be given one LLC. > However, allow_numa_imbalance returns 0 for the above case and can lead to task being placed on a different > NUMA group. > > In case of Gautham's suggested fix (https://lore.kernel.org/lkml/YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com/), > the v4 patch in question (https://lore.kernel.org/lkml/20211210093307.31701-3-mgorman@techsingularity.net/) > used the inequality "<=" to allow NUMA imbalance where we needed to consider the additional load CPU had > to bear. However that doesn't seem to be the case here. > I failed to change < to <= in allow_numa_imbalance, I'll fix and retest. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v3 0/2] Adjust NUMA imbalance for multiple LLCs @ 2021-12-01 15:18 Mel Gorman 2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 0 siblings, 1 reply; 48+ messages in thread From: Mel Gorman @ 2021-12-01 15:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. The series addresses two problems -- inconsistent use of scheduler domain weights and sub-optimal performance when there are many LLCs per NUMA node. -- 2.31.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for " Mel Gorman @ 2021-12-01 15:18 ` Mel Gorman 2021-12-03 8:15 ` Barry Song 2021-12-04 10:40 ` Peter Zijlstra 0 siblings, 2 replies; 48+ messages in thread From: Mel Gorman @ 2021-12-01 15:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. Zen* has multiple LLCs per node with local memory channels and due to the allowed imbalance, it's far harder to tune some workloads to run optimally than it is on hardware that has 1 LLC per node. This patch adjusts the imbalance on multi-LLC machines to allow an imbalance up to the point where LLCs should be balanced between nodes. On a Zen3 machine running STREAM parallelised with OMP to have on instance per LLC the results and without binding, the results are 5.16.0-rc1 5.16.0-rc1 vanilla sched-numaimb-v3r1 MB/sec copy-16 166712.18 ( 0.00%) 587662.60 ( 252.50%) MB/sec scale-16 140109.66 ( 0.00%) 393528.14 ( 180.87%) MB/sec add-16 160791.18 ( 0.00%) 618622.00 ( 284.74%) MB/sec triad-16 160043.84 ( 0.00%) 589188.40 ( 268.14%) STREAM can use directives to force the spread if the OpenMP is new enough but that doesn't help if an application uses threads and it's not known in advance how many threads will be created. Coremark is a CPU and cache intensive benchmark parallelised with threads. When running with 1 thread per instance, the vanilla kernel allows threads to contend on cache. With the patch; 5.16.0-rc1 5.16.0-rc1 vanilla sched-numaimb-v3r1 Min Score-16 367816.09 ( 0.00%) 403429.15 ( 9.68%) Hmean Score-16 389627.78 ( 0.00%) 451015.49 * 15.76%* Max Score-16 416178.96 ( 0.00%) 480012.00 ( 15.34%) Stddev Score-16 17361.82 ( 0.00%) 32378.08 ( -86.49%) CoeffVar Score-16 4.45 ( 0.00%) 7.14 ( -60.57%) It can also make a big difference for semi-realistic workloads like specjbb which can execute arbitrary numbers of threads without advance knowledge of how they should be placed 5.16.0-rc1 5.16.0-rc1 vanilla sched-numaimb-v3r1 Hmean tput-1 73743.05 ( 0.00%) 70258.27 * -4.73%* Hmean tput-8 563036.51 ( 0.00%) 591187.39 ( 5.00%) Hmean tput-16 1016590.61 ( 0.00%) 1032311.78 ( 1.55%) Hmean tput-24 1418558.41 ( 0.00%) 1424005.80 ( 0.38%) Hmean tput-32 1608794.22 ( 0.00%) 1907855.80 * 18.59%* Hmean tput-40 1761338.13 ( 0.00%) 2108162.23 * 19.69%* Hmean tput-48 2290646.54 ( 0.00%) 2214383.47 ( -3.33%) Hmean tput-56 2463345.12 ( 0.00%) 2780216.58 * 12.86%* Hmean tput-64 2650213.53 ( 0.00%) 2598196.66 ( -1.96%) Hmean tput-72 2497253.28 ( 0.00%) 2998882.47 * 20.09%* Hmean tput-80 2820786.72 ( 0.00%) 2951655.27 ( 4.64%) Hmean tput-88 2813541.68 ( 0.00%) 3045450.86 * 8.24%* Hmean tput-96 2604158.67 ( 0.00%) 3035311.91 * 16.56%* Hmean tput-104 2713810.62 ( 0.00%) 2984270.04 ( 9.97%) Hmean tput-112 2558425.37 ( 0.00%) 2894737.46 * 13.15%* Hmean tput-120 2611434.93 ( 0.00%) 2781661.01 ( 6.52%) Hmean tput-128 2706103.22 ( 0.00%) 2811447.85 ( 3.89%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 26 +++++++++++++++----------- kernel/sched/topology.c | 20 ++++++++++++++++++++ 3 files changed, 36 insertions(+), 11 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index c07bfa2d80f2..54f5207154d3 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -93,6 +93,7 @@ struct sched_domain { unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0a969affca76..64f211879e43 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1489,6 +1489,7 @@ struct task_numa_env { int src_cpu, src_nid; int dst_cpu, dst_nid; + int imb_numa_nr; struct numa_stats src_stats, dst_stats; @@ -1885,7 +1886,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, dst_running = env->dst_stats.nr_running + 1; imbalance = max(0, dst_running - src_running); imbalance = adjust_numa_imbalance(imbalance, dst_running, - env->dst_stats.weight); + env->imb_numa_nr); /* Use idle CPU if there is no imbalance */ if (!imbalance) { @@ -1950,8 +1951,10 @@ static int task_numa_migrate(struct task_struct *p) */ rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); - if (sd) + if (sd) { env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; + env.imb_numa_nr = sd->imb_numa_nr; + } rcu_read_unlock(); /* @@ -9046,13 +9049,14 @@ static bool update_pick_idlest(struct sched_group *idlest, } /* - * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain. - * This is an approximation as the number of running tasks may not be - * related to the number of busy CPUs due to sched_setaffinity. + * Allow a NUMA imbalance if busy CPUs is less than the allowed + * imbalance. This is an approximation as the number of running + * tasks may not be related to the number of busy CPUs due to + * sched_setaffinity. */ -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) { - return (dst_running < (dst_weight >> 2)); + return dst_running < imb_numa_nr; } /* @@ -9191,7 +9195,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) * a real need of migration, periodic load balance will * take care of it. */ - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) return NULL; } @@ -9283,9 +9287,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd #define NUMA_IMBALANCE_MIN 2 static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight) + int dst_running, int imb_numa_nr) { - if (!allow_numa_imbalance(dst_running, dst_weight)) + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) return imbalance; /* @@ -9397,7 +9401,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* Consider allowing a small imbalance between NUMA groups */ if (env->sd->flags & SD_NUMA) { env->imbalance = adjust_numa_imbalance(env->imbalance, - busiest->sum_nr_running, env->sd->span_weight); + busiest->sum_nr_running, env->sd->imb_numa_nr); } return; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..fee2930745ab 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } } + /* Calculate allowed NUMA imbalance */ + for_each_cpu(i, cpu_map) { + int imb_numa_nr = 0; + + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + struct sched_domain *child = sd->child; + + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && + (child->flags & SD_SHARE_PKG_RESOURCES)) { + int nr_groups; + + nr_groups = sd->span_weight / child->span_weight; + imb_numa_nr = max(1U, ((child->span_weight) >> 1) / + (nr_groups * num_online_nodes())); + } + + sd->imb_numa_nr = imb_numa_nr; + } + } + /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) -- 2.31.1 ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman @ 2021-12-03 8:15 ` Barry Song 2021-12-03 10:50 ` Mel Gorman 2021-12-04 10:40 ` Peter Zijlstra 1 sibling, 1 reply; 48+ messages in thread From: Barry Song @ 2021-12-03 8:15 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA > nodes") allowed an imbalance between NUMA nodes such that communicating > tasks would not be pulled apart by the load balancer. This works fine when > there is a 1:1 relationship between LLC and node but can be suboptimal > for multiple LLCs if independent tasks prematurely use CPUs sharing cache. > > Zen* has multiple LLCs per node with local memory channels and due to > the allowed imbalance, it's far harder to tune some workloads to run > optimally than it is on hardware that has 1 LLC per node. This patch > adjusts the imbalance on multi-LLC machines to allow an imbalance up to > the point where LLCs should be balanced between nodes. > > On a Zen3 machine running STREAM parallelised with OMP to have on instance > per LLC the results and without binding, the results are > > 5.16.0-rc1 5.16.0-rc1 > vanilla sched-numaimb-v3r1 > MB/sec copy-16 166712.18 ( 0.00%) 587662.60 ( 252.50%) > MB/sec scale-16 140109.66 ( 0.00%) 393528.14 ( 180.87%) > MB/sec add-16 160791.18 ( 0.00%) 618622.00 ( 284.74%) > MB/sec triad-16 160043.84 ( 0.00%) 589188.40 ( 268.14%) > > STREAM can use directives to force the spread if the OpenMP is new > enough but that doesn't help if an application uses threads and > it's not known in advance how many threads will be created. > > Coremark is a CPU and cache intensive benchmark parallelised with > threads. When running with 1 thread per instance, the vanilla kernel > allows threads to contend on cache. With the patch; > > 5.16.0-rc1 5.16.0-rc1 > vanilla sched-numaimb-v3r1 > Min Score-16 367816.09 ( 0.00%) 403429.15 ( 9.68%) > Hmean Score-16 389627.78 ( 0.00%) 451015.49 * 15.76%* > Max Score-16 416178.96 ( 0.00%) 480012.00 ( 15.34%) > Stddev Score-16 17361.82 ( 0.00%) 32378.08 ( -86.49%) > CoeffVar Score-16 4.45 ( 0.00%) 7.14 ( -60.57%) > > It can also make a big difference for semi-realistic workloads > like specjbb which can execute arbitrary numbers of threads without > advance knowledge of how they should be placed > > 5.16.0-rc1 5.16.0-rc1 > vanilla sched-numaimb-v3r1 > Hmean tput-1 73743.05 ( 0.00%) 70258.27 * -4.73%* > Hmean tput-8 563036.51 ( 0.00%) 591187.39 ( 5.00%) > Hmean tput-16 1016590.61 ( 0.00%) 1032311.78 ( 1.55%) > Hmean tput-24 1418558.41 ( 0.00%) 1424005.80 ( 0.38%) > Hmean tput-32 1608794.22 ( 0.00%) 1907855.80 * 18.59%* > Hmean tput-40 1761338.13 ( 0.00%) 2108162.23 * 19.69%* > Hmean tput-48 2290646.54 ( 0.00%) 2214383.47 ( -3.33%) > Hmean tput-56 2463345.12 ( 0.00%) 2780216.58 * 12.86%* > Hmean tput-64 2650213.53 ( 0.00%) 2598196.66 ( -1.96%) > Hmean tput-72 2497253.28 ( 0.00%) 2998882.47 * 20.09%* > Hmean tput-80 2820786.72 ( 0.00%) 2951655.27 ( 4.64%) > Hmean tput-88 2813541.68 ( 0.00%) 3045450.86 * 8.24%* > Hmean tput-96 2604158.67 ( 0.00%) 3035311.91 * 16.56%* > Hmean tput-104 2713810.62 ( 0.00%) 2984270.04 ( 9.97%) > Hmean tput-112 2558425.37 ( 0.00%) 2894737.46 * 13.15%* > Hmean tput-120 2611434.93 ( 0.00%) 2781661.01 ( 6.52%) > Hmean tput-128 2706103.22 ( 0.00%) 2811447.85 ( 3.89%) > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > include/linux/sched/topology.h | 1 + > kernel/sched/fair.c | 26 +++++++++++++++----------- > kernel/sched/topology.c | 20 ++++++++++++++++++++ > 3 files changed, 36 insertions(+), 11 deletions(-) > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index c07bfa2d80f2..54f5207154d3 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -93,6 +93,7 @@ struct sched_domain { > unsigned int busy_factor; /* less balancing by factor if busy */ > unsigned int imbalance_pct; /* No balance until over watermark */ > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ > > int nohz_idle; /* NOHZ IDLE status */ > int flags; /* See SD_* */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0a969affca76..64f211879e43 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > int src_cpu, src_nid; > int dst_cpu, dst_nid; > + int imb_numa_nr; > > struct numa_stats src_stats, dst_stats; > > @@ -1885,7 +1886,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, > dst_running = env->dst_stats.nr_running + 1; > imbalance = max(0, dst_running - src_running); > imbalance = adjust_numa_imbalance(imbalance, dst_running, > - env->dst_stats.weight); > + env->imb_numa_nr); > > /* Use idle CPU if there is no imbalance */ > if (!imbalance) { > @@ -1950,8 +1951,10 @@ static int task_numa_migrate(struct task_struct *p) > */ > rcu_read_lock(); > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > - if (sd) > + if (sd) { > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > + env.imb_numa_nr = sd->imb_numa_nr; > + } > rcu_read_unlock(); > > /* > @@ -9046,13 +9049,14 @@ static bool update_pick_idlest(struct sched_group *idlest, > } > > /* > - * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain. > - * This is an approximation as the number of running tasks may not be > - * related to the number of busy CPUs due to sched_setaffinity. > + * Allow a NUMA imbalance if busy CPUs is less than the allowed > + * imbalance. This is an approximation as the number of running > + * tasks may not be related to the number of busy CPUs due to > + * sched_setaffinity. > */ > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > { > - return (dst_running < (dst_weight >> 2)); > + return dst_running < imb_numa_nr; > } > > /* > @@ -9191,7 +9195,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > * a real need of migration, periodic load balance will > * take care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) > return NULL; > } > > @@ -9283,9 +9287,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > #define NUMA_IMBALANCE_MIN 2 > > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight) > + int dst_running, int imb_numa_nr) > { > - if (!allow_numa_imbalance(dst_running, dst_weight)) > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > return imbalance; > > /* > @@ -9397,7 +9401,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* Consider allowing a small imbalance between NUMA groups */ > if (env->sd->flags & SD_NUMA) { > env->imbalance = adjust_numa_imbalance(env->imbalance, > - busiest->sum_nr_running, env->sd->span_weight); > + busiest->sum_nr_running, env->sd->imb_numa_nr); > } > > return; > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d201a7052a29..fee2930745ab 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + /* Calculate allowed NUMA imbalance */ > + for_each_cpu(i, cpu_map) { > + int imb_numa_nr = 0; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > + int nr_groups; > + > + nr_groups = sd->span_weight / child->span_weight; > + imb_numa_nr = max(1U, ((child->span_weight) >> 1) / > + (nr_groups * num_online_nodes())); Hi Mel, you used to have 25% * numa_weight if node has only one LLC. for a system with 4 numa, In case sd has 2 nodes, child is 1 numa node, then nr_groups=2, num_online_nodes()=4, imb_numa_nr will be child->span_weight/2/2/4? Does this patch change the behaviour for machines whose numa equals LLC? > + } > + > + sd->imb_numa_nr = imb_numa_nr; > + } > + } > + > /* Calculate CPU capacity for physical packages and nodes */ > for (i = nr_cpumask_bits-1; i >= 0; i--) { > if (!cpumask_test_cpu(i, cpu_map)) > -- > 2.31.1 > Thanks Barry ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-03 8:15 ` Barry Song @ 2021-12-03 10:50 ` Mel Gorman 2021-12-03 11:14 ` Barry Song 0 siblings, 1 reply; 48+ messages in thread From: Mel Gorman @ 2021-12-03 10:50 UTC (permalink / raw) To: Barry Song Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Fri, Dec 03, 2021 at 09:15:15PM +1300, Barry Song wrote: > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index d201a7052a29..fee2930745ab 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > } > > } > > > > + /* Calculate allowed NUMA imbalance */ > > + for_each_cpu(i, cpu_map) { > > + int imb_numa_nr = 0; > > + > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > + struct sched_domain *child = sd->child; > > + > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > + int nr_groups; > > + > > + nr_groups = sd->span_weight / child->span_weight; > > + imb_numa_nr = max(1U, ((child->span_weight) >> 1) / > > + (nr_groups * num_online_nodes())); > > Hi Mel, you used to have 25% * numa_weight if node has only one LLC. > for a system with 4 numa, In case sd has 2 nodes, child is 1 numa node, > then nr_groups=2, num_online_nodes()=4, imb_numa_nr will be > child->span_weight/2/2/4? > > Does this patch change the behaviour for machines whose numa equals LLC? > Yes, it changes behaviour. Instead of a flat 25%, it takes into account the number of LLCs per node and the number of nodes overall. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-03 10:50 ` Mel Gorman @ 2021-12-03 11:14 ` Barry Song 2021-12-03 13:27 ` Mel Gorman 0 siblings, 1 reply; 48+ messages in thread From: Barry Song @ 2021-12-03 11:14 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Fri, Dec 3, 2021 at 11:50 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > On Fri, Dec 03, 2021 at 09:15:15PM +1300, Barry Song wrote: > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > > index d201a7052a29..fee2930745ab 100644 > > > --- a/kernel/sched/topology.c > > > +++ b/kernel/sched/topology.c > > > @@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > > > } > > > } > > > > > > + /* Calculate allowed NUMA imbalance */ > > > + for_each_cpu(i, cpu_map) { > > > + int imb_numa_nr = 0; > > > + > > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > > + struct sched_domain *child = sd->child; > > > + > > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > > + int nr_groups; > > > + > > > + nr_groups = sd->span_weight / child->span_weight; > > > + imb_numa_nr = max(1U, ((child->span_weight) >> 1) / > > > + (nr_groups * num_online_nodes())); > > > > Hi Mel, you used to have 25% * numa_weight if node has only one LLC. > > for a system with 4 numa, In case sd has 2 nodes, child is 1 numa node, > > then nr_groups=2, num_online_nodes()=4, imb_numa_nr will be > > child->span_weight/2/2/4? > > > > Does this patch change the behaviour for machines whose numa equals LLC? > > > > Yes, it changes behaviour. Instead of a flat 25%, it takes into account > the number of LLCs per node and the number of nodes overall. Considering the number of nodes overall seems to be quite weird to me. for example, for the below machines 1P * 2DIE = 2NUMA: node1 - node0 2P * 2DIE = 4NUMA: node1 - node0 ------ node2 - node3 4P * 2DIE = 8NUMA: node1 - node0 ------ node2 - node3 node5 - node4 ------ node6 - node7 if one service pins node1 and node0 in all above configurations, it seems in all different machines, the app will result in different behavior. the other example is: in a 2P machine, if one app pins the first two NUMAs, the other app pins the last two NUMAs, why would the num_online_nodes() matter to them? there is no balance requirement between the two P. > > -- > Mel Gorman > SUSE Labs Thanks Barry ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-03 11:14 ` Barry Song @ 2021-12-03 13:27 ` Mel Gorman 0 siblings, 0 replies; 48+ messages in thread From: Mel Gorman @ 2021-12-03 13:27 UTC (permalink / raw) To: Barry Song Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Sat, Dec 04, 2021 at 12:14:33AM +1300, Barry Song wrote: > > > Hi Mel, you used to have 25% * numa_weight if node has only one LLC. > > > for a system with 4 numa, In case sd has 2 nodes, child is 1 numa node, > > > then nr_groups=2, num_online_nodes()=4, imb_numa_nr will be > > > child->span_weight/2/2/4? > > > > > > Does this patch change the behaviour for machines whose numa equals LLC? > > > > > > > Yes, it changes behaviour. Instead of a flat 25%, it takes into account > > the number of LLCs per node and the number of nodes overall. > > Considering the number of nodes overall seems to be quite weird to me. > for example, for the below machines > > 1P * 2DIE = 2NUMA: node1 - node0 > 2P * 2DIE = 4NUMA: node1 - node0 ------ node2 - node3 > 4P * 2DIE = 8NUMA: node1 - node0 ------ node2 - node3 > node5 - node4 ------ node6 - node7 > > if one service pins node1 and node0 in all above configurations, it seems in all > different machines, the app will result in different behavior. > The intent is to balance between LLCs across the whole machine, hence accounting for the number of online nodes. > the other example is: > in a 2P machine, if one app pins the first two NUMAs, the other app pins > the last two NUMAs, why would the num_online_nodes() matter to them? > there is no balance requirement between the two P. > The previous 25% imbalance also did not take pinning into account and the choice was somewhat arbitrary. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 2021-12-03 8:15 ` Barry Song @ 2021-12-04 10:40 ` Peter Zijlstra 2021-12-06 8:48 ` Gautham R. Shenoy 2021-12-06 15:12 ` Mel Gorman 1 sibling, 2 replies; 48+ messages in thread From: Peter Zijlstra @ 2021-12-04 10:40 UTC (permalink / raw) To: Mel Gorman Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote: > + /* Calculate allowed NUMA imbalance */ > + for_each_cpu(i, cpu_map) { > + int imb_numa_nr = 0; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > + int nr_groups; > + > + nr_groups = sd->span_weight / child->span_weight; > + imb_numa_nr = max(1U, ((child->span_weight) >> 1) / > + (nr_groups * num_online_nodes())); > + } > + > + sd->imb_numa_nr = imb_numa_nr; > + } OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have imb_numa_nr = 0, all domains above it will have the same value calculated here. So far so good I suppose :-) Then nr_groups is what it says on the tin; we could've equally well iterated sd->groups and gotten the same number, but this is simpler. Now, imb_numa_nr is where the magic happens, the way it's written doesn't help, but it's something like: (child->span_weight / 2) / (nr_groups * num_online_nodes()) With a minimum value of 1. So the larger the system is, or the smaller the LLCs, the smaller this number gets, right? So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2) / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5. But a Zen box that has only like 4 CPUs per LLC will have 1, regardless of how many nodes it has. Now, I'm thinking this assumes (fairly reasonable) that the level above LLC is a node, but I don't think we need to assume this, while also not assuming the balance domain spans the whole machine (yay paritions!). for (top = sd; top->parent; top = top->parent) ; nr_llcs = top->span_weight / child->span_weight; imb_numa_nr = max(1, child->span_weight / nr_llcs); which for my ivb-ep gets me: 20 / (40 / 20) = 10 and the Zen system will have: 4 / (huge number) = 1 Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we can also write the above as: (child->span_weight * child->span_weight) / top->span_weight; Hmm? > + } > + > /* Calculate CPU capacity for physical packages and nodes */ > for (i = nr_cpumask_bits-1; i >= 0; i--) { > if (!cpumask_test_cpu(i, cpu_map)) > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-04 10:40 ` Peter Zijlstra @ 2021-12-06 8:48 ` Gautham R. Shenoy 2021-12-06 14:51 ` Peter Zijlstra 2021-12-06 15:12 ` Mel Gorman 1 sibling, 1 reply; 48+ messages in thread From: Gautham R. Shenoy @ 2021-12-06 8:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Sadagopan Srinivasan, Krupa Ramakrishnan Hello Peter, Mel, On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote: > On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote: > > + /* Calculate allowed NUMA imbalance */ > > + for_each_cpu(i, cpu_map) { > > + int imb_numa_nr = 0; > > + > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > + struct sched_domain *child = sd->child; > > + > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > + int nr_groups; > > + > > + nr_groups = sd->span_weight / child->span_weight; > > + imb_numa_nr = max(1U, ((child->span_weight) >> 1) / > > + (nr_groups * num_online_nodes())); > > + } > > + > > + sd->imb_numa_nr = imb_numa_nr; > > + } > > OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have > imb_numa_nr = 0, all domains above it will have the same value > calculated here. > > So far so good I suppose :-) Well, we will still have the same imb_numa_nr set for different NUMA domains which have different distances! > > Then nr_groups is what it says on the tin; we could've equally well > iterated sd->groups and gotten the same number, but this is simpler. > > Now, imb_numa_nr is where the magic happens, the way it's written > doesn't help, but it's something like: > > (child->span_weight / 2) / (nr_groups * num_online_nodes()) > > With a minimum value of 1. So the larger the system is, or the smaller > the LLCs, the smaller this number gets, right? > > So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2) > / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5. > > But a Zen box that has only like 4 CPUs per LLC will have 1, regardless > of how many nodes it has. That's correct. On a Zen3 box with 2 sockets with 64 cores per sockets, we can configure it with either 1/2/4 Nodes Per Socket (NPS). The imb_numa_nr value for each of the NPS configurations is as follows: NPS1 : ~~~~~~~~ SMT [span_wt=2] --> MC [span_wt=16, LLC] --> DIE[span_wt=128] --> NUMA [span_wt=256, SD_NUMA] sd->span = 128, child->span = 16, nr_groups = 8, num_online_nodes() = 2 imb_numa_nr = max(1, (16 >> 1)/(8*2)) = max(1, 0.5) = 1. NPS2 : ~~~~~~~~ SMT [span_wt=2] --> MC [span_wt=16,LLC] --> NODE[span_wt=64] --> NUMA [span_wt=128, SD_NUMA] --> NUMA [span_wt=256, SD_NUMA] sd->span = 64, child->span = 16, nr_groups = 4, num_online_nodes() = 4 imb_numa_nr = max(1, (16 >> 1)/(4*4)) = max(1, 0.5) = 1. NPS 4: ~~~~~~~ SMT [span_wt=2] --> MC [span_wt=16, LLC] --> NODE [span_wt=32] --> NUMA [span_wt=128, SD_NUMA] --> NUMA [span_wt=256, SD_NUMA] sd->span = 32, child->span = 16, nr_groups = 2, num_online_nodes() = 8 imb_numa_nr = max(1, (16 >> 1)/(2*8)) = max(1, 0.5) = 1. While the imb_numa_nr = 1 is good for the NUMA domain within a socket (the lower NUMA domains in in NPS2 and NPS4 modes), it appears to be a little bit aggressive for the NUMA domain spanning the two sockets. If we have only a pair of communicating tasks in a socket, we will end up spreading them across the two sockets with this patch. > > Now, I'm thinking this assumes (fairly reasonable) that the level above > LLC is a node, but I don't think we need to assume this, while also not > assuming the balance domain spans the whole machine (yay paritions!). > > for (top = sd; top->parent; top = top->parent) > ; > > nr_llcs = top->span_weight / child->span_weight; > imb_numa_nr = max(1, child->span_weight / nr_llcs); > > which for my ivb-ep gets me: 20 / (40 / 20) = 10 > and the Zen system will have: 4 / (huge number) = 1 > > Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we > can also write the above as: > > (child->span_weight * child->span_weight) / top->span_weight; Assuming that "child" here refers to the LLC domain, on Zen3 we would have (a) child->span_weight = 16. (b) top->span_weight = 256. So we get a^2/b = 1. > > Hmm? Last week, I tried a modification on top of Mel's current patch where we spread tasks between the LLCs of the groups within each NUMA domain and compute the value of imb_numa_nr per NUMA domain. The idea is to set sd->imb_numa_nr = min(1U, (Number of LLCs in each sd group / Number of sd groups)) This won't work for processors which have a single LLC in a socket, since the sd->imb_numa_nr will be 1 which is probably too low. FWIW, with this heuristic, the imb_numa_nr across the different NPS configurations of a Zen3 server is as follows NPS1: NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4 NPS2: 1st NUMA domain: nr_llcs_per_group = 4. nr_groups = 2. imb_numa_nr = 2. 2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4. NPS4: 1st NUMA domain: nr_llcs_per_group = 2. nr_groups = 4. imb_numa_nr = min(1, 2/4) = 1. 2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4. Thus, at the highest NUMA level (socket), we don't spread across the two sockets until there are 4 tasks within the socket. If there is only a pair of communicating tasks in the socket, they will be left alone within that socket. The stream numbers (average of 10 runs. The following are Triad numbers. The Copy, Scale and Add numbers have the same trend) are presented in the table below. We do see some degradation for the 4 thread case in NPS2 and NPS4 modes with the aforementioned approach, but there are gains as well for 16 and 32 thread case on NPS4 mode. NPS1: ==========+===========+================+================= | Nr | Mel v3 | tip/sched/core | Spread across| | Stream | | | LLCs of NUMA | | Threads | | | groups | ==========+===========+================+================= | 4 | 111106.14 | 94849.77 | 111820.02 | | | | (-14.63%) | (+0.64%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 8 | 175633.00 | 128268.22 | 189705.48 | | | | (-26.97%) | (+8.01%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 16 | 252812.87 | 136745.98 | 255577.34 | | | | (-45.91%) | (+1.09%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 32 | 248198.43 | 130120.30 | 253266.86 | | | | (-47.57%) | (+2.04%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 64 | 244202.33 | 133773.03 | 249449.53 | | | | (-45.22%) | (+2.15%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 128 | 248459.85 | 249450.61 | 250346.09 | | | | (+0.40%) | (+0.76%) | ~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ NPS2: ==========+===========+================+================= | Nr | Mel v3 | tip/sched/core | Spread across| | Stream | | | LLCs of NUMA | | Threads | | | groups | ==========+===========+================+================= | 4 | 110888.35 | 63067.26 | 104971.36 | | | | (-43.12%) | (-5.34%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 8 | 174983.85 | 96226.39 | 177558.65 | | | | (-45.01%) | (+1.47%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 16 | 252943.21 | 106474.3 | 260749.60 | | | | (-57.90%) | (+1.47%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 32 | 248540.52 | 113864.09 | 254141.33 | | | | (-54.19%) | (+2.25%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 64 | 248383.17 | 137101.85 | 255018.52 | | | | (-44.80%) | (+2.67%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 128 | 250123.31 | 257031.29 | 254457.13 | | | | (+2.76%) | (+1.73%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ NPS4: ==========+===========+================+================= | Nr | Mel v3 | tip/sched/core | Spread across| | Stream | | | LLCs of NUMA | | Threads | | | groups | ==========+===========+================+================= | 4 | 108580.91 | 31746.06 | 97585.53 | | | | (-70.76%) | (-10.12%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 8 | 150259.94 | 64841.89 | 154954.75 | | | | (-56.84%) | (+3.12%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 16 | 234137.41 | 106780.26 | 261005.27 | | | | (-54.39%) | (+11.48%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 32 | 241583.06 | 147572.50 | 257004.22 | | | | (-38.91%) | (+6.38%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 64 | 248511.64 | 166183.06 | 259599.32 | | | | (-33.12%) | (+4.46%) | ~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ | 128 | 252227.34 | 239270.85 | 259117.18 | | | | (-5.13%) | (2.73%) | ~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+ -- Thanks and Regards gautham. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-06 8:48 ` Gautham R. Shenoy @ 2021-12-06 14:51 ` Peter Zijlstra 0 siblings, 0 replies; 48+ messages in thread From: Peter Zijlstra @ 2021-12-06 14:51 UTC (permalink / raw) To: Gautham R. Shenoy Cc: Mel Gorman, Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Sadagopan Srinivasan, Krupa Ramakrishnan On Mon, Dec 06, 2021 at 02:18:21PM +0530, Gautham R. Shenoy wrote: > On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote: > > On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote: > > > + /* Calculate allowed NUMA imbalance */ > > > + for_each_cpu(i, cpu_map) { > > > + int imb_numa_nr = 0; > > > + > > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > > + struct sched_domain *child = sd->child; > > > + > > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > > + int nr_groups; > > > + > > > + nr_groups = sd->span_weight / child->span_weight; > > > + imb_numa_nr = max(1U, ((child->span_weight) >> 1) / > > > + (nr_groups * num_online_nodes())); > > > + } > > > + > > > + sd->imb_numa_nr = imb_numa_nr; > > > + } > > > > OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have > > imb_numa_nr = 0, all domains above it will have the same value > > calculated here. > > > > So far so good I suppose :-) > > Well, we will still have the same imb_numa_nr set for different NUMA > domains which have different distances! Fair enough; that would need making the computation depends on more thing, but that shouldn't be too hard. > > Then nr_groups is what it says on the tin; we could've equally well > > iterated sd->groups and gotten the same number, but this is simpler. > > > > Now, imb_numa_nr is where the magic happens, the way it's written > > doesn't help, but it's something like: > > > > (child->span_weight / 2) / (nr_groups * num_online_nodes()) > > > > With a minimum value of 1. So the larger the system is, or the smaller > > the LLCs, the smaller this number gets, right? > > > > So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2) > > / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5. > > > > But a Zen box that has only like 4 CPUs per LLC will have 1, regardless > > of how many nodes it has. > > That's correct. On a Zen3 box with 2 sockets with 64 cores per > sockets, we can configure it with either 1/2/4 Nodes Per Socket > (NPS). The imb_numa_nr value for each of the NPS configurations is as > follows: Cute; that's similar to the whole Intel sub-numa-cluster stuff then; perhaps update the comment that goes with x86_has_numa_in_package ? Currently that only mentions AMD Magny-Cours which is a few generations ago. > NPS 4: > ~~~~~~~ > SMT [span_wt=2] > --> MC [span_wt=16, LLC] > --> NODE [span_wt=32] > --> NUMA [span_wt=128, SD_NUMA] > --> NUMA [span_wt=256, SD_NUMA] OK, so at max nodes you still have at least 2 LLCs per node. > While the imb_numa_nr = 1 is good for the NUMA domain within a socket > (the lower NUMA domains in in NPS2 and NPS4 modes), it appears to be a > little bit aggressive for the NUMA domain spanning the two sockets. If > we have only a pair of communicating tasks in a socket, we will end up > spreading them across the two sockets with this patch. > > > > > Now, I'm thinking this assumes (fairly reasonable) that the level above > > LLC is a node, but I don't think we need to assume this, while also not > > assuming the balance domain spans the whole machine (yay paritions!). > > > > for (top = sd; top->parent; top = top->parent) > > ; > > > > nr_llcs = top->span_weight / child->span_weight; > > imb_numa_nr = max(1, child->span_weight / nr_llcs); > > > > which for my ivb-ep gets me: 20 / (40 / 20) = 10 > > and the Zen system will have: 4 / (huge number) = 1 > > > > Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we > > can also write the above as: > > > > (child->span_weight * child->span_weight) / top->span_weight; > > > Assuming that "child" here refers to the LLC domain, on Zen3 we would have > (a) child->span_weight = 16. (b) top->span_weight = 256. > > So we get a^2/b = 1. Yes, it would be in the same place as the current imb_numa_nr calculation, so child would be the largest domain having SHARE_PKG_RESOURCES, aka. LLC. > Last week, I tried a modification on top of Mel's current patch where > we spread tasks between the LLCs of the groups within each NUMA domain > and compute the value of imb_numa_nr per NUMA domain. The idea is to set > > sd->imb_numa_nr = min(1U, > (Number of LLCs in each sd group / Number of sd groups)) s/min/max/ Which is basically something like: for_each (sd in NUMA): llc_per_group = child->span / llc->span; nr_group = sd->span / child->span; imb = max(1, llc_per_group / nr_group); > This won't work for processors which have a single LLC in a socket, > since the sd->imb_numa_nr will be 1 which is probably too low. Right. > FWIW, > with this heuristic, the imb_numa_nr across the different NPS > configurations of a Zen3 server is as follows > > NPS1: > NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4 > > NPS2: > 1st NUMA domain: nr_llcs_per_group = 4. nr_groups = 2. imb_numa_nr = 2. > 2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4. > > NPS4: > 1st NUMA domain: nr_llcs_per_group = 2. nr_groups = 4. imb_numa_nr = min(1, 2/4) = 1. > 2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4. > > Thus, at the highest NUMA level (socket), we don't spread across the > two sockets until there are 4 tasks within the socket. If there is > only a pair of communicating tasks in the socket, they will be left > alone within that socket. Something that might work: imb = 0; imb_span = 1; for_each_sd(sd) { child = sd->child; if (!(sd->flags & SD_SPR) && child && (child->flags & SD_SPR)) { imb = /* initial magic */ imb_span = sd->span; sd->imb = imb; } else if (imb) { sd->imb = imb * (sd->span / imb_span); } } Where we calculate the initial imbalance for the LLC boundary, and then increase that for subsequent domains based on how often that boundary sd fits in it. That gives the same progression you have, but also works for NODE==LLC I think. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-04 10:40 ` Peter Zijlstra 2021-12-06 8:48 ` Gautham R. Shenoy @ 2021-12-06 15:12 ` Mel Gorman 2021-12-09 14:23 ` Valentin Schneider 1 sibling, 1 reply; 48+ messages in thread From: Mel Gorman @ 2021-12-06 15:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham R. Shenoy, LKML On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote: > On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote: > > + /* Calculate allowed NUMA imbalance */ > > + for_each_cpu(i, cpu_map) { > > + int imb_numa_nr = 0; > > + > > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > + struct sched_domain *child = sd->child; > > + > > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > + (child->flags & SD_SHARE_PKG_RESOURCES)) { > > + int nr_groups; > > + > > + nr_groups = sd->span_weight / child->span_weight; > > + imb_numa_nr = max(1U, ((child->span_weight) >> 1) / > > + (nr_groups * num_online_nodes())); > > + } > > + > > + sd->imb_numa_nr = imb_numa_nr; > > + } > > OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have > imb_numa_nr = 0, all domains above it will have the same value > calculated here. > > So far so good I suppose :-) > Good start :) > Then nr_groups is what it says on the tin; we could've equally well > iterated sd->groups and gotten the same number, but this is simpler. > I also thought it would be clearer. > Now, imb_numa_nr is where the magic happens, the way it's written > doesn't help, but it's something like: > > (child->span_weight / 2) / (nr_groups * num_online_nodes()) > > With a minimum value of 1. So the larger the system is, or the smaller > the LLCs, the smaller this number gets, right? > Correct. > So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2) > / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5. > > But a Zen box that has only like 4 CPUs per LLC will have 1, regardless > of how many nodes it has. > The minimum of one was to allow a pair of communicating tasks to remain on one node even if it's imbalacnced. > Now, I'm thinking this assumes (fairly reasonable) that the level above > LLC is a node, but I don't think we need to assume this, while also not > assuming the balance domain spans the whole machine (yay paritions!). > > for (top = sd; top->parent; top = top->parent) > ; > > nr_llcs = top->span_weight / child->span_weight; > imb_numa_nr = max(1, child->span_weight / nr_llcs); > > which for my ivb-ep gets me: 20 / (40 / 20) = 10 > and the Zen system will have: 4 / (huge number) = 1 > > Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we > can also write the above as: > > (child->span_weight * child->span_weight) / top->span_weight; > Gautham had similar reasoning to calculate the imbalance at each higher-level domain instead of using a static value throughout and it does make sense. For each level and splitting the imbalance between two domains, this works out as /* * Calculate an allowed NUMA imbalance such that LLCs do not get * imbalanced. */ for_each_cpu(i, cpu_map) { for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { struct sched_domain *child = sd->child; if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && (child->flags & SD_SHARE_PKG_RESOURCES)) { struct sched_domain *top = sd; unsigned int llc_sq; /* * nr_llcs = (top->span_weight / llc_weight); * imb = (child_weight / nr_llcs) >> 1 * * is equivalent to * * imb = (llc_weight^2 / top->span_weight) >> 1 * */ llc_sq = child->span_weight * child->span_weight; while (top) { top->imb_numa_nr = max(1U, (llc_sq / top->span_weight) >> 1); top = top->parent; } break; } } } I'll test this and should have results tomorrow. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-06 15:12 ` Mel Gorman @ 2021-12-09 14:23 ` Valentin Schneider 2021-12-09 15:43 ` Mel Gorman 0 siblings, 1 reply; 48+ messages in thread From: Valentin Schneider @ 2021-12-09 14:23 UTC (permalink / raw) To: Mel Gorman, Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham R. Shenoy, LKML On 06/12/21 15:12, Mel Gorman wrote: > Gautham had similar reasoning to calculate the imbalance at each > higher-level domain instead of using a static value throughout and > it does make sense. For each level and splitting the imbalance between > two domains, this works out as > > > /* > * Calculate an allowed NUMA imbalance such that LLCs do not get > * imbalanced. > */ > for_each_cpu(i, cpu_map) { > for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > struct sched_domain *child = sd->child; > > if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > (child->flags & SD_SHARE_PKG_RESOURCES)) { > struct sched_domain *top = sd; > unsigned int llc_sq; > > /* > * nr_llcs = (top->span_weight / llc_weight); > * imb = (child_weight / nr_llcs) >> 1 > * > * is equivalent to > * > * imb = (llc_weight^2 / top->span_weight) >> 1 > * > */ > llc_sq = child->span_weight * child->span_weight; > while (top) { > top->imb_numa_nr = max(1U, > (llc_sq / top->span_weight) >> 1); > top = top->parent; > } > > break; > } > } > } > IIRC Peter suggested punting that logic to before domains get degenerated, but I don't see how that helps here. If you just want to grab the LLC domain (aka highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)) and compare its span with that of its parents, that can happen after the degeneration, no? > I'll test this and should have results tomorrow. > > -- > Mel Gorman > SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-12-09 14:23 ` Valentin Schneider @ 2021-12-09 15:43 ` Mel Gorman 0 siblings, 0 replies; 48+ messages in thread From: Mel Gorman @ 2021-12-09 15:43 UTC (permalink / raw) To: Valentin Schneider Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham R. Shenoy, LKML On Thu, Dec 09, 2021 at 02:23:40PM +0000, Valentin Schneider wrote: > On 06/12/21 15:12, Mel Gorman wrote: > > Gautham had similar reasoning to calculate the imbalance at each > > higher-level domain instead of using a static value throughout and > > it does make sense. For each level and splitting the imbalance between > > two domains, this works out as > > > > > > /* > > * Calculate an allowed NUMA imbalance such that LLCs do not get > > * imbalanced. > > */ > > for_each_cpu(i, cpu_map) { > > for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > > struct sched_domain *child = sd->child; > > > > if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > > (child->flags & SD_SHARE_PKG_RESOURCES)) { > > struct sched_domain *top = sd; > > unsigned int llc_sq; > > > > /* > > * nr_llcs = (top->span_weight / llc_weight); > > * imb = (child_weight / nr_llcs) >> 1 > > * > > * is equivalent to > > * > > * imb = (llc_weight^2 / top->span_weight) >> 1 > > * > > */ > > llc_sq = child->span_weight * child->span_weight; > > while (top) { > > top->imb_numa_nr = max(1U, > > (llc_sq / top->span_weight) >> 1); > > top = top->parent; > > } > > > > break; > > } > > } > > } > > > > IIRC Peter suggested punting that logic to before domains get degenerated, > but I don't see how that helps here. If you just want to grab the LLC > domain (aka highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)) and compare > its span with that of its parents, that can happen after the degeneration, > no? > I guess we could but I don't see any specific advantage to doing that. > > I'll test this and should have results tomorrow. > > The test results indicated that there was still a problem with communicating tasks being pulled apart so am testing a new version. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 0/2] Adjust NUMA imbalance for multiple LLCs @ 2021-11-25 15:19 Mel Gorman 2021-11-25 15:19 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 0 siblings, 1 reply; 48+ messages in thread From: Mel Gorman @ 2021-11-25 15:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. The series addresses two problems -- inconsistent use of scheduler domain weights and sub-optimal performance when there are many LLCs per NUMA node. include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 26 +++++++++++++++----------- kernel/sched/topology.c | 24 ++++++++++++++++++++++++ 3 files changed, 40 insertions(+), 11 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for " Mel Gorman @ 2021-11-25 15:19 ` Mel Gorman 2021-11-26 23:22 ` kernel test robot 0 siblings, 1 reply; 48+ messages in thread From: Mel Gorman @ 2021-11-25 15:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. Zen* has multiple LLCs per node with local memory channels and due to the allowed imbalance, it's far harder to tune some workloads to run optimally than it is on hardware that has 1 LLC per node. This patch adjusts the imbalance on multi-LLC machines to allow an imbalance up to the point where LLCs should be balanced between nodes. On a Zen3 machine running STREAM parallelised with OMP to have on instance per LLC the results and without binding, the results are vanilla sched-numaimb-v2r4 MB/sec copy-16 164279.50 ( 0.00%) 702962.88 ( 327.91%) MB/sec scale-16 137487.08 ( 0.00%) 397132.98 ( 188.85%) MB/sec add-16 157561.68 ( 0.00%) 638006.32 ( 304.92%) MB/sec triad-16 154562.04 ( 0.00%) 641408.02 ( 314.98%) STREAM can use directives to force the spread if the OpenMP is new enough but that doesn't help if an application uses threads and it's not known in advance how many threads will be created. vanilla sched-numaimb-v1r2 Min Score-16 366090.84 ( 0.00%) 401505.65 ( 9.67%) Hmean Score-16 391416.56 ( 0.00%) 452546.28 * 15.62%* Stddev Score-16 16452.12 ( 0.00%) 31480.31 ( -91.35%) CoeffVar Score-16 4.20 ( 0.00%) 6.92 ( -64.99%) Max Score-16 416666.67 ( 0.00%) 483529.77 ( 16.05%) It can also make a big difference for semi-realistic workloads like specjbb which can execute arbitrary numbers of threads without advance knowledge of how they should be placed vanilla sched-numaimb-v2r5 Hmean tput-1 73743.05 ( 0.00%) 72517.86 ( -1.66%) Hmean tput-8 563036.51 ( 0.00%) 619505.85 * 10.03%* Hmean tput-16 1016590.61 ( 0.00%) 1084022.36 ( 6.63%) Hmean tput-24 1418558.41 ( 0.00%) 1443296.06 ( 1.74%) Hmean tput-32 1608794.22 ( 0.00%) 1869822.05 * 16.23%* Hmean tput-40 1761338.13 ( 0.00%) 2154415.40 * 22.32%* Hmean tput-48 2290646.54 ( 0.00%) 2561031.20 * 11.80%* Hmean tput-56 2463345.12 ( 0.00%) 2731874.84 * 10.90%* Hmean tput-64 2650213.53 ( 0.00%) 2867054.47 ( 8.18%) Hmean tput-72 2497253.28 ( 0.00%) 3017637.28 * 20.84%* Hmean tput-80 2820786.72 ( 0.00%) 3018947.39 ( 7.03%) Hmean tput-88 2813541.68 ( 0.00%) 3008805.43 * 6.94%* Hmean tput-96 2604158.67 ( 0.00%) 2948056.40 * 13.21%* Hmean tput-104 2713810.62 ( 0.00%) 2952327.00 ( 8.79%) Hmean tput-112 2558425.37 ( 0.00%) 2909089.90 * 13.71%* Hmean tput-120 2611434.93 ( 0.00%) 2773024.11 * 6.19%* Hmean tput-128 2706103.22 ( 0.00%) 2765678.84 ( 2.20%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 26 +++++++++++++++----------- kernel/sched/topology.c | 24 ++++++++++++++++++++++++ 3 files changed, 40 insertions(+), 11 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index c07bfa2d80f2..54f5207154d3 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -93,6 +93,7 @@ struct sched_domain { unsigned int busy_factor; /* less balancing by factor if busy */ unsigned int imbalance_pct; /* No balance until over watermark */ unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ int nohz_idle; /* NOHZ IDLE status */ int flags; /* See SD_* */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0a969affca76..64f211879e43 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1489,6 +1489,7 @@ struct task_numa_env { int src_cpu, src_nid; int dst_cpu, dst_nid; + int imb_numa_nr; struct numa_stats src_stats, dst_stats; @@ -1885,7 +1886,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, dst_running = env->dst_stats.nr_running + 1; imbalance = max(0, dst_running - src_running); imbalance = adjust_numa_imbalance(imbalance, dst_running, - env->dst_stats.weight); + env->imb_numa_nr); /* Use idle CPU if there is no imbalance */ if (!imbalance) { @@ -1950,8 +1951,10 @@ static int task_numa_migrate(struct task_struct *p) */ rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); - if (sd) + if (sd) { env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; + env.imb_numa_nr = sd->imb_numa_nr; + } rcu_read_unlock(); /* @@ -9046,13 +9049,14 @@ static bool update_pick_idlest(struct sched_group *idlest, } /* - * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain. - * This is an approximation as the number of running tasks may not be - * related to the number of busy CPUs due to sched_setaffinity. + * Allow a NUMA imbalance if busy CPUs is less than the allowed + * imbalance. This is an approximation as the number of running + * tasks may not be related to the number of busy CPUs due to + * sched_setaffinity. */ -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) { - return (dst_running < (dst_weight >> 2)); + return dst_running < imb_numa_nr; } /* @@ -9191,7 +9195,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) * a real need of migration, periodic load balance will * take care of it. */ - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) return NULL; } @@ -9283,9 +9287,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd #define NUMA_IMBALANCE_MIN 2 static inline long adjust_numa_imbalance(int imbalance, - int dst_running, int dst_weight) + int dst_running, int imb_numa_nr) { - if (!allow_numa_imbalance(dst_running, dst_weight)) + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) return imbalance; /* @@ -9397,7 +9401,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* Consider allowing a small imbalance between NUMA groups */ if (env->sd->flags & SD_NUMA) { env->imbalance = adjust_numa_imbalance(env->imbalance, - busiest->sum_nr_running, env->sd->span_weight); + busiest->sum_nr_running, env->sd->imb_numa_nr); } return; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d201a7052a29..9adeaa89ccb4 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -2242,6 +2242,30 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } } + /* Calculate allowed NUMA imbalance */ + for_each_cpu(i, cpu_map) { + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { + struct sched_domain *child = sd->child; + + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && + (child->flags & SD_SHARE_PKG_RESOURCES)) { + struct sched_domain *sd_numa = sd; + int imb_numa_nr, nr_groups; + + nr_groups = sd->span_weight / child->span_weight; + imb_numa_nr = nr_groups / num_online_nodes(); + + while (sd_numa) { + if (sd_numa->flags & SD_NUMA) { + sd_numa->imb_numa_nr = imb_numa_nr; + break; + } + sd_numa = sd_numa->parent; + } + } + } + } + /* Calculate CPU capacity for physical packages and nodes */ for (i = nr_cpumask_bits-1; i >= 0; i--) { if (!cpumask_test_cpu(i, cpu_map)) -- 2.31.1 ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs 2021-11-25 15:19 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman @ 2021-11-26 23:22 ` kernel test robot 0 siblings, 0 replies; 48+ messages in thread From: kernel test robot @ 2021-11-26 23:22 UTC (permalink / raw) To: kbuild-all [-- Attachment #1: Type: text/plain, Size: 16359 bytes --] Hi Mel, I love your patch! Perhaps something to improve: [auto build test WARNING on tip/sched/core] [also build test WARNING on tip/master linux/master linus/master v5.16-rc2 next-20211126] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20211125-232336 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 8c92606ab81086db00cbb73347d124b4eb169b7e config: s390-randconfig-s032-20211126 (https://download.01.org/0day-ci/archive/20211127/202111270726.GViokiOt-lkp(a)intel.com/config) compiler: s390-linux-gcc (GCC) 11.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # apt-get install sparse # sparse version: v0.6.4-dirty # https://github.com/0day-ci/linux/commit/b4d95a034cffb1e4424874645549d3cac2de5c02 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20211125-232336 git checkout b4d95a034cffb1e4424874645549d3cac2de5c02 # save the config file to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=s390 SHELL=/bin/bash kernel/sched/ If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> sparse warnings: (new ones prefixed by >>) kernel/sched/topology.c:461:19: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct perf_domain *pd @@ got struct perf_domain [noderef] __rcu *pd @@ kernel/sched/topology.c:461:19: sparse: expected struct perf_domain *pd kernel/sched/topology.c:461:19: sparse: got struct perf_domain [noderef] __rcu *pd kernel/sched/topology.c:623:49: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:623:49: sparse: expected struct sched_domain *parent kernel/sched/topology.c:623:49: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:694:50: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *parent @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:694:50: sparse: expected struct sched_domain *parent kernel/sched/topology.c:694:50: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:701:55: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *[noderef] __rcu child @@ got struct sched_domain *[assigned] tmp @@ kernel/sched/topology.c:701:55: sparse: expected struct sched_domain [noderef] __rcu *[noderef] __rcu child kernel/sched/topology.c:701:55: sparse: got struct sched_domain *[assigned] tmp kernel/sched/topology.c:711:29: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:711:29: sparse: expected struct sched_domain *[assigned] tmp kernel/sched/topology.c:711:29: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:716:20: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:716:20: sparse: expected struct sched_domain *sd kernel/sched/topology.c:716:20: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:737:13: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] tmp @@ got struct sched_domain [noderef] __rcu *sd @@ kernel/sched/topology.c:737:13: sparse: expected struct sched_domain *[assigned] tmp kernel/sched/topology.c:737:13: sparse: got struct sched_domain [noderef] __rcu *sd kernel/sched/topology.c:899:70: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:899:70: sparse: expected struct sched_domain *sd kernel/sched/topology.c:899:70: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:928:59: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:928:59: sparse: expected struct sched_domain *sd kernel/sched/topology.c:928:59: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:974:57: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:974:57: sparse: expected struct sched_domain *sd kernel/sched/topology.c:974:57: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:976:25: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sibling @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:976:25: sparse: expected struct sched_domain *sibling kernel/sched/topology.c:976:25: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:984:55: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:984:55: sparse: expected struct sched_domain *sd kernel/sched/topology.c:984:55: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:986:25: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sibling @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:986:25: sparse: expected struct sched_domain *sibling kernel/sched/topology.c:986:25: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1056:62: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct sched_domain *sd @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:1056:62: sparse: expected struct sched_domain *sd kernel/sched/topology.c:1056:62: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1160:40: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:1160:40: sparse: expected struct sched_domain *child kernel/sched/topology.c:1160:40: sparse: got struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1571:43: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain [noderef] __rcu *child @@ got struct sched_domain *child @@ kernel/sched/topology.c:1571:43: sparse: expected struct sched_domain [noderef] __rcu *child kernel/sched/topology.c:1571:43: sparse: got struct sched_domain *child kernel/sched/topology.c:2130:31: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain [noderef] __rcu *parent @@ got struct sched_domain *sd @@ kernel/sched/topology.c:2130:31: sparse: expected struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:2130:31: sparse: got struct sched_domain *sd kernel/sched/topology.c:2233:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2233:57: sparse: expected struct sched_domain *[assigned] sd kernel/sched/topology.c:2233:57: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:2248:56: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sched_domain *child @@ got struct sched_domain [noderef] __rcu *child @@ kernel/sched/topology.c:2248:56: sparse: expected struct sched_domain *child kernel/sched/topology.c:2248:56: sparse: got struct sched_domain [noderef] __rcu *child >> kernel/sched/topology.c:2263:49: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *sd_numa @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2263:49: sparse: expected struct sched_domain *sd_numa kernel/sched/topology.c:2263:49: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:2247:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2247:57: sparse: expected struct sched_domain *[assigned] sd kernel/sched/topology.c:2247:57: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:2274:57: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/topology.c:2274:57: sparse: expected struct sched_domain *[assigned] sd kernel/sched/topology.c:2274:57: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c: note: in included file: kernel/sched/sched.h:1744:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1744:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1744:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/sched.h:1757:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1757:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1757:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/sched.h:1744:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1744:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1744:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/sched.h:1757:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@ kernel/sched/sched.h:1757:9: sparse: expected struct sched_domain *[assigned] sd kernel/sched/sched.h:1757:9: sparse: got struct sched_domain [noderef] __rcu *parent kernel/sched/topology.c:929:31: sparse: sparse: dereference of noderef expression kernel/sched/topology.c:1592:19: sparse: sparse: dereference of noderef expression vim +2263 kernel/sched/topology.c 2186 2187 /* 2188 * Build sched domains for a given set of CPUs and attach the sched domains 2189 * to the individual CPUs 2190 */ 2191 static int 2192 build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr) 2193 { 2194 enum s_alloc alloc_state = sa_none; 2195 struct sched_domain *sd; 2196 struct s_data d; 2197 struct rq *rq = NULL; 2198 int i, ret = -ENOMEM; 2199 bool has_asym = false; 2200 2201 if (WARN_ON(cpumask_empty(cpu_map))) 2202 goto error; 2203 2204 alloc_state = __visit_domain_allocation_hell(&d, cpu_map); 2205 if (alloc_state != sa_rootdomain) 2206 goto error; 2207 2208 /* Set up domains for CPUs specified by the cpu_map: */ 2209 for_each_cpu(i, cpu_map) { 2210 struct sched_domain_topology_level *tl; 2211 2212 sd = NULL; 2213 for_each_sd_topology(tl) { 2214 2215 if (WARN_ON(!topology_span_sane(tl, cpu_map, i))) 2216 goto error; 2217 2218 sd = build_sched_domain(tl, cpu_map, attr, sd, i); 2219 2220 has_asym |= sd->flags & SD_ASYM_CPUCAPACITY; 2221 2222 if (tl == sched_domain_topology) 2223 *per_cpu_ptr(d.sd, i) = sd; 2224 if (tl->flags & SDTL_OVERLAP) 2225 sd->flags |= SD_OVERLAP; 2226 if (cpumask_equal(cpu_map, sched_domain_span(sd))) 2227 break; 2228 } 2229 } 2230 2231 /* Build the groups for the domains */ 2232 for_each_cpu(i, cpu_map) { 2233 for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { 2234 sd->span_weight = cpumask_weight(sched_domain_span(sd)); 2235 if (sd->flags & SD_OVERLAP) { 2236 if (build_overlap_sched_groups(sd, i)) 2237 goto error; 2238 } else { 2239 if (build_sched_groups(sd, i)) 2240 goto error; 2241 } 2242 } 2243 } 2244 2245 /* Calculate allowed NUMA imbalance */ 2246 for_each_cpu(i, cpu_map) { 2247 for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { 2248 struct sched_domain *child = sd->child; 2249 2250 if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && 2251 (child->flags & SD_SHARE_PKG_RESOURCES)) { 2252 struct sched_domain *sd_numa = sd; 2253 int imb_numa_nr, nr_groups; 2254 2255 nr_groups = sd->span_weight / child->span_weight; 2256 imb_numa_nr = nr_groups / num_online_nodes(); 2257 2258 while (sd_numa) { 2259 if (sd_numa->flags & SD_NUMA) { 2260 sd_numa->imb_numa_nr = imb_numa_nr; 2261 break; 2262 } > 2263 sd_numa = sd_numa->parent; 2264 } 2265 } 2266 } 2267 } 2268 2269 /* Calculate CPU capacity for physical packages and nodes */ 2270 for (i = nr_cpumask_bits-1; i >= 0; i--) { 2271 if (!cpumask_test_cpu(i, cpu_map)) 2272 continue; 2273 2274 for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { 2275 claim_allocations(i, sd); 2276 init_sched_groups_capacity(i, sd); 2277 } 2278 } 2279 2280 /* Attach the domains */ 2281 rcu_read_lock(); 2282 for_each_cpu(i, cpu_map) { 2283 rq = cpu_rq(i); 2284 sd = *per_cpu_ptr(d.sd, i); 2285 2286 /* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */ 2287 if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity)) 2288 WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig); 2289 2290 cpu_attach_domain(sd, d.rd, i); 2291 } 2292 rcu_read_unlock(); 2293 2294 if (has_asym) 2295 static_branch_inc_cpuslocked(&sched_asym_cpucapacity); 2296 2297 if (rq && sched_debug_verbose) { 2298 pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n", 2299 cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity); 2300 } 2301 2302 ret = 0; 2303 error: 2304 __free_domain_allocs(&d, alloc_state, cpu_map); 2305 2306 return ret; 2307 } 2308 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2022-02-14 11:27 UTC | newest] Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-12-10 9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman 2021-12-10 9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman 2021-12-21 10:53 ` Vincent Guittot 2021-12-21 11:32 ` Mel Gorman 2021-12-21 13:05 ` Vincent Guittot 2021-12-10 9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman 2021-12-13 8:28 ` Gautham R. Shenoy 2021-12-13 13:01 ` Mel Gorman 2021-12-13 14:47 ` Gautham R. Shenoy 2021-12-15 11:52 ` Gautham R. Shenoy 2021-12-15 12:25 ` Mel Gorman 2021-12-16 18:33 ` Gautham R. Shenoy 2021-12-20 11:12 ` Mel Gorman 2021-12-21 15:03 ` Gautham R. Shenoy 2021-12-21 17:13 ` Vincent Guittot 2021-12-22 8:52 ` Jirka Hladky 2022-01-04 19:52 ` Jirka Hladky 2022-01-05 10:42 ` Mel Gorman 2022-01-05 10:49 ` Mel Gorman 2022-01-10 15:53 ` Vincent Guittot 2022-01-12 10:24 ` Mel Gorman 2021-12-17 19:54 ` Gautham R. Shenoy -- strict thread matches above, loose matches on Subject: below -- 2022-02-08 9:43 [PATCH v6 0/2] Adjust NUMA imbalance for " Mel Gorman 2022-02-08 9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 2022-02-08 16:19 ` Gautham R. Shenoy 2022-02-09 5:10 ` K Prateek Nayak 2022-02-09 10:33 ` Mel Gorman 2022-02-11 19:02 ` Jirka Hladky 2022-02-14 10:27 ` Srikar Dronamraju 2022-02-14 11:03 ` Vincent Guittot 2022-02-03 14:46 [PATCH v5 0/2] Adjust NUMA imbalance for " Mel Gorman 2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 2022-02-04 1:30 ` kernel test robot 2022-02-04 7:06 ` Srikar Dronamraju 2022-02-04 9:04 ` Mel Gorman 2022-02-04 15:07 ` Nayak, KPrateek (K Prateek) 2022-02-04 16:45 ` Mel Gorman 2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for " Mel Gorman 2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 2021-12-03 8:15 ` Barry Song 2021-12-03 10:50 ` Mel Gorman 2021-12-03 11:14 ` Barry Song 2021-12-03 13:27 ` Mel Gorman 2021-12-04 10:40 ` Peter Zijlstra 2021-12-06 8:48 ` Gautham R. Shenoy 2021-12-06 14:51 ` Peter Zijlstra 2021-12-06 15:12 ` Mel Gorman 2021-12-09 14:23 ` Valentin Schneider 2021-12-09 15:43 ` Mel Gorman 2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for " Mel Gorman 2021-11-25 15:19 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman 2021-11-26 23:22 ` kernel test robot
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.