Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

From: Vincent Guittot <vincent.guittot@linaro.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Valentin Schneider <Valentin.Schneider@arm.com>,
	Aubrey Li <aubrey.li@linux.intel.com>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Mike Galbraith <efault@gmx.de>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Gautham Shenoy <gautham.shenoy@amd.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Date: Mon, 14 Feb 2022 12:03:05 +0100	[thread overview]
Message-ID: <CAKfTPtD=ROi0mH0Z6N_wF+z2D2PoOM7ZTRtqRxWHTdi3gmzSYQ@mail.gmail.com> (raw)
In-Reply-To: <20220208094334.16379-3-mgorman@techsingularity.net>

On Tue, 8 Feb 2022 at 10:44, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
>
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> allows an imbalance to exist up to the point where LLCs should be balanced
> between nodes.
>
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
>
>                             5.17.0-rc0             5.17.0-rc0
>                                vanilla       sched-numaimb-v6
> MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
> MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
> MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
> MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)
>
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
>
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per core, the vanilla kernel
> allows threads to contend on cache. With the patch;
>
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v5
> Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
> Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
> Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
> Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
> CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)
>
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed. Even in cases where
> the average performance is neutral, the results are more stable.
>
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v6
> Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
> Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
> Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
> Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
> Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
> Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
> Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
> Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
> Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
> Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
> Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
> Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
> Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
> Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
> Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
> Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
> Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
> Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
> Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
> Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
> Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
> Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
> Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
> Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
> Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
> Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
> Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
> Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
> Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
> Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
> Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
> Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
> Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
> Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)
>
> Similarly, for embarassingly parallel problems like NPB-ep, there are
> improvements due to better spreading across LLC when the machine is not
> fully utilised.
>
>                               vanilla       sched-numaimb-v6
> Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
> Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
> Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
> CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
> Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Good to see that you have been able to move on SD_NUMA instead of
SD_PREFER_SIBLING.
The allowed imbalance looks also more consistent whatever the number of LLC

Reviewed-by: Vincent Guittot <vincent.guitto@linaro.org>

> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 22 +++++++-------
>  kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
>  3 files changed, 66 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8054641c0a7b..56cffe42abbc 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>         unsigned int busy_factor;       /* less balancing by factor if busy */
>         unsigned int imbalance_pct;     /* No balance until over watermark */
>         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> +       unsigned int imb_numa_nr;       /* Nr running tasks that allows a NUMA imbalance */
>
>         int nohz_idle;                  /* NOHZ IDLE status */
>         int flags;                      /* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4592ccf82c34..538756bd8e7f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>
>         int src_cpu, src_nid;
>         int dst_cpu, dst_nid;
> +       int imb_numa_nr;
>
>         struct numa_stats src_stats, dst_stats;
>
> @@ -1503,7 +1504,7 @@ struct task_numa_env {
>  static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static inline long adjust_numa_imbalance(int imbalance,
> -                                       int dst_running, int dst_weight);
> +                                       int dst_running, int imb_numa_nr);
>
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
>                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -                                                       env->dst_stats.weight);
> +                                                 env->imb_numa_nr);
>
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
> @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
>          */
>         rcu_read_lock();
>         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -       if (sd)
> +       if (sd) {
>                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +               env.imb_numa_nr = sd->imb_numa_nr;
> +       }
>         rcu_read_unlock();
>
>         /*
> @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool
> -allow_numa_imbalance(unsigned int running, unsigned int weight)
> +static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
>  {
> -       return (running < (weight >> 2));
> +       return running <= imb_numa_nr;
>  }
>
>  /*
> @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                          * allowed. If there is a real need of migration,
>                          * periodic load balance will take care of it.
>                          */
> -                       if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
>                                 return NULL;
>                 }
>
> @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  #define NUMA_IMBALANCE_MIN 2
>
>  static inline long adjust_numa_imbalance(int imbalance,
> -                               int dst_running, int dst_weight)
> +                               int dst_running, int imb_numa_nr)
>  {
> -       if (!allow_numa_imbalance(dst_running, dst_weight))
> +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>                 return imbalance;
>
>         /*
> @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               local->sum_nr_running + 1, local->group_weight);
> +                               local->sum_nr_running + 1, env->sd->imb_numa_nr);
>                 }
>
>                 return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>                 }
>         }
>
> +       /*
> +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> +        * imbalanced.
> +        */
> +       for_each_cpu(i, cpu_map) {
> +               unsigned int imb = 0;
> +               unsigned int imb_span = 1;
> +
> +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +                       struct sched_domain *child = sd->child;
> +
> +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +                               struct sched_domain *top, *top_p;
> +                               unsigned int nr_llcs;
> +
> +                               /*
> +                                * For a single LLC per node, allow an
> +                                * imbalance up to 25% of the node. This is an
> +                                * arbitrary cutoff based on SMT-2 to balance
> +                                * between memory bandwidth and avoiding
> +                                * premature sharing of HT resources and SMT-4
> +                                * or SMT-8 *may* benefit from a different
> +                                * cutoff.
> +                                *
> +                                * For multiple LLCs, allow an imbalance
> +                                * until multiple tasks would share an LLC
> +                                * on one node while LLCs on another node
> +                                * remain idle.
> +                                */
> +                               nr_llcs = sd->span_weight / child->span_weight;
> +                               if (nr_llcs == 1)
> +                                       imb = sd->span_weight >> 2;
> +                               else
> +                                       imb = nr_llcs;
> +                               sd->imb_numa_nr = imb;
> +
> +                               /* Set span based on the first NUMA domain. */
> +                               top = sd;
> +                               top_p = top->parent;
> +                               while (top_p && !(top_p->flags & SD_NUMA)) {
> +                                       top = top->parent;
> +                                       top_p = top->parent;
> +                               }
> +                               imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +                       } else {
> +                               int factor = max(1U, (sd->span_weight / imb_span));
> +
> +                               sd->imb_numa_nr = imb * factor;
> +                       }
> +               }
> +       }
> +
>         /* Calculate CPU capacity for physical packages and nodes */
>         for (i = nr_cpumask_bits-1; i >= 0; i--) {
>                 if (!cpumask_test_cpu(i, cpu_map))
> --
> 2.31.1
>