All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Valentin Schneider <Valentin.Schneider@arm.com>,
	Aubrey Li <aubrey.li@linux.intel.com>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Mike Galbraith <efault@gmx.de>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Gautham Shenoy <gautham.shenoy@amd.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Date: Mon, 14 Feb 2022 12:03:05 +0100	[thread overview]
Message-ID: <CAKfTPtD=ROi0mH0Z6N_wF+z2D2PoOM7ZTRtqRxWHTdi3gmzSYQ@mail.gmail.com> (raw)
In-Reply-To: <20220208094334.16379-3-mgorman@techsingularity.net>

On Tue, 8 Feb 2022 at 10:44, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
>
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> allows an imbalance to exist up to the point where LLCs should be balanced
> between nodes.
>
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
>
>                             5.17.0-rc0             5.17.0-rc0
>                                vanilla       sched-numaimb-v6
> MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
> MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
> MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
> MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)
>
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
>
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per core, the vanilla kernel
> allows threads to contend on cache. With the patch;
>
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v5
> Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
> Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
> Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
> Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
> CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)
>
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed. Even in cases where
> the average performance is neutral, the results are more stable.
>
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v6
> Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
> Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
> Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
> Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
> Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
> Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
> Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
> Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
> Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
> Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
> Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
> Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
> Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
> Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
> Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
> Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
> Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
> Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
> Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
> Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
> Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
> Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
> Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
> Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
> Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
> Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
> Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
> Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
> Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
> Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
> Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
> Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
> Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
> Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)
>
> Similarly, for embarassingly parallel problems like NPB-ep, there are
> improvements due to better spreading across LLC when the machine is not
> fully utilised.
>
>                               vanilla       sched-numaimb-v6
> Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
> Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
> Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
> CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
> Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Good to see that you have been able to move on SD_NUMA instead of
SD_PREFER_SIBLING.
The allowed imbalance looks also more consistent whatever the number of LLC

Reviewed-by: Vincent Guittot <vincent.guitto@linaro.org>


> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 22 +++++++-------
>  kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
>  3 files changed, 66 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8054641c0a7b..56cffe42abbc 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>         unsigned int busy_factor;       /* less balancing by factor if busy */
>         unsigned int imbalance_pct;     /* No balance until over watermark */
>         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> +       unsigned int imb_numa_nr;       /* Nr running tasks that allows a NUMA imbalance */
>
>         int nohz_idle;                  /* NOHZ IDLE status */
>         int flags;                      /* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4592ccf82c34..538756bd8e7f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>
>         int src_cpu, src_nid;
>         int dst_cpu, dst_nid;
> +       int imb_numa_nr;
>
>         struct numa_stats src_stats, dst_stats;
>
> @@ -1503,7 +1504,7 @@ struct task_numa_env {
>  static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static inline long adjust_numa_imbalance(int imbalance,
> -                                       int dst_running, int dst_weight);
> +                                       int dst_running, int imb_numa_nr);
>
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
>                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -                                                       env->dst_stats.weight);
> +                                                 env->imb_numa_nr);
>
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
> @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
>          */
>         rcu_read_lock();
>         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -       if (sd)
> +       if (sd) {
>                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +               env.imb_numa_nr = sd->imb_numa_nr;
> +       }
>         rcu_read_unlock();
>
>         /*
> @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool
> -allow_numa_imbalance(unsigned int running, unsigned int weight)
> +static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
>  {
> -       return (running < (weight >> 2));
> +       return running <= imb_numa_nr;
>  }
>
>  /*
> @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                          * allowed. If there is a real need of migration,
>                          * periodic load balance will take care of it.
>                          */
> -                       if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
>                                 return NULL;
>                 }
>
> @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  #define NUMA_IMBALANCE_MIN 2
>
>  static inline long adjust_numa_imbalance(int imbalance,
> -                               int dst_running, int dst_weight)
> +                               int dst_running, int imb_numa_nr)
>  {
> -       if (!allow_numa_imbalance(dst_running, dst_weight))
> +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>                 return imbalance;
>
>         /*
> @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               local->sum_nr_running + 1, local->group_weight);
> +                               local->sum_nr_running + 1, env->sd->imb_numa_nr);
>                 }
>
>                 return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>                 }
>         }
>
> +       /*
> +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> +        * imbalanced.
> +        */
> +       for_each_cpu(i, cpu_map) {
> +               unsigned int imb = 0;
> +               unsigned int imb_span = 1;
> +
> +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +                       struct sched_domain *child = sd->child;
> +
> +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +                               struct sched_domain *top, *top_p;
> +                               unsigned int nr_llcs;
> +
> +                               /*
> +                                * For a single LLC per node, allow an
> +                                * imbalance up to 25% of the node. This is an
> +                                * arbitrary cutoff based on SMT-2 to balance
> +                                * between memory bandwidth and avoiding
> +                                * premature sharing of HT resources and SMT-4
> +                                * or SMT-8 *may* benefit from a different
> +                                * cutoff.
> +                                *
> +                                * For multiple LLCs, allow an imbalance
> +                                * until multiple tasks would share an LLC
> +                                * on one node while LLCs on another node
> +                                * remain idle.
> +                                */
> +                               nr_llcs = sd->span_weight / child->span_weight;
> +                               if (nr_llcs == 1)
> +                                       imb = sd->span_weight >> 2;
> +                               else
> +                                       imb = nr_llcs;
> +                               sd->imb_numa_nr = imb;
> +
> +                               /* Set span based on the first NUMA domain. */
> +                               top = sd;
> +                               top_p = top->parent;
> +                               while (top_p && !(top_p->flags & SD_NUMA)) {
> +                                       top = top->parent;
> +                                       top_p = top->parent;
> +                               }
> +                               imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +                       } else {
> +                               int factor = max(1U, (sd->span_weight / imb_span));
> +
> +                               sd->imb_numa_nr = imb * factor;
> +                       }
> +               }
> +       }
> +
>         /* Calculate CPU capacity for physical packages and nodes */
>         for (i = nr_cpumask_bits-1; i >= 0; i--) {
>                 if (!cpumask_test_cpu(i, cpu_map))
> --
> 2.31.1
>

  parent reply	other threads:[~2022-02-14 11:27 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-08  9:43 [PATCH v6 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2022-02-08  9:43 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
2022-02-08 15:06   ` Gautham R. Shenoy
2022-02-14  9:48   ` Vincent Guittot
2022-02-14 10:26   ` Srikar Dronamraju
2022-02-14 10:30   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2022-02-08 16:19   ` Gautham R. Shenoy
2022-02-09  5:10   ` K Prateek Nayak
2022-02-09 10:33     ` Mel Gorman
2022-02-11 19:02       ` Jirka Hladky
2022-02-14 10:27   ` Srikar Dronamraju
2022-02-14 10:30   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2022-02-14 11:03   ` Vincent Guittot [this message]
2022-02-09  9:38 ` [PATCH v6 0/2] Adjust NUMA imbalance for " Peter Zijlstra
  -- strict thread matches above, loose matches on Subject: below --
2022-02-03 14:46 [PATCH v5 " Mel Gorman
2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2022-02-04  1:30   ` kernel test robot
2022-02-04  7:06   ` Srikar Dronamraju
2022-02-04  9:04     ` Mel Gorman
2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
2022-02-04 16:45     ` Mel Gorman
2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2021-12-13  8:28   ` Gautham R. Shenoy
2021-12-13 13:01     ` Mel Gorman
2021-12-13 14:47       ` Gautham R. Shenoy
2021-12-15 11:52         ` Gautham R. Shenoy
2021-12-15 12:25           ` Mel Gorman
2021-12-16 18:33             ` Gautham R. Shenoy
2021-12-20 11:12               ` Mel Gorman
2021-12-21 15:03                 ` Gautham R. Shenoy
2021-12-21 17:13                 ` Vincent Guittot
2021-12-22  8:52                   ` Jirka Hladky
2022-01-04 19:52                     ` Jirka Hladky
2022-01-05 10:42                   ` Mel Gorman
2022-01-05 10:49                     ` Mel Gorman
2022-01-10 15:53                     ` Vincent Guittot
2022-01-12 10:24                       ` Mel Gorman
2021-12-17 19:54   ` Gautham R. Shenoy
2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2021-12-03  8:15   ` Barry Song
2021-12-03 10:50     ` Mel Gorman
2021-12-03 11:14       ` Barry Song
2021-12-03 13:27         ` Mel Gorman
2021-12-04 10:40   ` Peter Zijlstra
2021-12-06  8:48     ` Gautham R. Shenoy
2021-12-06 14:51       ` Peter Zijlstra
2021-12-06 15:12     ` Mel Gorman
2021-12-09 14:23       ` Valentin Schneider
2021-12-09 15:43         ` Mel Gorman
2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-11-25 15:19 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2021-11-26 23:22   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKfTPtD=ROi0mH0Z6N_wF+z2D2PoOM7ZTRtqRxWHTdi3gmzSYQ@mail.gmail.com' \
    --to=vincent.guittot@linaro.org \
    --cc=Valentin.Schneider@arm.com \
    --cc=aubrey.li@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=gautham.shenoy@amd.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=song.bao.hua@hisilicon.com \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.