Re: [PATCH 4/4] sched: Limit the amount of NUMA imbalance that can exist at fork time

From: Vincent Guittot <vincent.guittot@linaro.org>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 4/4] sched: Limit the amount of NUMA imbalance that can exist at fork time
Date: Fri, 20 Nov 2020 14:33:25 +0100	[thread overview]
Message-ID: <CAKfTPtDGDEHpOFC29+o-K434CB4jBT6JON07DR5Hhar7wPyybw@mail.gmail.com> (raw)
In-Reply-To: <20201120090630.3286-5-mgorman@techsingularity.net>

On Fri, 20 Nov 2020 at 10:06, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> At fork time currently, a local node can be allowed to fill completely
> and allow the periodic load balancer to fix the problem. This can be
> problematic in cases where a task creates lots of threads that idle until
> woken as part of a worker poll causing a memory bandwidth problem.
>
> However, a "real" workload suffers badly from this behaviour. The workload
> in question is mostly NUMA aware but spawns large numbers of threads
> that act as a worker pool that can be called from anywhere. These need
> to spread early to get reasonable behaviour.
>
> This patch limits how much a local node can fill before spilling over
> to another node and it will not be a universal win. Specifically,
> very short-lived workloads that fit within a NUMA node would prefer
> the memory bandwidth.
>
> As I cannot describe the "real" workload, the best proxy measure I found
> for illustration was a page fault microbenchmark. It's not representative
> of the workload but demonstrates the hazard of the current behaviour.
>
> pft timings
>                                  5.10.0-rc2             5.10.0-rc2
>                           imbalancefloat-v2          forkspread-v2
> Amean     elapsed-1        46.37 (   0.00%)       46.05 *   0.69%*
> Amean     elapsed-4        12.43 (   0.00%)       12.49 *  -0.47%*
> Amean     elapsed-7         7.61 (   0.00%)        7.55 *   0.81%*
> Amean     elapsed-12        4.79 (   0.00%)        4.80 (  -0.17%)
> Amean     elapsed-21        3.13 (   0.00%)        2.89 *   7.74%*
> Amean     elapsed-30        3.65 (   0.00%)        2.27 *  37.62%*
> Amean     elapsed-48        3.08 (   0.00%)        2.13 *  30.69%*
> Amean     elapsed-79        2.00 (   0.00%)        1.90 *   4.95%*
> Amean     elapsed-80        2.00 (   0.00%)        1.90 *   4.70%*
>
> This is showing the time to fault regions belonging to threads. The target
> machine has 80 logical CPUs and two nodes. Note the ~30% gain when the
> machine is approximately the point where one node becomes fully utilised.
> The slower results are borderline noise.
>
> Kernel building shows similar benefits around the same balance point.
> Generally performance was either neutral or better in the tests conducted.
> The main consideration with this patch is the point where fork stops
> spreading a task so some workloads may benefit from different balance
> points but it would be a risky tuning parameter.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e17e6c5da1d5..6d1c24708664 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8761,6 +8761,16 @@ static bool update_pick_idlest(struct sched_group *idlest,
>         return true;
>  }
>
> +/*
> + * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
> + * This is an approximation as the number of running tasks may not be
> + * related to the number of busy CPUs due to sched_setaffinity.
> + */
> +static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> +{
> +       return (dst_running < (dst_weight >> 2));
> +}
> +
>  /*
>   * find_idlest_group() finds and returns the least busy CPU group within the
>   * domain.
> @@ -8893,7 +8903,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                          * a real need of migration, periodic load balance will
>                          * take care of it.
>                          */
> -                       if (local_sgs.idle_cpus)
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
>                                 return NULL;
>                 }
>
> @@ -9000,11 +9010,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  static inline long adjust_numa_imbalance(int imbalance,
>                                 int dst_running, int dst_weight)
>  {
> +       if (!allow_numa_imbalance(dst_running, dst_weight))
> +               return imbalance;
> +
>         /*
>          * Allow a small imbalance based on a simple pair of communicating
>          * tasks that remain local when the destination is lightly loaded.
>          */
> -       if (dst_running < (dst_weight >> 2) && imbalance <= NUMA_IMBALANCE_MIN)
> +       if (imbalance <= NUMA_IMBALANCE_MIN)
>                 return 0;
>
>         return imbalance;
> --
> 2.26.2
>