From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E701C433EF for ; Tue, 21 Dec 2021 17:13:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240243AbhLURNb (ORCPT ); Tue, 21 Dec 2021 12:13:31 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56424 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236622AbhLURNa (ORCPT ); Tue, 21 Dec 2021 12:13:30 -0500 Received: from mail-lf1-x12b.google.com (mail-lf1-x12b.google.com [IPv6:2a00:1450:4864:20::12b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B214CC061574 for ; Tue, 21 Dec 2021 09:13:29 -0800 (PST) Received: by mail-lf1-x12b.google.com with SMTP id bt1so30763300lfb.13 for ; Tue, 21 Dec 2021 09:13:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=12dcJuvCHbrqEuGjZHrbvg5DKcpEVHqkDCNV4U9OhRs=; b=A3RLGqajl0TeHtr1c5Vv0kFB3FbWPw1WUaG7hsuHBG+Dfp7qKFdR0RNkwrrGotvNJx r2jgZlKzpqwjpdpyawyyQY+lbpDbv3z+iVHyGB+JXztHoLo8bLDsCQrDLfQ8o3bJ+iDa tkDb8NT40HcQcfPVSlVXl6mEKSWmcgKnD+GtwdZo/hQAl963SMePXW5ShHFA/KKB5ruK KKKqplNuEmZlnsXIG2+bdL6lT5SMTWQH5L0hK6y6CmcD05u8QJkflrRbF8cyd8+6ujWp 5bRo+E0TWWGWSQZafKNs2hg4octWjQAFMWRFsVE/wfEStqAaB9EuMIkLZts2JdO/E+TZ 67iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=12dcJuvCHbrqEuGjZHrbvg5DKcpEVHqkDCNV4U9OhRs=; b=tTdQlhXIvRFhVeqOWdjlCSX0E9Ag0pZ3KfjPSlNUuaP+1zX3Pp0gL0Q2O3GDu061/h J+haxT29P0uINY1hRgUtYAad9BqLRUuU8fcpHxgHR/pKBnR7rlaPvXoFkaY4N5MHDlNw bZw9KC8kSynT2TbS62Ah++sTegITg6jartGuLELqZXy2Waw44Mp9V8vLeO1GE8y8muW0 vb5R4mSSLV6hX+OMroQUi/wHD0PF2rnjW/wX4xTb9rQ9pnVnE20hXCakosIywPOUVjbi jrPlR0KTRyTfgCNbPr9JsUKmoYQyJFks8foTSPY+mSR+Lojq1vYYuIcvrVVQKkSsF/tX iBpg== X-Gm-Message-State: AOAM5339E4J9SdD7dwZhD37Q0yWFczL6yYcnTg7M+p7dcmPnsCu9wRco veoZmTzqG5ynP5ePQZWsBa8aV9uH+R8sfJ71+b+uOQ== X-Google-Smtp-Source: ABdhPJy0SMRbANoYsvRK8C4mSbAOUg6sVhbLSzzhXMKUy+ikltxRcSgtxT9puTj7+7PiFFDsCXH6PJOEWr+NPrC9DO8= X-Received: by 2002:ac2:52a4:: with SMTP id r4mr3812262lfm.645.1640106807496; Tue, 21 Dec 2021 09:13:27 -0800 (PST) MIME-Version: 1.0 References: <20211210093307.31701-1-mgorman@techsingularity.net> <20211210093307.31701-3-mgorman@techsingularity.net> <20211213130131.GQ3366@techsingularity.net> <20211215122550.GR3366@techsingularity.net> <20211220111243.GS3366@techsingularity.net> In-Reply-To: <20211220111243.GS3366@techsingularity.net> From: Vincent Guittot Date: Tue, 21 Dec 2021 18:13:15 +0100 Message-ID: Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs To: Mel Gorman Cc: "Gautham R. Shenoy" , Peter Zijlstra , Ingo Molnar , Valentin Schneider , Aubrey Li , Barry Song , Mike Galbraith , Srikar Dronamraju , LKML Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 20 Dec 2021 at 12:12, Mel Gorman wrote: > > (sorry for the delay, was offline for a few days) > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote: > > Hello Mel, > > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote: > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote: > > > > [..SNIP..] > > [snip] > > To avoid the corner case, we'd need to explicitly favour spreading early > and assume wakeup will pull communicating tasks together and NUMA > balancing migrate the data after some time which looks like > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index c07bfa2d80f2..54f5207154d3 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -93,6 +93,7 @@ struct sched_domain { > unsigned int busy_factor; /* less balancing by factor if busy */ > unsigned int imbalance_pct; /* No balance until over watermark */ > unsigned int cache_nice_tries; /* Leave cache hot tasks for # tries */ > + unsigned int imb_numa_nr; /* Nr imbalanced tasks allowed between nodes */ So now you compute an allowed imbalance level instead of using 25% of sd->span_weight or 25% of busiest->group_weight And you adjust this new imb_numa_nr according to the topology. That makes sense. > > int nohz_idle; /* NOHZ IDLE status */ > int flags; /* See SD_* */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0a969affca76..df0e84462e62 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1489,6 +1489,7 @@ struct task_numa_env { > > int src_cpu, src_nid; > int dst_cpu, dst_nid; > + int imb_numa_nr; > > struct numa_stats src_stats, dst_stats; > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq); > static unsigned long cpu_runnable(struct rq *rq); > static unsigned long cpu_util(int cpu); > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight); > + int dst_running, > + int imb_numa_nr); > > static inline enum > numa_type numa_classify(unsigned int imbalance_pct, > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, > dst_running = env->dst_stats.nr_running + 1; > imbalance = max(0, dst_running - src_running); > imbalance = adjust_numa_imbalance(imbalance, dst_running, > - env->dst_stats.weight); > + env->imb_numa_nr); > > /* Use idle CPU if there is no imbalance */ > if (!imbalance) { > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p) > */ > rcu_read_lock(); > sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu)); > - if (sd) > + if (sd) { > env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2; > + env.imb_numa_nr = sd->imb_numa_nr; > + } > rcu_read_unlock(); > > /* > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest, > * This is an approximation as the number of running tasks may not be > * related to the number of busy CPUs due to sched_setaffinity. > */ > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight) > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr) > { > - return (dst_running < (dst_weight >> 2)); > + return dst_running < imb_numa_nr; > } > > /* > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu) > return idlest; > #endif > /* > - * Otherwise, keep the task on this node to stay close > - * its wakeup source and improve locality. If there is > - * a real need of migration, periodic load balance will > - * take care of it. > + * Otherwise, keep the task on this node to stay local > + * to its wakeup source if the number of running tasks > + * are below the allowed imbalance. If there is a real > + * need of migration, periodic load balance will take > + * care of it. > */ > - if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight)) > + if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr)) > return NULL; > } > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > } > } > > -#define NUMA_IMBALANCE_MIN 2 > - > static inline long adjust_numa_imbalance(int imbalance, > - int dst_running, int dst_weight) > + int dst_running, int imb_numa_nr) > { > - if (!allow_numa_imbalance(dst_running, dst_weight)) > + if (!allow_numa_imbalance(dst_running, imb_numa_nr)) > return imbalance; > > - /* > - * Allow a small imbalance based on a simple pair of communicating > - * tasks that remain local when the destination is lightly loaded. > - */ > - if (imbalance <= NUMA_IMBALANCE_MIN) > + if (imbalance <= imb_numa_nr) Isn't this always true ? imbalance is "always" < dst_running as imbalance is usually the number of these tasks that we would like to migrate > return 0; > > return imbalance; > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > /* Consider allowing a small imbalance between NUMA groups */ > if (env->sd->flags & SD_NUMA) { > env->imbalance = adjust_numa_imbalance(env->imbalance, > - busiest->sum_nr_running, env->sd->span_weight); > + busiest->sum_nr_running, env->sd->imb_numa_nr); > } > > return; > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index d201a7052a29..1fa3e977521d 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att > } > } > > + /* > + * Calculate an allowed NUMA imbalance such that LLCs do not get > + * imbalanced. > + */ > + for_each_cpu(i, cpu_map) { > + unsigned int imb = 0; > + unsigned int imb_span = 1; > + > + for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) { > + struct sched_domain *child = sd->child; > + > + if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child && > + (child->flags & SD_SHARE_PKG_RESOURCES)) { sched_domains have not been degenerated yet so you found here the DIE domain > + struct sched_domain *top, *top_p; > + unsigned int llc_sq; > + > + /* > + * nr_llcs = (sd->span_weight / llc_weight); > + * imb = (llc_weight / nr_llcs) >> 2 it would be good to add a comment to explain why 25% of LLC weight / number of LLC in a node is the right value. For example, why is it better than just 25% of the LLC weight ? Do you want to allow the same imbalance at node level whatever the number of LLC in the node ? > + * > + * is equivalent to > + * > + * imb = (llc_weight^2 / sd->span_weight) >> 2 > + * > + */ > + llc_sq = child->span_weight * child->span_weight; > + > + imb = max(2U, ((llc_sq / sd->span_weight) >> 2)); > + sd->imb_numa_nr = imb; > + > + /* > + * Set span based on top domain that places > + * tasks in sibling domains. > + */ > + top = sd; > + top_p = top->parent; > + while (top_p && (top_p->flags & SD_PREFER_SIBLING)) { Why are you looping on SD_PREFER_SIBLING instead of SD_NUMA ? Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure that you want to take this case into account, only numa node don't have SD_PREFER_SIBLING > + top = top->parent; > + top_p = top->parent; > + } > + imb_span = top_p ? top_p->span_weight : sd->span_weight; > + } else { > + int factor = max(1U, (sd->span_weight / imb_span)); > + > + sd->imb_numa_nr = imb * factor; > + } > + } > + } > + > /* Calculate CPU capacity for physical packages and nodes */ > for (i = nr_cpumask_bits-1; i >= 0; i--) { > if (!cpumask_test_cpu(i, cpu_map))