From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 533E4C2BA83 for ; Wed, 12 Feb 2020 15:03:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 20DBD2073C for ; Wed, 12 Feb 2020 15:03:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="B/EY/1vo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728501AbgBLPDn (ORCPT ); Wed, 12 Feb 2020 10:03:43 -0500 Received: from mail-lf1-f66.google.com ([209.85.167.66]:43815 "EHLO mail-lf1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728094AbgBLPDn (ORCPT ); Wed, 12 Feb 2020 10:03:43 -0500 Received: by mail-lf1-f66.google.com with SMTP id 9so1824523lfq.10 for ; Wed, 12 Feb 2020 07:03:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=mJjUPQPGWzcs5bvAiI21vS0L5Gpnej33TBvxNbVOgSU=; b=B/EY/1voNoXzgMNiBuTQdRkXq4qfG/V3Zx2xexNx/Q1pMAyzZvwTj83zHfZ8KjiYnb sZ24ECoyQ/8awiQUnvwW8Bp66smCmY/ZGoHjqwnAIPDOgOY1xu/KHPW6wH56h2/5UV3A DM6Ve9M+3C3Em2YSQb/2q1dh9e0RBPnbUVXXU/sCeaxgga+DzP9AzDUMIYRiTwfnW52S 2TxjSfJQXZeYDubAgyjdoSOpVvLt6ghQS0N918IRhizYXfXSs/c7yLX66erZEUwBbk3E SMhyYLvyQ4PXRg7dhke4qqqrUAY+5ymxxpcuag/1sm7dK3boOaE5t1ty0IQ341ZweA4b 152A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=mJjUPQPGWzcs5bvAiI21vS0L5Gpnej33TBvxNbVOgSU=; b=UqO/hajS6wxtdDU3qkIUf8fvtf6EysER+4BnMcMjpavChwmp9msxODmSwMUAq73Lka YafmDXLEikYmdGuZ4u8XlICghEePF/uEw4IYuJXVYzD6AD6n2Y8LIwUPRDdPlsooYuNg rOyWEGJcUmXUKxT4YaGz+Mgt5HKaw+3gzz9tZN644i5h2oYnr7lUfsrr9bSyD4SC77Lf ifJFOElymstXwIlYNKjRSaCyroOnl5F6d5pKdWI6JhFBv4B75tLPkqkT+bHF4MO2lT67 qsp9e8lJybkuTNDZiu9/h2sOrcTutfY6nqWOIp7LuuUfhe3yqvJT+0uB/ACHPfIkSJX1 GeGg== X-Gm-Message-State: APjAAAWWwabJHbdRKEvyk/QB/TVZ7WyYeOSaQ63qwU0m06TIFWAdL0Bi c3FQvBNtB0aoRijdIklrLgIfBcfpE+Jgu8Sr4y/jYefvY1SLng== X-Google-Smtp-Source: APXvYqzdKTMQbbWbkl4hxV5duXRwdbTo16CAKO67F9sB05Hk18MPwDKsxrsmtBk8IJD+A8k/LoYk7X7Vm7SJzRBwgRk= X-Received: by 2002:ac2:59dd:: with SMTP id x29mr6818241lfn.95.1581519820104; Wed, 12 Feb 2020 07:03:40 -0800 (PST) MIME-Version: 1.0 References: <20200211174651.10330-1-vincent.guittot@linaro.org> <20200211174651.10330-3-vincent.guittot@linaro.org> <20200212133715.GU3420@suse.de> In-Reply-To: <20200212133715.GU3420@suse.de> From: Vincent Guittot Date: Wed, 12 Feb 2020 16:03:28 +0100 Message-ID: Subject: Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg To: Mel Gorman Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , linux-kernel , Phil Auld , Parth Shah , Valentin Schneider Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 12 Feb 2020 at 14:37, Mel Gorman wrote: > > On Tue, Feb 11, 2020 at 06:46:49PM +0100, Vincent Guittot wrote: > > Similarly to what has been done for the normal load balance, we can > > replace runnable_load_avg by load_avg in numa load balancing and track > > other statistics like the utilization and the number of running tasks to > > get to better view of the current state of a node. > > > > Signed-off-by: Vincent Guittot > > --- > > kernel/sched/fair.c | 102 ++++++++++++++++++++++++++++++++------------ > > 1 file changed, 75 insertions(+), 27 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index a1ea02b5362e..6e4c2b012c48 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, > > group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4; > > } > > > > -static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq); > > - > > -static unsigned long cpu_runnable_load(struct rq *rq) > > -{ > > - return cfs_rq_runnable_load_avg(&rq->cfs); > > -} > > +/* > > + * 'numa_type' describes the node at the moment of load balancing. > > + */ > > +enum numa_type { > > + /* The node has spare capacity that can be used to run more tasks. */ > > + node_has_spare = 0, > > + /* > > + * The node is fully used and the tasks don't compete for more CPU > > + * cycles. Nevertheless, some tasks might wait before running. > > + */ > > + node_fully_busy, > > + /* > > + * The node is overloaded and can't provide expected CPU cycles to all > > + * tasks. > > + */ > > + node_overloaded > > +}; > > > > Ok, to reconcile this with the load balancer, it would need to account > for imbalanced but that's ok in the context of this patch. > > > /* Cached statistics for all CPUs within a node */ > > struct numa_stats { > > unsigned long load; > > - > > + unsigned long util; > > /* Total compute capacity of CPUs on a node */ > > unsigned long compute_capacity; > > + unsigned int nr_running; > > + unsigned int weight; > > + enum numa_type node_type; > > }; > > > > -/* > > - * XXX borrowed from update_sg_lb_stats > > - */ > > -static void update_numa_stats(struct numa_stats *ns, int nid) > > -{ > > - int cpu; > > - > > - memset(ns, 0, sizeof(*ns)); > > - for_each_cpu(cpu, cpumask_of_node(nid)) { > > - struct rq *rq = cpu_rq(cpu); > > - > > - ns->load += cpu_runnable_load(rq); > > - ns->compute_capacity += capacity_of(cpu); > > - } > > - > > -} > > - > > struct task_numa_env { > > struct task_struct *p; > > > > @@ -1521,6 +1518,47 @@ struct task_numa_env { > > int best_cpu; > > }; > > > > +static unsigned long cpu_load(struct rq *rq); > > +static unsigned long cpu_util(int cpu); > > + > > +static inline enum > > +numa_type numa_classify(unsigned int imbalance_pct, > > + struct numa_stats *ns) > > +{ > > + if ((ns->nr_running > ns->weight) && > > + ((ns->compute_capacity * 100) < (ns->util * imbalance_pct))) > > + return node_overloaded; > > + > > Ok, so this is essentially group_is_overloaded. > > > > + if ((ns->nr_running < ns->weight) || > > + ((ns->compute_capacity * 100) > (ns->util * imbalance_pct))) > > + return node_has_spare; > > + > > And this is group_has_capacity. What I did was have a common helper > for both NUMA and normal load balancing and translated the fields from > sg_lb_stats and numa_stats into a common helper. This is to prevent them > getting out of sync. The conversion was incomplete in my case but in > principle, both NUMA and CPU load balancing should use common helpers or > they'll get out of sync. I fact, I wanted to keep this patch simple and readable for the 1st version in order to not afraid people from reviewing it. That's the main reason I didn't merge it with load_balance but i agree that some common helper function might be possible. Also the struct sg_lb_stats has a lot more fields compared to struct numa_stats Then, I wonder if we could end up with different rules for numa like taking into account some NUMA specifics metrics to classify the node > > > > + return node_fully_busy; > > +} > > + > > +/* > > + * XXX borrowed from update_sg_lb_stats > > + */ > > +static void update_numa_stats(struct task_numa_env *env, > > + struct numa_stats *ns, int nid) > > +{ > > + int cpu; > > + > > + memset(ns, 0, sizeof(*ns)); > > + for_each_cpu(cpu, cpumask_of_node(nid)) { > > + struct rq *rq = cpu_rq(cpu); > > + > > + ns->load += cpu_load(rq); > > + ns->util += cpu_util(cpu); > > + ns->nr_running += rq->cfs.h_nr_running; > > + ns->compute_capacity += capacity_of(cpu); > > + } > > + > > + ns->weight = cpumask_weight(cpumask_of_node(nid)); > > + > > + ns->node_type = numa_classify(env->imbalance_pct, ns); > > +} > > + > > Ok, this is more or less what I had except I wedged other stuff in there > too specific to NUMA balancing to avoid multiple walks of the cpumask. > > > static void task_numa_assign(struct task_numa_env *env, > > struct task_struct *p, long imp) > > { > > @@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load, > > long orig_src_load, orig_dst_load; > > long src_capacity, dst_capacity; > > > > + > > + /* If dst node has spare capacity, there is no real load imbalance */ > > + if (env->dst_stats.node_type == node_has_spare) > > + return false; > > + > > Not exactly what the load balancer thinks though, the load balancer > may decide to balance the tasks between NUMA groups even when there is > capacity. That said, what you did here is compatible with the current code. > > I'll want to test this further but in general > > Acked-by: Mel Gorman > > However, I really would have preferred if numa_classify shared common > code with group_[is_overloaded|has_capacity] instead of having > equivalent but different implementations. > > > /* > > * The load is corrected for the CPU capacity available on each node. > > * > > @@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p) > > dist = env.dist = node_distance(env.src_nid, env.dst_nid); > > taskweight = task_weight(p, env.src_nid, dist); > > groupweight = group_weight(p, env.src_nid, dist); > > - update_numa_stats(&env.src_stats, env.src_nid); > > + update_numa_stats(&env, &env.src_stats, env.src_nid); > > taskimp = task_weight(p, env.dst_nid, dist) - taskweight; > > groupimp = group_weight(p, env.dst_nid, dist) - groupweight; > > - update_numa_stats(&env.dst_stats, env.dst_nid); > > + update_numa_stats(&env, &env.dst_stats, env.dst_nid); > > > > /* Try to find a spot on the preferred nid. */ > > task_numa_find_cpu(&env, taskimp, groupimp); > > @@ -1824,7 +1867,7 @@ static int task_numa_migrate(struct task_struct *p) > > > > env.dist = dist; > > env.dst_nid = nid; > > - update_numa_stats(&env.dst_stats, env.dst_nid); > > + update_numa_stats(&env, &env.dst_stats, env.dst_nid); > > task_numa_find_cpu(&env, taskimp, groupimp); > > } > > } > > @@ -5446,6 +5489,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p) > > return load; > > } > > > > +static unsigned long cpu_runnable_load(struct rq *rq) > > +{ > > + return cfs_rq_runnable_load_avg(&rq->cfs); > > +} > > + > > static unsigned long capacity_of(int cpu) > > { > > return cpu_rq(cpu)->cpu_capacity; > > -- > > 2.17.1 > > > > -- > Mel Gorman > SUSE Labs