From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 911C4C3F68F for ; Tue, 7 Jan 2020 16:00:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6700B2081E for ; Tue, 7 Jan 2020 16:00:47 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="RxK80YoG" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728378AbgAGQAq (ORCPT ); Tue, 7 Jan 2020 11:00:46 -0500 Received: from mail-lf1-f66.google.com ([209.85.167.66]:43674 "EHLO mail-lf1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728173AbgAGQAp (ORCPT ); Tue, 7 Jan 2020 11:00:45 -0500 Received: by mail-lf1-f66.google.com with SMTP id 9so67015lfq.10 for ; Tue, 07 Jan 2020 08:00:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=G5IGKqgnnqHj2l5aFpISL9b2AZ+28RDzzoNGsHjiypA=; b=RxK80YoGgCKJzRzKDOZCCZlyvJ0AzC4vc1fmXE+EmdXarFZ6oOXSY7NjRykxTDRoPo hTqOctU+iKEpI+aGO/tOUb7JS6AiJHiMjBiiMHgZqqpuqguw/+oUyDhctYhZguJ8IJgL /bH6EvUoI1biGO9sPtz3X2v/UfaM7l0aEozfaLhzqp36vSamvA+ZMIs/0YOEwbmBGmLx 4NtR/qKtwwQbnic+lX1mQyPnBVSMjXNAlNki6fDoG+HjQ3+mB3suNZPhjUeBWAfNGF+u KCJYtxGGuMHDy1Pp9s/IhrpoJLRKcVHDPbry7Le8pM31t+gIit/w5IHHgHeNb8Ngg++e 9sOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=G5IGKqgnnqHj2l5aFpISL9b2AZ+28RDzzoNGsHjiypA=; b=Qzu87rOYbi8wi5EqVgawKySJ4MTgrf7vVlr0udkdyXtW4JZBq3mXC/lYSbmdkV0huB r2GVhtEHZqav75ebFAyzwEQY6L+1qgtM9nz07jJ1mcTgofL/vBytioJeK4rAPo/Uy073 yxe/qSMmmxceOo9RNB/ZXEPh817Vc9YTt3kZ8vBFZsocHFacGEp/IrdfZPqHV8DfslE9 xz+fZ+k+mCiFQl143rbJ4mHvQ8lKkEuK2YQttvXbVMSK3BoUrKIk+t9DO1U+WYUsaLlc kW6CaBWKRD76QrIURjF3bTlJ/yu03ghtzZfL1m1LZ9xzqEhWE7VHgKPEbBobBedxydct QJgw== X-Gm-Message-State: APjAAAXP9dpc93FaPPk6Vvav8QXh5k7RMCFoVKq+mELAIMyRWAP4rarA 7zr0+4jgE79WFDDNGBNZ4y3hiZDFmC14zQVk6qffZg== X-Google-Smtp-Source: APXvYqwh6r4ZT1gmuB4kDBHibCAmORzXwEFYl++P2qh/I8n6DOrgDh3MObKp/7emtrKEOWWoCP/x3nD4JWM6pScQrI8= X-Received: by 2002:a19:c80a:: with SMTP id y10mr80643lff.177.1578412842021; Tue, 07 Jan 2020 08:00:42 -0800 (PST) MIME-Version: 1.0 References: <20191220084252.GL3178@techsingularity.net> <20200103143051.GA3027@techsingularity.net> <20200106145225.GB3466@techsingularity.net> <20200107095655.GF3466@techsingularity.net> <20200107115646.GI3466@techsingularity.net> In-Reply-To: <20200107115646.GI3466@techsingularity.net> From: Vincent Guittot Date: Tue, 7 Jan 2020 17:00:29 +0100 Message-ID: Subject: Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 To: Mel Gorman Cc: Ingo Molnar , Peter Zijlstra , Phil Auld , Valentin Schneider , Srikar Dronamraju , Quentin Perret , Dietmar Eggemann , Morten Rasmussen , Hillf Danton , Parth Shah , Rik van Riel , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org busiest->group_weight * (env->sd->imbalance_pct - 100) / 100; On Tue, 7 Jan 2020 at 12:56, Mel Gorman wrote: > > On Tue, Jan 07, 2020 at 12:17:12PM +0100, Vincent Guittot wrote: > > On Tue, 7 Jan 2020 at 10:56, Mel Gorman wrote: > > > > > > On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote: > > > > > > This looks weird to me because you use imbalance_pct, which is > > > > > > meaningful only compare a ratio, to define a number that will be then > > > > > > compared to a number of tasks without taking into account the weight > > > > > > of the node. So whatever the node size, 32 or 128 CPUs, the > > > > > > imbalance_adj will be the same: 3 with the default imbalance_pct of > > > > > > NUMA level which is 125 AFAICT > > > > > > > > > > > > > > > > The intent in this version was to only cover the low utilisation case > > > > > regardless of the NUMA node size. There were too many corner cases > > > > > where the tradeoff of local memory latency versus local memory bandwidth > > > > > cannot be quantified. See Srikar's report as an example but it's a general > > > > > problem. The use of imbalance_pct was simply to find the smallest number of > > > > > running tasks were (imbalance_pct - 100) would be 1 running task and limit > > > > > > > > But using imbalance_pct alone doesn't mean anything. > > > > > > Other than figuring out "how many running tasks are required before > > > imbalance_pct is roughly equivalent to one fully active CPU?". Even then, > > > > sorry, I don't see how you deduct this from only using imbalance_pct > > which is mainly there to say how much percent of difference is > > significant > > > > Because if the difference is 25% then 1 CPU out of 4 active is enough > for imbalance_pct to potentially be a factor. Anyway, the approach seems > almost universally disliked so even if I had reasons for not scaling > this by the group_weight, no one appears to agree with them :P > > > > it's a bit weak as imbalance_pct makes hard-coded assumptions on what > > > the tradeoff of cross-domain migration is without considering the hardware. > > > > > > > Using similar to the below > > > > > > > > busiest->group_weight * (env->sd->imbalance_pct - 100) / 100 > > > > > > > > would be more meaningful > > > > > > > > > > It's meaningful to some sense from a conceptual point of view but > > > setting the low utilisation cutoff depending on the number of CPUs on > > > the node does not account for any local memory latency vs bandwidth. > > > i.e. on a small or mid-range machine the cutoff will make sense. On > > > larger machines, the cutoff could be at the point where memory bandwidth > > > is saturated leading to a scenario whereby upgrading to a larger > > > machine performs worse than the smaller machine. > > > > > > Much more importantly, doing what you suggest allows an imbalance > > > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2 > > > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3 > > > caches. That is going to have side-effects. While I also do not account > > > for the LLC group_weight, it's unlikely the cut-off I used would be > > > smaller than an LLC cache on a large machine as the cache. > > > > > > These two points are why I didn't take the group weight into account. > > > > > > Now if you want, I can do what you suggest anyway as long as you are happy > > > that the child domain weight is also taken into account and to bound the > > > > Taking into account child domain makes sense to me, but shouldn't we > > take into account the number of child group instead ? This should > > reflect the number of different LLC caches. > > I guess it would but why is it inherently better? The number of domains > would yield a similar result if we assume that all the lower domains > have equal weight so it simply because the weight of the SD_NUMA domain > divided by the number of child domains. but that's not what you are doing in your proposal. You are using directly child->span_weight which reflects the number of CPUs in the child and not the number of group you should do something like sds->busiest->span_weight / sds->busiest->child->span_weight which gives you an approximation of the number of independent group inside the busiest numa node from a shared resource pov > > Now, I could be missing something with asymmetric setups. I don't know > if it's possible for child domains of a NUMA domain to have different > sizes. I would be somewhat surprised if they did but I also do not work > on such machines nor have I ever accessed one (to my knowledge). > > > IIUC your reasoning, we want to make sure that tasks will not start to > > fight for using same resources which is the connection between LLC > > cache and memory in this case > > > > Yep. I don't want a case where the allowed imbalance causes the load > balancer to have to balance between the lower domains. *Maybe* that is > actually better in some cases but it's far from intuitive so I would > prefer that change would be a patch on its own with a big fat comment > explaining the reasoning behind the additional complexity. > > > > largest possible allowed imbalance to deal with the case of a node having > > > multiple small LLC caches. That means that some machines will be using the > > > size of the node and some machines will use the size of an LLC. It's less > > > predictable overall as some machines will be "special" relative to others > > > making it harder to reproduce certain problems locally but it would take > > > imbalance_pct into account in a way that you're happy with. > > > > > > Also bear in mind that whether LLC is accounted for or not, the final > > > result should be halved similar to the other imbalance calculations to > > > avoid over or under load balancing. > > > > > > > Or you could use the util_avg so you will take into account if the > > > > tasks are short running ones or long running ones > > > > > > > > > > util_avg can be skewed if there are big outliers. Even then, it's not > > > a great metric for the low utilisation cutoff. Large numbers of mostly > > > idle but running tasks would be treated similarly to small numbers of > > > fully active tasks. It's less predictable and harder to reason about how > > > > Yes but this also have the advantage of reflecting more accurately how > > the system is used. > > with nr_running, we consider that mostly idle and fully active tasks > > will have the exact same impact on the memory > > > > Maybe, maybe not. When there is spare capacity in the domain overall and > we are only interested in the low utilisation case, it seems to me that > we should consider the most obvious and understandable metric. Now, if we > were talking about a nearly fully loaded domain or an overloaded domain > then I would fully agree with you as balancing utilisation in that case > becomes critical. > > > > load balancing behaves across a variety of workloads. > > > > > > Based on what you suggest, the result looks like this (build tested > > > only) > > > > I'm going to make a try of this patch > > > > Thanks. I've queued the same patch on one machine to see what falls out. > I don't want to tie up half my test grid until we get some sort of > consensus. > > -- > Mel Gorman > SUSE Labs