From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F00A1C2BA83 for ; Thu, 13 Feb 2020 08:05:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B926D21734 for ; Thu, 13 Feb 2020 08:05:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="hSjKMLtr" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729557AbgBMIFu (ORCPT ); Thu, 13 Feb 2020 03:05:50 -0500 Received: from mail-lf1-f67.google.com ([209.85.167.67]:43028 "EHLO mail-lf1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726232AbgBMIFt (ORCPT ); Thu, 13 Feb 2020 03:05:49 -0500 Received: by mail-lf1-f67.google.com with SMTP id 9so3578550lfq.10 for ; Thu, 13 Feb 2020 00:05:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Z5hRnTB16rR+pKGopsGrMh3cBd31p82Y/7wFqqMJ/Jc=; b=hSjKMLtrgyw+zjg0ndioHTiJLzVc1uxGiE9BAN1YhI9xmavr4S/4tAy/3HdOFBsaRX KT/cEkixpavVgvBYR2CxU4ODhNijWIevxW7pWxHSH7cQAt5WCzezmOopMPWJn44puA0d UdfEZLkflHKAV1a0gG4Nb88N+gfaet6ZJYX5+ZBIbpFf+bgdFcV0AAsyiaTCSl5n0HL3 kSEtkHl3wbAsZ+mSf9eCOR7o54ZURxNcErOWx/nqiePgi7pgB6t4oluPB61eBMf2k7Ua D2URkGn/G24ZnuldTshi9Z4xWNlavI/veH5ev/Ti7bqxBMdQdvSRkQj3xL4JoNDGYXyX QH1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Z5hRnTB16rR+pKGopsGrMh3cBd31p82Y/7wFqqMJ/Jc=; b=aY3P50rfBlv1DwFcI1lS6vw05cYftuQsbzN6u+SrsaDNX1agJaG9w33ssdHc/Hnxc7 xkQMxvDTu+l7kqova9LwFnD8JsM7ovtb0f8Qp+QnsSDTAhkxHiaAKxY4FH/EB0Te9jHX R2oDP7qGtDyriNLEyFBpz0EcVvhgoM4FzrMhSaaIfhKm4HNr7BP1fu0qFmY7PJw7/ETB xxg9njqxTvA9sqaWcArtF69Y/E8sf03JRmA6vKOxxsuYNweJlBJIgc0yu1/Ucs53WkMm wJOaT1SiBFOGvJSDSgFYmnAYRYsf3Q03SQuY6pUZ51nHlx3xJDPK97aBIRM3A+xtxlaO fMYw== X-Gm-Message-State: APjAAAXh9rNLKiqzt+Ak2oiRoz9tuDUcowlHreEes3uYmKewheDEv+gB BAW0EwvfQSLO/NxBdol+GXzAWk408ltOziFBOINr/g== X-Google-Smtp-Source: APXvYqzdxKvLHJKnaBtpjJmeoTyPdZOj2khQPpzSezNMIv7p39fFA0oLytldPw4yuM+pfQIRalokJKdS4/7FLPi+l70= X-Received: by 2002:ac2:4add:: with SMTP id m29mr8823371lfp.190.1581581147089; Thu, 13 Feb 2020 00:05:47 -0800 (PST) MIME-Version: 1.0 References: <20200211174651.10330-1-vincent.guittot@linaro.org> <20200211174651.10330-3-vincent.guittot@linaro.org> <20200212133715.GU3420@suse.de> <20200212194903.GS3466@techsingularity.net> In-Reply-To: <20200212194903.GS3466@techsingularity.net> From: Vincent Guittot Date: Thu, 13 Feb 2020 09:05:35 +0100 Message-ID: Subject: Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg To: Mel Gorman Cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , linux-kernel , Phil Auld , Parth Shah , Valentin Schneider Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 12 Feb 2020 at 20:49, Mel Gorman wrote: > > On Wed, Feb 12, 2020 at 01:37:15PM +0000, Mel Gorman wrote: > > > static void task_numa_assign(struct task_numa_env *env, > > > struct task_struct *p, long imp) > > > { > > > @@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load, > > > long orig_src_load, orig_dst_load; > > > long src_capacity, dst_capacity; > > > > > > + > > > + /* If dst node has spare capacity, there is no real load imbalance */ > > > + if (env->dst_stats.node_type == node_has_spare) > > > + return false; > > > + > > > > Not exactly what the load balancer thinks though, the load balancer > > may decide to balance the tasks between NUMA groups even when there is > > capacity. That said, what you did here is compatible with the current code. > > > > I'll want to test this further but in general > > > > Acked-by: Mel Gorman > > > > And I was wrong. I worried that the conflict with the load balancer would > be a problem when I wrote this comment. That was based on finding that > I had to account for the load balancer when using capacity to decide > whether an idle CPU can be used. When I didn't, performance went to > hell but thought that maybe you had somehow avoided the same problem. > Unfortunately, testing indicates that the same, or similar, problem is > here. > > This is specjbb running two JVMs on a 2-socket Haswell machine with 48 > cores in total. This is just your first two patches, I have found that thanks for the test results. Unfortunately i haven't specjbb to analyse the metrics and study what is going wrong > the group classify patches do not recover it. > > Hmean tput-1 39748.93 ( 0.00%) 39568.77 ( -0.45%) > Hmean tput-2 88648.59 ( 0.00%) 85302.00 ( -3.78%) > Hmean tput-3 136285.01 ( 0.00%) 131280.85 ( -3.67%) > Hmean tput-4 181312.69 ( 0.00%) 179903.17 ( -0.78%) > Hmean tput-5 228725.85 ( 0.00%) 223499.65 ( -2.28%) > Hmean tput-6 273246.83 ( 0.00%) 268984.52 ( -1.56%) > Hmean tput-7 317708.89 ( 0.00%) 314593.10 ( -0.98%) > Hmean tput-8 362378.08 ( 0.00%) 337205.87 * -6.95%* > Hmean tput-9 403792.00 ( 0.00%) 349014.39 * -13.57%* > Hmean tput-10 446000.88 ( 0.00%) 379242.88 * -14.97%* > Hmean tput-11 486188.58 ( 0.00%) 341951.27 * -29.67%* > Hmean tput-12 522288.84 ( 0.00%) 388877.07 * -25.54%* > Hmean tput-13 532394.06 ( 0.00%) 437005.34 * -17.92%* > Hmean tput-14 539440.66 ( 0.00%) 467479.65 * -13.34%* > Hmean tput-15 541843.50 ( 0.00%) 455477.03 ( -15.94%) > Hmean tput-16 546510.71 ( 0.00%) 483599.58 ( -11.51%) > Hmean tput-17 544501.21 ( 0.00%) 506189.64 ( -7.04%) > Hmean tput-18 544802.98 ( 0.00%) 535869.05 ( -1.64%) > Hmean tput-19 545265.27 ( 0.00%) 533050.57 ( -2.24%) > Hmean tput-20 543284.33 ( 0.00%) 528100.03 ( -2.79%) > Hmean tput-21 543375.11 ( 0.00%) 528692.97 ( -2.70%) > Hmean tput-22 542536.60 ( 0.00%) 527160.27 ( -2.83%) > Hmean tput-23 536402.28 ( 0.00%) 521023.49 ( -2.87%) > Hmean tput-24 532307.76 ( 0.00%) 519208.85 * -2.46%* > Stddev tput-1 1426.23 ( 0.00%) 464.57 ( 67.43%) > Stddev tput-2 4437.10 ( 0.00%) 1489.17 ( 66.44%) > Stddev tput-3 3021.47 ( 0.00%) 750.95 ( 75.15%) > Stddev tput-4 4098.39 ( 0.00%) 1678.67 ( 59.04%) > Stddev tput-5 3524.22 ( 0.00%) 1025.30 ( 70.91%) > Stddev tput-6 3237.13 ( 0.00%) 2198.39 ( 32.09%) > Stddev tput-7 2534.27 ( 0.00%) 3261.18 ( -28.68%) > Stddev tput-8 3847.37 ( 0.00%) 4355.78 ( -13.21%) > Stddev tput-9 5278.55 ( 0.00%) 4145.06 ( 21.47%) > Stddev tput-10 5311.08 ( 0.00%) 10274.26 ( -93.45%) > Stddev tput-11 7537.76 ( 0.00%) 16882.17 (-123.97%) > Stddev tput-12 5023.29 ( 0.00%) 12735.70 (-153.53%) > Stddev tput-13 3852.32 ( 0.00%) 15385.23 (-299.38%) > Stddev tput-14 11859.59 ( 0.00%) 24273.56 (-104.67%) > Stddev tput-15 16298.10 ( 0.00%) 45409.69 (-178.62%) > Stddev tput-16 9041.77 ( 0.00%) 58839.77 (-550.75%) > Stddev tput-17 9322.50 ( 0.00%) 77205.45 (-728.16%) > Stddev tput-18 16040.01 ( 0.00%) 15021.07 ( 6.35%) > Stddev tput-19 8814.09 ( 0.00%) 27509.99 (-212.11%) > Stddev tput-20 7812.82 ( 0.00%) 31578.68 (-304.19%) > Stddev tput-21 6584.58 ( 0.00%) 3639.48 ( 44.73%) > Stddev tput-22 8294.36 ( 0.00%) 3033.49 ( 63.43%) > Stddev tput-23 6887.93 ( 0.00%) 5450.38 ( 20.87%) > Stddev tput-24 6081.83 ( 0.00%) 390.32 ( 93.58%) > > Note that how it reaches the point where the node is almost utilised > ( near tput-12) that performance drops. Graphing mpstat on a per-node > basis shows there is imbalance in the CPU utilisation between nodes. ok. So the has_spare_capacity condition that i added in load_too_imbalanced() is probably to aggressive although the goal was to let task moving on their preferred node > > NUMA locality is not bad, scanning and migrations are a bit higher but > overall this is a problem. > > On a 4-socket machine, I see slightly different results -- there is still > a big drop although not a statistically significant one. However, the > per-node CPU utilisation is skewed as nodes approach being fully utilised. > > I see similarly bad results on a 2-socket Broadwell machine. > > I don't have other results yet for other workloads but I'm not optimistic > that I'll see anything different. I still think that this probably didn't > show up with hackbench because it's too short-lived for NUMA balancing > to be a factor. > > -- > Mel Gorman > SUSE Labs