From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9B4FC433E2 for ; Thu, 3 Sep 2020 17:10:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BB0C12072A for ; Thu, 3 Sep 2020 17:10:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729002AbgICRKV (ORCPT ); Thu, 3 Sep 2020 13:10:21 -0400 Received: from mail2-relais-roc.national.inria.fr ([192.134.164.83]:39946 "EHLO mail2-relais-roc.national.inria.fr" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726025AbgICRKQ (ORCPT ); Thu, 3 Sep 2020 13:10:16 -0400 X-IronPort-AV: E=Sophos;i="5.76,359,1592863200"; d="scan'208";a="465973671" Received: from abo-173-121-68.mrs.modulonet.fr (HELO hadrien) ([85.68.121.173]) by mail2-relais-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 03 Sep 2020 19:10:12 +0200 Date: Thu, 3 Sep 2020 19:10:12 +0200 (CEST) From: Julia Lawall X-X-Sender: jll@hadrien To: Valentin Schneider cc: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , linux-kernel@vger.kernel.org, Gilles Muller Subject: Re: SD_LOAD_BALANCE In-Reply-To: Message-ID: References: User-Agent: Alpine 2.22 (DEB 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 3 Sep 2020, Valentin Schneider wrote: > > Hi Julia, > > On 03/09/20 15:09, Julia Lawall wrote: > > Uses of SD_LOAD_BALANCE were removed in commit e669ac8ab952 (first > > released in v5.8), with the comment: > > > > The SD_LOAD_BALANCE flag is set unconditionally for all domains in > > sd_init(). > > > > I have the impression that this was not quite true. The NUMA domain was > > not initialized with sd_init, and didn't have the SD_LOAD_BALANCE flag > > set. > > Did you check the contents of > > /proc/sys/kernel/sched_domain/cpu*/domain*/flags > > (requires CONFIG_SCHED_DEBUG)? If the LSB is set, it would mean > SD_LOAD_BALANCE is set. No, I didn't check that. Thanks for the pointer. > > The sched_domain construction flow isn't the easiest thing to follow, but > NUMA domains *have* to go through sd_init(). > > What happens is we first go through sched_init_numa(), and there we add > some more topology levels on top of the default ones (or the arch-defined > ones if using an arch-defined topology hierarchy) by using the NUMA > distance table. > > We then build the actual domains in sched_init_domains(), and that goes > through a loop that looks like > > for_each_cpu() { > for_each_sd_topology() { > build_sched_domain() -> sd_init() > } > } > > where the SD topology loop is going to iterate over the newly-added > NUMA-specific topology levels. Since that used to unconditionally set > SD_LOAD_BALANCE, NUMA domains really ought to have it. > > If that wasn't the case, we would have fired the (now removed) warning in > sched_domain_debug_one() that would do: > > if (!(sd->flags & SD_LOAD_BALANCE)) { > printk("does not load-balance\n"); > if (sd->parent) > printk(KERN_ERR "ERROR: !SD_LOAD_BALANCE domain has parent"); > return -1; > } OK, thanks for all of this information. I will study the code and execution some more, and let you know what I find out. julia > > > The effect is that in v5.8, the for_each_domain loop in > > select_task_rq_fair can always end up at the global NUMA domain, and thus > > consider any pair of waking cpu (cpu) and previous cpus of the wakee > > (prev_cpu) as arguments to wake_affine. Up to v5.7, this was only > > possible if cpu and prev_cpu were together in some lower level domain, ie > > sharing the LLC. The effect is that in v5.8 wake_affine can often end up > > choosing as a target a core that does not share the LLC with the core > > where the thread ran previously. Threads then move around a lot between > > the different sockets. > > > > Was this intentional? > > > > AFAICT it isn't forbidden for the logic here to peek outside of the > previous LLC. The NUMA reclaim distance thing says we allow affine wakeups > and fork / exec balancing to move a task to a CPU at most RECLAIM_DISTANCE > away (in NUMA distance values). However, I don't remember any patch > changing this between v5.7 and v5.8. > > Briefly glancing over the kernel/sched log between v5.7 and v5.8, I don't > see any obvious culprits. Did you try to bisect this? If it indeed ends on > the SD_LOAD_BALANCE thing, well, I'll be off eating my keyboard. > > > The effect can be seen in the traces of the parsec vips benchmark at the > > following URL: > > > > https://pages.lip6.fr/Julia.Lawall/vips.pdf > > > > The first two graphs (complete run and small fragment) are Linux v5.7 and > > the next two are Linux v5.8. The machine has 160 hardware threads > > organized in 4 sockets and the colors are by socket. In the small > > fragment for v5.7 (second graph), one can see that a given pid pretty much > > stays on the same socket, while in the corresponding fragment for v5.8 > > (fourth graph), the pids move around between the sockets. The x's > > describe the unblocks that result in a migration. A pink x means that the > > migration is in the same socket, while a blue x means that the migration > > is to another socket. It's not apparent from the graphs, but by adding > > some tracing, it seems that the new socket is always the one of the core > > that handles the wakeup. > > > > Interesting graphs, thanks for sharing! > > > I haven't yet studied the early part of the execution of vips in detail, > > but I suspect that the same issue causes all of the threads to be > > initially on the same socket in v5.7, while in v5.8 they are more quickly > > dispersed to other sockets. > > > > My impression from the parsec and the NAS benchmarks is that the v5.8 > > performance is a bit better than v5.7, probably because of getting more > > threads to more different sockets earlier, but other benchmarks might > > rely more on locality and might react less well to threads moving around > > so much in this way. > > > > julia >