From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3C75C433E2 for ; Fri, 4 Sep 2020 02:02:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D27302067C for ; Fri, 4 Sep 2020 02:02:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729566AbgIDCCb convert rfc822-to-8bit (ORCPT ); Thu, 3 Sep 2020 22:02:31 -0400 Received: from szxga02-in.huawei.com ([45.249.212.188]:3085 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728288AbgIDCCa (ORCPT ); Thu, 3 Sep 2020 22:02:30 -0400 Received: from DGGEMM406-HUB.china.huawei.com (unknown [172.30.72.53]) by Forcepoint Email with ESMTP id 620FFAA1C0EB0BCA423F; Fri, 4 Sep 2020 10:02:26 +0800 (CST) Received: from dggemi762-chm.china.huawei.com (10.1.198.148) by DGGEMM406-HUB.china.huawei.com (10.3.20.214) with Microsoft SMTP Server (TLS) id 14.3.487.0; Fri, 4 Sep 2020 10:02:25 +0800 Received: from dggemi761-chm.china.huawei.com (10.1.198.147) by dggemi762-chm.china.huawei.com (10.1.198.148) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.1913.5; Fri, 4 Sep 2020 10:02:25 +0800 Received: from dggemi761-chm.china.huawei.com ([10.9.49.202]) by dggemi761-chm.china.huawei.com ([10.9.49.202]) with mapi id 15.01.1913.007; Fri, 4 Sep 2020 10:02:26 +0800 From: "Song Bao Hua (Barry Song)" To: Valentin Schneider CC: "linux-kernel@vger.kernel.org" , Ingo Molnar , Peter Zijlstra , "vincent.guittot@linaro.org" , "dietmar.eggemann@arm.com" , "morten.rasmussen@arm.com" , Linuxarm Subject: RE: [RFC] sched/topology: NUMA topology limitations Thread-Topic: [RFC] sched/topology: NUMA topology limitations Thread-Index: AdZ9wnXtzQTRsfy/TeONv+XCq4dqYgAACfmw///10gD//H0PoIAICbeA//tEExA= Date: Fri, 4 Sep 2020 02:02:25 +0000 Message-ID: <655f90166c2946bc94ec389645694aa7@hisilicon.com> References: <6a5f06ff4ecb4f34bd7e9890dc07fb99@hisilicon.com> <422fb2cfe1d24fca8efa74ba23d8754b@hisilicon.com> In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.126.203.41] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > -----Original Message----- > From: Valentin Schneider [mailto:valentin.schneider@arm.com] > Sent: Tuesday, September 1, 2020 9:41 PM > To: Song Bao Hua (Barry Song) > Cc: linux-kernel@vger.kernel.org; Ingo Molnar ; Peter > Zijlstra ; vincent.guittot@linaro.org; > dietmar.eggemann@arm.com; morten.rasmussen@arm.com; Linuxarm > > Subject: Re: [RFC] sched/topology: NUMA topology limitations > > > On 31/08/20 11:45, Barry Song wrote: > >> From: Valentin Schneider [mailto:valentin.schneider@arm.com] > >> > >> Ignoring corner cases where task affinity gets in the way, load balance > >> will always pull tasks to the local CPU (i.e. the CPU who's sched_domain we > >> are working on). > >> > >> If we're balancing load for CPU0-domain1, we would be looking at which > CPUs > >> in [0-2] (i.e. the domain's span) we could (if we should) pull tasks from > >> to migrate them over to CPU0. > >> > >> We'll first try to figure out which sched_group has the more load (see > >> find_busiest_group() & friends), and that's where we may hit issues. > >> > >> Consider a scenario where CPU3 is noticeably busier than the other > >> CPUs. We'll end up marking CPU0-domain1-group2 (1-3) as the busiest > group, > >> and compute an imbalance (i.e. amount of load to pull) mostly based on the > >> status of CPU3. > >> > >> We'll then go to find_busiest_queue(); the mask of CPUs we iterate over is > >> restricted by the sched_domain_span (i.e. doesn't include CPU3 here), so > >> we'll pull things from either CPU1 or CPU2 based on stats we built looking > >> at CPU3, which is bound to be pretty bogus. > >> > >> To summarise: we won't pull from the "outsider" node(s) (i.e., nodes > >> included in the sched_groups but not covered by the sched_domain), but > they > >> will influence the stats and heuristics of the load balance. > > > > Hi Valentin, > > Thanks for your clarification. For many scenarios, to achieve good > performance, people would > > pin processes in numa node. So the priority to pin would be local node first, > then domain0 with one hop. Domain1 > > with two hops is actually too far. Domain2 with three hops would be a > disaster. If cpu0 pulls task from cpu2, > > but memory is still one CPU2's node, 3 hops would be a big problem for > memory access and page migration. > > > > Did you mean CPU3 here? Yep. I meant cpu3 here. > > > However, for automatic numa balance, I would agree we need to fix the > groups layout to make groups > > stay in the span of sched_domain. Otherwise, it seems the scheduler is > running incorrectly to find the right > > cpu to pull task. > > > > In case we have > > 0 task on cpu0 > > 1 task on cpu1 > > 1 task on cpu2 > > 4 task on cpu3 > > > > In sched_domain1, cpu1+cpu3 is busy, so cpu0 would try to pull task from > cpu2 of the group(1-3) because cpu3 is busy, > > meanwhile, it is an outsider. > > > > Right, we'd pull from either CPU1 or CPU2 (in this case via a tentative > active load balance) because they are in the same group as CPU3 which > inflates the sched_group load stats, but we can't pull from it at this > domain because it's not included in the domain span. > Thanks Barry