From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B19A4C433E0 for ; Mon, 25 Jan 2021 11:14:27 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 1DE5F22795 for ; Mon, 25 Jan 2021 11:14:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1DE5F22795 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=hisilicon.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:MIME-Version:In-Reply-To:References:Message-ID:Date: Subject:To:From:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=FVzZLo9BrCIyTuZ18NbOJjoXJ1nzzEdJyIEDhcVKWwk=; b=bFVk7R//R5QHWnR6Shh13BQ2C qQn4buXSiV0zOFW8tYoAAvlaQZovvcDLhwrXuJ505qaVHI3zMVI5YgK2bzZ7TbxFMw52BLf3UPhF7 JT1NO9bKLSZAs1Ek4vBT8b+EQnsb79LUvasA+9lidhaJjh/m0xuzi7smFIzuftxUq5SOFAwZ7jlJ0 5dLNqiu7tHnbZF1NFSp73cs0OG/gwvSVODdx8TpnqUiq2n0KO53elK/VS8TnNwsjhtKjz0nFC8OP7 lYPoNteb7XtsOdUEXfWeTobBhgT54YzSS+6ZApp2A/T2JOudLONPJ2m7WkD5OIjZjU/pPAgQ3Fz2k mbedoQJ9Q==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1l3znC-0007XH-Qp; Mon, 25 Jan 2021 11:12:23 +0000 Received: from frasgout.his.huawei.com ([185.176.79.56]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1l3zn1-0007UH-Ez for linux-arm-kernel@lists.infradead.org; Mon, 25 Jan 2021 11:12:15 +0000 Received: from fraeml736-chm.china.huawei.com (unknown [172.18.147.206]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DPRv82j7xz67gYX; Mon, 25 Jan 2021 19:08:52 +0800 (CST) Received: from lhreml717-chm.china.huawei.com (10.201.108.68) by fraeml736-chm.china.huawei.com (10.206.15.217) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 25 Jan 2021 12:12:08 +0100 Received: from dggemi761-chm.china.huawei.com (10.1.198.147) by lhreml717-chm.china.huawei.com (10.201.108.68) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.1.2106.2; Mon, 25 Jan 2021 11:12:06 +0000 Received: from dggemi761-chm.china.huawei.com ([10.9.49.202]) by dggemi761-chm.china.huawei.com ([10.9.49.202]) with mapi id 15.01.2106.006; Mon, 25 Jan 2021 19:12:04 +0800 From: "Song Bao Hua (Barry Song)" To: Dietmar Eggemann , Morten Rasmussen , Tim Chen Subject: RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler Thread-Topic: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler Thread-Index: AQHW5AbtPZ7BCMS2PUiSVVUiYd0GPqocSCyAgAELFICAAOgS8IAFOlkAgBTRMvA= Date: Mon, 25 Jan 2021 11:12:04 +0000 Message-ID: <94c2e3b176e542afa03bea4aa0da7c9c@hisilicon.com> References: <20210106083026.40444-1-song.bao.hua@hisilicon.com> <737932c9-846a-0a6b-08b8-e2d2d95b67ce@linux.intel.com> <20210108151241.GA47324@e123083-lin> In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.126.202.218] MIME-Version: 1.0 X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210125_061211_780297_5DAC55AF X-CRM114-Status: GOOD ( 30.79 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "juri.lelli@redhat.com" , "mark.rutland@arm.com" , "peterz@infradead.org" , "catalin.marinas@arm.com" , "bsegall@google.com" , "xuwei \(O\)" , "will@kernel.org" , "vincent.guittot@linaro.org" , "aubrey.li@linux.intel.com" , "linux-acpi@vger.kernel.org" , "mingo@redhat.com" , "mgorman@suse.de" , "valentin.schneider@arm.com" , "lenb@kernel.org" , "linuxarm@openeuler.org" , "rostedt@goodmis.org" , "Zengtao \(B\)" , Jonathan Cameron , "linux-arm-kernel@lists.infradead.org" , "gregkh@linuxfoundation.org" , "rjw@rjwysocki.net" , "linux-kernel@vger.kernel.org" , "sudeep.holla@arm.com" , "tiantao \(H\)" Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org > -----Original Message----- > From: Dietmar Eggemann [mailto:dietmar.eggemann@arm.com] > Sent: Wednesday, January 13, 2021 1:53 AM > To: Song Bao Hua (Barry Song) ; Morten Rasmussen > ; Tim Chen > Cc: valentin.schneider@arm.com; catalin.marinas@arm.com; will@kernel.org; > rjw@rjwysocki.net; vincent.guittot@linaro.org; lenb@kernel.org; > gregkh@linuxfoundation.org; Jonathan Cameron ; > mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com; > rostedt@goodmis.org; bsegall@google.com; mgorman@suse.de; > mark.rutland@arm.com; sudeep.holla@arm.com; aubrey.li@linux.intel.com; > linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org; > linux-acpi@vger.kernel.org; linuxarm@openeuler.org; xuwei (O) > ; Zengtao (B) ; tiantao (H) > > Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and > add cluster scheduler > > On 08/01/2021 22:30, Song Bao Hua (Barry Song) wrote: > > > >> -----Original Message----- > >> From: Morten Rasmussen [mailto:morten.rasmussen@arm.com] > >> Sent: Saturday, January 9, 2021 4:13 AM > >> To: Tim Chen > >> Cc: Song Bao Hua (Barry Song) ; > >> valentin.schneider@arm.com; catalin.marinas@arm.com; will@kernel.org; > >> rjw@rjwysocki.net; vincent.guittot@linaro.org; lenb@kernel.org; > >> gregkh@linuxfoundation.org; Jonathan Cameron > ; > >> mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com; > >> dietmar.eggemann@arm.com; rostedt@goodmis.org; bsegall@google.com; > >> mgorman@suse.de; mark.rutland@arm.com; sudeep.holla@arm.com; > >> aubrey.li@linux.intel.com; linux-arm-kernel@lists.infradead.org; > >> linux-kernel@vger.kernel.org; linux-acpi@vger.kernel.org; > >> linuxarm@openeuler.org; xuwei (O) ; Zengtao (B) > >> ; tiantao (H) > >> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters > and > >> add cluster scheduler > >> > >> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote: > >>> On 1/6/21 12:30 AM, Barry Song wrote: > >>>> ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each > >>>> cluster has 4 cpus. All clusters share L3 cache data while each cluster > >>>> has local L3 tag. On the other hand, each cluster will share some > >>>> internal system bus. This means cache is much more affine inside one cluster > >>>> than across clusters. > >>> > >>> There is a similar need for clustering in x86. Some x86 cores could share > >> L2 caches that > >>> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 > clusters > >>> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing > >> L3). > >>> Having a sched domain at the L2 cluster helps spread load among > >>> L2 domains. This will reduce L2 cache contention and help with > >>> performance for low to moderate load scenarios. > >> > >> IIUC, you are arguing for the exact opposite behaviour, i.e. balancing > >> between L2 caches while Barry is after consolidating tasks within the > >> boundaries of a L3 tag cache. One helps cache utilization, the other > >> communication latency between tasks. Am I missing something? > > > > Morten, this is not true. > > > > we are both actually looking for the same behavior. My patch also > > has done the exact same behavior of spreading with Tim's patch. > > That's the case for the load-balance path because of the extra Sched > Domain (SD) (CLS/MC_L2) below MC. > > But in wakeup you add code which leads to a different packing strategy. Yes, but I put a note for the 1st case: "Case 1. we have two tasks *without* any relationship running in a system with 2 clusters and 8 cpus" so for tasks without wake-up relationship, the current patch will only result in spreading. Anyway, I will also test Tim's benchmark in kunpeng920 with the SCHED_CLUTER to see what will happen. Till now, benchmark has only covered the case to figure out the benefit of changing wake-up path. I would also be interested in figuring out what we have got from the change of load_balance(). > > It looks like that Tim's workload (SPECrate mcf) shows a performance > boost solely because of the changes the additional MC_L2 SD introduces > in load balance. The wakeup path is unchanged, i.e. llc-packing. IMHO we > have to carefully distinguish between packing vs. spreading in wakeup > and load-balance here. > > > Considering the below two cases: > > Case 1. we have two tasks without any relationship running in a system with > 2 clusters and 8 cpus. > > > > Without the sched_domain of cluster, these two tasks might be put as below: > > +-------------------+ +-----------------+ > > | +----+ +----+ | | | > > | |task| |task| | | | > > | |1 | |2 | | | | > > | +----+ +----+ | | | > > | | | | > > | cluster1 | | cluster2 | > > +-------------------+ +-----------------+ > > With the sched_domain of cluster, load balance will spread them as below: > > +-------------------+ +-----------------+ > > | +----+ | | +----+ | > > | |task| | | |task| | > > | |1 | | | |2 | | > > | +----+ | | +----+ | > > | | | | > > | cluster1 | | cluster2 | > > +-------------------+ +-----------------+ > > > > Then task1 and tasks2 get more cache and decrease cache contention. > > They will get better performance. > > > > That is what my original patch also can make. And tim's patch > > is also doing. Once we add a sched_domain, load balance will > > get involved. > > > > > > Case 2. we have 8 tasks, running in a system with 2 clusters and 8 cpus. > > But they are working in 4 groups: > > Task1 wakes up task4 > > Task2 wakes up task5 > > Task3 wakes up task6 > > Task4 wakes up task7 > > > > With my changing in select_idle_sibling, the WAKE_AFFINE mechanism will > > try to put task1 and 4, task2 and 5, task3 and 6, task4 and 7 in same clusters > rather > > than putting all of them in the random one of the 8 cpus. However, the 8 tasks > > are still spreading among the 8 cpus with my change in select_idle_sibling > > as load balance is still working. > > > > +---------------------------+ +----------------------+ > > | +----+ +-----+ | | +----+ +-----+ | > > | |task| |task | | | |task| |task | | > > | |1 | | 4 | | | |2 | |5 | | > > | +----+ +-----+ | | +----+ +-----+ | > > | | | | > > | cluster1 | | cluster2 | > > | | | | > > | | | | > > | +-----+ +------+ | | +-----+ +------+ | > > | |task | | task | | | |task | |task | | > > | |3 | | 6 | | | |4 | |8 | | > > | +-----+ +------+ | | +-----+ +------+ | > > +---------------------------+ +----------------------+ > > Your use-case (#tasks, runtime/period) seems to be perfectly crafted to > show the benefit of your patch on your specific system (cluster-size = > 4). IMHO, this extra infrastructure especially in the wakeup path should > show benefits over a range of different benchmarks. > > > Let's consider the 3rd case, that one would be more tricky: > > > > task1 and task2 have close relationship and they are waker-wakee pair. > > With my current patch, select_idle_sidling() wants to put them in one > > cluster, load balance wants to put them in two clusters. Load balance will > win. > > Then maybe we need some same mechanism like adjusting numa imbalance: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ > kernel/sched/fair.c?id=b396f52326de20 > > if we permit a light imbalance between clusters, select_idle_sidling() > > will win. And task1 and task2 get better cache affinity. > > This would look weird to allow this kind of imbalance on CLS (MC_L2) and > NUMA domains but not on the MC domain for example. Yes. I guess I actually meant permitting imbalance between sched_group made by the child sched_cluster domain of the parent sched_mc domain. sched_mc domain +----------------------------------+ | +--------+ +----------+ | | |sched_ | |sched_ | | | |group | |group | | | +--+-----+ +----+-----+ | | | allow small | | | | imbalance | | +----------------------------------+ | | | | | | | | | | + + child domain: child domain: sched_cluster sched_cluster For sched_group within one sched_cluster domain, we don't allow this kind of imbalance. Anyway, I would be happier to see this kind of imbalance is only allowed when we exactly know two tasks in the cluster have wake-up relationship. Right now, SD_NUMA seems to be simply allowing this imbalance without the knowledge of the relationships of tasks causing imbalance. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel