From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=q8Yd=G4=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-9.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,
	MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B19A4C433E0
	for <linux-arm-kernel@archiver.kernel.org>; Mon, 25 Jan 2021 11:14:27 +0000 (UTC)
Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 1DE5F22795
	for <linux-arm-kernel@archiver.kernel.org>; Mon, 25 Jan 2021 11:14:27 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1DE5F22795
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=hisilicon.com
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding:
	Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive:
	List-Unsubscribe:List-Id:MIME-Version:In-Reply-To:References:Message-ID:Date:
	Subject:To:From:Reply-To:Content-ID:Content-Description:Resent-Date:
	Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner;
	 bh=FVzZLo9BrCIyTuZ18NbOJjoXJ1nzzEdJyIEDhcVKWwk=; b=bFVk7R//R5QHWnR6Shh13BQ2C
	qQn4buXSiV0zOFW8tYoAAvlaQZovvcDLhwrXuJ505qaVHI3zMVI5YgK2bzZ7TbxFMw52BLf3UPhF7
	JT1NO9bKLSZAs1Ek4vBT8b+EQnsb79LUvasA+9lidhaJjh/m0xuzi7smFIzuftxUq5SOFAwZ7jlJ0
	5dLNqiu7tHnbZF1NFSp73cs0OG/gwvSVODdx8TpnqUiq2n0KO53elK/VS8TnNwsjhtKjz0nFC8OP7
	lYPoNteb7XtsOdUEXfWeTobBhgT54YzSS+6ZApp2A/T2JOudLONPJ2m7WkD5OIjZjU/pPAgQ3Fz2k
	mbedoQJ9Q==;
Received: from localhost ([::1] helo=merlin.infradead.org)
	by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux))
	id 1l3znC-0007XH-Qp; Mon, 25 Jan 2021 11:12:23 +0000
Received: from frasgout.his.huawei.com ([185.176.79.56])
 by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux))
 id 1l3zn1-0007UH-Ez
 for linux-arm-kernel@lists.infradead.org; Mon, 25 Jan 2021 11:12:15 +0000
Received: from fraeml736-chm.china.huawei.com (unknown [172.18.147.206])
 by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DPRv82j7xz67gYX;
 Mon, 25 Jan 2021 19:08:52 +0800 (CST)
Received: from lhreml717-chm.china.huawei.com (10.201.108.68) by
 fraeml736-chm.china.huawei.com (10.206.15.217) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2106.2; Mon, 25 Jan 2021 12:12:08 +0100
Received: from dggemi761-chm.china.huawei.com (10.1.198.147) by
 lhreml717-chm.china.huawei.com (10.201.108.68) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id
 15.1.2106.2; Mon, 25 Jan 2021 11:12:06 +0000
Received: from dggemi761-chm.china.huawei.com ([10.9.49.202]) by
 dggemi761-chm.china.huawei.com ([10.9.49.202]) with mapi id 15.01.2106.006;
 Mon, 25 Jan 2021 19:12:04 +0800
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>, Morten Rasmussen
 <morten.rasmussen@arm.com>, Tim Chen <tim.c.chen@linux.intel.com>
Subject: RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
 add cluster scheduler
Thread-Topic: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters
 and add cluster scheduler
Thread-Index: AQHW5AbtPZ7BCMS2PUiSVVUiYd0GPqocSCyAgAELFICAAOgS8IAFOlkAgBTRMvA=
Date: Mon, 25 Jan 2021 11:12:04 +0000
Message-ID: <94c2e3b176e542afa03bea4aa0da7c9c@hisilicon.com>
References: <20210106083026.40444-1-song.bao.hua@hisilicon.com>
 <737932c9-846a-0a6b-08b8-e2d2d95b67ce@linux.intel.com>
 <20210108151241.GA47324@e123083-lin>
 <f15f8feb4e764c11a078ffd74f002a8d@hisilicon.com>
 <a5dfcbf6-84f4-0c72-3a88-62926f1f351d@arm.com>
In-Reply-To: <a5dfcbf6-84f4-0c72-3a88-62926f1f351d@arm.com>
Accept-Language: en-GB, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.126.202.218]
MIME-Version: 1.0
X-CFilter-Loop: Reflected
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20210125_061211_780297_5DAC55AF 
X-CRM114-Status: GOOD (  30.79  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Cc: "juri.lelli@redhat.com" <juri.lelli@redhat.com>,
 "mark.rutland@arm.com" <mark.rutland@arm.com>,
 "peterz@infradead.org" <peterz@infradead.org>,
 "catalin.marinas@arm.com" <catalin.marinas@arm.com>,
 "bsegall@google.com" <bsegall@google.com>, "xuwei \(O\)" <xuwei5@huawei.com>,
 "will@kernel.org" <will@kernel.org>,
 "vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
 "aubrey.li@linux.intel.com" <aubrey.li@linux.intel.com>,
 "linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
 "mingo@redhat.com" <mingo@redhat.com>, "mgorman@suse.de" <mgorman@suse.de>,
 "valentin.schneider@arm.com" <valentin.schneider@arm.com>,
 "lenb@kernel.org" <lenb@kernel.org>,
 "linuxarm@openeuler.org" <linuxarm@openeuler.org>,
 "rostedt@goodmis.org" <rostedt@goodmis.org>,
 "Zengtao \(B\)" <prime.zeng@hisilicon.com>,
 Jonathan Cameron <jonathan.cameron@huawei.com>,
 "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>,
 "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
 "rjw@rjwysocki.net" <rjw@rjwysocki.net>,
 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
 "sudeep.holla@arm.com" <sudeep.holla@arm.com>,
 "tiantao \(H\)" <tiantao6@hisilicon.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org


> -----Original Message-----
> From: Dietmar Eggemann [mailto:dietmar.eggemann@arm.com]
> Sent: Wednesday, January 13, 2021 1:53 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Morten Rasmussen
> <morten.rasmussen@arm.com>; Tim Chen <tim.c.chen@linux.intel.com>
> Cc: valentin.schneider@arm.com; catalin.marinas@arm.com; will@kernel.org;
> rjw@rjwysocki.net; vincent.guittot@linaro.org; lenb@kernel.org;
> gregkh@linuxfoundation.org; Jonathan Cameron <jonathan.cameron@huawei.com>;
> mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com;
> rostedt@goodmis.org; bsegall@google.com; mgorman@suse.de;
> mark.rutland@arm.com; sudeep.holla@arm.com; aubrey.li@linux.intel.com;
> linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-acpi@vger.kernel.org; linuxarm@openeuler.org; xuwei (O)
> <xuwei5@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; tiantao (H)
> <tiantao6@hisilicon.com>
> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
> add cluster scheduler
> 
> On 08/01/2021 22:30, Song Bao Hua (Barry Song) wrote:
> >
> >> -----Original Message-----
> >> From: Morten Rasmussen [mailto:morten.rasmussen@arm.com]
> >> Sent: Saturday, January 9, 2021 4:13 AM
> >> To: Tim Chen <tim.c.chen@linux.intel.com>
> >> Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> >> valentin.schneider@arm.com; catalin.marinas@arm.com; will@kernel.org;
> >> rjw@rjwysocki.net; vincent.guittot@linaro.org; lenb@kernel.org;
> >> gregkh@linuxfoundation.org; Jonathan Cameron
> <jonathan.cameron@huawei.com>;
> >> mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com;
> >> dietmar.eggemann@arm.com; rostedt@goodmis.org; bsegall@google.com;
> >> mgorman@suse.de; mark.rutland@arm.com; sudeep.holla@arm.com;
> >> aubrey.li@linux.intel.com; linux-arm-kernel@lists.infradead.org;
> >> linux-kernel@vger.kernel.org; linux-acpi@vger.kernel.org;
> >> linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Zengtao (B)
> >> <prime.zeng@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>
> >> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters
> and
> >> add cluster scheduler
> >>
> >> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> >>> On 1/6/21 12:30 AM, Barry Song wrote:
> >>>> ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
> >>>> cluster has 4 cpus. All clusters share L3 cache data while each cluster
> >>>> has local L3 tag. On the other hand, each cluster will share some
> >>>> internal system bus. This means cache is much more affine inside one cluster
> >>>> than across clusters.
> >>>
> >>> There is a similar need for clustering in x86.  Some x86 cores could share
> >> L2 caches that
> >>> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6
> clusters
> >>> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing
> >> L3).
> >>> Having a sched domain at the L2 cluster helps spread load among
> >>> L2 domains.  This will reduce L2 cache contention and help with
> >>> performance for low to moderate load scenarios.
> >>
> >> IIUC, you are arguing for the exact opposite behaviour, i.e. balancing
> >> between L2 caches while Barry is after consolidating tasks within the
> >> boundaries of a L3 tag cache. One helps cache utilization, the other
> >> communication latency between tasks. Am I missing something?
> >
> > Morten, this is not true.
> >
> > we are both actually looking for the same behavior. My patch also
> > has done the exact same behavior of spreading with Tim's patch.
> 
> That's the case for the load-balance path because of the extra Sched
> Domain (SD) (CLS/MC_L2) below MC.
> 
> But in wakeup you add code which leads to a different packing strategy.

Yes, but I put a note for the 1st case:
"Case 1. we have two tasks *without* any relationship running in a system
with 2 clusters and 8 cpus"

so for tasks without wake-up relationship, the current patch will only
result in spreading.

Anyway, I will also test Tim's benchmark in kunpeng920 with the SCHED_CLUTER
to see what will happen. Till now, benchmark has only covered the case to
figure out the benefit of changing wake-up path.
I would also be interested in figuring out what we have got from the change
of load_balance().

> 
> It looks like that Tim's workload (SPECrate mcf) shows a performance
> boost solely because of the changes the additional MC_L2 SD introduces
> in load balance. The wakeup path is unchanged, i.e. llc-packing. IMHO we
> have to carefully distinguish between packing vs. spreading in wakeup
> and load-balance here.
> 
> > Considering the below two cases:
> > Case 1. we have two tasks without any relationship running in a system with
> 2 clusters and 8 cpus.
> >
> > Without the sched_domain of cluster, these two tasks might be put as below:
> > +-------------------+            +-----------------+
> > | +----+   +----+   |            |                 |
> > | |task|   |task|   |            |                 |
> > | |1   |   |2   |   |            |                 |
> > | +----+   +----+   |            |                 |
> > |                   |            |                 |
> > |       cluster1    |            |     cluster2    |
> > +-------------------+            +-----------------+
> > With the sched_domain of cluster, load balance will spread them as below:
> > +-------------------+            +-----------------+
> > | +----+            |            | +----+          |
> > | |task|            |            | |task|          |
> > | |1   |            |            | |2   |          |
> > | +----+            |            | +----+          |
> > |                   |            |                 |
> > |       cluster1    |            |     cluster2    |
> > +-------------------+            +-----------------+
> >
> > Then task1 and tasks2 get more cache and decrease cache contention.
> > They will get better performance.
> >
> > That is what my original patch also can make. And tim's patch
> > is also doing. Once we add a sched_domain, load balance will
> > get involved.
> >
> >
> > Case 2. we have 8 tasks, running in a system with 2 clusters and 8 cpus.
> > But they are working in 4 groups:
> > Task1 wakes up task4
> > Task2 wakes up task5
> > Task3 wakes up task6
> > Task4 wakes up task7
> >
> > With my changing in select_idle_sibling, the WAKE_AFFINE mechanism will
> > try to put task1 and 4, task2 and 5, task3 and 6, task4 and 7 in same clusters
> rather
> > than putting all of them in the random one of the 8 cpus. However, the 8 tasks
> > are still spreading among the 8 cpus with my change in select_idle_sibling
> > as load balance is still working.
> >
> > +---------------------------+    +----------------------+
> > | +----+        +-----+     |    | +----+      +-----+  |
> > | |task|        |task |     |    | |task|      |task |  |
> > | |1   |        | 4   |     |    | |2   |      |5    |  |
> > | +----+        +-----+     |    | +----+      +-----+  |
> > |                           |    |                      |
> > |       cluster1            |    |     cluster2         |
> > |                           |    |                      |
> > |                           |    |                      |
> > | +-----+       +------+    |    | +-----+     +------+ |
> > | |task |       | task |    |    | |task |     |task  | |
> > | |3    |       |  6   |    |    | |4    |     |8     | |
> > | +-----+       +------+    |    | +-----+     +------+ |
> > +---------------------------+    +----------------------+
> 
> Your use-case (#tasks, runtime/period) seems to be perfectly crafted to
> show the benefit of your patch on your specific system (cluster-size =
> 4). IMHO, this extra infrastructure especially in the wakeup path should
> show benefits over a range of different benchmarks.
> 
> > Let's consider the 3rd case, that one would be more tricky:
> >
> > task1 and task2 have close relationship and they are waker-wakee pair.
> > With my current patch, select_idle_sidling() wants to put them in one
> > cluster, load balance wants to put them in two clusters. Load balance will
> win.
> > Then maybe we need some same mechanism like adjusting numa imbalance:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> kernel/sched/fair.c?id=b396f52326de20
> > if we permit a light imbalance between clusters, select_idle_sidling()
> > will win. And task1 and task2 get better cache affinity.
> 
> This would look weird to allow this kind of imbalance on CLS (MC_L2) and
> NUMA domains but not on the MC domain for example.

Yes. I guess I actually meant permitting imbalance between sched_group
made by the child sched_cluster domain of the parent sched_mc domain.

sched_mc domain

+----------------------------------+
|   +--------+     +----------+    |
|   |sched_  |     |sched_    |    |
|   |group   |     |group     |    |
|   +--+-----+     +----+-----+    |
|      |  allow small   |          |
|      |  imbalance     |          |
+----------------------------------+
       |                |
       |                |
       |                |
       |                |
       |                |
       +                +
   child domain:     child domain:
   sched_cluster     sched_cluster

For sched_group within one sched_cluster domain, we don't allow this
kind of imbalance.

Anyway, I would be happier to see this kind of imbalance is
only allowed when we exactly know two tasks in the cluster
have wake-up relationship. Right now, SD_NUMA seems to be
simply allowing this imbalance without the knowledge of the
relationships of tasks causing imbalance.

Thanks
Barry
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel