Linux-ACPI Archive on lore.kernel.org
 help / color / Atom feed
From: "Song Bao Hua (Barry Song)" <song.bao.hua@hisilicon.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Tim Chen <tim.c.chen@linux.intel.com>
Cc: "valentin.schneider@arm.com" <valentin.schneider@arm.com>,
	"catalin.marinas@arm.com" <catalin.marinas@arm.com>,
	"will@kernel.org" <will@kernel.org>,
	"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
	"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"juri.lelli@redhat.com" <juri.lelli@redhat.com>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	"bsegall@google.com" <bsegall@google.com>,
	"mgorman@suse.de" <mgorman@suse.de>,
	"mark.rutland@arm.com" <mark.rutland@arm.com>,
	"sudeep.holla@arm.com" <sudeep.holla@arm.com>,
	"aubrey.li@linux.intel.com" <aubrey.li@linux.intel.com>,
	"linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linuxarm@openeuler.org" <linuxarm@openeuler.org>,
	"xuwei (O)" <xuwei5@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	"tiantao (H)" <tiantao6@hisilicon.com>
Subject: RE: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler
Date: Mon, 25 Jan 2021 11:12:04 +0000
Message-ID: <94c2e3b176e542afa03bea4aa0da7c9c@hisilicon.com> (raw)
In-Reply-To: <a5dfcbf6-84f4-0c72-3a88-62926f1f351d@arm.com>



> -----Original Message-----
> From: Dietmar Eggemann [mailto:dietmar.eggemann@arm.com]
> Sent: Wednesday, January 13, 2021 1:53 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>; Morten Rasmussen
> <morten.rasmussen@arm.com>; Tim Chen <tim.c.chen@linux.intel.com>
> Cc: valentin.schneider@arm.com; catalin.marinas@arm.com; will@kernel.org;
> rjw@rjwysocki.net; vincent.guittot@linaro.org; lenb@kernel.org;
> gregkh@linuxfoundation.org; Jonathan Cameron <jonathan.cameron@huawei.com>;
> mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com;
> rostedt@goodmis.org; bsegall@google.com; mgorman@suse.de;
> mark.rutland@arm.com; sudeep.holla@arm.com; aubrey.li@linux.intel.com;
> linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-acpi@vger.kernel.org; linuxarm@openeuler.org; xuwei (O)
> <xuwei5@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; tiantao (H)
> <tiantao6@hisilicon.com>
> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and
> add cluster scheduler
> 
> On 08/01/2021 22:30, Song Bao Hua (Barry Song) wrote:
> >
> >> -----Original Message-----
> >> From: Morten Rasmussen [mailto:morten.rasmussen@arm.com]
> >> Sent: Saturday, January 9, 2021 4:13 AM
> >> To: Tim Chen <tim.c.chen@linux.intel.com>
> >> Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com>;
> >> valentin.schneider@arm.com; catalin.marinas@arm.com; will@kernel.org;
> >> rjw@rjwysocki.net; vincent.guittot@linaro.org; lenb@kernel.org;
> >> gregkh@linuxfoundation.org; Jonathan Cameron
> <jonathan.cameron@huawei.com>;
> >> mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com;
> >> dietmar.eggemann@arm.com; rostedt@goodmis.org; bsegall@google.com;
> >> mgorman@suse.de; mark.rutland@arm.com; sudeep.holla@arm.com;
> >> aubrey.li@linux.intel.com; linux-arm-kernel@lists.infradead.org;
> >> linux-kernel@vger.kernel.org; linux-acpi@vger.kernel.org;
> >> linuxarm@openeuler.org; xuwei (O) <xuwei5@huawei.com>; Zengtao (B)
> >> <prime.zeng@hisilicon.com>; tiantao (H) <tiantao6@hisilicon.com>
> >> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters
> and
> >> add cluster scheduler
> >>
> >> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> >>> On 1/6/21 12:30 AM, Barry Song wrote:
> >>>> ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
> >>>> cluster has 4 cpus. All clusters share L3 cache data while each cluster
> >>>> has local L3 tag. On the other hand, each cluster will share some
> >>>> internal system bus. This means cache is much more affine inside one cluster
> >>>> than across clusters.
> >>>
> >>> There is a similar need for clustering in x86.  Some x86 cores could share
> >> L2 caches that
> >>> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6
> clusters
> >>> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing
> >> L3).
> >>> Having a sched domain at the L2 cluster helps spread load among
> >>> L2 domains.  This will reduce L2 cache contention and help with
> >>> performance for low to moderate load scenarios.
> >>
> >> IIUC, you are arguing for the exact opposite behaviour, i.e. balancing
> >> between L2 caches while Barry is after consolidating tasks within the
> >> boundaries of a L3 tag cache. One helps cache utilization, the other
> >> communication latency between tasks. Am I missing something?
> >
> > Morten, this is not true.
> >
> > we are both actually looking for the same behavior. My patch also
> > has done the exact same behavior of spreading with Tim's patch.
> 
> That's the case for the load-balance path because of the extra Sched
> Domain (SD) (CLS/MC_L2) below MC.
> 
> But in wakeup you add code which leads to a different packing strategy.

Yes, but I put a note for the 1st case:
"Case 1. we have two tasks *without* any relationship running in a system
with 2 clusters and 8 cpus"

so for tasks without wake-up relationship, the current patch will only
result in spreading.

Anyway, I will also test Tim's benchmark in kunpeng920 with the SCHED_CLUTER
to see what will happen. Till now, benchmark has only covered the case to
figure out the benefit of changing wake-up path.
I would also be interested in figuring out what we have got from the change
of load_balance().

> 
> It looks like that Tim's workload (SPECrate mcf) shows a performance
> boost solely because of the changes the additional MC_L2 SD introduces
> in load balance. The wakeup path is unchanged, i.e. llc-packing. IMHO we
> have to carefully distinguish between packing vs. spreading in wakeup
> and load-balance here.
> 
> > Considering the below two cases:
> > Case 1. we have two tasks without any relationship running in a system with
> 2 clusters and 8 cpus.
> >
> > Without the sched_domain of cluster, these two tasks might be put as below:
> > +-------------------+            +-----------------+
> > | +----+   +----+   |            |                 |
> > | |task|   |task|   |            |                 |
> > | |1   |   |2   |   |            |                 |
> > | +----+   +----+   |            |                 |
> > |                   |            |                 |
> > |       cluster1    |            |     cluster2    |
> > +-------------------+            +-----------------+
> > With the sched_domain of cluster, load balance will spread them as below:
> > +-------------------+            +-----------------+
> > | +----+            |            | +----+          |
> > | |task|            |            | |task|          |
> > | |1   |            |            | |2   |          |
> > | +----+            |            | +----+          |
> > |                   |            |                 |
> > |       cluster1    |            |     cluster2    |
> > +-------------------+            +-----------------+
> >
> > Then task1 and tasks2 get more cache and decrease cache contention.
> > They will get better performance.
> >
> > That is what my original patch also can make. And tim's patch
> > is also doing. Once we add a sched_domain, load balance will
> > get involved.
> >
> >
> > Case 2. we have 8 tasks, running in a system with 2 clusters and 8 cpus.
> > But they are working in 4 groups:
> > Task1 wakes up task4
> > Task2 wakes up task5
> > Task3 wakes up task6
> > Task4 wakes up task7
> >
> > With my changing in select_idle_sibling, the WAKE_AFFINE mechanism will
> > try to put task1 and 4, task2 and 5, task3 and 6, task4 and 7 in same clusters
> rather
> > than putting all of them in the random one of the 8 cpus. However, the 8 tasks
> > are still spreading among the 8 cpus with my change in select_idle_sibling
> > as load balance is still working.
> >
> > +---------------------------+    +----------------------+
> > | +----+        +-----+     |    | +----+      +-----+  |
> > | |task|        |task |     |    | |task|      |task |  |
> > | |1   |        | 4   |     |    | |2   |      |5    |  |
> > | +----+        +-----+     |    | +----+      +-----+  |
> > |                           |    |                      |
> > |       cluster1            |    |     cluster2         |
> > |                           |    |                      |
> > |                           |    |                      |
> > | +-----+       +------+    |    | +-----+     +------+ |
> > | |task |       | task |    |    | |task |     |task  | |
> > | |3    |       |  6   |    |    | |4    |     |8     | |
> > | +-----+       +------+    |    | +-----+     +------+ |
> > +---------------------------+    +----------------------+
> 
> Your use-case (#tasks, runtime/period) seems to be perfectly crafted to
> show the benefit of your patch on your specific system (cluster-size =
> 4). IMHO, this extra infrastructure especially in the wakeup path should
> show benefits over a range of different benchmarks.
> 
> > Let's consider the 3rd case, that one would be more tricky:
> >
> > task1 and task2 have close relationship and they are waker-wakee pair.
> > With my current patch, select_idle_sidling() wants to put them in one
> > cluster, load balance wants to put them in two clusters. Load balance will
> win.
> > Then maybe we need some same mechanism like adjusting numa imbalance:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> kernel/sched/fair.c?id=b396f52326de20
> > if we permit a light imbalance between clusters, select_idle_sidling()
> > will win. And task1 and task2 get better cache affinity.
> 
> This would look weird to allow this kind of imbalance on CLS (MC_L2) and
> NUMA domains but not on the MC domain for example.

Yes. I guess I actually meant permitting imbalance between sched_group
made by the child sched_cluster domain of the parent sched_mc domain.

sched_mc domain

+----------------------------------+
|   +--------+     +----------+    |
|   |sched_  |     |sched_    |    |
|   |group   |     |group     |    |
|   +--+-----+     +----+-----+    |
|      |  allow small   |          |
|      |  imbalance     |          |
+----------------------------------+
       |                |
       |                |
       |                |
       |                |
       |                |
       +                +
   child domain:     child domain:
   sched_cluster     sched_cluster

For sched_group within one sched_cluster domain, we don't allow this
kind of imbalance.

Anyway, I would be happier to see this kind of imbalance is
only allowed when we exactly know two tasks in the cluster
have wake-up relationship. Right now, SD_NUMA seems to be
simply allowing this imbalance without the knowledge of the
relationships of tasks causing imbalance.

Thanks
Barry

  reply index

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-06  8:30 Barry Song
2021-01-06  8:30 ` [RFC PATCH v3 1/2] topology: Represent clusters of CPUs within a die Barry Song
2021-02-09 22:48   ` Masayoshi Mizuma
2021-01-06  8:30 ` [RFC PATCH v3 2/2] scheduler: add scheduler level for clusters Barry Song
2021-01-06 16:29   ` Vincent Guittot
2021-01-06 20:09     ` Song Bao Hua (Barry Song)
2021-01-07 23:16 ` [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler Tim Chen
2021-01-08 15:12   ` Morten Rasmussen
2021-01-08 20:22     ` Tim Chen
2021-01-11  9:28       ` Morten Rasmussen
2021-01-12 11:00         ` Dietmar Eggemann
2021-01-25 10:50           ` Song Bao Hua (Barry Song)
2021-01-26 11:02             ` Dietmar Eggemann
2021-04-13 10:45           ` Song Bao Hua (Barry Song)
2021-04-13 19:00             ` Tim Chen
2021-01-08 21:30     ` Song Bao Hua (Barry Song)
2021-01-12 12:53       ` Dietmar Eggemann
2021-01-25 11:12         ` Song Bao Hua (Barry Song) [this message]
2021-02-03 11:32   ` Song Bao Hua (Barry Song)
2021-02-16 18:04     ` Tim Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=94c2e3b176e542afa03bea4aa0da7c9c@hisilicon.com \
    --to=song.bao.hua@hisilicon.com \
    --cc=aubrey.li@linux.intel.com \
    --cc=bsegall@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jonathan.cameron@huawei.com \
    --cc=juri.lelli@redhat.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@openeuler.org \
    --cc=mark.rutland@arm.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=prime.zeng@hisilicon.com \
    --cc=rjw@rjwysocki.net \
    --cc=rostedt@goodmis.org \
    --cc=sudeep.holla@arm.com \
    --cc=tiantao6@hisilicon.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=will@kernel.org \
    --cc=xuwei5@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-ACPI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-acpi/0 linux-acpi/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-acpi linux-acpi/ https://lore.kernel.org/linux-acpi \
		linux-acpi@vger.kernel.org
	public-inbox-index linux-acpi

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-acpi


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git