From: Barry Song <song.bao.hua@hisilicon.com> To: <tim.c.chen@linux.intel.com>, <catalin.marinas@arm.com>, <will@kernel.org>, <rjw@rjwysocki.net>, <vincent.guittot@linaro.org>, <bp@alien8.de>, <tglx@linutronix.de>, <mingo@redhat.com>, <lenb@kernel.org>, <peterz@infradead.org>, <dietmar.eggemann@arm.com>, <rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de> Cc: <msys.mizuma@gmail.com>, <valentin.schneider@arm.com>, <gregkh@linuxfoundation.org>, <jonathan.cameron@huawei.com>, <juri.lelli@redhat.com>, <mark.rutland@arm.com>, <sudeep.holla@arm.com>, <aubrey.li@linux.intel.com>, <linux-arm-kernel@lists.infradead.org>, <linux-kernel@vger.kernel.org>, <linux-acpi@vger.kernel.org>, <x86@kernel.org>, <xuwei5@huawei.com>, <prime.zeng@hisilicon.com>, <guodong.xu@linaro.org>, <yangyicong@huawei.com>, <liguozhu@hisilicon.com>, <linuxarm@openeuler.org>, <hpa@zytor.com>, Barry Song <song.bao.hua@hisilicon.com> Subject: [RFC PATCH v5 0/4] scheduler: expose the topology of clusters and add cluster scheduler Date: Fri, 19 Mar 2021 17:16:14 +1300 [thread overview] Message-ID: <20210319041618.14316-1-song.bao.hua@hisilicon.com> (raw) ARM64 server chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | There is a similar need for clustering in x86. Some x86 cores could share L2 caches that is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 clusters of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3). Having a sched_domain for clusters will bring two aspects of improvement: 1. spreading unrelated tasks among clusters, which decreases the contention of resources and improve the throughput. unrelated tasks might be put randomly without cluster sched_domain: +-------------------+ +-----------------+ | +----+ +----+ | | | | |task| |task| | | | | |1 | |2 | | | | | +----+ +----+ | | | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ but with cluster sched_domain, they are likely to spread due to LB: +-------------------+ +-----------------+ | +----+ | | +----+ | | |task| | | |task| | | |1 | | | |2 | | | +----+ | | +----+ | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ 2. gathering related tasks within a cluster, which improves the cache affinity of tasks talking with each other. Without cluster sched_domain, related tasks might be put randomly. In case task1-8 have relationship as below: Task1 wakes up task4 Task2 wakes up task5 Task3 wakes up task6 Task4 wakes up task7 With the tuning of select_idle_cpu() to scan local cluster first, those tasks might get a chance to be gathered like: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 4 | | | |2 | |5 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 6 | | | |4 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ Otherwise, the result might be: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 2 | | | |5 | |6 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 4 | | | |7 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ -v5: * split "add scheduler level for clusters" into two patches to evaluate the impact of spreading and gathering separately; * add a tracepoint of select_idle_cpu for debug purpose; add bcc script in commit log; * add cluster_id = -1 in reset_cpu_topology() * rebased to tip/sched/core -v4: * rebased to tip/sched/core with the latest unified code of select_idle_cpu * added Tim's patch for x86 Jacobsville * also added benchmark data of spreading unrelated tasks * avoided the iteration of sched_domain by moving to static_key(addressing Vincent's comment * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment) Barry Song (2): scheduler: add scheduler level for clusters scheduler: scan idle cpu in cluster before scanning the whole llc Jonathan Cameron (1): topology: Represent clusters of CPUs within a die Tim Chen (1): scheduler: Add cluster scheduler level for x86 Documentation/admin-guide/cputopology.rst | 26 +++++++++++-- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 + arch/x86/Kconfig | 8 ++++ arch/x86/include/asm/smp.h | 7 ++++ arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/cpu/cacheinfo.c | 1 + arch/x86/kernel/cpu/common.c | 3 ++ arch/x86/kernel/smpboot.c | 43 ++++++++++++++++++++- drivers/acpi/pptt.c | 63 +++++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 15 ++++++++ drivers/base/topology.c | 10 +++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/cluster.h | 19 ++++++++++ include/linux/sched/topology.h | 7 ++++ include/linux/topology.h | 13 +++++++ include/trace/events/sched.h | 22 +++++++++++ kernel/sched/core.c | 20 ++++++++++ kernel/sched/fair.c | 36 +++++++++++++++++- kernel/sched/sched.h | 1 + kernel/sched/topology.c | 5 +++ 22 files changed, 313 insertions(+), 6 deletions(-) create mode 100644 include/linux/sched/cluster.h -- 1.8.3.1
WARNING: multiple messages have this Message-ID (diff)
From: Barry Song <song.bao.hua@hisilicon.com> To: <tim.c.chen@linux.intel.com>, <catalin.marinas@arm.com>, <will@kernel.org>, <rjw@rjwysocki.net>, <vincent.guittot@linaro.org>, <bp@alien8.de>, <tglx@linutronix.de>, <mingo@redhat.com>, <lenb@kernel.org>, <peterz@infradead.org>, <dietmar.eggemann@arm.com>, <rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de> Cc: <msys.mizuma@gmail.com>, <valentin.schneider@arm.com>, <gregkh@linuxfoundation.org>, <jonathan.cameron@huawei.com>, <juri.lelli@redhat.com>, <mark.rutland@arm.com>, <sudeep.holla@arm.com>, <aubrey.li@linux.intel.com>, <linux-arm-kernel@lists.infradead.org>, <linux-kernel@vger.kernel.org>, <linux-acpi@vger.kernel.org>, <x86@kernel.org>, <xuwei5@huawei.com>, <prime.zeng@hisilicon.com>, <guodong.xu@linaro.org>, <yangyicong@huawei.com>, <liguozhu@hisilicon.com>, <linuxarm@openeuler.org>, <hpa@zytor.com>, Barry Song <song.bao.hua@hisilicon.com> Subject: [RFC PATCH v5 0/4] scheduler: expose the topology of clusters and add cluster scheduler Date: Fri, 19 Mar 2021 17:16:14 +1300 [thread overview] Message-ID: <20210319041618.14316-1-song.bao.hua@hisilicon.com> (raw) ARM64 server chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | There is a similar need for clustering in x86. Some x86 cores could share L2 caches that is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 clusters of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3). Having a sched_domain for clusters will bring two aspects of improvement: 1. spreading unrelated tasks among clusters, which decreases the contention of resources and improve the throughput. unrelated tasks might be put randomly without cluster sched_domain: +-------------------+ +-----------------+ | +----+ +----+ | | | | |task| |task| | | | | |1 | |2 | | | | | +----+ +----+ | | | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ but with cluster sched_domain, they are likely to spread due to LB: +-------------------+ +-----------------+ | +----+ | | +----+ | | |task| | | |task| | | |1 | | | |2 | | | +----+ | | +----+ | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ 2. gathering related tasks within a cluster, which improves the cache affinity of tasks talking with each other. Without cluster sched_domain, related tasks might be put randomly. In case task1-8 have relationship as below: Task1 wakes up task4 Task2 wakes up task5 Task3 wakes up task6 Task4 wakes up task7 With the tuning of select_idle_cpu() to scan local cluster first, those tasks might get a chance to be gathered like: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 4 | | | |2 | |5 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 6 | | | |4 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ Otherwise, the result might be: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 2 | | | |5 | |6 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 4 | | | |7 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ -v5: * split "add scheduler level for clusters" into two patches to evaluate the impact of spreading and gathering separately; * add a tracepoint of select_idle_cpu for debug purpose; add bcc script in commit log; * add cluster_id = -1 in reset_cpu_topology() * rebased to tip/sched/core -v4: * rebased to tip/sched/core with the latest unified code of select_idle_cpu * added Tim's patch for x86 Jacobsville * also added benchmark data of spreading unrelated tasks * avoided the iteration of sched_domain by moving to static_key(addressing Vincent's comment * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment) Barry Song (2): scheduler: add scheduler level for clusters scheduler: scan idle cpu in cluster before scanning the whole llc Jonathan Cameron (1): topology: Represent clusters of CPUs within a die Tim Chen (1): scheduler: Add cluster scheduler level for x86 Documentation/admin-guide/cputopology.rst | 26 +++++++++++-- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 + arch/x86/Kconfig | 8 ++++ arch/x86/include/asm/smp.h | 7 ++++ arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/cpu/cacheinfo.c | 1 + arch/x86/kernel/cpu/common.c | 3 ++ arch/x86/kernel/smpboot.c | 43 ++++++++++++++++++++- drivers/acpi/pptt.c | 63 +++++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 15 ++++++++ drivers/base/topology.c | 10 +++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/cluster.h | 19 ++++++++++ include/linux/sched/topology.h | 7 ++++ include/linux/topology.h | 13 +++++++ include/trace/events/sched.h | 22 +++++++++++ kernel/sched/core.c | 20 ++++++++++ kernel/sched/fair.c | 36 +++++++++++++++++- kernel/sched/sched.h | 1 + kernel/sched/topology.c | 5 +++ 22 files changed, 313 insertions(+), 6 deletions(-) create mode 100644 include/linux/sched/cluster.h -- 1.8.3.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next reply other threads:[~2021-03-19 4:24 UTC|newest] Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-03-19 4:16 Barry Song [this message] 2021-03-19 4:16 ` [RFC PATCH v5 0/4] scheduler: expose the topology of clusters and add cluster scheduler Barry Song 2021-03-19 4:16 ` [RFC PATCH v5 1/4] topology: Represent clusters of CPUs within a die Barry Song 2021-03-19 4:16 ` Barry Song 2021-03-19 6:35 ` Greg KH 2021-03-19 6:35 ` Greg KH 2021-03-19 6:57 ` Song Bao Hua (Barry Song) 2021-03-19 6:57 ` Song Bao Hua (Barry Song) 2021-03-19 9:36 ` Jonathan Cameron 2021-03-19 9:36 ` Jonathan Cameron 2021-03-19 10:01 ` Greg KH 2021-03-19 10:01 ` Greg KH 2021-04-20 3:30 ` Song Bao Hua (Barry Song) 2021-04-20 3:30 ` Song Bao Hua (Barry Song) 2021-04-21 4:06 ` Song Bao Hua (Barry Song) 2021-04-21 4:06 ` Song Bao Hua (Barry Song) 2021-03-19 4:16 ` [RFC PATCH v5 2/4] scheduler: add scheduler level for clusters Barry Song 2021-03-19 4:16 ` Barry Song 2021-03-19 4:16 ` [RFC PATCH v5 3/4] scheduler: scan idle cpu in cluster before scanning the whole llc Barry Song 2021-03-19 4:16 ` Barry Song 2021-03-19 21:39 ` Song Bao Hua (Barry Song) 2021-03-19 21:39 ` Song Bao Hua (Barry Song) 2021-03-19 4:16 ` [RFC PATCH v5 4/4] scheduler: Add cluster scheduler level for x86 Barry Song 2021-03-19 4:16 ` Barry Song 2021-03-23 22:50 ` Tim Chen 2021-03-23 22:50 ` Tim Chen 2021-03-23 23:21 ` Song Bao Hua (Barry Song) 2021-03-23 23:21 ` Song Bao Hua (Barry Song) 2021-04-20 18:31 ` Tim Chen 2021-04-20 18:31 ` Tim Chen 2021-04-20 22:31 ` Song Bao Hua (Barry Song) 2021-04-20 22:31 ` Song Bao Hua (Barry Song) 2021-03-31 10:07 ` Song Bao Hua (Barry Song) 2021-03-31 10:07 ` Song Bao Hua (Barry Song)
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20210319041618.14316-1-song.bao.hua@hisilicon.com \ --to=song.bao.hua@hisilicon.com \ --cc=aubrey.li@linux.intel.com \ --cc=bp@alien8.de \ --cc=bsegall@google.com \ --cc=catalin.marinas@arm.com \ --cc=dietmar.eggemann@arm.com \ --cc=gregkh@linuxfoundation.org \ --cc=guodong.xu@linaro.org \ --cc=hpa@zytor.com \ --cc=jonathan.cameron@huawei.com \ --cc=juri.lelli@redhat.com \ --cc=lenb@kernel.org \ --cc=liguozhu@hisilicon.com \ --cc=linux-acpi@vger.kernel.org \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linuxarm@openeuler.org \ --cc=mark.rutland@arm.com \ --cc=mgorman@suse.de \ --cc=mingo@redhat.com \ --cc=msys.mizuma@gmail.com \ --cc=peterz@infradead.org \ --cc=prime.zeng@hisilicon.com \ --cc=rjw@rjwysocki.net \ --cc=rostedt@goodmis.org \ --cc=sudeep.holla@arm.com \ --cc=tglx@linutronix.de \ --cc=tim.c.chen@linux.intel.com \ --cc=valentin.schneider@arm.com \ --cc=vincent.guittot@linaro.org \ --cc=will@kernel.org \ --cc=x86@kernel.org \ --cc=xuwei5@huawei.com \ --cc=yangyicong@huawei.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.