From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 47316C433ED for ; Tue, 20 Apr 2021 00:26:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 16D646135F for ; Tue, 20 Apr 2021 00:26:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229880AbhDTA0d (ORCPT ); Mon, 19 Apr 2021 20:26:33 -0400 Received: from szxga04-in.huawei.com ([45.249.212.190]:16136 "EHLO szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229758AbhDTA0c (ORCPT ); Mon, 19 Apr 2021 20:26:32 -0400 Received: from DGGEMS407-HUB.china.huawei.com (unknown [172.30.72.59]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4FPPXj076pzmdWf; Tue, 20 Apr 2021 08:23:01 +0800 (CST) Received: from SWX921481.china.huawei.com (10.126.200.79) by DGGEMS407-HUB.china.huawei.com (10.3.19.207) with Microsoft SMTP Server id 14.3.498.0; Tue, 20 Apr 2021 08:25:52 +0800 From: Barry Song To: , , , , , , , , , , , , , CC: , , , , , , , , , , , , , , , , , , , Barry Song Subject: [RFC PATCH v6 0/4] scheduler: expose the topology of clusters and add cluster scheduler Date: Tue, 20 Apr 2021 12:18:40 +1200 Message-ID: <20210420001844.9116-1-song.bao.hua@hisilicon.com> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.126.200.79] X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org ARM64 server chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | There is a similar need for clustering in x86. Some x86 cores could share L2 caches that is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 clusters of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3). Having a sched_domain for clusters will bring two aspects of improvement: 1. spreading unrelated tasks among clusters, which decreases the contention of resources and improve the throughput. unrelated tasks might be put randomly without cluster sched_domain: +-------------------+ +-----------------+ | +----+ +----+ | | | | |task| |task| | | | | |1 | |2 | | | | | +----+ +----+ | | | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ but with cluster sched_domain, they are likely to spread due to LB: +-------------------+ +-----------------+ | +----+ | | +----+ | | |task| | | |task| | | |1 | | | |2 | | | +----+ | | +----+ | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ 2. gathering related tasks within a cluster, which improves the cache affinity of tasks talking with each other. Without cluster sched_domain, related tasks might be put randomly. In case task1-8 have relationship as below: Task1 talks with task5 Task2 talks with task6 Task3 talks with task7 Task4 talks with task8 With the tuning of select_idle_cpu() to scan local cluster first, those tasks might get a chance to be gathered like: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 5 | | | |3 | |7 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |2 | | 6 | | | |4 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ Otherwise, the result might be: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 2 | | | |5 | |6 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 4 | | | |7 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ -v6: * added topology_cluster_cpumask() for x86, code provided by Tim. * emulated a two-level spreading/packing heuristic by only scanning cluster in wake_affine path for tasks running in same LLC(also NUMA). This partially addressed Dietmar's comment in RFC v3: "In case we would like to further distinguish between llc-packing and even narrower (cluster or MC-L2)-packing, we would introduce a 2. level packing vs. spreading heuristic further down in sis(). IMHO, Barry's current implementation doesn't do this right now. Instead he's trying to pack on cluster first and if not successful look further among the remaining llc CPUs for an idle CPU." * adjusted the hackbench parameter to make relatively low and high load. previous patchsets with "-f 10" ran under an extremely high load with hundreds of threads, which seems not real use cases. This also addressed Vincent's question in RFC v4: "In particular, I'm still not convinced that the modification of the wakeup path is the root of the hackbench improvement; especially with g=14 where there should not be much idle CPUs with 14*40 tasks on at most 32 CPUs." -v5: * split "add scheduler level for clusters" into two patches to evaluate the impact of spreading and gathering separately; * add a tracepoint of select_idle_cpu for debug purpose; add bcc script in commit log; * add cluster_id = -1 in reset_cpu_topology() * rebased to tip/sched/core -v4: * rebased to tip/sched/core with the latest unified code of select_idle_cpu * added Tim's patch for x86 Jacobsville * also added benchmark data of spreading unrelated tasks * avoided the iteration of sched_domain by moving to static_key(addressing Vincent's comment * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment) Barry Song (2): scheduler: add scheduler level for clusters scheduler: scan idle cpu in cluster for tasks within one LLC Jonathan Cameron (1): topology: Represent clusters of CPUs within a die Tim Chen (1): scheduler: Add cluster scheduler level for x86 Documentation/admin-guide/cputopology.rst | 26 +++++++++++-- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 + arch/x86/Kconfig | 8 ++++ arch/x86/include/asm/smp.h | 7 ++++ arch/x86/include/asm/topology.h | 2 + arch/x86/kernel/cpu/cacheinfo.c | 1 + arch/x86/kernel/cpu/common.c | 3 ++ arch/x86/kernel/smpboot.c | 43 ++++++++++++++++++++- block/blk-mq.c | 2 +- drivers/acpi/pptt.c | 63 +++++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 15 ++++++++ drivers/base/topology.c | 10 +++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/cluster.h | 19 ++++++++++ include/linux/sched/sd_flags.h | 9 +++++ include/linux/sched/topology.h | 12 +++++- include/linux/topology.h | 13 +++++++ kernel/sched/core.c | 29 ++++++++++++-- kernel/sched/fair.c | 51 +++++++++++++++---------- kernel/sched/sched.h | 4 ++ kernel/sched/topology.c | 18 +++++++++ 23 files changed, 324 insertions(+), 30 deletions(-) create mode 100644 include/linux/sched/cluster.h -- 1.8.3.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05B31C43460 for ; Tue, 20 Apr 2021 00:28:49 +0000 (UTC) Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 6D882610CC for ; Tue, 20 Apr 2021 00:28:48 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6D882610CC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=hisilicon.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding :Content-Type:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:MIME-Version:Message-ID:Date:Subject:CC:To:From: Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender :Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Owner; bh=VbbdJftCgdtj5Sd1J5D0Z1fkloM9SxAJf6UL+t/BjuA=; b=jrP9yFqTuFqwE7OskF7ryZuKLH uT3sGi10RG1FPiDOU62wI/mMyz+4w9ruV0mCNeyInzGNExWM4rfl9n0Aq59wBAm3KW6XmBnpespF2 22QDRSwjCcWyAiq1DQ2blGjYhFr8cFTeJt8ThjFWfC/01xPaKUZW/s0SlEwEgm9N27XDltXlsmUta 3eGl8qWCPGZqr2B6szMEc0zqaKF2OgLVyR2Ur3PIr7q2+xxBs3MEC+k1ZJ+qWehuCRs/6246HOZNp sr9BCHgAT1qqUt7Nib9Yc+9TVnSL8oZ9+T64KA0P7mrOwQanP0MfAM+iz0Yku72TEwrivGqLyZnF1 aLwigf1w==; Received: from localhost ([::1] helo=desiato.infradead.org) by desiato.infradead.org with esmtp (Exim 4.94 #2 (Red Hat Linux)) id 1lYeDh-00Aqf3-Li; Tue, 20 Apr 2021 00:26:25 +0000 Received: from bombadil.infradead.org ([2607:7c80:54:e::133]) by desiato.infradead.org with esmtps (Exim 4.94 #2 (Red Hat Linux)) id 1lYeDc-00Aqec-22 for linux-arm-kernel@desiato.infradead.org; Tue, 20 Apr 2021 00:26:20 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Content-Type: Content-Transfer-Encoding:MIME-Version:Message-ID:Date:Subject:CC:To:From: Sender:Reply-To:Content-ID:Content-Description:In-Reply-To:References; bh=TrfgBZ48fujO97kxSwn+/WrT6MZGI2MhME1+NwxxW7s=; b=OWmJO/VkNaAFJj1seDsuZEPupC x2yoMVmL5wN0jfqNKyomrNtj3RMR+KHr5god57UdO2cZKkSkuPngr1DmjsegeG3bnG4I5vwOEFYTO H5gZ1lQ/r4OYPMq7UYSLx8tme8Qka0QUg2GSGWkSq9L768QVMU42d82DyzGSmu7yEQTmcRgsmIjuM 09Z3LpNx+HnHI5gkZf9DMtHNBYxySgOMO7ilX3fnGKhcIeHtI+vxXtkVLjN1o5j4+T2ZHzbghChnw k5qOq3MOnMzj5xKBGtPXh59Eh2qGf8uowyKGcCHsYhNmN3U4EbtiJpH5KCLgWOxFMsbNoLYVKQYUv 86bC0g1Q==; Received: from szxga04-in.huawei.com ([45.249.212.190]) by bombadil.infradead.org with esmtps (Exim 4.94 #2 (Red Hat Linux)) id 1lYeDU-00Bk1v-TM for linux-arm-kernel@lists.infradead.org; Tue, 20 Apr 2021 00:26:17 +0000 Received: from DGGEMS407-HUB.china.huawei.com (unknown [172.30.72.59]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4FPPXj076pzmdWf; Tue, 20 Apr 2021 08:23:01 +0800 (CST) Received: from SWX921481.china.huawei.com (10.126.200.79) by DGGEMS407-HUB.china.huawei.com (10.3.19.207) with Microsoft SMTP Server id 14.3.498.0; Tue, 20 Apr 2021 08:25:52 +0800 From: Barry Song To: , , , , , , , , , , , , , CC: , , , , , , , , , , , , , , , , , , , Barry Song Subject: [RFC PATCH v6 0/4] scheduler: expose the topology of clusters and add cluster scheduler Date: Tue, 20 Apr 2021 12:18:40 +1200 Message-ID: <20210420001844.9116-1-song.bao.hua@hisilicon.com> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 X-Originating-IP: [10.126.200.79] X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210419_172613_292905_9A9E92D0 X-CRM114-Status: GOOD ( 14.11 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org ARM64 server chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | There is a similar need for clustering in x86. Some x86 cores could share L2 caches that is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 clusters of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3). Having a sched_domain for clusters will bring two aspects of improvement: 1. spreading unrelated tasks among clusters, which decreases the contention of resources and improve the throughput. unrelated tasks might be put randomly without cluster sched_domain: +-------------------+ +-----------------+ | +----+ +----+ | | | | |task| |task| | | | | |1 | |2 | | | | | +----+ +----+ | | | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ but with cluster sched_domain, they are likely to spread due to LB: +-------------------+ +-----------------+ | +----+ | | +----+ | | |task| | | |task| | | |1 | | | |2 | | | +----+ | | +----+ | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ 2. gathering related tasks within a cluster, which improves the cache affinity of tasks talking with each other. Without cluster sched_domain, related tasks might be put randomly. In case task1-8 have relationship as below: Task1 talks with task5 Task2 talks with task6 Task3 talks with task7 Task4 talks with task8 With the tuning of select_idle_cpu() to scan local cluster first, those tasks might get a chance to be gathered like: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 5 | | | |3 | |7 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |2 | | 6 | | | |4 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ Otherwise, the result might be: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 2 | | | |5 | |6 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 4 | | | |7 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ -v6: * added topology_cluster_cpumask() for x86, code provided by Tim. * emulated a two-level spreading/packing heuristic by only scanning cluster in wake_affine path for tasks running in same LLC(also NUMA). This partially addressed Dietmar's comment in RFC v3: "In case we would like to further distinguish between llc-packing and even narrower (cluster or MC-L2)-packing, we would introduce a 2. level packing vs. spreading heuristic further down in sis(). IMHO, Barry's current implementation doesn't do this right now. Instead he's trying to pack on cluster first and if not successful look further among the remaining llc CPUs for an idle CPU." * adjusted the hackbench parameter to make relatively low and high load. previous patchsets with "-f 10" ran under an extremely high load with hundreds of threads, which seems not real use cases. This also addressed Vincent's question in RFC v4: "In particular, I'm still not convinced that the modification of the wakeup path is the root of the hackbench improvement; especially with g=14 where there should not be much idle CPUs with 14*40 tasks on at most 32 CPUs." -v5: * split "add scheduler level for clusters" into two patches to evaluate the impact of spreading and gathering separately; * add a tracepoint of select_idle_cpu for debug purpose; add bcc script in commit log; * add cluster_id = -1 in reset_cpu_topology() * rebased to tip/sched/core -v4: * rebased to tip/sched/core with the latest unified code of select_idle_cpu * added Tim's patch for x86 Jacobsville * also added benchmark data of spreading unrelated tasks * avoided the iteration of sched_domain by moving to static_key(addressing Vincent's comment * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment) Barry Song (2): scheduler: add scheduler level for clusters scheduler: scan idle cpu in cluster for tasks within one LLC Jonathan Cameron (1): topology: Represent clusters of CPUs within a die Tim Chen (1): scheduler: Add cluster scheduler level for x86 Documentation/admin-guide/cputopology.rst | 26 +++++++++++-- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 + arch/x86/Kconfig | 8 ++++ arch/x86/include/asm/smp.h | 7 ++++ arch/x86/include/asm/topology.h | 2 + arch/x86/kernel/cpu/cacheinfo.c | 1 + arch/x86/kernel/cpu/common.c | 3 ++ arch/x86/kernel/smpboot.c | 43 ++++++++++++++++++++- block/blk-mq.c | 2 +- drivers/acpi/pptt.c | 63 +++++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 15 ++++++++ drivers/base/topology.c | 10 +++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/cluster.h | 19 ++++++++++ include/linux/sched/sd_flags.h | 9 +++++ include/linux/sched/topology.h | 12 +++++- include/linux/topology.h | 13 +++++++ kernel/sched/core.c | 29 ++++++++++++-- kernel/sched/fair.c | 51 +++++++++++++++---------- kernel/sched/sched.h | 4 ++ kernel/sched/topology.c | 18 +++++++++ 23 files changed, 324 insertions(+), 30 deletions(-) create mode 100644 include/linux/sched/cluster.h -- 1.8.3.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel