From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B58AC433DB for ; Wed, 6 Jan 2021 08:36:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2339723104 for ; Wed, 6 Jan 2021 08:36:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725894AbhAFIg3 (ORCPT ); Wed, 6 Jan 2021 03:36:29 -0500 Received: from szxga04-in.huawei.com ([45.249.212.190]:9716 "EHLO szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725788AbhAFIg2 (ORCPT ); Wed, 6 Jan 2021 03:36:28 -0500 Received: from DGGEMS411-HUB.china.huawei.com (unknown [172.30.72.60]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4D9jMv1QVgzl0mh; Wed, 6 Jan 2021 16:34:35 +0800 (CST) Received: from SWX921481.china.huawei.com (10.126.203.68) by DGGEMS411-HUB.china.huawei.com (10.3.19.211) with Microsoft SMTP Server id 14.3.498.0; Wed, 6 Jan 2021 16:35:34 +0800 From: Barry Song To: , , , , , , , , , , , , , , , , , CC: , , , , , , , Barry Song Subject: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler Date: Wed, 6 Jan 2021 21:30:24 +1300 Message-ID: <20210106083026.40444-1-song.bao.hua@hisilicon.com> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.126.203.68] X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | Through the following small program, you can see the performance impact of running it in one cluster and across two clusters: struct foo { int x; int y; } f; void *thread1_fun(void *param) { int s = 0; for (int i = 0; i < 0xfffffff; i++) s += f.x; } void *thread2_fun(void *param) { int s = 0; for (int i = 0; i < 0xfffffff; i++) f.y++; } int main(int argc, char **argv) { pthread_t tid1, tid2; pthread_create(&tid1, NULL, thread1_fun, NULL); pthread_create(&tid2, NULL, thread2_fun, NULL); pthread_join(tid1, NULL); pthread_join(tid2, NULL); } While running this program in one cluster, it takes: $ time taskset -c 0,1 ./a.out real 0m0.832s user 0m1.649s sys 0m0.004s As a contrast, it takes much more time if we run the same program in two clusters: $ time taskset -c 0,4 ./a.out real 0m1.133s user 0m1.960s sys 0m0.000s 0.832/1.133 = 73%, it is a huge difference. Also, hackbench running on 4 cpus in single one cluster and 4 cpus in different clusters also shows a large contrast: * inside a cluster: root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 20000 messages of 100 bytes Time: 4.285 * across clusters: root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 20000 messages of 100 bytes Time: 5.524 The score is 4.285 vs 5.524, shorter time means better performance. All these testing implies that we should let the Linux scheduler use this topology to make better load balancing and WAKE_AFFINE decisions. However, the current scheduler totally has no idea of clusters. This patchset exposed the cluster topology first, then added the sched domain for cluster. While it is named as "cluster", architectures and machines can define the exact meaning of cluster as long as they have some resources sharing under llc and they can leverage the affinity of this resource to achive better scheduling performance. -v3: - rebased againest 5.11-rc2 - with respect to the comments of Valentin Schneider, Peter Zijlstra, Vincent Guittot and Mel Gorman etc. * moved the scheduler changes from arm64 to the common place for all architectures. * added SD_SHARE_CLS_RESOURCES sd_flags specifying the sched_domain where select_idle_cpu() should begin to scan from * removed redundant select_idle_cluster() function since all code is in select_idle_cpu() now. it also avoided scanning cluster cpus twice in v2 code; * redo the hackbench in one numa after the above changes Valentin suggested that select_idle_cpu() could begin to scan from domain with SD_SHARE_PKG_RESOURCES. Changing like this might be too aggressive and limit the spreading of tasks. Thus, this patch lets the architectures and machines to decide where to start by adding a new SD_SHARE_CLS_RESOURCES. Barry Song (1): scheduler: add scheduler level for clusters Jonathan Cameron (1): topology: Represent clusters of CPUs within a die. Documentation/admin-guide/cputopology.rst | 26 +++++++++++--- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 ++ drivers/acpi/pptt.c | 60 +++++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 14 ++++++++ drivers/base/topology.c | 10 ++++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/sd_flags.h | 9 +++++ include/linux/sched/topology.h | 7 ++++ include/linux/topology.h | 13 +++++++ kernel/sched/fair.c | 27 ++++++++++---- kernel/sched/topology.c | 6 ++++ 13 files changed, 181 insertions(+), 10 deletions(-) -- 2.7.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1FEC9C433E0 for ; Wed, 6 Jan 2021 08:37:53 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id A2F1723107 for ; Wed, 6 Jan 2021 08:37:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A2F1723107 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=hisilicon.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:MIME-Version:Message-ID:Date:Subject:To:From: Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender :Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Owner; bh=fAx9b6P8nnXykM6HPi58aw5zoFGgN5zHnDozLDGv8k0=; b=GtT8UooNJhZdnygUD6Ecguntg1 dykOQE0d8/6nm5bNKp87vomKJJbzOJMb2VAuSlSHfjdVVS48vvcrBb/0DUi8Xxnj6w5V76gj8PjiI 6krwmC60QqYLsPyDIQ9dXVZ9fOJTtRdTFnMVOUHnxZRKgxYmU+5bY858+ymGtui7vsdvNkI6Bepeu LciND0rQb7lLirbLL9pQlTRh992x4dXbPTnA2CHYdvYLcX0agPNZ7/nuQtrZD2Yqx1aZawcJ0H8tA s+lCYdz2eKyWnCYrylljYpoMzpTrCYJlt1krSSZxzP81oYfHH01DovWfbEx7un4brPpUAHEha9ZI2 ZiSTdUeA==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1kx4Ia-0002zn-4D; Wed, 06 Jan 2021 08:36:08 +0000 Received: from szxga04-in.huawei.com ([45.249.212.190]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1kx4IR-0002wK-6o for linux-arm-kernel@lists.infradead.org; Wed, 06 Jan 2021 08:36:05 +0000 Received: from DGGEMS411-HUB.china.huawei.com (unknown [172.30.72.60]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4D9jMv1QVgzl0mh; Wed, 6 Jan 2021 16:34:35 +0800 (CST) Received: from SWX921481.china.huawei.com (10.126.203.68) by DGGEMS411-HUB.china.huawei.com (10.3.19.211) with Microsoft SMTP Server id 14.3.498.0; Wed, 6 Jan 2021 16:35:34 +0800 From: Barry Song To: , , , , , , , , , , , , , , , , , Subject: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler Date: Wed, 6 Jan 2021 21:30:24 +1300 Message-ID: <20210106083026.40444-1-song.bao.hua@hisilicon.com> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 X-Originating-IP: [10.126.203.68] X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210106_033604_390886_773ED145 X-CRM114-Status: GOOD ( 13.91 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Barry Song , prime.zeng@hisilicon.com, linux-kernel@vger.kernel.org, linuxarm@openeuler.org, linux-acpi@vger.kernel.org, xuwei5@huawei.com, tiantao6@hisilicon.com, linux-arm-kernel@lists.infradead.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | Through the following small program, you can see the performance impact of running it in one cluster and across two clusters: struct foo { int x; int y; } f; void *thread1_fun(void *param) { int s = 0; for (int i = 0; i < 0xfffffff; i++) s += f.x; } void *thread2_fun(void *param) { int s = 0; for (int i = 0; i < 0xfffffff; i++) f.y++; } int main(int argc, char **argv) { pthread_t tid1, tid2; pthread_create(&tid1, NULL, thread1_fun, NULL); pthread_create(&tid2, NULL, thread2_fun, NULL); pthread_join(tid1, NULL); pthread_join(tid2, NULL); } While running this program in one cluster, it takes: $ time taskset -c 0,1 ./a.out real 0m0.832s user 0m1.649s sys 0m0.004s As a contrast, it takes much more time if we run the same program in two clusters: $ time taskset -c 0,4 ./a.out real 0m1.133s user 0m1.960s sys 0m0.000s 0.832/1.133 = 73%, it is a huge difference. Also, hackbench running on 4 cpus in single one cluster and 4 cpus in different clusters also shows a large contrast: * inside a cluster: root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 20000 messages of 100 bytes Time: 4.285 * across clusters: root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 20000 messages of 100 bytes Time: 5.524 The score is 4.285 vs 5.524, shorter time means better performance. All these testing implies that we should let the Linux scheduler use this topology to make better load balancing and WAKE_AFFINE decisions. However, the current scheduler totally has no idea of clusters. This patchset exposed the cluster topology first, then added the sched domain for cluster. While it is named as "cluster", architectures and machines can define the exact meaning of cluster as long as they have some resources sharing under llc and they can leverage the affinity of this resource to achive better scheduling performance. -v3: - rebased againest 5.11-rc2 - with respect to the comments of Valentin Schneider, Peter Zijlstra, Vincent Guittot and Mel Gorman etc. * moved the scheduler changes from arm64 to the common place for all architectures. * added SD_SHARE_CLS_RESOURCES sd_flags specifying the sched_domain where select_idle_cpu() should begin to scan from * removed redundant select_idle_cluster() function since all code is in select_idle_cpu() now. it also avoided scanning cluster cpus twice in v2 code; * redo the hackbench in one numa after the above changes Valentin suggested that select_idle_cpu() could begin to scan from domain with SD_SHARE_PKG_RESOURCES. Changing like this might be too aggressive and limit the spreading of tasks. Thus, this patch lets the architectures and machines to decide where to start by adding a new SD_SHARE_CLS_RESOURCES. Barry Song (1): scheduler: add scheduler level for clusters Jonathan Cameron (1): topology: Represent clusters of CPUs within a die. Documentation/admin-guide/cputopology.rst | 26 +++++++++++--- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 ++ drivers/acpi/pptt.c | 60 +++++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 14 ++++++++ drivers/base/topology.c | 10 ++++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/sd_flags.h | 9 +++++ include/linux/sched/topology.h | 7 ++++ include/linux/topology.h | 13 +++++++ kernel/sched/fair.c | 27 ++++++++++---- kernel/sched/topology.c | 6 ++++ 13 files changed, 181 insertions(+), 10 deletions(-) -- 2.7.4 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel