From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-acpi-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-11.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7B58AC433DB
	for <linux-acpi@archiver.kernel.org>; Wed,  6 Jan 2021 08:36:45 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 2339723104
	for <linux-acpi@archiver.kernel.org>; Wed,  6 Jan 2021 08:36:45 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1725894AbhAFIg3 (ORCPT <rfc822;linux-acpi@archiver.kernel.org>);
        Wed, 6 Jan 2021 03:36:29 -0500
Received: from szxga04-in.huawei.com ([45.249.212.190]:9716 "EHLO
        szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725788AbhAFIg2 (ORCPT
        <rfc822;linux-acpi@vger.kernel.org>); Wed, 6 Jan 2021 03:36:28 -0500
Received: from DGGEMS411-HUB.china.huawei.com (unknown [172.30.72.60])
        by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4D9jMv1QVgzl0mh;
        Wed,  6 Jan 2021 16:34:35 +0800 (CST)
Received: from SWX921481.china.huawei.com (10.126.203.68) by
 DGGEMS411-HUB.china.huawei.com (10.3.19.211) with Microsoft SMTP Server id
 14.3.498.0; Wed, 6 Jan 2021 16:35:34 +0800
From:   Barry Song <song.bao.hua@hisilicon.com>
To:     <valentin.schneider@arm.com>, <catalin.marinas@arm.com>,
        <will@kernel.org>, <rjw@rjwysocki.net>,
        <vincent.guittot@linaro.org>, <lenb@kernel.org>,
        <gregkh@linuxfoundation.org>, <jonathan.cameron@huawei.com>,
        <mingo@redhat.com>, <peterz@infradead.org>,
        <juri.lelli@redhat.com>, <dietmar.eggemann@arm.com>,
        <rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>,
        <mark.rutland@arm.com>, <sudeep.holla@arm.com>,
        <aubrey.li@linux.intel.com>
CC:     <linux-arm-kernel@lists.infradead.org>,
        <linux-kernel@vger.kernel.org>, <linux-acpi@vger.kernel.org>,
        <linuxarm@openeuler.org>, <xuwei5@huawei.com>,
        <prime.zeng@hisilicon.com>, <tiantao6@hisilicon.com>,
        Barry Song <song.bao.hua@hisilicon.com>
Subject: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler
Date:   Wed, 6 Jan 2021 21:30:24 +1300
Message-ID: <20210106083026.40444-1-song.bao.hua@hisilicon.com>
X-Mailer: git-send-email 2.21.0.windows.1
MIME-Version: 1.0
Content-Transfer-Encoding: 7BIT
Content-Type:   text/plain; charset=US-ASCII
X-Originating-IP: [10.126.203.68]
X-CFilter-Loop: Reflected
Precedence: bulk
List-ID: <linux-acpi.vger.kernel.org>
X-Mailing-List: linux-acpi@vger.kernel.org

ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data while each cluster
has local L3 tag. On the other hand, each cluster will share some
internal system bus. This means cache is much more affine inside one cluster
than across clusters.

    +-----------------------------------+                          +---------+
    |  +------+    +------+            +---------------------------+         |
    |  | CPU0 |    | cpu1 |             |    +-----------+         |         |
    |  +------+    +------+             |    |           |         |         |
    |                                   +----+    L3     |         |         |
    |  +------+    +------+   cluster   |    |    tag    |         |         |
    |  | CPU2 |    | CPU3 |             |    |           |         |         |
    |  +------+    +------+             |    +-----------+         |         |
    |                                   |                          |         |
    +-----------------------------------+                          |         |
    +-----------------------------------+                          |         |
    |  +------+    +------+             +--------------------------+         |
    |  |      |    |      |             |    +-----------+         |         |
    |  +------+    +------+             |    |           |         |         |
    |                                   |    |    L3     |         |         |
    |  +------+    +------+             +----+    tag    |         |         |
    |  |      |    |      |             |    |           |         |         |
    |  +------+    +------+             |    +-----------+         |         |
    |                                   |                          |         |
    +-----------------------------------+                          |   L3    |
                                                                   |   data  |
    +-----------------------------------+                          |         |
    |  +------+    +------+             |    +-----------+         |         |
    |  |      |    |      |             |    |           |         |         |
    |  +------+    +------+             +----+    L3     |         |         |
    |                                   |    |    tag    |         |         |
    |  +------+    +------+             |    |           |         |         |
    |  |      |    |      |            ++    +-----------+         |         |
    |  +------+    +------+            |---------------------------+         |
    +-----------------------------------|                          |         |
    +-----------------------------------|                          |         |
    |  +------+    +------+            +---------------------------+         |
    |  |      |    |      |             |    +-----------+         |         |
    |  +------+    +------+             |    |           |         |         |
    |                                   +----+    L3     |         |         |
    |  +------+    +------+             |    |    tag    |         |         |
    |  |      |    |      |             |    |           |         |         |
    |  +------+    +------+             |    +-----------+         |         |
    |                                   |                          |         |
    +-----------------------------------+                          |         |
    +-----------------------------------+                          |         |
    |  +------+    +------+             +--------------------------+         |
    |  |      |    |      |             |   +-----------+          |         |
    |  +------+    +------+             |   |           |          |         |


Through the following small program, you can see the performance impact of
running it in one cluster and across two clusters:

struct foo {
        int x;
        int y;
} f;

void *thread1_fun(void *param)
{
        int s = 0;
        for (int i = 0; i < 0xfffffff; i++)
                s += f.x;
}

void *thread2_fun(void *param)
{
        int s = 0;
        for (int i = 0; i < 0xfffffff; i++)
                f.y++;
}

int main(int argc, char **argv)
{
        pthread_t tid1, tid2;

        pthread_create(&tid1, NULL, thread1_fun, NULL);
        pthread_create(&tid2, NULL, thread2_fun, NULL);
        pthread_join(tid1, NULL);
        pthread_join(tid2, NULL);
}

While running this program in one cluster, it takes:
$ time taskset -c 0,1 ./a.out 
real	0m0.832s
user	0m1.649s
sys	0m0.004s

As a contrast, it takes much more time if we run the same program
in two clusters:
$ time taskset -c 0,4 ./a.out 
real	0m1.133s
user	0m1.960s
sys	0m0.000s

0.832/1.133 = 73%, it is a huge difference.

Also, hackbench running on 4 cpus in single one cluster and 4 cpus in
different clusters also shows a large contrast:
* inside a cluster:
root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 4.285

* across clusters:
root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 5.524

The score is 4.285 vs 5.524, shorter time means better performance.

All these testing implies that we should let the Linux scheduler use
this topology to make better load balancing and WAKE_AFFINE decisions.
However, the current scheduler totally has no idea of clusters.

This patchset exposed the cluster topology first, then added the sched
domain for cluster. While it is named as "cluster", architectures and
machines can define the exact meaning of cluster as long as they have
some resources sharing under llc and they can leverage the affinity
of this resource to achive better scheduling performance.

-v3:
 - rebased againest 5.11-rc2
 - with respect to the comments of Valentin Schneider, Peter Zijlstra,
   Vincent Guittot and Mel Gorman etc.
  * moved the scheduler changes from arm64 to the common place for all
    architectures.
  * added SD_SHARE_CLS_RESOURCES sd_flags specifying the sched_domain
    where select_idle_cpu() should begin to scan from
  * removed redundant select_idle_cluster() function since all code is
    in select_idle_cpu() now. it also avoided scanning cluster cpus
    twice in v2 code;
  * redo the hackbench in one numa after the above changes

Valentin suggested that select_idle_cpu() could begin to scan from
domain with SD_SHARE_PKG_RESOURCES. Changing like this might be too
aggressive and limit the spreading of tasks. Thus, this patch lets
the architectures and machines to decide where to start by adding
a new SD_SHARE_CLS_RESOURCES.

Barry Song (1):
  scheduler: add scheduler level for clusters

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die.

 Documentation/admin-guide/cputopology.rst | 26 +++++++++++---
 arch/arm64/Kconfig                        |  7 ++++
 arch/arm64/kernel/topology.c              |  2 ++
 drivers/acpi/pptt.c                       | 60 +++++++++++++++++++++++++++++++
 drivers/base/arch_topology.c              | 14 ++++++++
 drivers/base/topology.c                   | 10 ++++++
 include/linux/acpi.h                      |  5 +++
 include/linux/arch_topology.h             |  5 +++
 include/linux/sched/sd_flags.h            |  9 +++++
 include/linux/sched/topology.h            |  7 ++++
 include/linux/topology.h                  | 13 +++++++
 kernel/sched/fair.c                       | 27 ++++++++++----
 kernel/sched/topology.c                   |  6 ++++
 13 files changed, 181 insertions(+), 10 deletions(-)

-- 
2.7.4


From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=c6Tt=GJ=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-12.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT
	autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1FEC9C433E0
	for <linux-arm-kernel@archiver.kernel.org>; Wed,  6 Jan 2021 08:37:53 +0000 (UTC)
Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id A2F1723107
	for <linux-arm-kernel@archiver.kernel.org>; Wed,  6 Jan 2021 08:37:52 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A2F1723107
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=hisilicon.com
Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding:
	Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive:
	List-Unsubscribe:List-Id:MIME-Version:Message-ID:Date:Subject:To:From:
	Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender
	:Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Owner;
	bh=fAx9b6P8nnXykM6HPi58aw5zoFGgN5zHnDozLDGv8k0=; b=GtT8UooNJhZdnygUD6Ecguntg1
	dykOQE0d8/6nm5bNKp87vomKJJbzOJMb2VAuSlSHfjdVVS48vvcrBb/0DUi8Xxnj6w5V76gj8PjiI
	6krwmC60QqYLsPyDIQ9dXVZ9fOJTtRdTFnMVOUHnxZRKgxYmU+5bY858+ymGtui7vsdvNkI6Bepeu
	LciND0rQb7lLirbLL9pQlTRh992x4dXbPTnA2CHYdvYLcX0agPNZ7/nuQtrZD2Yqx1aZawcJ0H8tA
	s+lCYdz2eKyWnCYrylljYpoMzpTrCYJlt1krSSZxzP81oYfHH01DovWfbEx7un4brPpUAHEha9ZI2
	ZiSTdUeA==;
Received: from localhost ([::1] helo=merlin.infradead.org)
	by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux))
	id 1kx4Ia-0002zn-4D; Wed, 06 Jan 2021 08:36:08 +0000
Received: from szxga04-in.huawei.com ([45.249.212.190])
 by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux))
 id 1kx4IR-0002wK-6o
 for linux-arm-kernel@lists.infradead.org; Wed, 06 Jan 2021 08:36:05 +0000
Received: from DGGEMS411-HUB.china.huawei.com (unknown [172.30.72.60])
 by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4D9jMv1QVgzl0mh;
 Wed,  6 Jan 2021 16:34:35 +0800 (CST)
Received: from SWX921481.china.huawei.com (10.126.203.68) by
 DGGEMS411-HUB.china.huawei.com (10.3.19.211) with Microsoft SMTP Server id
 14.3.498.0; Wed, 6 Jan 2021 16:35:34 +0800
From: Barry Song <song.bao.hua@hisilicon.com>
To: <valentin.schneider@arm.com>, <catalin.marinas@arm.com>,
 <will@kernel.org>, <rjw@rjwysocki.net>, <vincent.guittot@linaro.org>,
 <lenb@kernel.org>, <gregkh@linuxfoundation.org>,
 <jonathan.cameron@huawei.com>, <mingo@redhat.com>, <peterz@infradead.org>,
 <juri.lelli@redhat.com>, <dietmar.eggemann@arm.com>, <rostedt@goodmis.org>,
 <bsegall@google.com>, <mgorman@suse.de>, <mark.rutland@arm.com>,
 <sudeep.holla@arm.com>, <aubrey.li@linux.intel.com>
Subject: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add
 cluster scheduler
Date: Wed, 6 Jan 2021 21:30:24 +1300
Message-ID: <20210106083026.40444-1-song.bao.hua@hisilicon.com>
X-Mailer: git-send-email 2.21.0.windows.1
MIME-Version: 1.0
X-Originating-IP: [10.126.203.68]
X-CFilter-Loop: Reflected
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20210106_033604_390886_773ED145 
X-CRM114-Status: GOOD (  13.91  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Cc: Barry Song <song.bao.hua@hisilicon.com>, prime.zeng@hisilicon.com,
 linux-kernel@vger.kernel.org, linuxarm@openeuler.org,
 linux-acpi@vger.kernel.org, xuwei5@huawei.com, tiantao6@hisilicon.com,
 linux-arm-kernel@lists.infradead.org
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data while each cluster
has local L3 tag. On the other hand, each cluster will share some
internal system bus. This means cache is much more affine inside one cluster
than across clusters.

    +-----------------------------------+                          +---------+
    |  +------+    +------+            +---------------------------+         |
    |  | CPU0 |    | cpu1 |             |    +-----------+         |         |
    |  +------+    +------+             |    |           |         |         |
    |                                   +----+    L3     |         |         |
    |  +------+    +------+   cluster   |    |    tag    |         |         |
    |  | CPU2 |    | CPU3 |             |    |           |         |         |
    |  +------+    +------+             |    +-----------+         |         |
    |                                   |                          |         |
    +-----------------------------------+                          |         |
    +-----------------------------------+                          |         |
    |  +------+    +------+             +--------------------------+         |
    |  |      |    |      |             |    +-----------+         |         |
    |  +------+    +------+             |    |           |         |         |
    |                                   |    |    L3     |         |         |
    |  +------+    +------+             +----+    tag    |         |         |
    |  |      |    |      |             |    |           |         |         |
    |  +------+    +------+             |    +-----------+         |         |
    |                                   |                          |         |
    +-----------------------------------+                          |   L3    |
                                                                   |   data  |
    +-----------------------------------+                          |         |
    |  +------+    +------+             |    +-----------+         |         |
    |  |      |    |      |             |    |           |         |         |
    |  +------+    +------+             +----+    L3     |         |         |
    |                                   |    |    tag    |         |         |
    |  +------+    +------+             |    |           |         |         |
    |  |      |    |      |            ++    +-----------+         |         |
    |  +------+    +------+            |---------------------------+         |
    +-----------------------------------|                          |         |
    +-----------------------------------|                          |         |
    |  +------+    +------+            +---------------------------+         |
    |  |      |    |      |             |    +-----------+         |         |
    |  +------+    +------+             |    |           |         |         |
    |                                   +----+    L3     |         |         |
    |  +------+    +------+             |    |    tag    |         |         |
    |  |      |    |      |             |    |           |         |         |
    |  +------+    +------+             |    +-----------+         |         |
    |                                   |                          |         |
    +-----------------------------------+                          |         |
    +-----------------------------------+                          |         |
    |  +------+    +------+             +--------------------------+         |
    |  |      |    |      |             |   +-----------+          |         |
    |  +------+    +------+             |   |           |          |         |


Through the following small program, you can see the performance impact of
running it in one cluster and across two clusters:

struct foo {
        int x;
        int y;
} f;

void *thread1_fun(void *param)
{
        int s = 0;
        for (int i = 0; i < 0xfffffff; i++)
                s += f.x;
}

void *thread2_fun(void *param)
{
        int s = 0;
        for (int i = 0; i < 0xfffffff; i++)
                f.y++;
}

int main(int argc, char **argv)
{
        pthread_t tid1, tid2;

        pthread_create(&tid1, NULL, thread1_fun, NULL);
        pthread_create(&tid2, NULL, thread2_fun, NULL);
        pthread_join(tid1, NULL);
        pthread_join(tid2, NULL);
}

While running this program in one cluster, it takes:
$ time taskset -c 0,1 ./a.out 
real	0m0.832s
user	0m1.649s
sys	0m0.004s

As a contrast, it takes much more time if we run the same program
in two clusters:
$ time taskset -c 0,4 ./a.out 
real	0m1.133s
user	0m1.960s
sys	0m0.000s

0.832/1.133 = 73%, it is a huge difference.

Also, hackbench running on 4 cpus in single one cluster and 4 cpus in
different clusters also shows a large contrast:
* inside a cluster:
root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 4.285

* across clusters:
root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each
(== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 5.524

The score is 4.285 vs 5.524, shorter time means better performance.

All these testing implies that we should let the Linux scheduler use
this topology to make better load balancing and WAKE_AFFINE decisions.
However, the current scheduler totally has no idea of clusters.

This patchset exposed the cluster topology first, then added the sched
domain for cluster. While it is named as "cluster", architectures and
machines can define the exact meaning of cluster as long as they have
some resources sharing under llc and they can leverage the affinity
of this resource to achive better scheduling performance.

-v3:
 - rebased againest 5.11-rc2
 - with respect to the comments of Valentin Schneider, Peter Zijlstra,
   Vincent Guittot and Mel Gorman etc.
  * moved the scheduler changes from arm64 to the common place for all
    architectures.
  * added SD_SHARE_CLS_RESOURCES sd_flags specifying the sched_domain
    where select_idle_cpu() should begin to scan from
  * removed redundant select_idle_cluster() function since all code is
    in select_idle_cpu() now. it also avoided scanning cluster cpus
    twice in v2 code;
  * redo the hackbench in one numa after the above changes

Valentin suggested that select_idle_cpu() could begin to scan from
domain with SD_SHARE_PKG_RESOURCES. Changing like this might be too
aggressive and limit the spreading of tasks. Thus, this patch lets
the architectures and machines to decide where to start by adding
a new SD_SHARE_CLS_RESOURCES.

Barry Song (1):
  scheduler: add scheduler level for clusters

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die.

 Documentation/admin-guide/cputopology.rst | 26 +++++++++++---
 arch/arm64/Kconfig                        |  7 ++++
 arch/arm64/kernel/topology.c              |  2 ++
 drivers/acpi/pptt.c                       | 60 +++++++++++++++++++++++++++++++
 drivers/base/arch_topology.c              | 14 ++++++++
 drivers/base/topology.c                   | 10 ++++++
 include/linux/acpi.h                      |  5 +++
 include/linux/arch_topology.h             |  5 +++
 include/linux/sched/sd_flags.h            |  9 +++++
 include/linux/sched/topology.h            |  7 ++++
 include/linux/topology.h                  | 13 +++++++
 kernel/sched/fair.c                       | 27 ++++++++++----
 kernel/sched/topology.c                   |  6 ++++
 13 files changed, 181 insertions(+), 10 deletions(-)

-- 
2.7.4


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel