[RFC,v4,3/3] scheduler: Add cluster scheduler level for x86
diff mbox series

Message ID 20210301225940.16728-4-song.bao.hua@hisilicon.com
State New, archived
Headers show
Series
  • scheduler: expose the topology of clusters and add cluster scheduler
Related show

Commit Message

Song Bao Hua (Barry Song) March 1, 2021, 10:59 p.m. UTC
From: Tim Chen <tim.c.chen@linux.intel.com>

There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
is shared among a cluster of cores instead of being exclusive
to one single core.

To prevent oversubscription of L2 cache, load should be
balanced between such L2 clusters, especially for tasks with
no shared data.

Also with cluster scheduling policy where tasks are woken up
in the same L2 cluster, we will benefit from keeping tasks
related to each other and likely sharing data in the same L2
cluster.

Add CPU masks of CPUs sharing the L2 cache so we can build such
L2 cluster scheduler domain.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 arch/x86/Kconfig                |  8 ++++++++
 arch/x86/include/asm/smp.h      |  7 +++++++
 arch/x86/include/asm/topology.h |  1 +
 arch/x86/kernel/cpu/cacheinfo.c |  1 +
 arch/x86/kernel/cpu/common.c    |  3 +++
 arch/x86/kernel/smpboot.c       | 43 ++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 62 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra March 2, 2021, 10:30 a.m. UTC | #1
On Tue, Mar 02, 2021 at 11:59:40AM +1300, Barry Song wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> 
> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> is shared among a cluster of cores instead of being exclusive
> to one single core.

Isn't that most atoms one way or another? Tremont seems to have it per 4
cores, but earlier it was per 2 cores.
Tim Chen March 3, 2021, 6:34 p.m. UTC | #2
On 3/2/21 2:30 AM, Peter Zijlstra wrote:
> On Tue, Mar 02, 2021 at 11:59:40AM +1300, Barry Song wrote:
>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>
>> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
>> is shared among a cluster of cores instead of being exclusive
>> to one single core.
> 
> Isn't that most atoms one way or another? Tremont seems to have it per 4
> cores, but earlier it was per 2 cores.
> 

Yes, older Atoms have 2 cores sharing L2.  I probably should
rephrase my comments to not leave the impression that sharing
L2 among cores is new for Atoms.

Tremont based Atom CPUs increases the possible load imbalance more
with 4 cores per L2 instead of 2.  And also with more overall cores on a die, the
chance increases for packing running tasks on a few clusters while leaving
others empty on light/medium loaded systems.  We did see
this effect on Jacobsville.

So load balancing between the L2 clusters is more
useful on Tremont based Atom CPUs compared to the older Atoms.

Tim
Song Bao Hua (Barry Song) March 8, 2021, 10:30 p.m. UTC | #3
> -----Original Message-----
> From: Tim Chen [mailto:tim.c.chen@linux.intel.com]
> Sent: Thursday, March 4, 2021 7:34 AM
> To: Peter Zijlstra <peterz@infradead.org>; Song Bao Hua (Barry Song)
> <song.bao.hua@hisilicon.com>
> Cc: catalin.marinas@arm.com; will@kernel.org; rjw@rjwysocki.net;
> vincent.guittot@linaro.org; bp@alien8.de; tglx@linutronix.de;
> mingo@redhat.com; lenb@kernel.org; dietmar.eggemann@arm.com;
> rostedt@goodmis.org; bsegall@google.com; mgorman@suse.de;
> msys.mizuma@gmail.com; valentin.schneider@arm.com;
> gregkh@linuxfoundation.org; Jonathan Cameron <jonathan.cameron@huawei.com>;
> juri.lelli@redhat.com; mark.rutland@arm.com; sudeep.holla@arm.com;
> aubrey.li@linux.intel.com; linux-arm-kernel@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-acpi@vger.kernel.org; x86@kernel.org;
> xuwei (O) <xuwei5@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> guodong.xu@linaro.org; yangyicong <yangyicong@huawei.com>; Liguozhu (Kenneth)
> <liguozhu@hisilicon.com>; linuxarm@openeuler.org; hpa@zytor.com
> Subject: [Linuxarm] Re: [RFC PATCH v4 3/3] scheduler: Add cluster scheduler
> level for x86
> 
> 
> 
> On 3/2/21 2:30 AM, Peter Zijlstra wrote:
> > On Tue, Mar 02, 2021 at 11:59:40AM +1300, Barry Song wrote:
> >> From: Tim Chen <tim.c.chen@linux.intel.com>
> >>
> >> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce
> >> is shared among a cluster of cores instead of being exclusive
> >> to one single core.
> >
> > Isn't that most atoms one way or another? Tremont seems to have it per 4
> > cores, but earlier it was per 2 cores.
> >
> 
> Yes, older Atoms have 2 cores sharing L2.  I probably should
> rephrase my comments to not leave the impression that sharing
> L2 among cores is new for Atoms.
> 
> Tremont based Atom CPUs increases the possible load imbalance more
> with 4 cores per L2 instead of 2.  And also with more overall cores on a die,
> the
> chance increases for packing running tasks on a few clusters while leaving
> others empty on light/medium loaded systems.  We did see
> this effect on Jacobsville.
> 
> So load balancing between the L2 clusters is more
> useful on Tremont based Atom CPUs compared to the older Atoms.

It seems sensible the more CPU we get in the cluster, the more
we need the kernel to be aware of its existence.

Tim, it is possible for you to bring up the cpu_cluster_mask and
cluster_sibling for x86 so that the topology can be represented
in sysfs and be used by scheduler? It seems your patch lacks this
part.

BTW, I wonder if x86 can do some improvement on your KMP_AFFINITY
by leveraging the cluster topology level.
https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html

KMP_AFFINITY has thread affinity modes like compact and scatter,
it seems this "compact" and "scatter" can also use the cluster
information as you see we are also struggling with the "compact"
and "scatter" issues here in this patchset :-)

Thanks
Barry
Tim Chen March 15, 2021, 8:53 p.m. UTC | #4
> It seems sensible the more CPU we get in the cluster, the more
> we need the kernel to be aware of its existence.
> 
> Tim, it is possible for you to bring up the cpu_cluster_mask and
> cluster_sibling for x86 so that the topology can be represented
> in sysfs and be used by scheduler? It seems your patch lacks this
> part.

You mean having something in /sys/devices/system/cpu/cpu0/topology
on cluster information so that an external program can 
affinitize to a cluster if it prefers to do so?

Tim  

> 
> BTW, I wonder if x86 can do some improvement on your KMP_AFFINITY
> by leveraging the cluster topology level.
> https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html
> 
> KMP_AFFINITY has thread affinity modes like compact and scatter,
> it seems this "compact" and "scatter" can also use the cluster
> information as you see we are also struggling with the "compact"
> and "scatter" issues here in this patchset :-)
> 
> Thanks
> Barry
>

Patch
diff mbox series

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d3338a8..40110de 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1009,6 +1009,14 @@  config NR_CPUS
 	  This is purely to save memory: each supported CPU adds about 8KB
 	  to the kernel image.
 
+config SCHED_CLUSTER
+	bool "Cluster scheduler support"
+	default n
+	help
+	 Cluster scheduler support improves the CPU scheduler's decision
+	 making when dealing with machines that have clusters of CPUs
+	 sharing L2 cache. If unsure say N here.
+
 config SCHED_SMT
 	def_bool y if SMP
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index c0538f8..9cbc4ae 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -16,7 +16,9 @@ 
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
+DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
 
 static inline struct cpumask *cpu_llc_shared_mask(int cpu)
@@ -24,6 +26,11 @@  static inline struct cpumask *cpu_llc_shared_mask(int cpu)
 	return per_cpu(cpu_llc_shared_map, cpu);
 }
 
+static inline struct cpumask *cpu_l2c_shared_mask(int cpu)
+{
+	return per_cpu(cpu_l2c_shared_map, cpu);
+}
+
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9239399..2a11ccc 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -103,6 +103,7 @@  static inline void setup_node_to_cpumask_map(void) { }
 #include <asm-generic/topology.h>
 
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
+extern const struct cpumask *cpu_clustergroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)	(cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)	(cpu_data(cpu).phys_proc_id)
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index 3ca9be4..0d03a71 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -846,6 +846,7 @@  void init_intel_cacheinfo(struct cpuinfo_x86 *c)
 		l2 = new_l2;
 #ifdef CONFIG_SMP
 		per_cpu(cpu_llc_id, cpu) = l2_id;
+		per_cpu(cpu_l2c_id, cpu) = l2_id;
 #endif
 	}
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 35ad848..fb08c73 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -78,6 +78,9 @@ 
 /* Last level cache ID of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;
 
+/* L2 cache ID of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID;
+
 /* correctly size the local cpu masks */
 void __init setup_cpu_local_masks(void)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 02813a7..c85ffa8 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,8 @@ 
 
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
+
 /* Per CPU bogomips and other parameters */
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
@@ -501,6 +503,21 @@  static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	return topology_sane(c, o, "llc");
 }
 
+static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
+
+	/* Do not match if we do not have a valid APICID for cpu: */
+	if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
+		return false;
+
+	/* Do not match if L2 cache id does not match: */
+	if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))
+		return false;
+
+	return topology_sane(c, o, "l2c");
+}
+
 /*
  * Unlike the other levels, we do not enforce keeping a
  * multicore group inside a NUMA node.  If this happens, we will
@@ -522,7 +539,7 @@  static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 }
 
 
-#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC)
+#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_CLUSTER) || defined(CONFIG_SCHED_MC)
 static inline int x86_sched_itmt_flags(void)
 {
 	return sysctl_sched_itmt_enabled ? SD_ASYM_PACKING : 0;
@@ -540,12 +557,21 @@  static int x86_smt_flags(void)
 	return cpu_smt_flags() | x86_sched_itmt_flags();
 }
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+static int x86_cluster_flags(void)
+{
+	return cpu_cluster_flags() | x86_sched_itmt_flags();
+}
+#endif
 #endif
 
 static struct sched_domain_topology_level x86_numa_in_package_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
@@ -556,6 +582,9 @@  static int x86_smt_flags(void)
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
@@ -583,6 +612,7 @@  void set_cpu_sibling_map(int cpu)
 	if (!has_mp) {
 		cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
 		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
+		cpumask_set_cpu(cpu, cpu_l2c_shared_mask(cpu));
 		cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
 		cpumask_set_cpu(cpu, topology_die_cpumask(cpu));
 		c->booted_cores = 1;
@@ -598,6 +628,8 @@  void set_cpu_sibling_map(int cpu)
 		if ((i == cpu) || (has_mp && match_llc(c, o)))
 			link_mask(cpu_llc_shared_mask, cpu, i);
 
+		if ((i == cpu) || (has_mp && match_l2c(c, o)))
+			link_mask(cpu_l2c_shared_mask, cpu, i);
 	}
 
 	/*
@@ -649,6 +681,11 @@  const struct cpumask *cpu_coregroup_mask(int cpu)
 	return cpu_llc_shared_mask(cpu);
 }
 
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	return cpu_l2c_shared_mask(cpu);
+}
+
 static void impress_friends(void)
 {
 	int cpu;
@@ -1332,6 +1369,7 @@  void __init native_smp_prepare_cpus(unsigned int max_cpus)
 		zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_die_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
+		zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL);
 	}
 
 	/*
@@ -1556,7 +1594,10 @@  static void remove_siblinginfo(int cpu)
 		cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
 	for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
 		cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+	for_each_cpu(sibling, cpu_l2c_shared_mask(cpu))
+		cpumask_clear_cpu(cpu, cpu_l2c_shared_mask(sibling));
 	cpumask_clear(cpu_llc_shared_mask(cpu));
+	cpumask_clear(cpu_l2c_shared_mask(cpu));
 	cpumask_clear(topology_sibling_cpumask(cpu));
 	cpumask_clear(topology_core_cpumask(cpu));
 	cpumask_clear(topology_die_cpumask(cpu));