linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/5] rework sched_domain topology description
@ 2014-04-11  9:44 Vincent Guittot
  2014-04-11  9:44 ` [PATCH v4 1/5] sched: rework of sched_domain topology definition Vincent Guittot
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Vincent Guittot @ 2014-04-11  9:44 UTC (permalink / raw)
  To: linux-arm-kernel

This patchset was previously part of the larger tasks packing patchset [1].
I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology (this patchset)
-update and consolidation of cpu_power
-tasks packing algorithm

Based on Peter Z's proposal [2][3], this patchset modifies the way to configure
the sched_domain level in order to let architectures to add specific level like
the current BOOK level or the proposed power gating level for ARM architecture.

[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2013/11/5/239
[3] https://lkml.org/lkml/2013/11/5/449

Change since v3:
- remove last patch which was adding SD_SHARE_POWER_DOMAIN for powerpc's SMT
  level

Change since v2:
- remove patch 1 as it has been already applied to metag tree for v3.15
- updates some commit messages
- add new flag description in TOPOLOGY_SD_FLAGS description

Change since v1:
- move sched_domains_curr_level back under #ifdef CONFIG_NUMA
- use function pointer to set flag instead of a plain value.
- add list of tunable flags in the commit message of patch 2
- add SD_SHARE_POWER_DOMAIN flag for powerpc's SMT level

Vincent Guittot (5):
  sched: rework of sched_domain topology definition
  sched: s390: create a dedicated topology table
  sched: powerpc: create a dedicated topology table
  sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
  sched: ARM: create a dedicated scheduler topology table

 arch/arm/kernel/topology.c       |  26 +++++
 arch/ia64/include/asm/topology.h |  24 ----
 arch/powerpc/kernel/smp.c        |  31 +++--
 arch/s390/include/asm/topology.h |  13 +--
 arch/s390/kernel/topology.c      |  20 ++++
 arch/tile/include/asm/topology.h |  33 ------
 include/linux/sched.h            |  49 +++++++-
 include/linux/topology.h         | 128 +++-----------------
 kernel/sched/core.c              | 244 ++++++++++++++++++++-------------------
 9 files changed, 255 insertions(+), 313 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/5] sched: rework of sched_domain topology definition
  2014-04-11  9:44 [PATCH v4 0/5] rework sched_domain topology description Vincent Guittot
@ 2014-04-11  9:44 ` Vincent Guittot
  2014-04-18 10:56   ` Peter Zijlstra
  2014-04-11  9:44 ` [PATCH v4 2/5] sched: s390: create a dedicated topology table Vincent Guittot
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Vincent Guittot @ 2014-04-11  9:44 UTC (permalink / raw)
  To: linux-arm-kernel

We replace the old way to configure the scheduler topology with a new method
which enables a platform to declare additionnal level (if needed).

We still have a default topology table definition that can be used by platform
that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
be overwritten by an arch which either wants to add new level where a load
balance make sense like BOOK or powergating level or wants to change the flags
configuration of some levels.

For each level, we need a function pointer that returns cpumask for each cpu,
a function pointer that returns the flags for the level and a name. Only flags
that describe topology, can be set by an architecture. The current topology
flags are:
 SD_SHARE_CPUPOWER
 SD_SHARE_PKG_RESOURCES
 SD_NUMA
 SD_ASYM_PACKING

Then, each level must be a subset on the next one. The build sequence of the
sched_domain will take care of removing useless levels like those with 1 CPU
and those with the same CPU span and no more relevant information for
load balancing than its childs.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 arch/ia64/include/asm/topology.h |  24 ----
 arch/s390/include/asm/topology.h |   2 -
 arch/tile/include/asm/topology.h |  33 ------
 include/linux/sched.h            |  48 ++++++++
 include/linux/topology.h         | 128 +++------------------
 kernel/sched/core.c              | 235 ++++++++++++++++++++-------------------
 6 files changed, 183 insertions(+), 287 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index 5cb55a1..3202aa7 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -46,30 +46,6 @@
 
 void build_cpu_to_node_map(void);
 
-#define SD_CPU_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 1,			\
-	.max_interval		= 4,			\
-	.busy_factor		= 64,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 2,			\
-	.idle_idx		= 1,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_FORK	\
-				| SD_WAKE_AFFINE,	\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index 05425b1..07763bd 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -64,8 +64,6 @@ static inline void s390_init_cpu_topology(void)
 };
 #endif
 
-#define SD_BOOK_INIT	SD_CPU_INIT
-
 #include <asm-generic/topology.h>
 
 #endif /* _ASM_S390_TOPOLOGY_H */
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d15c0d8..9383118 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 /* For now, use numa node -1 for global allocation. */
 #define pcibus_to_node(bus)		((void)(bus), -1)
 
-/*
- * TILE architecture has many cores integrated in one processor, so we need
- * setup bigger balance_interval for both CPU/NODE scheduling domains to
- * reduce process scheduling costs.
- */
-
-/* sched_domains SD_CPU_INIT for TILE architecture */
-#define SD_CPU_INIT (struct sched_domain) {				\
-	.min_interval		= 4,					\
-	.max_interval		= 128,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 0*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 32,					\
-}
-
 /* By definition, we create nodes based on online memory. */
 #define node_has_online_mem(nid) 1
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7cb07fd..3161701 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -874,6 +874,20 @@ enum cpu_idle_type {
 
 extern int __weak arch_sd_sibiling_asym_packing(void);
 
+#ifdef CONFIG_SCHED_SMT
+static inline const int cpu_smt_flags(void)
+{
+	return SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES;
+}
+#endif
+
+#ifdef CONFIG_SCHED_MC
+static inline const int cpu_core_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES;
+}
+#endif
+
 struct sched_domain_attr {
 	int relax_domain_level;
 };
@@ -980,6 +994,38 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
 bool cpus_share_cache(int this_cpu, int that_cpu);
 
+typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
+typedef const int (*sched_domain_flags_f)(void);
+
+#define SDTL_OVERLAP	0x01
+
+struct sd_data {
+	struct sched_domain **__percpu sd;
+	struct sched_group **__percpu sg;
+	struct sched_group_power **__percpu sgp;
+};
+
+struct sched_domain_topology_level {
+	sched_domain_mask_f mask;
+	sched_domain_flags_f sd_flags;
+	int		    flags;
+	int		    numa_level;
+	struct sd_data      data;
+#ifdef CONFIG_SCHED_DEBUG
+	char                *name;
+#endif
+};
+
+extern struct sched_domain_topology_level *sched_domain_topology;
+
+extern void set_sched_topology(struct sched_domain_topology_level *tl);
+
+#ifdef CONFIG_SCHED_DEBUG
+# define SD_INIT_NAME(type)		.name = #type
+#else
+# define SD_INIT_NAME(type)
+#endif
+
 #else /* CONFIG_SMP */
 
 struct sched_domain_attr;
@@ -995,6 +1041,8 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+static inline void set_sched_topology(struct sched_domain_topology_level *tl) { }
+
 #endif	/* !CONFIG_SMP */
 
 
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..3a9db05 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -66,121 +66,6 @@ int arch_update_cpu_topology(void);
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
 
-/*
- * Below are the 3 major initializers used in building sched_domains:
- * SD_SIBLING_INIT, for SMT domains
- * SD_CPU_INIT, for SMP domains
- *
- * Any architecture that cares to do any tuning to these values should do so
- * by defining their own arch-specific initializer in include/asm/topology.h.
- * A definition there will automagically override these default initializers
- * and allow arch-specific performance tuning of sched_domains.
- * (Only non-zero and non-null fields need be specified.)
- */
-
-#ifdef CONFIG_SCHED_SMT
-/* MCD - Do we really need this?  It is always on if CONFIG_SCHED_SMT is,
- * so can't we drop this in favor of CONFIG_SCHED_SMT?
- */
-#define ARCH_HAS_SCHED_WAKE_IDLE
-/* Common values for SMT siblings */
-#ifndef SD_SIBLING_INIT
-#define SD_SIBLING_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 2,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 110,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 1*SD_SHARE_CPUPOWER			\
-				| 1*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
-				| arch_sd_sibling_asym_packing()	\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-	.smt_gain		= 1178,	/* 15% */			\
-	.max_newidle_lb_cost	= 0,					\
-	.next_decay_max_lb_cost	= jiffies,				\
-}
-#endif
-#endif /* CONFIG_SCHED_SMT */
-
-#ifdef CONFIG_SCHED_MC
-/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */
-#ifndef SD_MC_INIT
-#define SD_MC_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 4,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 1*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-	.max_newidle_lb_cost	= 0,					\
-	.next_decay_max_lb_cost	= jiffies,				\
-}
-#endif
-#endif /* CONFIG_SCHED_MC */
-
-/* Common values for CPUs */
-#ifndef SD_CPU_INIT
-#define SD_CPU_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 4,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				| 1*SD_PREFER_SIBLING			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-	.max_newidle_lb_cost	= 0,					\
-	.next_decay_max_lb_cost	= jiffies,				\
-}
-#endif
-
-#ifdef CONFIG_SCHED_BOOK
-#ifndef SD_BOOK_INIT
-#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
-#endif
-#endif /* CONFIG_SCHED_BOOK */
-
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DECLARE_PER_CPU(int, numa_node);
 
@@ -295,4 +180,17 @@ static inline int cpu_to_mem(int cpu)
 #define topology_core_cpumask(cpu)		cpumask_of(cpu)
 #endif
 
+#ifdef CONFIG_SCHED_SMT
+static inline const struct cpumask *cpu_smt_mask(int cpu)
+{
+	return topology_thread_cpumask(cpu);
+}
+#endif
+
+static inline const struct cpumask *cpu_cpu_mask(int cpu)
+{
+	return cpumask_of_node(cpu_to_node(cpu));
+}
+
+
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9d9776..0714574 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5600,17 +5600,6 @@ static int __init isolated_cpu_setup(char *str)
 
 __setup("isolcpus=", isolated_cpu_setup);
 
-static const struct cpumask *cpu_cpu_mask(int cpu)
-{
-	return cpumask_of_node(cpu_to_node(cpu));
-}
-
-struct sd_data {
-	struct sched_domain **__percpu sd;
-	struct sched_group **__percpu sg;
-	struct sched_group_power **__percpu sgp;
-};
-
 struct s_data {
 	struct sched_domain ** __percpu sd;
 	struct root_domain	*rd;
@@ -5623,21 +5612,6 @@ enum s_alloc {
 	sa_none,
 };
 
-struct sched_domain_topology_level;
-
-typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
-typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
-
-#define SDTL_OVERLAP	0x01
-
-struct sched_domain_topology_level {
-	sched_domain_init_f init;
-	sched_domain_mask_f mask;
-	int		    flags;
-	int		    numa_level;
-	struct sd_data      data;
-};
-
 /*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
@@ -5866,34 +5840,6 @@ int __weak arch_sd_sibling_asym_packing(void)
  * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
  */
 
-#ifdef CONFIG_SCHED_DEBUG
-# define SD_INIT_NAME(sd, type)		sd->name = #type
-#else
-# define SD_INIT_NAME(sd, type)		do { } while (0)
-#endif
-
-#define SD_INIT_FUNC(type)						\
-static noinline struct sched_domain *					\
-sd_init_##type(struct sched_domain_topology_level *tl, int cpu) 	\
-{									\
-	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);	\
-	*sd = SD_##type##_INIT;						\
-	SD_INIT_NAME(sd, type);						\
-	sd->private = &tl->data;					\
-	return sd;							\
-}
-
-SD_INIT_FUNC(CPU)
-#ifdef CONFIG_SCHED_SMT
- SD_INIT_FUNC(SIBLING)
-#endif
-#ifdef CONFIG_SCHED_MC
- SD_INIT_FUNC(MC)
-#endif
-#ifdef CONFIG_SCHED_BOOK
- SD_INIT_FUNC(BOOK)
-#endif
-
 static int default_relax_domain_level = -1;
 int sched_domain_level_max;
 
@@ -5981,97 +5927,156 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
 		*per_cpu_ptr(sdd->sgp, cpu) = NULL;
 }
 
-#ifdef CONFIG_SCHED_SMT
-static const struct cpumask *cpu_smt_mask(int cpu)
-{
-	return topology_thread_cpumask(cpu);
-}
-#endif
-
-/*
- * Topology list, bottom-up.
- */
-static struct sched_domain_topology_level default_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-	{ sd_init_SIBLING, cpu_smt_mask, },
-#endif
-#ifdef CONFIG_SCHED_MC
-	{ sd_init_MC, cpu_coregroup_mask, },
-#endif
-#ifdef CONFIG_SCHED_BOOK
-	{ sd_init_BOOK, cpu_book_mask, },
-#endif
-	{ sd_init_CPU, cpu_cpu_mask, },
-	{ NULL, },
-};
-
-static struct sched_domain_topology_level *sched_domain_topology = default_topology;
-
-#define for_each_sd_topology(tl)			\
-	for (tl = sched_domain_topology; tl->init; tl++)
-
 #ifdef CONFIG_NUMA
-
 static int sched_domains_numa_levels;
 static int *sched_domains_numa_distance;
 static struct cpumask ***sched_domains_numa_masks;
 static int sched_domains_curr_level;
+#endif
 
-static inline int sd_local_flags(int level)
-{
-	if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
-		return 0;
-
-	return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
-}
+/*
+ * SD_flags allowed in topology descriptions.
+ *
+ * SD_SHARE_CPUPOWER      - describes SMT topologies
+ * SD_SHARE_PKG_RESOURCES - describes shared caches
+ * SD_NUMA                - describes NUMA topologies
+ *
+ * Odd one out:
+ * SD_ASYM_PACKING        - describes SMT quirks
+ */
+#define TOPOLOGY_SD_FLAGS		\
+	(SD_SHARE_CPUPOWER |		\
+	 SD_SHARE_PKG_RESOURCES |	\
+	 SD_NUMA |			\
+	 SD_ASYM_PACKING)
 
 static struct sched_domain *
-sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
+sd_init(struct sched_domain_topology_level *tl, int cpu)
 {
 	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
-	int level = tl->numa_level;
-	int sd_weight = cpumask_weight(
-			sched_domains_numa_masks[level][cpu_to_node(cpu)]);
+	int sd_weight, sd_flags = 0;
+
+#ifdef CONFIG_NUMA
+	/*
+	 * Ugly hack to pass state to sd_numa_mask()...
+	 */
+	sched_domains_curr_level = tl->numa_level;
+#endif
+
+	sd_weight = cpumask_weight(tl->mask(cpu));
+
+	if (tl->sd_flags)
+		sd_flags = (*tl->sd_flags)();
+	if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
+			"wrong sd_flags in topology description\n"))
+		sd_flags &= ~TOPOLOGY_SD_FLAGS;
 
 	*sd = (struct sched_domain){
 		.min_interval		= sd_weight,
 		.max_interval		= 2*sd_weight,
 		.busy_factor		= 32,
 		.imbalance_pct		= 125,
-		.cache_nice_tries	= 2,
-		.busy_idx		= 3,
-		.idle_idx		= 2,
+
+		.cache_nice_tries	= 0,
+		.busy_idx		= 0,
+		.idle_idx		= 0,
 		.newidle_idx		= 0,
 		.wake_idx		= 0,
 		.forkexec_idx		= 0,
 
 		.flags			= 1*SD_LOAD_BALANCE
 					| 1*SD_BALANCE_NEWIDLE
-					| 0*SD_BALANCE_EXEC
-					| 0*SD_BALANCE_FORK
+					| 1*SD_BALANCE_EXEC
+					| 1*SD_BALANCE_FORK
 					| 0*SD_BALANCE_WAKE
-					| 0*SD_WAKE_AFFINE
+					| 1*SD_WAKE_AFFINE
 					| 0*SD_SHARE_CPUPOWER
 					| 0*SD_SHARE_PKG_RESOURCES
-					| 1*SD_SERIALIZE
+					| 0*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
-					| 1*SD_NUMA
-					| sd_local_flags(level)
+					| 0*SD_NUMA
+					| sd_flags
 					,
+
 		.last_balance		= jiffies,
 		.balance_interval	= sd_weight,
+		.smt_gain		= 0,
+		.max_newidle_lb_cost	= 0,
+		.next_decay_max_lb_cost	= jiffies,
+#ifdef CONFIG_SCHED_DEBUG
+		.name			= tl->name,
+#endif
 	};
-	SD_INIT_NAME(sd, NUMA);
-	sd->private = &tl->data;
 
 	/*
-	 * Ugly hack to pass state to sd_numa_mask()...
+	 * Convert topological properties into behaviour.
 	 */
-	sched_domains_curr_level = tl->numa_level;
+
+	if (sd->flags & SD_SHARE_CPUPOWER) {
+		sd->imbalance_pct = 110;
+		sd->smt_gain = 1178; /* ~15% */
+		sd->flags |= arch_sd_sibling_asym_packing();
+
+	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
+		sd->imbalance_pct = 117;
+		sd->cache_nice_tries = 1;
+		sd->busy_idx = 2;
+
+#ifdef CONFIG_NUMA
+	} else if (sd->flags & SD_NUMA) {
+		sd->cache_nice_tries = 2;
+		sd->busy_idx = 3;
+		sd->idle_idx = 2;
+
+		sd->flags |= SD_SERIALIZE;
+		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+			sd->flags &= ~(SD_BALANCE_EXEC |
+				       SD_BALANCE_FORK |
+				       SD_WAKE_AFFINE);
+		}
+
+#endif
+	} else {
+		sd->flags |= SD_PREFER_SIBLING;
+		sd->cache_nice_tries = 1;
+		sd->busy_idx = 2;
+		sd->idle_idx = 1;
+	}
+
+	sd->private = &tl->data;
 
 	return sd;
 }
 
+/*
+ * Topology list, bottom-up.
+ */
+static struct sched_domain_topology_level default_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_MC
+	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
+#ifdef CONFIG_SCHED_BOOK
+	{ cpu_book_mask, SD_INIT_NAME(BOOK) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
+struct sched_domain_topology_level *sched_domain_topology = default_topology;
+
+#define for_each_sd_topology(tl)			\
+	for (tl = sched_domain_topology; tl->mask; tl++)
+
+void set_sched_topology(struct sched_domain_topology_level *tl)
+{
+	sched_domain_topology = tl;
+}
+
+#ifdef CONFIG_NUMA
+
 static const struct cpumask *sd_numa_mask(int cpu)
 {
 	return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];
@@ -6215,7 +6220,10 @@ static void sched_init_numa(void)
 		}
 	}
 
-	tl = kzalloc((ARRAY_SIZE(default_topology) + level) *
+	/* Compute default topology size */
+	for (i = 0; sched_domain_topology[i].mask; i++);
+
+	tl = kzalloc((i + level) *
 			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
 	if (!tl)
 		return;
@@ -6223,18 +6231,19 @@ static void sched_init_numa(void)
 	/*
 	 * Copy the default topology bits..
 	 */
-	for (i = 0; default_topology[i].init; i++)
-		tl[i] = default_topology[i];
+	for (i = 0; sched_domain_topology[i].mask; i++)
+		tl[i] = sched_domain_topology[i];
 
 	/*
 	 * .. and append 'j' levels of NUMA goodness.
 	 */
 	for (j = 0; j < level; i++, j++) {
 		tl[i] = (struct sched_domain_topology_level){
-			.init = sd_numa_init,
 			.mask = sd_numa_mask,
+			.sd_flags = SD_NUMA,
 			.flags = SDTL_OVERLAP,
 			.numa_level = j,
+			SD_INIT_NAME(NUMA)
 		};
 	}
 
@@ -6392,7 +6401,7 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 		struct sched_domain *child, int cpu)
 {
-	struct sched_domain *sd = tl->init(tl, cpu);
+	struct sched_domain *sd = sd_init(tl, cpu);
 	if (!sd)
 		return child;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 2/5] sched: s390: create a dedicated topology table
  2014-04-11  9:44 [PATCH v4 0/5] rework sched_domain topology description Vincent Guittot
  2014-04-11  9:44 ` [PATCH v4 1/5] sched: rework of sched_domain topology definition Vincent Guittot
@ 2014-04-11  9:44 ` Vincent Guittot
  2014-04-11  9:44 ` [PATCH v4 3/5] sched: powerpc: " Vincent Guittot
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Vincent Guittot @ 2014-04-11  9:44 UTC (permalink / raw)
  To: linux-arm-kernel

BOOK level is only relevant for s390 so we create a dedicated topology table
with BOOK level and remove it from default table.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/s390/include/asm/topology.h | 11 +----------
 arch/s390/kernel/topology.c      | 20 ++++++++++++++++++++
 kernel/sched/core.c              |  3 ---
 3 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index 07763bd..56af530 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -26,21 +26,12 @@ extern struct cpu_topology_s390 cpu_topology[NR_CPUS];
 
 #define mc_capable() 1
 
-static inline const struct cpumask *cpu_coregroup_mask(int cpu)
-{
-	return &cpu_topology[cpu].core_mask;
-}
-
-static inline const struct cpumask *cpu_book_mask(int cpu)
-{
-	return &cpu_topology[cpu].book_mask;
-}
-
 int topology_cpu_init(struct cpu *);
 int topology_set_cpu_management(int fc);
 void topology_schedule_update(void);
 void store_topology(struct sysinfo_15_1_x *info);
 void topology_expect_change(void);
+const struct cpumask *cpu_coregroup_mask(int cpu);
 
 #else /* CONFIG_SCHED_BOOK */
 
diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
index 4b2e3e3..ceddd77 100644
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -443,6 +443,23 @@ int topology_cpu_init(struct cpu *cpu)
 	return sysfs_create_group(&cpu->dev.kobj, &topology_cpu_attr_group);
 }
 
+const struct cpumask *cpu_coregroup_mask(int cpu)
+{
+	return &cpu_topology[cpu].core_mask;
+}
+
+static const struct cpumask *cpu_book_mask(int cpu)
+{
+	return &cpu_topology[cpu].book_mask;
+}
+
+static struct sched_domain_topology_level s390_topology[] = {
+	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+	{ cpu_book_mask, SD_INIT_NAME(BOOK) },
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 static int __init topology_init(void)
 {
 	if (!MACHINE_HAS_TOPOLOGY) {
@@ -452,6 +469,9 @@ static int __init topology_init(void)
 	set_topology_timer();
 out:
 	update_cpu_masks();
+
+	set_sched_topology(s390_topology);
+
 	return device_create_file(cpu_subsys.dev_root, &dev_attr_dispatching);
 }
 device_initcall(topology_init);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0714574..b15b994 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6058,9 +6058,6 @@ static struct sched_domain_topology_level default_topology[] = {
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
-#ifdef CONFIG_SCHED_BOOK
-	{ cpu_book_mask, SD_INIT_NAME(BOOK) },
-#endif
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 3/5] sched: powerpc: create a dedicated topology table
  2014-04-11  9:44 [PATCH v4 0/5] rework sched_domain topology description Vincent Guittot
  2014-04-11  9:44 ` [PATCH v4 1/5] sched: rework of sched_domain topology definition Vincent Guittot
  2014-04-11  9:44 ` [PATCH v4 2/5] sched: s390: create a dedicated topology table Vincent Guittot
@ 2014-04-11  9:44 ` Vincent Guittot
  2014-04-11  9:44 ` [PATCH v4 4/5] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 25+ messages in thread
From: Vincent Guittot @ 2014-04-11  9:44 UTC (permalink / raw)
  To: linux-arm-kernel

Create a dedicated topology table for handling asymetric feature of powerpc.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 31 +++++++++++++++++++++++--------
 include/linux/sched.h     |  2 --
 kernel/sched/core.c       |  6 ------
 3 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ac2621a..c9cade5 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -755,6 +755,28 @@ int setup_profiling_timer(unsigned int multiplier)
 	return 0;
 }
 
+#ifdef CONFIG_SCHED_SMT
+/* cpumask of CPUs with asymetric SMT dependancy */
+static const int powerpc_smt_flags(void)
+{
+	int flags = SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES;
+
+	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
+		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
+		flags |= SD_ASYM_PACKING;
+	}
+	return flags;
+}
+#endif
+
+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	cpumask_var_t old_mask;
@@ -779,15 +801,8 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 
-}
+	set_sched_topology(powerpc_topology);
 
-int arch_sd_sibling_asym_packing(void)
-{
-	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
-		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
-		return SD_ASYM_PACKING;
-	}
-	return 0;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3161701..7df3ca2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -872,8 +872,6 @@ enum cpu_idle_type {
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 #define SD_NUMA			0x4000	/* cross-node balancing */
 
-extern int __weak arch_sd_sibiling_asym_packing(void);
-
 #ifdef CONFIG_SCHED_SMT
 static inline const int cpu_smt_flags(void)
 {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b15b994..3743a6c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5830,11 +5830,6 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 	atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight);
 }
 
-int __weak arch_sd_sibling_asym_packing(void)
-{
-       return 0*SD_ASYM_PACKING;
-}
-
 /*
  * Initializers for schedule domains
  * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -6015,7 +6010,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 	if (sd->flags & SD_SHARE_CPUPOWER) {
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
-		sd->flags |= arch_sd_sibling_asym_packing();
 
 	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
 		sd->imbalance_pct = 117;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 4/5] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
  2014-04-11  9:44 [PATCH v4 0/5] rework sched_domain topology description Vincent Guittot
                   ` (2 preceding siblings ...)
  2014-04-11  9:44 ` [PATCH v4 3/5] sched: powerpc: " Vincent Guittot
@ 2014-04-11  9:44 ` Vincent Guittot
  2014-04-18 10:58   ` Peter Zijlstra
  2014-04-11  9:44 ` [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table Vincent Guittot
  2014-04-12 12:56 ` [PATCH v4 0/5] rework sched_domain topology description Dietmar Eggemann
  5 siblings, 1 reply; 25+ messages in thread
From: Vincent Guittot @ 2014-04-11  9:44 UTC (permalink / raw)
  To: linux-arm-kernel

A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs
in a sched_domain level can or not reach different power state. As an example,
the flag should be cleared at CPU level if groups of cores can be power gated
independently. This information can be used in the load balance decision or to
add load balancing level between group of CPUs that can power gate
independantly.
This flag is part of the topology flags that can be set by arch.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 10 +++++++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7df3ca2..18464cb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -865,6 +865,7 @@ enum cpu_idle_type {
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
 #define SD_SHARE_CPUPOWER	0x0080	/* Domain members share cpu power */
+#define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3743a6c..a30b0b4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5295,7 +5295,8 @@ static int sd_degenerate(struct sched_domain *sd)
 			 SD_BALANCE_FORK |
 			 SD_BALANCE_EXEC |
 			 SD_SHARE_CPUPOWER |
-			 SD_SHARE_PKG_RESOURCES)) {
+			 SD_SHARE_PKG_RESOURCES |
+			 SD_SHARE_POWERDOMAIN)) {
 		if (sd->groups != sd->groups->next)
 			return 0;
 	}
@@ -5326,7 +5327,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_BALANCE_EXEC |
 				SD_SHARE_CPUPOWER |
 				SD_SHARE_PKG_RESOURCES |
-				SD_PREFER_SIBLING);
+				SD_PREFER_SIBLING |
+				SD_SHARE_POWERDOMAIN);
 		if (nr_node_ids == 1)
 			pflags &= ~SD_SERIALIZE;
 	}
@@ -5935,6 +5937,7 @@ static int sched_domains_curr_level;
  * SD_SHARE_CPUPOWER      - describes SMT topologies
  * SD_SHARE_PKG_RESOURCES - describes shared caches
  * SD_NUMA                - describes NUMA topologies
+ * SD_SHARE_POWERDOMAIN   - describes shared power domain
  *
  * Odd one out:
  * SD_ASYM_PACKING        - describes SMT quirks
@@ -5943,7 +5946,8 @@ static int sched_domains_curr_level;
 	(SD_SHARE_CPUPOWER |		\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA |			\
-	 SD_ASYM_PACKING)
+	 SD_ASYM_PACKING |		\
+	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl, int cpu)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table
  2014-04-11  9:44 [PATCH v4 0/5] rework sched_domain topology description Vincent Guittot
                   ` (3 preceding siblings ...)
  2014-04-11  9:44 ` [PATCH v4 4/5] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
@ 2014-04-11  9:44 ` Vincent Guittot
  2014-04-23 11:46   ` Dietmar Eggemann
  2014-04-12 12:56 ` [PATCH v4 0/5] rework sched_domain topology description Dietmar Eggemann
  5 siblings, 1 reply; 25+ messages in thread
From: Vincent Guittot @ 2014-04-11  9:44 UTC (permalink / raw)
  To: linux-arm-kernel

Create a dedicated topology table for ARM which will create new level to
differentiate CPUs that can or not powergate independantly from others.

The patch gives an example of how to add domain that will take advantage of
SD_SHARE_POWERDOMAIN.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/arm/kernel/topology.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 0bc94b1..71e1fec 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return &cpu_topology[cpu].core_sibling;
 }
 
+/*
+ * The current assumption is that we can power gate each core independently.
+ * This will be superseded by DT binding once available.
+ */
+const struct cpumask *cpu_corepower_mask(int cpu)
+{
+	return &cpu_topology[cpu].thread_sibling;
+}
+
 static void update_siblings_masks(unsigned int cpuid)
 {
 	struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
@@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid)
 		cpu_topology[cpuid].socket_id, mpidr);
 }
 
+static inline const int cpu_corepower_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
+}
+
+static struct sched_domain_topology_level arm_topology[] = {
+#ifdef CONFIG_SCHED_MC
+	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
+	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 /*
  * init_cpu_topology is called at boot when only one cpu is running
  * which prevent simultaneous write access to cpu_topology array
@@ -289,4 +312,7 @@ void __init init_cpu_topology(void)
 	smp_wmb();
 
 	parse_dt_topology();
+
+	/* Set scheduler topology descriptor */
+	set_sched_topology(arm_topology);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 0/5] rework sched_domain topology description
  2014-04-11  9:44 [PATCH v4 0/5] rework sched_domain topology description Vincent Guittot
                   ` (4 preceding siblings ...)
  2014-04-11  9:44 ` [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table Vincent Guittot
@ 2014-04-12 12:56 ` Dietmar Eggemann
  2014-04-14  7:29   ` Vincent Guittot
  2014-04-15  7:53   ` Peter Zijlstra
  5 siblings, 2 replies; 25+ messages in thread
From: Dietmar Eggemann @ 2014-04-12 12:56 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/04/14 10:44, Vincent Guittot wrote:
> This patchset was previously part of the larger tasks packing patchset [1].
> I have splitted the latter in 3 different patchsets (at least) to make the
> thing easier.
> -configuration of sched_domain topology (this patchset)
> -update and consolidation of cpu_power
> -tasks packing algorithm
>
> Based on Peter Z's proposal [2][3], this patchset modifies the way to configure
> the sched_domain level in order to let architectures to add specific level like
> the current BOOK level or the proposed power gating level for ARM architecture.
>
> [1] https://lkml.org/lkml/2013/10/18/121
> [2] https://lkml.org/lkml/2013/11/5/239
> [3] https://lkml.org/lkml/2013/11/5/449

Hi Vincent,

given the discussion we had for v1-v3 and a short boot test of v4:

For patch 1/5, 4/5, 5/5 on ARM TC2 (heterogeneous dual socket w/o SMT 
machine):

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Cheers,

-- Dietmar

>
> Change since v3:
> - remove last patch which was adding SD_SHARE_POWER_DOMAIN for powerpc's SMT
>    level
>
> Change since v2:
> - remove patch 1 as it has been already applied to metag tree for v3.15
> - updates some commit messages
> - add new flag description in TOPOLOGY_SD_FLAGS description
>
> Change since v1:
> - move sched_domains_curr_level back under #ifdef CONFIG_NUMA
> - use function pointer to set flag instead of a plain value.
> - add list of tunable flags in the commit message of patch 2
> - add SD_SHARE_POWER_DOMAIN flag for powerpc's SMT level
>
> Vincent Guittot (5):
>    sched: rework of sched_domain topology definition
>    sched: s390: create a dedicated topology table
>    sched: powerpc: create a dedicated topology table
>    sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
>    sched: ARM: create a dedicated scheduler topology table
>
>   arch/arm/kernel/topology.c       |  26 +++++
>   arch/ia64/include/asm/topology.h |  24 ----
>   arch/powerpc/kernel/smp.c        |  31 +++--
>   arch/s390/include/asm/topology.h |  13 +--
>   arch/s390/kernel/topology.c      |  20 ++++
>   arch/tile/include/asm/topology.h |  33 ------
>   include/linux/sched.h            |  49 +++++++-
>   include/linux/topology.h         | 128 +++-----------------
>   kernel/sched/core.c              | 244 ++++++++++++++++++++-------------------
>   9 files changed, 255 insertions(+), 313 deletions(-)
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 0/5] rework sched_domain topology description
  2014-04-12 12:56 ` [PATCH v4 0/5] rework sched_domain topology description Dietmar Eggemann
@ 2014-04-14  7:29   ` Vincent Guittot
  2014-04-15  7:53   ` Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: Vincent Guittot @ 2014-04-14  7:29 UTC (permalink / raw)
  To: linux-arm-kernel

On 12 April 2014 14:56, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 11/04/14 10:44, Vincent Guittot wrote:
>>
>> This patchset was previously part of the larger tasks packing patchset
>> [1].
>> I have splitted the latter in 3 different patchsets (at least) to make the
>> thing easier.
>> -configuration of sched_domain topology (this patchset)
>> -update and consolidation of cpu_power
>> -tasks packing algorithm
>>
>> Based on Peter Z's proposal [2][3], this patchset modifies the way to
>> configure
>> the sched_domain level in order to let architectures to add specific level
>> like
>> the current BOOK level or the proposed power gating level for ARM
>> architecture.
>>
>> [1] https://lkml.org/lkml/2013/10/18/121
>> [2] https://lkml.org/lkml/2013/11/5/239
>> [3] https://lkml.org/lkml/2013/11/5/449
>
>
> Hi Vincent,
>
> given the discussion we had for v1-v3 and a short boot test of v4:
>
> For patch 1/5, 4/5, 5/5 on ARM TC2 (heterogeneous dual socket w/o SMT
> machine):
>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>

Thanks

> Cheers,
>
> -- Dietmar
>
>
>>
>> Change since v3:
>> - remove last patch which was adding SD_SHARE_POWER_DOMAIN for powerpc's
>> SMT
>>    level
>>
>> Change since v2:
>> - remove patch 1 as it has been already applied to metag tree for v3.15
>> - updates some commit messages
>> - add new flag description in TOPOLOGY_SD_FLAGS description
>>
>> Change since v1:
>> - move sched_domains_curr_level back under #ifdef CONFIG_NUMA
>> - use function pointer to set flag instead of a plain value.
>> - add list of tunable flags in the commit message of patch 2
>> - add SD_SHARE_POWER_DOMAIN flag for powerpc's SMT level
>>
>> Vincent Guittot (5):
>>    sched: rework of sched_domain topology definition
>>    sched: s390: create a dedicated topology table
>>    sched: powerpc: create a dedicated topology table
>>    sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
>>    sched: ARM: create a dedicated scheduler topology table
>>
>>   arch/arm/kernel/topology.c       |  26 +++++
>>   arch/ia64/include/asm/topology.h |  24 ----
>>   arch/powerpc/kernel/smp.c        |  31 +++--
>>   arch/s390/include/asm/topology.h |  13 +--
>>   arch/s390/kernel/topology.c      |  20 ++++
>>   arch/tile/include/asm/topology.h |  33 ------
>>   include/linux/sched.h            |  49 +++++++-
>>   include/linux/topology.h         | 128 +++-----------------
>>   kernel/sched/core.c              | 244
>> ++++++++++++++++++++-------------------
>>   9 files changed, 255 insertions(+), 313 deletions(-)
>>
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 0/5] rework sched_domain topology description
  2014-04-12 12:56 ` [PATCH v4 0/5] rework sched_domain topology description Dietmar Eggemann
  2014-04-14  7:29   ` Vincent Guittot
@ 2014-04-15  7:53   ` Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-04-15  7:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Apr 12, 2014 at 01:56:16PM +0100, Dietmar Eggemann wrote:
> On 11/04/14 10:44, Vincent Guittot wrote:
> >This patchset was previously part of the larger tasks packing patchset [1].
> >I have splitted the latter in 3 different patchsets (at least) to make the
> >thing easier.
> >-configuration of sched_domain topology (this patchset)
> >-update and consolidation of cpu_power
> >-tasks packing algorithm
> >
> >Based on Peter Z's proposal [2][3], this patchset modifies the way to configure
> >the sched_domain level in order to let architectures to add specific level like
> >the current BOOK level or the proposed power gating level for ARM architecture.
> >
> >[1] https://lkml.org/lkml/2013/10/18/121
> >[2] https://lkml.org/lkml/2013/11/5/239
> >[3] https://lkml.org/lkml/2013/11/5/449
> 
> Hi Vincent,
> 
> given the discussion we had for v1-v3 and a short boot test of v4:
> 
> For patch 1/5, 4/5, 5/5 on ARM TC2 (heterogeneous dual socket w/o SMT
> machine):
> 
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Thanks guys!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 1/5] sched: rework of sched_domain topology definition
  2014-04-11  9:44 ` [PATCH v4 1/5] sched: rework of sched_domain topology definition Vincent Guittot
@ 2014-04-18 10:56   ` Peter Zijlstra
  2014-04-18 11:34     ` [PATCH] fix: " Vincent Guittot
  2014-04-18 11:34     ` [PATCH v4 1/5] " Vincent Guittot
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-04-18 10:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Apr 11, 2014 at 11:44:37AM +0200, Vincent Guittot wrote:
> We replace the old way to configure the scheduler topology with a new method
> which enables a platform to declare additionnal level (if needed).
> 
> We still have a default topology table definition that can be used by platform
> that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
> be overwritten by an arch which either wants to add new level where a load
> balance make sense like BOOK or powergating level or wants to change the flags
> configuration of some levels.
> 
> For each level, we need a function pointer that returns cpumask for each cpu,
> a function pointer that returns the flags for the level and a name. Only flags
> that describe topology, can be set by an architecture. The current topology
> flags are:
>  SD_SHARE_CPUPOWER
>  SD_SHARE_PKG_RESOURCES
>  SD_NUMA
>  SD_ASYM_PACKING
> 
> Then, each level must be a subset on the next one. The build sequence of the
> sched_domain will take care of removing useless levels like those with 1 CPU
> and those with the same CPU span and no more relevant information for
> load balancing than its childs.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>

On x86_64 defconfig this gets me:

kernel/sched/core.c: In function 'sched_init_numa':
kernel/sched/core.c:6197:4: warning: initialization makes pointer from integer without a cast [enabled by default]
kernel/sched/core.c:6197:4: warning: (near initialization for '(anonymous).sd_flags') [enabled by default]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 4/5] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
  2014-04-11  9:44 ` [PATCH v4 4/5] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
@ 2014-04-18 10:58   ` Peter Zijlstra
  2014-04-18 11:54     ` [PATCH] fix: sched: rework of sched_domain topology definition Vincent Guittot
  2014-04-18 11:54     ` [PATCH v4 4/5] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
  0 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-04-18 10:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Apr 11, 2014 at 11:44:40AM +0200, Vincent Guittot wrote:
> A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs
> in a sched_domain level can or not reach different power state. As an example,
> the flag should be cleared at CPU level if groups of cores can be power gated
> independently. This information can be used in the load balance decision or to
> add load balancing level between group of CPUs that can power gate
> independantly.
> This flag is part of the topology flags that can be set by arch.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---

Building a cris-defconfig this gets me an endless stream of misery:

In file included from /usr/src/linux-2.6/init/do_mounts.c:12:0:
/usr/src/linux-2.6/include/linux/sched.h:1048:46: warning: 'struct sched_domain_topology_level' declared inside parameter list [enabled by default]
/usr/src/linux-2.6/include/linux/sched.h:1048:46: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
In file included from /usr/src/linux-2.6/include/linux/utsname.h:5:0,
                 from /usr/src/linux-2.6/include/linux/init_task.h:6,
                 from /usr/src/linux-2.6/init/init_task.c:1:

Might be something about UP builds, didn't check x86-UP.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] fix: sched: rework of sched_domain topology definition
  2014-04-18 10:56   ` Peter Zijlstra
@ 2014-04-18 11:34     ` Vincent Guittot
  2014-04-18 11:39       ` Peter Zijlstra
  2014-04-18 11:34     ` [PATCH v4 1/5] " Vincent Guittot
  1 sibling, 1 reply; 25+ messages in thread
From: Vincent Guittot @ 2014-04-18 11:34 UTC (permalink / raw)
  To: linux-arm-kernel

fix missing change from int to pointer function for numa flags

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h | 7 +++++++
 kernel/sched/core.c   | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18464cb..d2b981c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -887,6 +887,13 @@ static inline const int cpu_core_flags(void)
 }
 #endif
 
+#ifdef CONFIG_NUMA
+static inline const int cpu_numa_flags(void)
+{
+	return SD_NUMA;
+}
+#endif
+
 struct sched_domain_attr {
 	int relax_domain_level;
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a30b0b4..24d0831 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6235,7 +6235,7 @@ static void sched_init_numa(void)
 	for (j = 0; j < level; i++, j++) {
 		tl[i] = (struct sched_domain_topology_level){
 			.mask = sd_numa_mask,
-			.sd_flags = SD_NUMA,
+			.sd_flags = cpu_numa_flags,
 			.flags = SDTL_OVERLAP,
 			.numa_level = j,
 			SD_INIT_NAME(NUMA)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 1/5] sched: rework of sched_domain topology definition
  2014-04-18 10:56   ` Peter Zijlstra
  2014-04-18 11:34     ` [PATCH] fix: " Vincent Guittot
@ 2014-04-18 11:34     ` Vincent Guittot
  1 sibling, 0 replies; 25+ messages in thread
From: Vincent Guittot @ 2014-04-18 11:34 UTC (permalink / raw)
  To: linux-arm-kernel

On 18 April 2014 12:56, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Apr 11, 2014 at 11:44:37AM +0200, Vincent Guittot wrote:
>> We replace the old way to configure the scheduler topology with a new method
>> which enables a platform to declare additionnal level (if needed).
>>
>> We still have a default topology table definition that can be used by platform
>> that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
>> be overwritten by an arch which either wants to add new level where a load
>> balance make sense like BOOK or powergating level or wants to change the flags
>> configuration of some levels.
>>
>> For each level, we need a function pointer that returns cpumask for each cpu,
>> a function pointer that returns the flags for the level and a name. Only flags
>> that describe topology, can be set by an architecture. The current topology
>> flags are:
>>  SD_SHARE_CPUPOWER
>>  SD_SHARE_PKG_RESOURCES
>>  SD_NUMA
>>  SD_ASYM_PACKING
>>
>> Then, each level must be a subset on the next one. The build sequence of the
>> sched_domain will take care of removing useless levels like those with 1 CPU
>> and those with the same CPU span and no more relevant information for
>> load balancing than its childs.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
>
> On x86_64 defconfig this gets me:
>
> kernel/sched/core.c: In function 'sched_init_numa':
> kernel/sched/core.c:6197:4: warning: initialization makes pointer from integer without a cast [enabled by default]
> kernel/sched/core.c:6197:4: warning: (near initialization for '(anonymous).sd_flags') [enabled by default]

Sorry, while moving from int to function pointer, i forgot to change
the numa part

I'm going to send a fix right now


>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] fix: sched: rework of sched_domain topology definition
  2014-04-18 11:34     ` [PATCH] fix: " Vincent Guittot
@ 2014-04-18 11:39       ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-04-18 11:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Apr 18, 2014 at 01:34:06PM +0200, Vincent Guittot wrote:
> fix missing change from int to pointer function for numa flags
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Thanks, folded it into the original patch.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] fix: sched: rework of sched_domain topology definition
  2014-04-18 10:58   ` Peter Zijlstra
@ 2014-04-18 11:54     ` Vincent Guittot
  2014-04-18 11:54     ` [PATCH v4 4/5] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
  1 sibling, 0 replies; 25+ messages in thread
From: Vincent Guittot @ 2014-04-18 11:54 UTC (permalink / raw)
  To: linux-arm-kernel

set_sched_topology doesn't need to be defined for non SMP because
there is no sched_domain

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2b981c..674b1fa 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1047,8 +1047,6 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
-static inline void set_sched_topology(struct sched_domain_topology_level *tl) { }
-
 #endif	/* !CONFIG_SMP */
 
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 4/5] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
  2014-04-18 10:58   ` Peter Zijlstra
  2014-04-18 11:54     ` [PATCH] fix: sched: rework of sched_domain topology definition Vincent Guittot
@ 2014-04-18 11:54     ` Vincent Guittot
  1 sibling, 0 replies; 25+ messages in thread
From: Vincent Guittot @ 2014-04-18 11:54 UTC (permalink / raw)
  To: linux-arm-kernel

On 18 April 2014 12:58, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Apr 11, 2014 at 11:44:40AM +0200, Vincent Guittot wrote:
>> A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs
>> in a sched_domain level can or not reach different power state. As an example,
>> the flag should be cleared at CPU level if groups of cores can be power gated
>> independently. This information can be used in the load balance decision or to
>> add load balancing level between group of CPUs that can power gate
>> independantly.
>> This flag is part of the topology flags that can be set by arch.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>
> Building a cris-defconfig this gets me an endless stream of misery:
>
> In file included from /usr/src/linux-2.6/init/do_mounts.c:12:0:
> /usr/src/linux-2.6/include/linux/sched.h:1048:46: warning: 'struct sched_domain_topology_level' declared inside parameter list [enabled by default]
> /usr/src/linux-2.6/include/linux/sched.h:1048:46: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
> In file included from /usr/src/linux-2.6/include/linux/utsname.h:5:0,
>                  from /usr/src/linux-2.6/include/linux/init_task.h:6,
>                  from /usr/src/linux-2.6/init/init_task.c:1:
>
> Might be something about UP builds, didn't check x86-UP.

yes, you're right
An empty set_sched_topology function has been define for UP whereas it
can't because of the sched_domain_topology_level parameter which is
available only in SMP

I'm going to send a fix.

sorry for the dumb error

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table
  2014-04-11  9:44 ` [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table Vincent Guittot
@ 2014-04-23 11:46   ` Dietmar Eggemann
  2014-04-23 14:46     ` Vincent Guittot
  0 siblings, 1 reply; 25+ messages in thread
From: Dietmar Eggemann @ 2014-04-23 11:46 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

I'm trying to use this approach of specifying different per-cpu views on
sd flags on DIE level on a TC2 platform (cluster 0 w/ CPU0/1 and cluster
1 w/ CPU2/3/4 w/o SMT). It doesn't work like in the case for the GMC/MC
sd level.

If I use the following patch (just to illustrate the issue) on top
(returning SD_SHARE_POWERDOMAIN in cpu_cpupower_flags() is bogus ... I
just needed a flag function for GDIE level):

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 71e1fec6d31a..803330d89c09 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -194,6 +194,14 @@ const struct cpumask *cpu_corepower_mask(int cpu)
 	return &cpu_topology[cpu].thread_sibling;
 }

+const struct cpumask *cpu_cpupower_mask(int cpu)
+{
+	return cpu_topology[cpu].socket_id ?
+			cpumask_of_node(cpu_to_node(cpu)) :
+			&cpu_topology[cpu].core_sibling;
+}
+
+
 static void update_siblings_masks(unsigned int cpuid)
 {
 	struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
@@ -280,11 +288,18 @@ static inline const int cpu_corepower_flags(void)
 	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
 }

+static inline const int cpu_cpupower_flags(void)
+{
+	return SD_SHARE_POWERDOMAIN;
+}
+
+
 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
 	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
 	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
+	{ cpu_cpupower_mask, cpu_cpupower_flags, SD_INIT_NAME(GDIE) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };

so I get the following topology:

CPU0: cpu_corepower_mask=0   (GMC)
CPU0: cpu_coregroup_mask=0-1 (MC)
CPU0:  cpu_cpupower_mask=0-1 (GDIE)
CPU0:       cpu_cpu_mask=0-4 (DIE)
CPU1: cpu_corepower_mask=1    ...
CPU1: cpu_coregroup_mask=0-1
CPU1:  cpu_cpupower_mask=0-1
CPU1:       cpu_cpu_mask=0-4
CPU2: cpu_corepower_mask=2
CPU2: cpu_coregroup_mask=2-4
CPU2:  cpu_cpupower_mask=0-4
CPU2:       cpu_cpu_mask=0-4
CPU3: cpu_corepower_mask=3
CPU3: cpu_coregroup_mask=2-4
CPU3:  cpu_cpupower_mask=0-4
CPU3:       cpu_cpu_mask=0-4
CPU4: cpu_corepower_mask=4
CPU4: cpu_coregroup_mask=2-4
CPU4:  cpu_cpupower_mask=0-4
CPU4:       cpu_cpu_mask=0-4

Firstly, I had to get rid of the cpumask_equal(cpu_map,
sched_domain_span(sd)) condition in build_sched_domains() to allow that
I can have two sd levels which span CPU 0-4 (for CPU2/3/4).

But it still doesn't work correctly:

dmesg snippet 2:

CPU0 attaching sched-domain:
 domain 0: span 0-1 level MC
  groups: 0 1
  domain 1: span 0-4 level DIE     <-- error (there's only one group)
   groups: 0-4 (cpu_power = 2048)
...
CPU2 attaching sched-domain:
 domain 0: span 2-4 level MC
  groups: 2 3 4
  domain 1: span 0-4 level GDIE
ERROR: domain->groups does not contain CPU2
   groups:
ERROR: domain->cpu_power not set

ERROR: groups don't span domain->span
...

It turns out that the function get_group() which is used a couple of
times in build_sched_groups() uses a reference to sd->child and even
though the degenerate functionality gets rid of GDIE for CPU0/1 and DIE
for CPU2/3/4 the group set-up doesn't work as expected since sd->child
for DIE is GDIE and not MC any more.
So it looks like GMC/MC level is somehow special (GMC has no sd->child
for TC2 or GMC/MC contains only one cpu per group?).

Although this problem does not effect the current patch-set, people
might think that they can apply this degenerate trick for other sd
levels as well.

I'm trying to fix get_group()/build_sched_groups() in such a way that my
example would work but so far I haven't succeeded. The question for me
remains ... is this application of the degenerate trick feasible at all
in all sd levels, i.e. does it scale? What about platforms w/ SMT sd
level which want to use this trick in GMC/MC level?

Any hints are highly appreciated here.

-- Dietmar

On 11/04/14 10:44, Vincent Guittot wrote:
> Create a dedicated topology table for ARM which will create new level to
> differentiate CPUs that can or not powergate independantly from others.
> 
> The patch gives an example of how to add domain that will take advantage of
> SD_SHARE_POWERDOMAIN.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  arch/arm/kernel/topology.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 0bc94b1..71e1fec 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>  	return &cpu_topology[cpu].core_sibling;
>  }
>  
> +/*
> + * The current assumption is that we can power gate each core independently.
> + * This will be superseded by DT binding once available.
> + */
> +const struct cpumask *cpu_corepower_mask(int cpu)
> +{
> +	return &cpu_topology[cpu].thread_sibling;
> +}
> +
>  static void update_siblings_masks(unsigned int cpuid)
>  {
>  	struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> @@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid)
>  		cpu_topology[cpuid].socket_id, mpidr);
>  }
>  
> +static inline const int cpu_corepower_flags(void)
> +{
> +	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
> +}
> +
> +static struct sched_domain_topology_level arm_topology[] = {
> +#ifdef CONFIG_SCHED_MC
> +	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
> +	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> +#endif
> +	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +	{ NULL, },
> +};
> +
>  /*
>   * init_cpu_topology is called at boot when only one cpu is running
>   * which prevent simultaneous write access to cpu_topology array
> @@ -289,4 +312,7 @@ void __init init_cpu_topology(void)
>  	smp_wmb();
>  
>  	parse_dt_topology();
> +
> +	/* Set scheduler topology descriptor */
> +	set_sched_topology(arm_topology);
>  }
> 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table
  2014-04-23 11:46   ` Dietmar Eggemann
@ 2014-04-23 14:46     ` Vincent Guittot
  2014-04-23 15:26       ` Dietmar Eggemann
  0 siblings, 1 reply; 25+ messages in thread
From: Vincent Guittot @ 2014-04-23 14:46 UTC (permalink / raw)
  To: linux-arm-kernel

On 23 April 2014 13:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> Hi,
>
> I'm trying to use this approach of specifying different per-cpu views on
> sd flags on DIE level on a TC2 platform (cluster 0 w/ CPU0/1 and cluster
> 1 w/ CPU2/3/4 w/o SMT). It doesn't work like in the case for the GMC/MC
> sd level.
>
> If I use the following patch (just to illustrate the issue) on top
> (returning SD_SHARE_POWERDOMAIN in cpu_cpupower_flags() is bogus ... I
> just needed a flag function for GDIE level):
>
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 71e1fec6d31a..803330d89c09 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -194,6 +194,14 @@ const struct cpumask *cpu_corepower_mask(int cpu)
>         return &cpu_topology[cpu].thread_sibling;
>  }
>
> +const struct cpumask *cpu_cpupower_mask(int cpu)
> +{
> +       return cpu_topology[cpu].socket_id ?
> +                       cpumask_of_node(cpu_to_node(cpu)) :
> +                       &cpu_topology[cpu].core_sibling;
> +}
> +
> +
>  static void update_siblings_masks(unsigned int cpuid)
>  {
>         struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> @@ -280,11 +288,18 @@ static inline const int cpu_corepower_flags(void)
>         return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
>  }
>
> +static inline const int cpu_cpupower_flags(void)
> +{
> +       return SD_SHARE_POWERDOMAIN;
> +}
> +
> +
>  static struct sched_domain_topology_level arm_topology[] = {
>  #ifdef CONFIG_SCHED_MC
>         { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
>         { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
>  #endif
> +       { cpu_cpupower_mask, cpu_cpupower_flags, SD_INIT_NAME(GDIE) },
>         { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>         { NULL, },
>  };
>
> so I get the following topology:
>
> CPU0: cpu_corepower_mask=0   (GMC)
> CPU0: cpu_coregroup_mask=0-1 (MC)
> CPU0:  cpu_cpupower_mask=0-1 (GDIE)
> CPU0:       cpu_cpu_mask=0-4 (DIE)
> CPU1: cpu_corepower_mask=1    ...
> CPU1: cpu_coregroup_mask=0-1
> CPU1:  cpu_cpupower_mask=0-1
> CPU1:       cpu_cpu_mask=0-4
> CPU2: cpu_corepower_mask=2
> CPU2: cpu_coregroup_mask=2-4
> CPU2:  cpu_cpupower_mask=0-4
> CPU2:       cpu_cpu_mask=0-4
> CPU3: cpu_corepower_mask=3
> CPU3: cpu_coregroup_mask=2-4
> CPU3:  cpu_cpupower_mask=0-4
> CPU3:       cpu_cpu_mask=0-4
> CPU4: cpu_corepower_mask=4
> CPU4: cpu_coregroup_mask=2-4
> CPU4:  cpu_cpupower_mask=0-4
> CPU4:       cpu_cpu_mask=0-4

You  have an inconsistency in your topology description:

At GDIE level:
CPU1 cpu_cpupower_mask=0-1
but
CPU2 cpu_cpupower_mask=0-4

so CPU2 says that it shares power domain with CPU0 but CPU1 says the opposite

Regards
Vincent

>
> Firstly, I had to get rid of the cpumask_equal(cpu_map,
> sched_domain_span(sd)) condition in build_sched_domains() to allow that
> I can have two sd levels which span CPU 0-4 (for CPU2/3/4).
>
> But it still doesn't work correctly:
>
> dmesg snippet 2:
>
> CPU0 attaching sched-domain:
>  domain 0: span 0-1 level MC
>   groups: 0 1
>   domain 1: span 0-4 level DIE     <-- error (there's only one group)
>    groups: 0-4 (cpu_power = 2048)
> ...
> CPU2 attaching sched-domain:
>  domain 0: span 2-4 level MC
>   groups: 2 3 4
>   domain 1: span 0-4 level GDIE
> ERROR: domain->groups does not contain CPU2
>    groups:
> ERROR: domain->cpu_power not set
>
> ERROR: groups don't span domain->span
> ...
>
> It turns out that the function get_group() which is used a couple of
> times in build_sched_groups() uses a reference to sd->child and even
> though the degenerate functionality gets rid of GDIE for CPU0/1 and DIE
> for CPU2/3/4 the group set-up doesn't work as expected since sd->child
> for DIE is GDIE and not MC any more.
> So it looks like GMC/MC level is somehow special (GMC has no sd->child
> for TC2 or GMC/MC contains only one cpu per group?).
>
> Although this problem does not effect the current patch-set, people
> might think that they can apply this degenerate trick for other sd
> levels as well.
>
> I'm trying to fix get_group()/build_sched_groups() in such a way that my
> example would work but so far I haven't succeeded. The question for me
> remains ... is this application of the degenerate trick feasible at all
> in all sd levels, i.e. does it scale? What about platforms w/ SMT sd
> level which want to use this trick in GMC/MC level?
>
> Any hints are highly appreciated here.
>
> -- Dietmar
>
> On 11/04/14 10:44, Vincent Guittot wrote:
>> Create a dedicated topology table for ARM which will create new level to
>> differentiate CPUs that can or not powergate independantly from others.
>>
>> The patch gives an example of how to add domain that will take advantage of
>> SD_SHARE_POWERDOMAIN.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  arch/arm/kernel/topology.c | 26 ++++++++++++++++++++++++++
>>  1 file changed, 26 insertions(+)
>>
>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index 0bc94b1..71e1fec 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>>       return &cpu_topology[cpu].core_sibling;
>>  }
>>
>> +/*
>> + * The current assumption is that we can power gate each core independently.
>> + * This will be superseded by DT binding once available.
>> + */
>> +const struct cpumask *cpu_corepower_mask(int cpu)
>> +{
>> +     return &cpu_topology[cpu].thread_sibling;
>> +}
>> +
>>  static void update_siblings_masks(unsigned int cpuid)
>>  {
>>       struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
>> @@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid)
>>               cpu_topology[cpuid].socket_id, mpidr);
>>  }
>>
>> +static inline const int cpu_corepower_flags(void)
>> +{
>> +     return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
>> +}
>> +
>> +static struct sched_domain_topology_level arm_topology[] = {
>> +#ifdef CONFIG_SCHED_MC
>> +     { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
>> +     { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
>> +#endif
>> +     { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>> +     { NULL, },
>> +};
>> +
>>  /*
>>   * init_cpu_topology is called at boot when only one cpu is running
>>   * which prevent simultaneous write access to cpu_topology array
>> @@ -289,4 +312,7 @@ void __init init_cpu_topology(void)
>>       smp_wmb();
>>
>>       parse_dt_topology();
>> +
>> +     /* Set scheduler topology descriptor */
>> +     set_sched_topology(arm_topology);
>>  }
>>
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table
  2014-04-23 14:46     ` Vincent Guittot
@ 2014-04-23 15:26       ` Dietmar Eggemann
  2014-04-24  7:30         ` Vincent Guittot
  0 siblings, 1 reply; 25+ messages in thread
From: Dietmar Eggemann @ 2014-04-23 15:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/04/14 15:46, Vincent Guittot wrote:
> On 23 April 2014 13:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> Hi,
>>
>> I'm trying to use this approach of specifying different per-cpu views on
>> sd flags on DIE level on a TC2 platform (cluster 0 w/ CPU0/1 and cluster
>> 1 w/ CPU2/3/4 w/o SMT). It doesn't work like in the case for the GMC/MC
>> sd level.
>>
>> If I use the following patch (just to illustrate the issue) on top
>> (returning SD_SHARE_POWERDOMAIN in cpu_cpupower_flags() is bogus ... I
>> just needed a flag function for GDIE level):
>>
>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index 71e1fec6d31a..803330d89c09 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -194,6 +194,14 @@ const struct cpumask *cpu_corepower_mask(int cpu)
>>         return &cpu_topology[cpu].thread_sibling;
>>  }
>>
>> +const struct cpumask *cpu_cpupower_mask(int cpu)
>> +{
>> +       return cpu_topology[cpu].socket_id ?
>> +                       cpumask_of_node(cpu_to_node(cpu)) :
>> +                       &cpu_topology[cpu].core_sibling;
>> +}
>> +
>> +
>>  static void update_siblings_masks(unsigned int cpuid)
>>  {
>>         struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
>> @@ -280,11 +288,18 @@ static inline const int cpu_corepower_flags(void)
>>         return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
>>  }
>>
>> +static inline const int cpu_cpupower_flags(void)
>> +{
>> +       return SD_SHARE_POWERDOMAIN;
>> +}
>> +
>> +
>>  static struct sched_domain_topology_level arm_topology[] = {
>>  #ifdef CONFIG_SCHED_MC
>>         { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
>>         { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
>>  #endif
>> +       { cpu_cpupower_mask, cpu_cpupower_flags, SD_INIT_NAME(GDIE) },
>>         { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>>         { NULL, },
>>  };
>>
>> so I get the following topology:
>>
>> CPU0: cpu_corepower_mask=0   (GMC)
>> CPU0: cpu_coregroup_mask=0-1 (MC)
>> CPU0:  cpu_cpupower_mask=0-1 (GDIE)
>> CPU0:       cpu_cpu_mask=0-4 (DIE)
>> CPU1: cpu_corepower_mask=1    ...
>> CPU1: cpu_coregroup_mask=0-1
>> CPU1:  cpu_cpupower_mask=0-1
>> CPU1:       cpu_cpu_mask=0-4
>> CPU2: cpu_corepower_mask=2
>> CPU2: cpu_coregroup_mask=2-4
>> CPU2:  cpu_cpupower_mask=0-4
>> CPU2:       cpu_cpu_mask=0-4
>> CPU3: cpu_corepower_mask=3
>> CPU3: cpu_coregroup_mask=2-4
>> CPU3:  cpu_cpupower_mask=0-4
>> CPU3:       cpu_cpu_mask=0-4
>> CPU4: cpu_corepower_mask=4
>> CPU4: cpu_coregroup_mask=2-4
>> CPU4:  cpu_cpupower_mask=0-4
>> CPU4:       cpu_cpu_mask=0-4
> 
> You  have an inconsistency in your topology description:

That's true functional-wise but I think that this is not the reason why
the code in get_group()/build_sched_groups() isn't working correctly any
more with this set-up.

Like I said above the cpu_cpupower_flags() is just bogus and I only
added it to illustrate the issue (Shouldn't have used
SD_SHARE_POWERDOMAIN in the first place).
Essentially what I want to do is bind an SD_SHARE_*FOO* flag to the GDIE
related sd's of CPU2/3/4 and not to the DIE related sd's of CPU0/1.

I thought so far that I can achieve that by getting rid of GDIE sd level
for CPU0/1 simply by choosing the cpu_foo_mask() function in such a way
that it returns the same cpu mask as its child sd level (MC) and of DIE
sd level for CPU2/3/4 because it returns the same cpu mask as its child
sd level (GDIE) related cpu mask function. This will let sd degenerate
do it's job of folding sd levels which it does. The only problem I have
is that the groups are not created correctly any more.

I don't see right now how the flag SD_SHARE_FOO affects the code in
get_group()/build_sched_groups().

Think of SD_SHARE_FOO of something I would like to have for all sd's of
CPU's of cluster 1 (CPU2/3/4) and not on cluster 0 (CPU0/1) in the sd
level where each CPU sees two groups (group0 containing CPU0/1 and
group1 containing CPU2/3/4 or vice versa) (GDIE/DIE) .

-- Dietmar

> 
> At GDIE level:
> CPU1 cpu_cpupower_mask=0-1
> but
> CPU2 cpu_cpupower_mask=0-4
> 
> so CPU2 says that it shares power domain with CPU0 but CPU1 says the opposite
> 
> Regards
> Vincent
> 
>>
>> Firstly, I had to get rid of the cpumask_equal(cpu_map,
>> sched_domain_span(sd)) condition in build_sched_domains() to allow that
>> I can have two sd levels which span CPU 0-4 (for CPU2/3/4).
>>
>> But it still doesn't work correctly:
>>
>> dmesg snippet 2:
>>
>> CPU0 attaching sched-domain:
>>  domain 0: span 0-1 level MC
>>   groups: 0 1
>>   domain 1: span 0-4 level DIE     <-- error (there's only one group)
>>    groups: 0-4 (cpu_power = 2048)
>> ...
>> CPU2 attaching sched-domain:
>>  domain 0: span 2-4 level MC
>>   groups: 2 3 4
>>   domain 1: span 0-4 level GDIE
>> ERROR: domain->groups does not contain CPU2
>>    groups:
>> ERROR: domain->cpu_power not set
>>
>> ERROR: groups don't span domain->span
>> ...
>>
>> It turns out that the function get_group() which is used a couple of
>> times in build_sched_groups() uses a reference to sd->child and even
>> though the degenerate functionality gets rid of GDIE for CPU0/1 and DIE
>> for CPU2/3/4 the group set-up doesn't work as expected since sd->child
>> for DIE is GDIE and not MC any more.
>> So it looks like GMC/MC level is somehow special (GMC has no sd->child
>> for TC2 or GMC/MC contains only one cpu per group?).
>>
>> Although this problem does not effect the current patch-set, people
>> might think that they can apply this degenerate trick for other sd
>> levels as well.
>>
>> I'm trying to fix get_group()/build_sched_groups() in such a way that my
>> example would work but so far I haven't succeeded. The question for me
>> remains ... is this application of the degenerate trick feasible at all
>> in all sd levels, i.e. does it scale? What about platforms w/ SMT sd
>> level which want to use this trick in GMC/MC level?
>>
>> Any hints are highly appreciated here.
>>
>> -- Dietmar
>>
>> On 11/04/14 10:44, Vincent Guittot wrote:
>>> Create a dedicated topology table for ARM which will create new level to
>>> differentiate CPUs that can or not powergate independantly from others.
>>>
>>> The patch gives an example of how to add domain that will take advantage of
>>> SD_SHARE_POWERDOMAIN.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>  arch/arm/kernel/topology.c | 26 ++++++++++++++++++++++++++
>>>  1 file changed, 26 insertions(+)
>>>
>>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>>> index 0bc94b1..71e1fec 100644
>>> --- a/arch/arm/kernel/topology.c
>>> +++ b/arch/arm/kernel/topology.c
>>> @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>>>       return &cpu_topology[cpu].core_sibling;
>>>  }
>>>
>>> +/*
>>> + * The current assumption is that we can power gate each core independently.
>>> + * This will be superseded by DT binding once available.
>>> + */
>>> +const struct cpumask *cpu_corepower_mask(int cpu)
>>> +{
>>> +     return &cpu_topology[cpu].thread_sibling;
>>> +}
>>> +
>>>  static void update_siblings_masks(unsigned int cpuid)
>>>  {
>>>       struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
>>> @@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid)
>>>               cpu_topology[cpuid].socket_id, mpidr);
>>>  }
>>>
>>> +static inline const int cpu_corepower_flags(void)
>>> +{
>>> +     return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
>>> +}
>>> +
>>> +static struct sched_domain_topology_level arm_topology[] = {
>>> +#ifdef CONFIG_SCHED_MC
>>> +     { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
>>> +     { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
>>> +#endif
>>> +     { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>>> +     { NULL, },
>>> +};
>>> +
>>>  /*
>>>   * init_cpu_topology is called at boot when only one cpu is running
>>>   * which prevent simultaneous write access to cpu_topology array
>>> @@ -289,4 +312,7 @@ void __init init_cpu_topology(void)
>>>       smp_wmb();
>>>
>>>       parse_dt_topology();
>>> +
>>> +     /* Set scheduler topology descriptor */
>>> +     set_sched_topology(arm_topology);
>>>  }
>>>
>>
>>
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table
  2014-04-23 15:26       ` Dietmar Eggemann
@ 2014-04-24  7:30         ` Vincent Guittot
  2014-04-24 12:48           ` Dietmar Eggemann
  0 siblings, 1 reply; 25+ messages in thread
From: Vincent Guittot @ 2014-04-24  7:30 UTC (permalink / raw)
  To: linux-arm-kernel

On 23 April 2014 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/04/14 15:46, Vincent Guittot wrote:
>> On 23 April 2014 13:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>> Hi,
>>>

[snip]

>> You  have an inconsistency in your topology description:
>
> That's true functional-wise but I think that this is not the reason why
> the code in get_group()/build_sched_groups() isn't working correctly any
> more with this set-up.
>
> Like I said above the cpu_cpupower_flags() is just bogus and I only
> added it to illustrate the issue (Shouldn't have used
> SD_SHARE_POWERDOMAIN in the first place).

More than the flag that is used for the example, it's about the
cpumask which are inconsistent across CPUs for the same level and the
build_sched_domain sequence rely on this consistency to build
sched_group

> Essentially what I want to do is bind an SD_SHARE_*FOO* flag to the GDIE
> related sd's of CPU2/3/4 and not to the DIE related sd's of CPU0/1.
>
> I thought so far that I can achieve that by getting rid of GDIE sd level
> for CPU0/1 simply by choosing the cpu_foo_mask() function in such a way
> that it returns the same cpu mask as its child sd level (MC) and of DIE
> sd level for CPU2/3/4 because it returns the same cpu mask as its child
> sd level (GDIE) related cpu mask function. This will let sd degenerate
> do it's job of folding sd levels which it does. The only problem I have
> is that the groups are not created correctly any more.
>
> I don't see right now how the flag SD_SHARE_FOO affects the code in
> get_group()/build_sched_groups().
>
> Think of SD_SHARE_FOO of something I would like to have for all sd's of
> CPU's of cluster 1 (CPU2/3/4) and not on cluster 0 (CPU0/1) in the sd
> level where each CPU sees two groups (group0 containing CPU0/1 and
> group1 containing CPU2/3/4 or vice versa) (GDIE/DIE) .

I'm not sure that's it's feasible because it's not possible from a
topology pov to have different flags if the span include all cpus.
Could you give us more details about what you want to achieve with
this flag ?

Vincent

>
> -- Dietmar
>
>>
>> At GDIE level:
>> CPU1 cpu_cpupower_mask=0-1
>> but
>> CPU2 cpu_cpupower_mask=0-4
>>
>> so CPU2 says that it shares power domain with CPU0 but CPU1 says the opposite
>>
>> Regards
>> Vincent
>>
>>>
>>> Firstly, I had to get rid of the cpumask_equal(cpu_map,
>>> sched_domain_span(sd)) condition in build_sched_domains() to allow that
>>> I can have two sd levels which span CPU 0-4 (for CPU2/3/4).
>>>
>>> But it still doesn't work correctly:
>>>
>>> dmesg snippet 2:
>>>
>>> CPU0 attaching sched-domain:
>>>  domain 0: span 0-1 level MC
>>>   groups: 0 1
>>>   domain 1: span 0-4 level DIE     <-- error (there's only one group)
>>>    groups: 0-4 (cpu_power = 2048)
>>> ...
>>> CPU2 attaching sched-domain:
>>>  domain 0: span 2-4 level MC
>>>   groups: 2 3 4
>>>   domain 1: span 0-4 level GDIE
>>> ERROR: domain->groups does not contain CPU2
>>>    groups:
>>> ERROR: domain->cpu_power not set
>>>
>>> ERROR: groups don't span domain->span
>>> ...
>>>

[snip]

>>>>
>>>
>>>
>>
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table
  2014-04-24  7:30         ` Vincent Guittot
@ 2014-04-24 12:48           ` Dietmar Eggemann
  2014-04-25  7:45             ` Vincent Guittot
  0 siblings, 1 reply; 25+ messages in thread
From: Dietmar Eggemann @ 2014-04-24 12:48 UTC (permalink / raw)
  To: linux-arm-kernel

On 24/04/14 08:30, Vincent Guittot wrote:
> On 23 April 2014 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 23/04/14 15:46, Vincent Guittot wrote:
>>> On 23 April 2014 13:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>> Hi,

[...]

> 
> More than the flag that is used for the example, it's about the
> cpumask which are inconsistent across CPUs for the same level and the
> build_sched_domain sequence rely on this consistency to build
> sched_group

Now I'm lost here. I thought so far that by specifying different cpu
masks per CPU in an sd level, we get the sd level folding functionality
in sd degenerate?

We discussed this here for an example on TC2 for the GMC level:
https://lkml.org/lkml/2014/3/21/126

Back than I had
  CPU0: cpu_corepower_mask=0-1
  CPU2: cpu_corepower_mask=2
so for GMC level the cpumasks are inconsistent across CPUs and it worked.

The header of '[PATCH v4 1/5] sched: rework of sched_domain topology
definition' mentions only the requirement "Then, each level must be a
subset on the next one" and this one I haven't broken w/ my
GMC/MC/GDIE/DIE set-up.

Do I miss something else here?

> 
>> Essentially what I want to do is bind an SD_SHARE_*FOO* flag to the GDIE
>> related sd's of CPU2/3/4 and not to the DIE related sd's of CPU0/1.
>>
>> I thought so far that I can achieve that by getting rid of GDIE sd level
>> for CPU0/1 simply by choosing the cpu_foo_mask() function in such a way
>> that it returns the same cpu mask as its child sd level (MC) and of DIE
>> sd level for CPU2/3/4 because it returns the same cpu mask as its child
>> sd level (GDIE) related cpu mask function. This will let sd degenerate
>> do it's job of folding sd levels which it does. The only problem I have
>> is that the groups are not created correctly any more.
>>
>> I don't see right now how the flag SD_SHARE_FOO affects the code in
>> get_group()/build_sched_groups().
>>
>> Think of SD_SHARE_FOO of something I would like to have for all sd's of
>> CPU's of cluster 1 (CPU2/3/4) and not on cluster 0 (CPU0/1) in the sd
>> level where each CPU sees two groups (group0 containing CPU0/1 and
>> group1 containing CPU2/3/4 or vice versa) (GDIE/DIE) .
> 
> I'm not sure that's it's feasible because it's not possible from a
> topology pov to have different flags if the span include all cpus.
> Could you give us more details about what you want to achieve with
> this flag ?

IMHO, the flag is not important for this discussion.  OTHA, information
like you can't use sd degenerate functionality to fold adjacent sd
levels (GFOO/FOO) on sd level which span all CPUs would be.  I want to
make sure we understand what are the limitations to use folding of
adjacent sd levels based on per-cpu differences in the return value of
cpu_mask functions.

-- Dietmar

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table
  2014-04-24 12:48           ` Dietmar Eggemann
@ 2014-04-25  7:45             ` Vincent Guittot
  2014-04-25 15:55               ` Dietmar Eggemann
  2014-04-25 16:04               ` Peter Zijlstra
  0 siblings, 2 replies; 25+ messages in thread
From: Vincent Guittot @ 2014-04-25  7:45 UTC (permalink / raw)
  To: linux-arm-kernel

On 24 April 2014 14:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 24/04/14 08:30, Vincent Guittot wrote:
>> On 23 April 2014 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>> On 23/04/14 15:46, Vincent Guittot wrote:
>>>> On 23 April 2014 13:46, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>> Hi,
>
> [...]
>
>>
>> More than the flag that is used for the example, it's about the
>> cpumask which are inconsistent across CPUs for the same level and the
>> build_sched_domain sequence rely on this consistency to build
>> sched_group
>
> Now I'm lost here. I thought so far that by specifying different cpu
> masks per CPU in an sd level, we get the sd level folding functionality
> in sd degenerate?
>
> We discussed this here for an example on TC2 for the GMC level:
> https://lkml.org/lkml/2014/3/21/126
>
> Back than I had
>   CPU0: cpu_corepower_mask=0-1
>   CPU2: cpu_corepower_mask=2
> so for GMC level the cpumasks are inconsistent across CPUs and it worked.

The example above is consistent because CPU2 mask and CPU0 mask are
fully exclusive

so
CPU0: cpu_corepower_mask=0-1
CPU2: cpu_corepower_mask=2
are consistent

CPU0: cpu_corepower_mask=0-2
CPU2: cpu_corepower_mask=0-2
are also consistent

but

CPU0: cpu_corepower_mask=0-1
CPU2: cpu_corepower_mask=0-2
are not consistent

and your example uses the last configuration

To be more precise, the rule above applies on default SDT definition
but the flag SD_OVERLAP enables such kind of overlap between group.
Have you tried it ?

Vincent

>
> The header of '[PATCH v4 1/5] sched: rework of sched_domain topology
> definition' mentions only the requirement "Then, each level must be a
> subset on the next one" and this one I haven't broken w/ my
> GMC/MC/GDIE/DIE set-up.
>
> Do I miss something else here?
>
>>
>>> Essentially what I want to do is bind an SD_SHARE_*FOO* flag to the GDIE
>>> related sd's of CPU2/3/4 and not to the DIE related sd's of CPU0/1.
>>>
>>> I thought so far that I can achieve that by getting rid of GDIE sd level
>>> for CPU0/1 simply by choosing the cpu_foo_mask() function in such a way
>>> that it returns the same cpu mask as its child sd level (MC) and of DIE
>>> sd level for CPU2/3/4 because it returns the same cpu mask as its child
>>> sd level (GDIE) related cpu mask function. This will let sd degenerate
>>> do it's job of folding sd levels which it does. The only problem I have
>>> is that the groups are not created correctly any more.
>>>
>>> I don't see right now how the flag SD_SHARE_FOO affects the code in
>>> get_group()/build_sched_groups().
>>>
>>> Think of SD_SHARE_FOO of something I would like to have for all sd's of
>>> CPU's of cluster 1 (CPU2/3/4) and not on cluster 0 (CPU0/1) in the sd
>>> level where each CPU sees two groups (group0 containing CPU0/1 and
>>> group1 containing CPU2/3/4 or vice versa) (GDIE/DIE) .
>>
>> I'm not sure that's it's feasible because it's not possible from a
>> topology pov to have different flags if the span include all cpus.
>> Could you give us more details about what you want to achieve with
>> this flag ?
>
> IMHO, the flag is not important for this discussion.  OTHA, information
> like you can't use sd degenerate functionality to fold adjacent sd
> levels (GFOO/FOO) on sd level which span all CPUs would be.  I want to
> make sure we understand what are the limitations to use folding of
> adjacent sd levels based on per-cpu differences in the return value of
> cpu_mask functions.
>
> -- Dietmar
>
> [...]
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table
  2014-04-25  7:45             ` Vincent Guittot
@ 2014-04-25 15:55               ` Dietmar Eggemann
  2014-04-25 16:04               ` Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: Dietmar Eggemann @ 2014-04-25 15:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 25/04/14 08:45, Vincent Guittot wrote:

[...]

>>
>> Back than I had
>>   CPU0: cpu_corepower_mask=0-1
>>   CPU2: cpu_corepower_mask=2
>> so for GMC level the cpumasks are inconsistent across CPUs and it worked.
> 
> The example above is consistent because CPU2 mask and CPU0 mask are
> fully exclusive

OK, got it now. The cpu mask functions on an sd level can return
different (but then exclusive) cpu masks or they all return the same cpu
mask (DIE level in example). Like you said we still have to respect the
topology of the system.

This essentially excludes the DIE level (i.e. the sd level which spawns
all CPUs) from playing this 'sd level folding via sd degenerate' game
for a system which specifies FORCE_SD_OVERLAP to false or don't use this
SDTL_OVERLAP tl flag.

> 
> so
> CPU0: cpu_corepower_mask=0-1
> CPU2: cpu_corepower_mask=2
> are consistent
> 
> CPU0: cpu_corepower_mask=0-2
> CPU2: cpu_corepower_mask=0-2
> are also consistent
> 
> but
> 
> CPU0: cpu_corepower_mask=0-1
> CPU2: cpu_corepower_mask=0-2
> are not consistent
> 
> and your example uses the last configuration
> 
> To be more precise, the rule above applies on default SDT definition
> but the flag SD_OVERLAP enables such kind of overlap between group.
> Have you tried it ?

Setting FORCE_SD_OVERLAP indeed changes the scenario a bit (we're now
using build_overlap_sched_groups() instead of build_sched_groups()). It
looks better, but the groups for CPU0/1 in DIE level are wrong (to get
so far I still have to comment out the check that 'if cpu_map is equal
to sd span of sd then break' in build_sched_domains() though).

dmesg snippet:

 CPU0 attaching sched-domain:
  domain 0: span 0-1 level MC
   groups: 0 1
   domain 1: span 0-4 level DIE
    groups: 0-4 (cpu_power = 5120) 0-1 (cpu_power = 2048) <-- error !!!
 CPU1 attaching sched-domain:
  domain 0: span 0-1 level MC
   groups: 1 0
   domain 1: span 0-4 level DIE
    groups: 0-1 (cpu_power = 2048) 0-4 (cpu_power = 5120) <-- error !!!
 CPU2 attaching sched-domain:
  domain 0: span 2-4 level GMC
   groups: 2 3 4
   domain 1: span 0-4 level GDIE
    groups: 2-4 (cpu_power = 3072) 0-1 (cpu_power = 2048)
 ...

The feature I'm currently working on is to add sd energy information to
sd levels of the sd topology level table. I essentially added another
column of sd energy related func ptr's (next to the flags related one)
and wanted to use 'sd level folding via sd degenerate' in MC and DIE
level to have different sd energy information per CPU.

In any case, this dependency to FORCE_SD_OVERLAP would be less then nice
for this feature though :-( A way out would be a 'int cpu' parameter but
we already discussed this back then for the flag function.

Thanks,

-- Dietmar

> 
> Vincent
> 

[...]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table
  2014-04-25  7:45             ` Vincent Guittot
  2014-04-25 15:55               ` Dietmar Eggemann
@ 2014-04-25 16:04               ` Peter Zijlstra
  2014-04-25 16:05                 ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2014-04-25 16:04 UTC (permalink / raw)
  To: linux-arm-kernel

> The example above is consistent because CPU2 mask and CPU0 mask are
> fully exclusive
> 
> so
> CPU0: cpu_corepower_mask=0-1
> CPU2: cpu_corepower_mask=2
> are consistent
> 
> CPU0: cpu_corepower_mask=0-2
> CPU2: cpu_corepower_mask=0-2
> are also consistent
> 
> but
> 
> CPU0: cpu_corepower_mask=0-1
> CPU2: cpu_corepower_mask=0-2
> are not consistent
> 
> and your example uses the last configuration
> 
> To be more precise, the rule above applies on default SDT definition
> but the flag SD_OVERLAP enables such kind of overlap between group.
> Have you tried it ?

I've never tried degenerate stuff with SD_OVERLAP, it might horribly
explode -- its not actually meant to work.

The SD_OVERLAP comes from not fully connected NUMA topologies; suppose
something like:

        0------1
        |      |
        |      |
	2------3

or:

 ( 10 20 20  0 )
 ( 20 10  0 20 )
 ( 20  0 10 20 )
 (  0 20 20 10 )

Your domain level that models the single-hop/20 distance has overlapping
masks:

N0: 0-2
N1: 0,1,3
N2: 0,2,3
N3: 1-3

I've never tried to construct a NUMA topology that would be overlapping
and have redundant bits in.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table
  2014-04-25 16:04               ` Peter Zijlstra
@ 2014-04-25 16:05                 ` Peter Zijlstra
  0 siblings, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2014-04-25 16:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Apr 25, 2014 at 06:04:19PM +0200, Peter Zijlstra wrote:
> > The example above is consistent because CPU2 mask and CPU0 mask are
> > fully exclusive
> > 
> > so
> > CPU0: cpu_corepower_mask=0-1
> > CPU2: cpu_corepower_mask=2
> > are consistent
> > 
> > CPU0: cpu_corepower_mask=0-2
> > CPU2: cpu_corepower_mask=0-2
> > are also consistent
> > 
> > but
> > 
> > CPU0: cpu_corepower_mask=0-1
> > CPU2: cpu_corepower_mask=0-2
> > are not consistent
> > 
> > and your example uses the last configuration
> > 
> > To be more precise, the rule above applies on default SDT definition
> > but the flag SD_OVERLAP enables such kind of overlap between group.
> > Have you tried it ?
> 
> I've never tried degenerate stuff with SD_OVERLAP, it might horribly
> explode -- its not actually meant to work.
> 
> The SD_OVERLAP comes from not fully connected NUMA topologies; suppose
> something like:
> 
>         0------1
>         |      |
>         |      |
>         2------3
> 
> or:
> 
>  ( 10 20 20  0 )
>  ( 20 10  0 20 )
>  ( 20  0 10 20 )
>  (  0 20 20 10 )

d'0h: s/0/30/

0 <-> 3 is 2 hops, too focused on the single hop case

> Your domain level that models the single-hop/20 distance has overlapping
> masks:
> 
> N0: 0-2
> N1: 0,1,3
> N2: 0,2,3
> N3: 1-3
> 
> I've never tried to construct a NUMA topology that would be overlapping
> and have redundant bits in.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-04-25 16:05 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-11  9:44 [PATCH v4 0/5] rework sched_domain topology description Vincent Guittot
2014-04-11  9:44 ` [PATCH v4 1/5] sched: rework of sched_domain topology definition Vincent Guittot
2014-04-18 10:56   ` Peter Zijlstra
2014-04-18 11:34     ` [PATCH] fix: " Vincent Guittot
2014-04-18 11:39       ` Peter Zijlstra
2014-04-18 11:34     ` [PATCH v4 1/5] " Vincent Guittot
2014-04-11  9:44 ` [PATCH v4 2/5] sched: s390: create a dedicated topology table Vincent Guittot
2014-04-11  9:44 ` [PATCH v4 3/5] sched: powerpc: " Vincent Guittot
2014-04-11  9:44 ` [PATCH v4 4/5] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
2014-04-18 10:58   ` Peter Zijlstra
2014-04-18 11:54     ` [PATCH] fix: sched: rework of sched_domain topology definition Vincent Guittot
2014-04-18 11:54     ` [PATCH v4 4/5] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
2014-04-11  9:44 ` [PATCH v4 5/5] sched: ARM: create a dedicated scheduler topology table Vincent Guittot
2014-04-23 11:46   ` Dietmar Eggemann
2014-04-23 14:46     ` Vincent Guittot
2014-04-23 15:26       ` Dietmar Eggemann
2014-04-24  7:30         ` Vincent Guittot
2014-04-24 12:48           ` Dietmar Eggemann
2014-04-25  7:45             ` Vincent Guittot
2014-04-25 15:55               ` Dietmar Eggemann
2014-04-25 16:04               ` Peter Zijlstra
2014-04-25 16:05                 ` Peter Zijlstra
2014-04-12 12:56 ` [PATCH v4 0/5] rework sched_domain topology description Dietmar Eggemann
2014-04-14  7:29   ` Vincent Guittot
2014-04-15  7:53   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).