linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/6] rework sched_domain topology description
@ 2014-03-05  7:18 Vincent Guittot
  2014-03-05  7:18 ` [RFC 1/6] sched: remove unused SCHED_INIT_NODE Vincent Guittot
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Vincent Guittot @ 2014-03-05  7:18 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel
  Cc: dietmar.eggemann, preeti, linaro-kernel, Vincent Guittot

This patchset was previously part of the larger tasks packing patchset [1].
I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology (this patchset)
-update and consolidation of cpu_power
-tasks packing algorithm

Based on Peter Z's proposal [2][3], this patchset modifies the way to configure
the sched_domain level in order to let architectures to add specific level like
the current BOOK level or the proposed power gating level for ARM architecture.

[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2013/11/5/239
[3] https://lkml.org/lkml/2013/11/5/449

Vincent Guittot (6):
  sched: remove unused SCHED_INIT_NODE
  sched: rework of sched_domain topology definition
  sched: s390: create a dedicated topology table
  sched: powerpc: create a dedicated topology table
  sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
  sched: ARM: create a dedicated scheduler topology table

 arch/arm/kernel/topology.c        |   26 ++++
 arch/ia64/include/asm/topology.h  |   24 ----
 arch/metag/include/asm/topology.h |   27 -----
 arch/powerpc/kernel/smp.c         |   35 ++++--
 arch/s390/include/asm/topology.h  |   13 +-
 arch/s390/kernel/topology.c       |   25 ++++
 arch/tile/include/asm/topology.h  |   33 ------
 include/linux/sched.h             |   30 +++++
 include/linux/topology.h          |  128 ++------------------
 kernel/sched/core.c               |  235 ++++++++++++++++++-------------------
 10 files changed, 237 insertions(+), 339 deletions(-)

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC 1/6] sched: remove unused SCHED_INIT_NODE
  2014-03-05  7:18 [RFC 0/6] rework sched_domain topology description Vincent Guittot
@ 2014-03-05  7:18 ` Vincent Guittot
  2014-03-05  7:18 ` [PATCH 2/6] sched: rework of sched_domain topology definition Vincent Guittot
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Vincent Guittot @ 2014-03-05  7:18 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel
  Cc: dietmar.eggemann, preeti, linaro-kernel, Vincent Guittot

not used since new numa scheduler init sequence

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/metag/include/asm/topology.h |   27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/arch/metag/include/asm/topology.h b/arch/metag/include/asm/topology.h
index 8e9c0b3..e95f874 100644
--- a/arch/metag/include/asm/topology.h
+++ b/arch/metag/include/asm/topology.h
@@ -3,33 +3,6 @@
 
 #ifdef CONFIG_NUMA
 
-/* sched_domains SD_NODE_INIT for Meta machines */
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 8,			\
-	.max_interval		= 32,			\
-	.busy_factor		= 32,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 3,			\
-	.idle_idx		= 2,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_FORK	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_SERIALIZE,		\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-	.max_newidle_lb_cost	= 0,			\
-	.next_decay_max_lb_cost	= jiffies,		\
-}
-
 #define cpu_to_node(cpu)	((void)(cpu), 0)
 #define parent_node(node)	((void)(node), 0)
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/6] sched: rework of sched_domain topology definition
  2014-03-05  7:18 [RFC 0/6] rework sched_domain topology description Vincent Guittot
  2014-03-05  7:18 ` [RFC 1/6] sched: remove unused SCHED_INIT_NODE Vincent Guittot
@ 2014-03-05  7:18 ` Vincent Guittot
  2014-03-05 17:09   ` Dietmar Eggemann
  2014-03-05  7:18 ` [RFC 3/6] sched: s390: create a dedicated topology table Vincent Guittot
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Vincent Guittot @ 2014-03-05  7:18 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel
  Cc: dietmar.eggemann, preeti, linaro-kernel, Vincent Guittot

We replace the old way to configure the scheduler topology with a new method
which enables a platform to declare additionnal level (if needed).

We still have a default topology table definition that can be used by platform
that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
be overwritten by an arch which wants to add new level where a load balance
make sense like BOOK or powergating level.

For each level, we need a function pointer that returns cpumask for each cpu,
the flags configuration and a name. Each level must be a subset on
the next one. The build sequence of the sched_domain will take care of
removing useless levels like those with 1 CPU and those with the same CPU span
and relevant information for load balancing than its child .

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/ia64/include/asm/topology.h |   24 ----
 arch/s390/include/asm/topology.h |    2 -
 arch/tile/include/asm/topology.h |   33 ------
 include/linux/sched.h            |   29 +++++
 include/linux/topology.h         |  128 +++------------------
 kernel/sched/core.c              |  227 +++++++++++++++++++-------------------
 6 files changed, 156 insertions(+), 287 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e4..20d12fa 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -46,30 +46,6 @@
 
 void build_cpu_to_node_map(void);
 
-#define SD_CPU_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 1,			\
-	.max_interval		= 4,			\
-	.busy_factor		= 64,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 2,			\
-	.idle_idx		= 1,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_FORK	\
-				| SD_WAKE_AFFINE,	\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index 05425b1..07763bd 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -64,8 +64,6 @@ static inline void s390_init_cpu_topology(void)
 };
 #endif
 
-#define SD_BOOK_INIT	SD_CPU_INIT
-
 #include <asm-generic/topology.h>
 
 #endif /* _ASM_S390_TOPOLOGY_H */
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index d15c0d8..9383118 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 /* For now, use numa node -1 for global allocation. */
 #define pcibus_to_node(bus)		((void)(bus), -1)
 
-/*
- * TILE architecture has many cores integrated in one processor, so we need
- * setup bigger balance_interval for both CPU/NODE scheduling domains to
- * reduce process scheduling costs.
- */
-
-/* sched_domains SD_CPU_INIT for TILE architecture */
-#define SD_CPU_INIT (struct sched_domain) {				\
-	.min_interval		= 4,					\
-	.max_interval		= 128,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 0*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 32,					\
-}
-
 /* By definition, we create nodes based on online memory. */
 #define node_has_online_mem(nid) 1
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 825ed83..dbc35dd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -976,6 +976,35 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
 bool cpus_share_cache(int this_cpu, int that_cpu);
 
+typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
+
+#define SDTL_OVERLAP	0x01
+
+struct sd_data {
+	struct sched_domain **__percpu sd;
+	struct sched_group **__percpu sg;
+	struct sched_group_power **__percpu sgp;
+};
+
+struct sched_domain_topology_level {
+	sched_domain_mask_f mask;
+	int		    sd_flags;
+	int		    flags;
+	int		    numa_level;
+	struct sd_data      data;
+#ifdef CONFIG_SCHED_DEBUG
+	char                *name;
+#endif
+};
+
+extern struct sched_domain_topology_level *sched_domain_topology;
+
+#ifdef CONFIG_SCHED_DEBUG
+# define SD_INIT_NAME(type)		.name = #type
+#else
+# define SD_INIT_NAME(type)
+#endif
+
 #else /* CONFIG_SMP */
 
 struct sched_domain_attr;
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..3a9db05 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -66,121 +66,6 @@ int arch_update_cpu_topology(void);
 #define PENALTY_FOR_NODE_WITH_CPUS	(1)
 #endif
 
-/*
- * Below are the 3 major initializers used in building sched_domains:
- * SD_SIBLING_INIT, for SMT domains
- * SD_CPU_INIT, for SMP domains
- *
- * Any architecture that cares to do any tuning to these values should do so
- * by defining their own arch-specific initializer in include/asm/topology.h.
- * A definition there will automagically override these default initializers
- * and allow arch-specific performance tuning of sched_domains.
- * (Only non-zero and non-null fields need be specified.)
- */
-
-#ifdef CONFIG_SCHED_SMT
-/* MCD - Do we really need this?  It is always on if CONFIG_SCHED_SMT is,
- * so can't we drop this in favor of CONFIG_SCHED_SMT?
- */
-#define ARCH_HAS_SCHED_WAKE_IDLE
-/* Common values for SMT siblings */
-#ifndef SD_SIBLING_INIT
-#define SD_SIBLING_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 2,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 110,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 1*SD_SHARE_CPUPOWER			\
-				| 1*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
-				| arch_sd_sibling_asym_packing()	\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-	.smt_gain		= 1178,	/* 15% */			\
-	.max_newidle_lb_cost	= 0,					\
-	.next_decay_max_lb_cost	= jiffies,				\
-}
-#endif
-#endif /* CONFIG_SCHED_SMT */
-
-#ifdef CONFIG_SCHED_MC
-/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */
-#ifndef SD_MC_INIT
-#define SD_MC_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 4,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 1*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-	.max_newidle_lb_cost	= 0,					\
-	.next_decay_max_lb_cost	= jiffies,				\
-}
-#endif
-#endif /* CONFIG_SCHED_MC */
-
-/* Common values for CPUs */
-#ifndef SD_CPU_INIT
-#define SD_CPU_INIT (struct sched_domain) {				\
-	.min_interval		= 1,					\
-	.max_interval		= 4,					\
-	.busy_factor		= 64,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 2,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 0*SD_SERIALIZE			\
-				| 1*SD_PREFER_SIBLING			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-	.max_newidle_lb_cost	= 0,					\
-	.next_decay_max_lb_cost	= jiffies,				\
-}
-#endif
-
-#ifdef CONFIG_SCHED_BOOK
-#ifndef SD_BOOK_INIT
-#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
-#endif
-#endif /* CONFIG_SCHED_BOOK */
-
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DECLARE_PER_CPU(int, numa_node);
 
@@ -295,4 +180,17 @@ static inline int cpu_to_mem(int cpu)
 #define topology_core_cpumask(cpu)		cpumask_of(cpu)
 #endif
 
+#ifdef CONFIG_SCHED_SMT
+static inline const struct cpumask *cpu_smt_mask(int cpu)
+{
+	return topology_thread_cpumask(cpu);
+}
+#endif
+
+static inline const struct cpumask *cpu_cpu_mask(int cpu)
+{
+	return cpumask_of_node(cpu_to_node(cpu));
+}
+
+
 #endif /* _LINUX_TOPOLOGY_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee8004c..6f6a8f0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5588,17 +5588,6 @@ static int __init isolated_cpu_setup(char *str)
 
 __setup("isolcpus=", isolated_cpu_setup);
 
-static const struct cpumask *cpu_cpu_mask(int cpu)
-{
-	return cpumask_of_node(cpu_to_node(cpu));
-}
-
-struct sd_data {
-	struct sched_domain **__percpu sd;
-	struct sched_group **__percpu sg;
-	struct sched_group_power **__percpu sgp;
-};
-
 struct s_data {
 	struct sched_domain ** __percpu sd;
 	struct root_domain	*rd;
@@ -5611,21 +5600,6 @@ enum s_alloc {
 	sa_none,
 };
 
-struct sched_domain_topology_level;
-
-typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
-typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
-
-#define SDTL_OVERLAP	0x01
-
-struct sched_domain_topology_level {
-	sched_domain_init_f init;
-	sched_domain_mask_f mask;
-	int		    flags;
-	int		    numa_level;
-	struct sd_data      data;
-};
-
 /*
  * Build an iteration mask that can exclude certain CPUs from the upwards
  * domain traversal.
@@ -5854,34 +5828,6 @@ int __weak arch_sd_sibling_asym_packing(void)
  * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
  */
 
-#ifdef CONFIG_SCHED_DEBUG
-# define SD_INIT_NAME(sd, type)		sd->name = #type
-#else
-# define SD_INIT_NAME(sd, type)		do { } while (0)
-#endif
-
-#define SD_INIT_FUNC(type)						\
-static noinline struct sched_domain *					\
-sd_init_##type(struct sched_domain_topology_level *tl, int cpu) 	\
-{									\
-	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);	\
-	*sd = SD_##type##_INIT;						\
-	SD_INIT_NAME(sd, type);						\
-	sd->private = &tl->data;					\
-	return sd;							\
-}
-
-SD_INIT_FUNC(CPU)
-#ifdef CONFIG_SCHED_SMT
- SD_INIT_FUNC(SIBLING)
-#endif
-#ifdef CONFIG_SCHED_MC
- SD_INIT_FUNC(MC)
-#endif
-#ifdef CONFIG_SCHED_BOOK
- SD_INIT_FUNC(BOOK)
-#endif
-
 static int default_relax_domain_level = -1;
 int sched_domain_level_max;
 
@@ -5969,97 +5915,148 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
 		*per_cpu_ptr(sdd->sgp, cpu) = NULL;
 }
 
-#ifdef CONFIG_SCHED_SMT
-static const struct cpumask *cpu_smt_mask(int cpu)
-{
-	return topology_thread_cpumask(cpu);
-}
-#endif
-
-/*
- * Topology list, bottom-up.
- */
-static struct sched_domain_topology_level default_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-	{ sd_init_SIBLING, cpu_smt_mask, },
-#endif
-#ifdef CONFIG_SCHED_MC
-	{ sd_init_MC, cpu_coregroup_mask, },
-#endif
-#ifdef CONFIG_SCHED_BOOK
-	{ sd_init_BOOK, cpu_book_mask, },
-#endif
-	{ sd_init_CPU, cpu_cpu_mask, },
-	{ NULL, },
-};
-
-static struct sched_domain_topology_level *sched_domain_topology = default_topology;
-
-#define for_each_sd_topology(tl)			\
-	for (tl = sched_domain_topology; tl->init; tl++)
+static int sched_domains_curr_level;
 
 #ifdef CONFIG_NUMA
-
 static int sched_domains_numa_levels;
 static int *sched_domains_numa_distance;
 static struct cpumask ***sched_domains_numa_masks;
-static int sched_domains_curr_level;
-
-static inline int sd_local_flags(int level)
-{
-	if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
-		return 0;
+#endif
 
-	return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
-}
+/*
+ * SD_flags allowed in topology descriptions.
+ *
+ * SD_SHARE_CPUPOWER      - describes SMT topologies
+ * SD_SHARE_PKG_RESOURCES - describes shared caches
+ * SD_NUMA                - describes NUMA topologies
+ *
+ * Odd one out:
+ * SD_ASYM_PACKING        - describes SMT quirks
+ */
+#define TOPOLOGY_SD_FLAGS		\
+	(SD_SHARE_CPUPOWER |		\
+	 SD_SHARE_PKG_RESOURCES |	\
+	 SD_NUMA |			\
+	 SD_ASYM_PACKING)
 
 static struct sched_domain *
-sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
+sd_init(struct sched_domain_topology_level *tl, int cpu)
 {
 	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
-	int level = tl->numa_level;
-	int sd_weight = cpumask_weight(
-			sched_domains_numa_masks[level][cpu_to_node(cpu)]);
+	int sd_weight;
+
+	/*
+	 * Ugly hack to pass state to sd_numa_mask()...
+	 */
+	sched_domains_curr_level = tl->numa_level;
+
+	sd_weight = cpumask_weight(tl->mask(cpu));
+
+	if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
+			"wrong sd_flags in topology description\n"))
+		tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
 
 	*sd = (struct sched_domain){
 		.min_interval		= sd_weight,
 		.max_interval		= 2*sd_weight,
 		.busy_factor		= 32,
 		.imbalance_pct		= 125,
-		.cache_nice_tries	= 2,
-		.busy_idx		= 3,
-		.idle_idx		= 2,
+
+		.cache_nice_tries	= 0,
+		.busy_idx		= 0,
+		.idle_idx		= 0,
 		.newidle_idx		= 0,
 		.wake_idx		= 0,
 		.forkexec_idx		= 0,
 
 		.flags			= 1*SD_LOAD_BALANCE
 					| 1*SD_BALANCE_NEWIDLE
-					| 0*SD_BALANCE_EXEC
-					| 0*SD_BALANCE_FORK
+					| 1*SD_BALANCE_EXEC
+					| 1*SD_BALANCE_FORK
 					| 0*SD_BALANCE_WAKE
-					| 0*SD_WAKE_AFFINE
+					| 1*SD_WAKE_AFFINE
 					| 0*SD_SHARE_CPUPOWER
 					| 0*SD_SHARE_PKG_RESOURCES
-					| 1*SD_SERIALIZE
+					| 0*SD_SERIALIZE
 					| 0*SD_PREFER_SIBLING
-					| 1*SD_NUMA
-					| sd_local_flags(level)
+					| 0*SD_NUMA
+					| tl->sd_flags
 					,
+
 		.last_balance		= jiffies,
 		.balance_interval	= sd_weight,
+		.smt_gain		= 0,
+		.max_newidle_lb_cost	= 0,
+		.next_decay_max_lb_cost	= jiffies,
+#ifdef CONFIG_SCHED_DEBUG
+		.name			= tl->name,
+#endif
 	};
-	SD_INIT_NAME(sd, NUMA);
-	sd->private = &tl->data;
 
 	/*
-	 * Ugly hack to pass state to sd_numa_mask()...
+	 * Convert topological properties into behaviour.
 	 */
-	sched_domains_curr_level = tl->numa_level;
+
+	if (sd->flags & SD_SHARE_CPUPOWER) {
+		sd->imbalance_pct = 110;
+		sd->smt_gain = 1178; /* ~15% */
+		sd->flags |= arch_sd_sibling_asym_packing();
+
+	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
+		sd->imbalance_pct = 117;
+		sd->cache_nice_tries = 1;
+		sd->busy_idx = 2;
+
+#ifdef CONFIG_NUMA
+	} else if (sd->flags & SD_NUMA) {
+		sd->cache_nice_tries = 2;
+		sd->busy_idx = 3;
+		sd->idle_idx = 2;
+
+		sd->flags |= SD_SERIALIZE;
+		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
+			sd->flags &= ~(SD_BALANCE_EXEC |
+				       SD_BALANCE_FORK |
+				       SD_WAKE_AFFINE);
+		}
+
+#endif
+	} else {
+		sd->flags |= SD_PREFER_SIBLING;
+		sd->cache_nice_tries = 1;
+		sd->busy_idx = 2;
+		sd->idle_idx = 1;
+	}
+
+	sd->private = &tl->data;
 
 	return sd;
 }
 
+/*
+ * Topology list, bottom-up.
+ */
+static struct sched_domain_topology_level default_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_MC
+	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
+#endif
+#ifdef CONFIG_SCHED_BOOK
+	{ cpu_book_mask, SD_INIT_NAME(BOOK) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
+struct sched_domain_topology_level *sched_domain_topology = default_topology;
+
+#define for_each_sd_topology(tl)			\
+	for (tl = sched_domain_topology; tl->mask; tl++)
+
+#ifdef CONFIG_NUMA
+
 static const struct cpumask *sd_numa_mask(int cpu)
 {
 	return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];
@@ -6203,7 +6200,10 @@ static void sched_init_numa(void)
 		}
 	}
 
-	tl = kzalloc((ARRAY_SIZE(default_topology) + level) *
+	/* Compute default topology size */
+	for (i = 0; sched_domain_topology[i].mask; i++);
+
+	tl = kzalloc((i + level) *
 			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
 	if (!tl)
 		return;
@@ -6211,18 +6211,19 @@ static void sched_init_numa(void)
 	/*
 	 * Copy the default topology bits..
 	 */
-	for (i = 0; default_topology[i].init; i++)
-		tl[i] = default_topology[i];
+	for (i = 0; sched_domain_topology[i].mask; i++)
+		tl[i] = sched_domain_topology[i];
 
 	/*
 	 * .. and append 'j' levels of NUMA goodness.
 	 */
 	for (j = 0; j < level; i++, j++) {
 		tl[i] = (struct sched_domain_topology_level){
-			.init = sd_numa_init,
 			.mask = sd_numa_mask,
+			.sd_flags = SD_NUMA,
 			.flags = SDTL_OVERLAP,
 			.numa_level = j,
+			SD_INIT_NAME(NUMA)
 		};
 	}
 
@@ -6380,7 +6381,7 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
 		const struct cpumask *cpu_map, struct sched_domain_attr *attr,
 		struct sched_domain *child, int cpu)
 {
-	struct sched_domain *sd = tl->init(tl, cpu);
+	struct sched_domain *sd = sd_init(tl, cpu);
 	if (!sd)
 		return child;
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 3/6] sched: s390: create a dedicated topology table
  2014-03-05  7:18 [RFC 0/6] rework sched_domain topology description Vincent Guittot
  2014-03-05  7:18 ` [RFC 1/6] sched: remove unused SCHED_INIT_NODE Vincent Guittot
  2014-03-05  7:18 ` [PATCH 2/6] sched: rework of sched_domain topology definition Vincent Guittot
@ 2014-03-05  7:18 ` Vincent Guittot
  2014-03-05  7:18 ` [RFC 4/6] sched: powerpc: " Vincent Guittot
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Vincent Guittot @ 2014-03-05  7:18 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel
  Cc: dietmar.eggemann, preeti, linaro-kernel, Vincent Guittot

BOOK level is only relevant for s390 so we create a dedicated topology table
with BOOK level and remove it from default table.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/s390/include/asm/topology.h |   11 +----------
 arch/s390/kernel/topology.c      |   25 +++++++++++++++++++++++++
 kernel/sched/core.c              |    3 ---
 3 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
index 07763bd..56af530 100644
--- a/arch/s390/include/asm/topology.h
+++ b/arch/s390/include/asm/topology.h
@@ -26,21 +26,12 @@ extern struct cpu_topology_s390 cpu_topology[NR_CPUS];
 
 #define mc_capable() 1
 
-static inline const struct cpumask *cpu_coregroup_mask(int cpu)
-{
-	return &cpu_topology[cpu].core_mask;
-}
-
-static inline const struct cpumask *cpu_book_mask(int cpu)
-{
-	return &cpu_topology[cpu].book_mask;
-}
-
 int topology_cpu_init(struct cpu *);
 int topology_set_cpu_management(int fc);
 void topology_schedule_update(void);
 void store_topology(struct sysinfo_15_1_x *info);
 void topology_expect_change(void);
+const struct cpumask *cpu_coregroup_mask(int cpu);
 
 #else /* CONFIG_SCHED_BOOK */
 
diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
index 4b2e3e3..8a7ac47 100644
--- a/arch/s390/kernel/topology.c
+++ b/arch/s390/kernel/topology.c
@@ -443,6 +443,28 @@ int topology_cpu_init(struct cpu *cpu)
 	return sysfs_create_group(&cpu->dev.kobj, &topology_cpu_attr_group);
 }
 
+const struct cpumask *cpu_coregroup_mask(int cpu)
+{
+	return &cpu_topology[cpu].core_mask;
+}
+
+static const struct cpumask *cpu_book_mask(int cpu)
+{
+	return &cpu_topology[cpu].book_mask;
+}
+
+static struct sched_domain_topology_level s390_topology[] = {
+	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
+	{ cpu_book_mask, SD_INIT_NAME(BOOK) },
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
+static void __init set_sched_topology(void)
+{
+	sched_domain_topology = s390_topology;
+}
+
 static int __init topology_init(void)
 {
 	if (!MACHINE_HAS_TOPOLOGY) {
@@ -452,6 +474,9 @@ static int __init topology_init(void)
 	set_topology_timer();
 out:
 	update_cpu_masks();
+
+	set_sched_topology();
+
 	return device_create_file(cpu_subsys.dev_root, &dev_attr_dispatching);
 }
 device_initcall(topology_init);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6f6a8f0..3479467 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6043,9 +6043,6 @@ static struct sched_domain_topology_level default_topology[] = {
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
 #endif
-#ifdef CONFIG_SCHED_BOOK
-	{ cpu_book_mask, SD_INIT_NAME(BOOK) },
-#endif
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 4/6] sched: powerpc: create a dedicated topology table
  2014-03-05  7:18 [RFC 0/6] rework sched_domain topology description Vincent Guittot
                   ` (2 preceding siblings ...)
  2014-03-05  7:18 ` [RFC 3/6] sched: s390: create a dedicated topology table Vincent Guittot
@ 2014-03-05  7:18 ` Vincent Guittot
  2014-03-11 10:08   ` Preeti U Murthy
  2014-03-05  7:18 ` [RFC 5/6] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Vincent Guittot @ 2014-03-05  7:18 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel
  Cc: dietmar.eggemann, preeti, linaro-kernel, Vincent Guittot

Create a dedicated topology table for handling asymetric feature.
The current proposal creates a new level which describes which groups of CPUs
take adavantge of SD_ASYM_PACKING. The useless level will be removed during the
build of the sched_domain topology.

Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level
during the boot sequence and before the build of the sched_domain topology.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/powerpc/kernel/smp.c |   35 +++++++++++++++++++++++++++--------
 kernel/sched/core.c       |    6 ------
 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ac2621a..75da054 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -755,6 +755,32 @@ int setup_profiling_timer(unsigned int multiplier)
 	return 0;
 }
 
+#ifdef CONFIG_SCHED_SMT
+/* cpumask of CPUs with asymetric SMT dependancy */
+static const struct cpumask *cpu_asmt_mask(int cpu)
+{
+	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
+		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
+		return topology_thread_cpumask(cpu);
+	}
+	return cpumask_of(cpu);
+}
+#endif
+
+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING, SD_INIT_NAME(ASMT) },
+	{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
+static void __init set_sched_topology(void)
+{
+	sched_domain_topology = powerpc_topology;
+}
+
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	cpumask_var_t old_mask;
@@ -779,15 +805,8 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 
-}
+	set_sched_topology();
 
-int arch_sd_sibling_asym_packing(void)
-{
-	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
-		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
-		return SD_ASYM_PACKING;
-	}
-	return 0;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3479467..7606de0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5818,11 +5818,6 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 	atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight);
 }
 
-int __weak arch_sd_sibling_asym_packing(void)
-{
-       return 0*SD_ASYM_PACKING;
-}
-
 /*
  * Initializers for schedule domains
  * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -6000,7 +5995,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 	if (sd->flags & SD_SHARE_CPUPOWER) {
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
-		sd->flags |= arch_sd_sibling_asym_packing();
 
 	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
 		sd->imbalance_pct = 117;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 5/6] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
  2014-03-05  7:18 [RFC 0/6] rework sched_domain topology description Vincent Guittot
                   ` (3 preceding siblings ...)
  2014-03-05  7:18 ` [RFC 4/6] sched: powerpc: " Vincent Guittot
@ 2014-03-05  7:18 ` Vincent Guittot
  2014-03-05  7:18 ` [RFC 6/6] sched: ARM: create a dedicated scheduler topology table Vincent Guittot
  2014-03-05 23:17 ` [RFC 0/6] rework sched_domain topology description Dietmar Eggemann
  6 siblings, 0 replies; 35+ messages in thread
From: Vincent Guittot @ 2014-03-05  7:18 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel
  Cc: dietmar.eggemann, preeti, linaro-kernel, Vincent Guittot

A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs
in a sched_domain level can or not reach different power state. As an example,
the flag should be cleared at CPU level if groups of cores can be power gated
independently. This information can be used to add load balancing level between
group of CPUs than can power gate independantly. The default behavior of the
scheduler is to spread tasks across CPUs and groups of CPUs so the flag is set
into all sched_domains.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |    1 +
 kernel/sched/core.c   |    9 ++++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dbc35dd..182a080 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -861,6 +861,7 @@ enum cpu_idle_type {
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
 #define SD_SHARE_CPUPOWER	0x0080	/* Domain members share cpu power */
+#define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
 #define SD_ASYM_PACKING		0x0800  /* Place busy groups earlier in the domain */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7606de0..b28cff0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5283,7 +5283,8 @@ static int sd_degenerate(struct sched_domain *sd)
 			 SD_BALANCE_FORK |
 			 SD_BALANCE_EXEC |
 			 SD_SHARE_CPUPOWER |
-			 SD_SHARE_PKG_RESOURCES)) {
+			 SD_SHARE_PKG_RESOURCES |
+			 SD_SHARE_POWERDOMAIN)) {
 		if (sd->groups != sd->groups->next)
 			return 0;
 	}
@@ -5314,7 +5315,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_BALANCE_EXEC |
 				SD_SHARE_CPUPOWER |
 				SD_SHARE_PKG_RESOURCES |
-				SD_PREFER_SIBLING);
+				SD_PREFER_SIBLING |
+				SD_SHARE_POWERDOMAIN);
 		if (nr_node_ids == 1)
 			pflags &= ~SD_SERIALIZE;
 	}
@@ -5932,7 +5934,8 @@ static struct cpumask ***sched_domains_numa_masks;
 	(SD_SHARE_CPUPOWER |		\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA |			\
-	 SD_ASYM_PACKING)
+	 SD_ASYM_PACKING |		\
+	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl, int cpu)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC 6/6] sched: ARM: create a dedicated scheduler topology table
  2014-03-05  7:18 [RFC 0/6] rework sched_domain topology description Vincent Guittot
                   ` (4 preceding siblings ...)
  2014-03-05  7:18 ` [RFC 5/6] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
@ 2014-03-05  7:18 ` Vincent Guittot
  2014-03-05 22:38   ` Dietmar Eggemann
  2014-03-05 23:17 ` [RFC 0/6] rework sched_domain topology description Dietmar Eggemann
  6 siblings, 1 reply; 35+ messages in thread
From: Vincent Guittot @ 2014-03-05  7:18 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel
  Cc: dietmar.eggemann, preeti, linaro-kernel, Vincent Guittot

Create a dedicated topology table for ARM which will create new level to
differentiate CPUs that can or not powergate independantly from others.

The patch gives an example of how to add domain that will take advantage of
SD_SHARE_POWERDOMAIN.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/arm/kernel/topology.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 0bc94b1..ae8ffbc 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return &cpu_topology[cpu].core_sibling;
 }
 
+/*
+ * The current assumption is that we can power gate each core independently.
+ * This will be superseded by DT binding once available.
+ */
+const struct cpumask *cpu_corepower_mask(int cpu)
+{
+	return &cpu_topology[cpu].thread_sibling;
+}
+
 static void update_siblings_masks(unsigned int cpuid)
 {
 	struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
@@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid)
 		cpu_topology[cpuid].socket_id, mpidr);
 }
 
+static struct sched_domain_topology_level arm_topology[] = {
+#ifdef CONFIG_SCHED_MC
+	{ cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) },
+	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
+static void __init set_sched_topology(void)
+{
+	sched_domain_topology = arm_topology;
+}
+
 /*
  * init_cpu_topology is called at boot when only one cpu is running
  * which prevent simultaneous write access to cpu_topology array
@@ -289,4 +312,7 @@ void __init init_cpu_topology(void)
 	smp_wmb();
 
 	parse_dt_topology();
+
+	/* Set scheduler topology descriptor */
+	set_sched_topology();
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/6] sched: rework of sched_domain topology definition
  2014-03-05  7:18 ` [PATCH 2/6] sched: rework of sched_domain topology definition Vincent Guittot
@ 2014-03-05 17:09   ` Dietmar Eggemann
  2014-03-06  8:32     ` Vincent Guittot
  0 siblings, 1 reply; 35+ messages in thread
From: Dietmar Eggemann @ 2014-03-05 17:09 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, tony.luck,
	fenghua.yu, schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel
  Cc: preeti, linaro-kernel

On 05/03/14 07:18, Vincent Guittot wrote:
> We replace the old way to configure the scheduler topology with a new method
> which enables a platform to declare additionnal level (if needed).
>
> We still have a default topology table definition that can be used by platform
> that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
> be overwritten by an arch which wants to add new level where a load balance
> make sense like BOOK or powergating level.
>
> For each level, we need a function pointer that returns cpumask for each cpu,
> the flags configuration and a name. Each level must be a subset on

Maybe it's worth mentioning here that those flags are from the set of 
the sd topology flags to distinguish them from the set of sd behavioural 
flags. Latter can't be set via this interface.

> the next one. The build sequence of the sched_domain will take care of
> removing useless levels like those with 1 CPU and those with the same CPU span
> and relevant information for load balancing than its child .
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   arch/ia64/include/asm/topology.h |   24 ----
>   arch/s390/include/asm/topology.h |    2 -
>   arch/tile/include/asm/topology.h |   33 ------
>   include/linux/sched.h            |   29 +++++
>   include/linux/topology.h         |  128 +++------------------
>   kernel/sched/core.c              |  227 +++++++++++++++++++-------------------
>   6 files changed, 156 insertions(+), 287 deletions(-)
>
> diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
> index a2496e4..20d12fa 100644
> --- a/arch/ia64/include/asm/topology.h
> +++ b/arch/ia64/include/asm/topology.h
> @@ -46,30 +46,6 @@
>
>   void build_cpu_to_node_map(void);
>
> -#define SD_CPU_INIT (struct sched_domain) {            \
> -       .parent                 = NULL,                 \
> -       .child                  = NULL,                 \
> -       .groups                 = NULL,                 \
> -       .min_interval           = 1,                    \
> -       .max_interval           = 4,                    \
> -       .busy_factor            = 64,                   \
> -       .imbalance_pct          = 125,                  \
> -       .cache_nice_tries       = 2,                    \
> -       .busy_idx               = 2,                    \
> -       .idle_idx               = 1,                    \
> -       .newidle_idx            = 0,                    \
> -       .wake_idx               = 0,                    \
> -       .forkexec_idx           = 0,                    \
> -       .flags                  = SD_LOAD_BALANCE       \
> -                               | SD_BALANCE_NEWIDLE    \
> -                               | SD_BALANCE_EXEC       \
> -                               | SD_BALANCE_FORK       \
> -                               | SD_WAKE_AFFINE,       \
> -       .last_balance           = jiffies,              \
> -       .balance_interval       = 1,                    \
> -       .nr_balance_failed      = 0,                    \
> -}
> -
>   #endif /* CONFIG_NUMA */
>
>   #ifdef CONFIG_SMP
> diff --git a/arch/s390/include/asm/topology.h b/arch/s390/include/asm/topology.h
> index 05425b1..07763bd 100644
> --- a/arch/s390/include/asm/topology.h
> +++ b/arch/s390/include/asm/topology.h
> @@ -64,8 +64,6 @@ static inline void s390_init_cpu_topology(void)
>   };
>   #endif
>
> -#define SD_BOOK_INIT   SD_CPU_INIT
> -
>   #include <asm-generic/topology.h>
>
>   #endif /* _ASM_S390_TOPOLOGY_H */
> diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
> index d15c0d8..9383118 100644
> --- a/arch/tile/include/asm/topology.h
> +++ b/arch/tile/include/asm/topology.h
> @@ -44,39 +44,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
>   /* For now, use numa node -1 for global allocation. */
>   #define pcibus_to_node(bus)            ((void)(bus), -1)
>
> -/*
> - * TILE architecture has many cores integrated in one processor, so we need
> - * setup bigger balance_interval for both CPU/NODE scheduling domains to
> - * reduce process scheduling costs.
> - */
> -
> -/* sched_domains SD_CPU_INIT for TILE architecture */
> -#define SD_CPU_INIT (struct sched_domain) {                            \
> -       .min_interval           = 4,                                    \
> -       .max_interval           = 128,                                  \
> -       .busy_factor            = 64,                                   \
> -       .imbalance_pct          = 125,                                  \
> -       .cache_nice_tries       = 1,                                    \
> -       .busy_idx               = 2,                                    \
> -       .idle_idx               = 1,                                    \
> -       .newidle_idx            = 0,                                    \
> -       .wake_idx               = 0,                                    \
> -       .forkexec_idx           = 0,                                    \
> -                                                                       \
> -       .flags                  = 1*SD_LOAD_BALANCE                     \
> -                               | 1*SD_BALANCE_NEWIDLE                  \
> -                               | 1*SD_BALANCE_EXEC                     \
> -                               | 1*SD_BALANCE_FORK                     \
> -                               | 0*SD_BALANCE_WAKE                     \
> -                               | 0*SD_WAKE_AFFINE                      \
> -                               | 0*SD_SHARE_CPUPOWER                   \
> -                               | 0*SD_SHARE_PKG_RESOURCES              \
> -                               | 0*SD_SERIALIZE                        \
> -                               ,                                       \
> -       .last_balance           = jiffies,                              \
> -       .balance_interval       = 32,                                   \
> -}
> -
>   /* By definition, we create nodes based on online memory. */
>   #define node_has_online_mem(nid) 1
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 825ed83..dbc35dd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -976,6 +976,35 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
>
>   bool cpus_share_cache(int this_cpu, int that_cpu);
>
> +typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
> +
> +#define SDTL_OVERLAP   0x01
> +
> +struct sd_data {
> +       struct sched_domain **__percpu sd;
> +       struct sched_group **__percpu sg;
> +       struct sched_group_power **__percpu sgp;
> +};
> +
> +struct sched_domain_topology_level {
> +       sched_domain_mask_f mask;
> +       int                 sd_flags;
> +       int                 flags;
> +       int                 numa_level;
> +       struct sd_data      data;
> +#ifdef CONFIG_SCHED_DEBUG
> +       char                *name;
> +#endif
> +};
> +
> +extern struct sched_domain_topology_level *sched_domain_topology;
> +
> +#ifdef CONFIG_SCHED_DEBUG
> +# define SD_INIT_NAME(type)            .name = #type
> +#else
> +# define SD_INIT_NAME(type)
> +#endif
> +
>   #else /* CONFIG_SMP */
>
>   struct sched_domain_attr;
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 12ae6ce..3a9db05 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -66,121 +66,6 @@ int arch_update_cpu_topology(void);
>   #define PENALTY_FOR_NODE_WITH_CPUS     (1)
>   #endif
>
> -/*
> - * Below are the 3 major initializers used in building sched_domains:
> - * SD_SIBLING_INIT, for SMT domains
> - * SD_CPU_INIT, for SMP domains
> - *
> - * Any architecture that cares to do any tuning to these values should do so
> - * by defining their own arch-specific initializer in include/asm/topology.h.
> - * A definition there will automagically override these default initializers
> - * and allow arch-specific performance tuning of sched_domains.
> - * (Only non-zero and non-null fields need be specified.)
> - */
> -
> -#ifdef CONFIG_SCHED_SMT
> -/* MCD - Do we really need this?  It is always on if CONFIG_SCHED_SMT is,
> - * so can't we drop this in favor of CONFIG_SCHED_SMT?
> - */
> -#define ARCH_HAS_SCHED_WAKE_IDLE
> -/* Common values for SMT siblings */
> -#ifndef SD_SIBLING_INIT
> -#define SD_SIBLING_INIT (struct sched_domain) {                                \
> -       .min_interval           = 1,                                    \
> -       .max_interval           = 2,                                    \
> -       .busy_factor            = 64,                                   \
> -       .imbalance_pct          = 110,                                  \
> -                                                                       \
> -       .flags                  = 1*SD_LOAD_BALANCE                     \
> -                               | 1*SD_BALANCE_NEWIDLE                  \
> -                               | 1*SD_BALANCE_EXEC                     \
> -                               | 1*SD_BALANCE_FORK                     \
> -                               | 0*SD_BALANCE_WAKE                     \
> -                               | 1*SD_WAKE_AFFINE                      \
> -                               | 1*SD_SHARE_CPUPOWER                   \
> -                               | 1*SD_SHARE_PKG_RESOURCES              \
> -                               | 0*SD_SERIALIZE                        \
> -                               | 0*SD_PREFER_SIBLING                   \
> -                               | arch_sd_sibling_asym_packing()        \
> -                               ,                                       \
> -       .last_balance           = jiffies,                              \
> -       .balance_interval       = 1,                                    \
> -       .smt_gain               = 1178, /* 15% */                       \
> -       .max_newidle_lb_cost    = 0,                                    \
> -       .next_decay_max_lb_cost = jiffies,                              \
> -}
> -#endif
> -#endif /* CONFIG_SCHED_SMT */
> -
> -#ifdef CONFIG_SCHED_MC
> -/* Common values for MC siblings. for now mostly derived from SD_CPU_INIT */
> -#ifndef SD_MC_INIT
> -#define SD_MC_INIT (struct sched_domain) {                             \
> -       .min_interval           = 1,                                    \
> -       .max_interval           = 4,                                    \
> -       .busy_factor            = 64,                                   \
> -       .imbalance_pct          = 125,                                  \
> -       .cache_nice_tries       = 1,                                    \
> -       .busy_idx               = 2,                                    \
> -       .wake_idx               = 0,                                    \
> -       .forkexec_idx           = 0,                                    \
> -                                                                       \
> -       .flags                  = 1*SD_LOAD_BALANCE                     \
> -                               | 1*SD_BALANCE_NEWIDLE                  \
> -                               | 1*SD_BALANCE_EXEC                     \
> -                               | 1*SD_BALANCE_FORK                     \
> -                               | 0*SD_BALANCE_WAKE                     \
> -                               | 1*SD_WAKE_AFFINE                      \
> -                               | 0*SD_SHARE_CPUPOWER                   \
> -                               | 1*SD_SHARE_PKG_RESOURCES              \
> -                               | 0*SD_SERIALIZE                        \
> -                               ,                                       \
> -       .last_balance           = jiffies,                              \
> -       .balance_interval       = 1,                                    \
> -       .max_newidle_lb_cost    = 0,                                    \
> -       .next_decay_max_lb_cost = jiffies,                              \
> -}
> -#endif
> -#endif /* CONFIG_SCHED_MC */
> -
> -/* Common values for CPUs */
> -#ifndef SD_CPU_INIT
> -#define SD_CPU_INIT (struct sched_domain) {                            \
> -       .min_interval           = 1,                                    \
> -       .max_interval           = 4,                                    \
> -       .busy_factor            = 64,                                   \
> -       .imbalance_pct          = 125,                                  \
> -       .cache_nice_tries       = 1,                                    \
> -       .busy_idx               = 2,                                    \
> -       .idle_idx               = 1,                                    \
> -       .newidle_idx            = 0,                                    \
> -       .wake_idx               = 0,                                    \
> -       .forkexec_idx           = 0,                                    \
> -                                                                       \
> -       .flags                  = 1*SD_LOAD_BALANCE                     \
> -                               | 1*SD_BALANCE_NEWIDLE                  \
> -                               | 1*SD_BALANCE_EXEC                     \
> -                               | 1*SD_BALANCE_FORK                     \
> -                               | 0*SD_BALANCE_WAKE                     \
> -                               | 1*SD_WAKE_AFFINE                      \
> -                               | 0*SD_SHARE_CPUPOWER                   \
> -                               | 0*SD_SHARE_PKG_RESOURCES              \
> -                               | 0*SD_SERIALIZE                        \
> -                               | 1*SD_PREFER_SIBLING                   \
> -                               ,                                       \
> -       .last_balance           = jiffies,                              \
> -       .balance_interval       = 1,                                    \
> -       .max_newidle_lb_cost    = 0,                                    \
> -       .next_decay_max_lb_cost = jiffies,                              \
> -}
> -#endif
> -
> -#ifdef CONFIG_SCHED_BOOK
> -#ifndef SD_BOOK_INIT
> -#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
> -#endif
> -#endif /* CONFIG_SCHED_BOOK */
> -
>   #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
>   DECLARE_PER_CPU(int, numa_node);
>
> @@ -295,4 +180,17 @@ static inline int cpu_to_mem(int cpu)
>   #define topology_core_cpumask(cpu)             cpumask_of(cpu)
>   #endif
>
> +#ifdef CONFIG_SCHED_SMT
> +static inline const struct cpumask *cpu_smt_mask(int cpu)
> +{
> +       return topology_thread_cpumask(cpu);
> +}
> +#endif
> +
> +static inline const struct cpumask *cpu_cpu_mask(int cpu)
> +{
> +       return cpumask_of_node(cpu_to_node(cpu));
> +}
> +
> +
>   #endif /* _LINUX_TOPOLOGY_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee8004c..6f6a8f0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5588,17 +5588,6 @@ static int __init isolated_cpu_setup(char *str)
>
>   __setup("isolcpus=", isolated_cpu_setup);
>
> -static const struct cpumask *cpu_cpu_mask(int cpu)
> -{
> -       return cpumask_of_node(cpu_to_node(cpu));
> -}
> -
> -struct sd_data {
> -       struct sched_domain **__percpu sd;
> -       struct sched_group **__percpu sg;
> -       struct sched_group_power **__percpu sgp;
> -};
> -
>   struct s_data {
>          struct sched_domain ** __percpu sd;
>          struct root_domain      *rd;
> @@ -5611,21 +5600,6 @@ enum s_alloc {
>          sa_none,
>   };
>
> -struct sched_domain_topology_level;
> -
> -typedef struct sched_domain *(*sched_domain_init_f)(struct sched_domain_topology_level *tl, int cpu);
> -typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
> -
> -#define SDTL_OVERLAP   0x01
> -
> -struct sched_domain_topology_level {
> -       sched_domain_init_f init;
> -       sched_domain_mask_f mask;
> -       int                 flags;
> -       int                 numa_level;
> -       struct sd_data      data;
> -};
> -
>   /*
>    * Build an iteration mask that can exclude certain CPUs from the upwards
>    * domain traversal.
> @@ -5854,34 +5828,6 @@ int __weak arch_sd_sibling_asym_packing(void)
>    * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
>    */
>
> -#ifdef CONFIG_SCHED_DEBUG
> -# define SD_INIT_NAME(sd, type)                sd->name = #type
> -#else
> -# define SD_INIT_NAME(sd, type)                do { } while (0)
> -#endif
> -
> -#define SD_INIT_FUNC(type)                                             \
> -static noinline struct sched_domain *                                  \
> -sd_init_##type(struct sched_domain_topology_level *tl, int cpu)        \
> -{                                                                      \
> -       struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);       \
> -       *sd = SD_##type##_INIT;                                         \
> -       SD_INIT_NAME(sd, type);                                         \
> -       sd->private = &tl->data;                                        \
> -       return sd;                                                      \
> -}
> -
> -SD_INIT_FUNC(CPU)
> -#ifdef CONFIG_SCHED_SMT
> - SD_INIT_FUNC(SIBLING)
> -#endif
> -#ifdef CONFIG_SCHED_MC
> - SD_INIT_FUNC(MC)
> -#endif
> -#ifdef CONFIG_SCHED_BOOK
> - SD_INIT_FUNC(BOOK)
> -#endif
> -
>   static int default_relax_domain_level = -1;
>   int sched_domain_level_max;
>
> @@ -5969,97 +5915,148 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
>                  *per_cpu_ptr(sdd->sgp, cpu) = NULL;
>   }
>
> -#ifdef CONFIG_SCHED_SMT
> -static const struct cpumask *cpu_smt_mask(int cpu)
> -{
> -       return topology_thread_cpumask(cpu);
> -}
> -#endif
> -
> -/*
> - * Topology list, bottom-up.
> - */
> -static struct sched_domain_topology_level default_topology[] = {
> -#ifdef CONFIG_SCHED_SMT
> -       { sd_init_SIBLING, cpu_smt_mask, },
> -#endif
> -#ifdef CONFIG_SCHED_MC
> -       { sd_init_MC, cpu_coregroup_mask, },
> -#endif
> -#ifdef CONFIG_SCHED_BOOK
> -       { sd_init_BOOK, cpu_book_mask, },
> -#endif
> -       { sd_init_CPU, cpu_cpu_mask, },
> -       { NULL, },
> -};
> -
> -static struct sched_domain_topology_level *sched_domain_topology = default_topology;
> -
> -#define for_each_sd_topology(tl)                       \
> -       for (tl = sched_domain_topology; tl->init; tl++)

Why is sched_domains_curr_level now outside #ifdef CONFIG_NUMA?

> +static int sched_domains_curr_level;
>
>   #ifdef CONFIG_NUMA
> -
>   static int sched_domains_numa_levels;
>   static int *sched_domains_numa_distance;
>   static struct cpumask ***sched_domains_numa_masks;
> -static int sched_domains_curr_level;
> -
> -static inline int sd_local_flags(int level)
> -{
> -       if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
> -               return 0;
> +#endif
>
> -       return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
> -}
> +/*
> + * SD_flags allowed in topology descriptions.
> + *
> + * SD_SHARE_CPUPOWER      - describes SMT topologies
> + * SD_SHARE_PKG_RESOURCES - describes shared caches
> + * SD_NUMA                - describes NUMA topologies
> + *
> + * Odd one out:
> + * SD_ASYM_PACKING        - describes SMT quirks
> + */
> +#define TOPOLOGY_SD_FLAGS              \
> +       (SD_SHARE_CPUPOWER |            \
> +        SD_SHARE_PKG_RESOURCES |       \
> +        SD_NUMA |                      \
> +        SD_ASYM_PACKING)
>
>   static struct sched_domain *
> -sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
> +sd_init(struct sched_domain_topology_level *tl, int cpu)
>   {
>          struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
> -       int level = tl->numa_level;
> -       int sd_weight = cpumask_weight(
> -                       sched_domains_numa_masks[level][cpu_to_node(cpu)]);
> +       int sd_weight;
> +

Next line could be guared by #ifdef CONFIG_NUMA. We still use #ifdef 
CONFIG_NUMA later in sd_init() though.
> +       /*
> +        * Ugly hack to pass state to sd_numa_mask()...
> +        */
> +       sched_domains_curr_level = tl->numa_level;
> +
> +       sd_weight = cpumask_weight(tl->mask(cpu));
> +
> +       if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
> +                       "wrong sd_flags in topology description\n"))
> +               tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
>
>          *sd = (struct sched_domain){
>                  .min_interval           = sd_weight,
>                  .max_interval           = 2*sd_weight,
>                  .busy_factor            = 32,
>                  .imbalance_pct          = 125,
> -               .cache_nice_tries       = 2,
> -               .busy_idx               = 3,
> -               .idle_idx               = 2,
> +
> +               .cache_nice_tries       = 0,
> +               .busy_idx               = 0,
> +               .idle_idx               = 0,
>                  .newidle_idx            = 0,
>                  .wake_idx               = 0,
>                  .forkexec_idx           = 0,

Why we want to explicitly set those indexes to 0 here? IMHO, the memory 
for *sd is zeroed out before. This is true for all data members which 
are set to 0 later in this function including the | 0*SD_FOO . IMHO, 
would make the code more readable.

>
>                  .flags                  = 1*SD_LOAD_BALANCE
>                                          | 1*SD_BALANCE_NEWIDLE
> -                                       | 0*SD_BALANCE_EXEC
> -                                       | 0*SD_BALANCE_FORK
> +                                       | 1*SD_BALANCE_EXEC
> +                                       | 1*SD_BALANCE_FORK
>                                          | 0*SD_BALANCE_WAKE
> -                                       | 0*SD_WAKE_AFFINE
> +                                       | 1*SD_WAKE_AFFINE
>                                          | 0*SD_SHARE_CPUPOWER
>                                          | 0*SD_SHARE_PKG_RESOURCES
> -                                       | 1*SD_SERIALIZE
> +                                       | 0*SD_SERIALIZE
>                                          | 0*SD_PREFER_SIBLING
> -                                       | 1*SD_NUMA
> -                                       | sd_local_flags(level)
> +                                       | 0*SD_NUMA
> +                                       | tl->sd_flags
>                                          ,
> +
>                  .last_balance           = jiffies,
>                  .balance_interval       = sd_weight,
> +               .smt_gain               = 0,
> +               .max_newidle_lb_cost    = 0,
> +               .next_decay_max_lb_cost = jiffies,
> +#ifdef CONFIG_SCHED_DEBUG
> +               .name                   = tl->name,
> +#endif
>          };
> -       SD_INIT_NAME(sd, NUMA);
> -       sd->private = &tl->data;
>
>          /*
> -        * Ugly hack to pass state to sd_numa_mask()...
> +        * Convert topological properties into behaviour.
>           */
> -       sched_domains_curr_level = tl->numa_level;
> +
> +       if (sd->flags & SD_SHARE_CPUPOWER) {
> +               sd->imbalance_pct = 110;
> +               sd->smt_gain = 1178; /* ~15% */
> +               sd->flags |= arch_sd_sibling_asym_packing();
> +
> +       } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> +               sd->imbalance_pct = 117;
> +               sd->cache_nice_tries = 1;
> +               sd->busy_idx = 2;
> +
> +#ifdef CONFIG_NUMA
> +       } else if (sd->flags & SD_NUMA) {
> +               sd->cache_nice_tries = 2;
> +               sd->busy_idx = 3;
> +               sd->idle_idx = 2;
> +
> +               sd->flags |= SD_SERIALIZE;
> +               if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
> +                       sd->flags &= ~(SD_BALANCE_EXEC |
> +                                      SD_BALANCE_FORK |
> +                                      SD_WAKE_AFFINE);
> +               }
> +
> +#endif
> +       } else {
> +               sd->flags |= SD_PREFER_SIBLING;
> +               sd->cache_nice_tries = 1;
> +               sd->busy_idx = 2;
> +               sd->idle_idx = 1;
> +       }
> +
> +       sd->private = &tl->data;
>
>          return sd;
>   }
>
> +/*
> + * Topology list, bottom-up.
> + */
> +static struct sched_domain_topology_level default_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) },
> +#endif
> +#ifdef CONFIG_SCHED_MC
> +       { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
> +#endif
> +#ifdef CONFIG_SCHED_BOOK
> +       { cpu_book_mask, SD_INIT_NAME(BOOK) },
> +#endif

Never got the new name DIE for CPU? Might confuse people when they use 
/proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().

> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +       { NULL, },
> +};
> +
> +struct sched_domain_topology_level *sched_domain_topology = default_topology;
> +
> +#define for_each_sd_topology(tl)                       \
> +       for (tl = sched_domain_topology; tl->mask; tl++)
> +
> +#ifdef CONFIG_NUMA
> +
>   static const struct cpumask *sd_numa_mask(int cpu)
>   {
>          return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];
> @@ -6203,7 +6200,10 @@ static void sched_init_numa(void)
>                  }
>          }
>
> -       tl = kzalloc((ARRAY_SIZE(default_topology) + level) *
> +       /* Compute default topology size */
> +       for (i = 0; sched_domain_topology[i].mask; i++);
> +
> +       tl = kzalloc((i + level) *
>                          sizeof(struct sched_domain_topology_level), GFP_KERNEL);
>          if (!tl)
>                  return;
> @@ -6211,18 +6211,19 @@ static void sched_init_numa(void)
>          /*
>           * Copy the default topology bits..
>           */
> -       for (i = 0; default_topology[i].init; i++)
> -               tl[i] = default_topology[i];
> +       for (i = 0; sched_domain_topology[i].mask; i++)
> +               tl[i] = sched_domain_topology[i];
>
>          /*
>           * .. and append 'j' levels of NUMA goodness.
>           */
>          for (j = 0; j < level; i++, j++) {
>                  tl[i] = (struct sched_domain_topology_level){
> -                       .init = sd_numa_init,
>                          .mask = sd_numa_mask,
> +                       .sd_flags = SD_NUMA,
>                          .flags = SDTL_OVERLAP,
>                          .numa_level = j,
> +                       SD_INIT_NAME(NUMA)
>                  };
>          }
>
> @@ -6380,7 +6381,7 @@ struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
>                  const struct cpumask *cpu_map, struct sched_domain_attr *attr,
>                  struct sched_domain *child, int cpu)
>   {
> -       struct sched_domain *sd = tl->init(tl, cpu);
> +       struct sched_domain *sd = sd_init(tl, cpu);
>          if (!sd)
>                  return child;
>
> --
> 1.7.9.5
>
>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 6/6] sched: ARM: create a dedicated scheduler topology table
  2014-03-05  7:18 ` [RFC 6/6] sched: ARM: create a dedicated scheduler topology table Vincent Guittot
@ 2014-03-05 22:38   ` Dietmar Eggemann
  2014-03-06  8:42     ` Vincent Guittot
  0 siblings, 1 reply; 35+ messages in thread
From: Dietmar Eggemann @ 2014-03-05 22:38 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, tony.luck,
	fenghua.yu, schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel
  Cc: preeti, linaro-kernel

On 05/03/14 07:18, Vincent Guittot wrote:
> Create a dedicated topology table for ARM which will create new level to
> differentiate CPUs that can or not powergate independantly from others.
> 
> The patch gives an example of how to add domain that will take advantage of
> SD_SHARE_POWERDOMAIN.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>   arch/arm/kernel/topology.c |   26 ++++++++++++++++++++++++++
>   1 file changed, 26 insertions(+)
> 
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 0bc94b1..ae8ffbc 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -185,6 +185,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>   	return &cpu_topology[cpu].core_sibling;
>   }
>   
> +/*
> + * The current assumption is that we can power gate each core independently.
> + * This will be superseded by DT binding once available.
> + */
> +const struct cpumask *cpu_corepower_mask(int cpu)
> +{
> +	return &cpu_topology[cpu].thread_sibling;
> +}

Although you already explained this to me in a private conversation,
it's important to notice that running this set-up on a dual cluster TC2
(2 Cortex A15 - 3 Cortex A7) (no independent core power gating) we
don't see the SD_SHARE_POWERDOMAIN topology flag set in the sd's on MC
level because this patch-set doesn't contain the appropriate DT
parsing. Like you said back then, the comment above mentions this.

> +
>   static void update_siblings_masks(unsigned int cpuid)
>   {
>   	struct cputopo_arm *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> @@ -266,6 +275,20 @@ void store_cpu_topology(unsigned int cpuid)
>   		cpu_topology[cpuid].socket_id, mpidr);
>   }
>   
> +static struct sched_domain_topology_level arm_topology[] = {
> +#ifdef CONFIG_SCHED_MC
> +	{ cpu_corepower_mask, SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) },
> +	{ cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
> +#endif
> +	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +	{ NULL, },
> +};
> +
> +static void __init set_sched_topology(void)
> +{
> +	sched_domain_topology = arm_topology;
> +}
> +
>   /*
>    * init_cpu_topology is called at boot when only one cpu is running
>    * which prevent simultaneous write access to cpu_topology array
> @@ -289,4 +312,7 @@ void __init init_cpu_topology(void)
>   	smp_wmb();
>   
>   	parse_dt_topology();
> +
> +	/* Set scheduler topology descriptor */
> +	set_sched_topology();
>   }
> 

How about the core scheduler provides an interface set_sched_topology()
instead of each arch has its own __init function? Sketched out below
for ARM.

-- >8 --
Subject: [PATCH] sched: set_sched_topology() as an interface of core scheduler

---
 arch/arm/kernel/topology.c | 7 +------
 include/linux/sched.h      | 2 +-
 kernel/sched/core.c        | 5 +++++
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index ae8ffbc..89d5592 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -284,11 +284,6 @@ static struct sched_domain_topology_level arm_topology[] = {
 	{ NULL, },
 };
 
-static void __init set_sched_topology(void)
-{
-	sched_domain_topology = arm_topology;
-}
-
 /*
  * init_cpu_topology is called at boot when only one cpu is running
  * which prevent simultaneous write access to cpu_topology array
@@ -314,5 +309,5 @@ void __init init_cpu_topology(void)
 	parse_dt_topology();
 
 	/* Set scheduler topology descriptor */
-	set_sched_topology();
+	set_sched_topology(arm_topology);
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8831413..fefd4e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -998,7 +998,7 @@ struct sched_domain_topology_level {
 #endif
 };
 
-extern struct sched_domain_topology_level *sched_domain_topology;
+extern void set_sched_topology(struct sched_domain_topology_level *tl);
 
 #ifdef CONFIG_SCHED_DEBUG
 # define SD_INIT_NAME(type)            .name = #type
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b4fb0df..a748c92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6105,6 +6105,11 @@ static struct sched_domain_topology_level default_topology[] = {
 
 struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
+void set_sched_topology(struct sched_domain_topology_level *tl)
+{
+	sched_domain_topology = tl;
+}
+
 #define for_each_sd_topology(tl)                       \
        for (tl = sched_domain_topology; tl->mask; tl++)
 
-- 
1.8.3.2






^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-05  7:18 [RFC 0/6] rework sched_domain topology description Vincent Guittot
                   ` (5 preceding siblings ...)
  2014-03-05  7:18 ` [RFC 6/6] sched: ARM: create a dedicated scheduler topology table Vincent Guittot
@ 2014-03-05 23:17 ` Dietmar Eggemann
  2014-03-06  9:04   ` Vincent Guittot
  6 siblings, 1 reply; 35+ messages in thread
From: Dietmar Eggemann @ 2014-03-05 23:17 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, tony.luck,
	fenghua.yu, schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel
  Cc: preeti, linaro-kernel

On 05/03/14 07:18, Vincent Guittot wrote:
> This patchset was previously part of the larger tasks packing patchset [1].
> I have splitted the latter in 3 different patchsets (at least) to make the
> thing easier.
> -configuration of sched_domain topology (this patchset)
> -update and consolidation of cpu_power
> -tasks packing algorithm
>
> Based on Peter Z's proposal [2][3], this patchset modifies the way to configure
> the sched_domain level in order to let architectures to add specific level like
> the current BOOK level or the proposed power gating level for ARM architecture.
>
> [1] https://lkml.org/lkml/2013/10/18/121
> [2] https://lkml.org/lkml/2013/11/5/239
> [3] https://lkml.org/lkml/2013/11/5/449
>
> Vincent Guittot (6):
>    sched: remove unused SCHED_INIT_NODE
>    sched: rework of sched_domain topology definition
>    sched: s390: create a dedicated topology table
>    sched: powerpc: create a dedicated topology table
>    sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
>    sched: ARM: create a dedicated scheduler topology table
>
>   arch/arm/kernel/topology.c        |   26 ++++
>   arch/ia64/include/asm/topology.h  |   24 ----
>   arch/metag/include/asm/topology.h |   27 -----
>   arch/powerpc/kernel/smp.c         |   35 ++++--
>   arch/s390/include/asm/topology.h  |   13 +-
>   arch/s390/kernel/topology.c       |   25 ++++
>   arch/tile/include/asm/topology.h  |   33 ------
>   include/linux/sched.h             |   30 +++++
>   include/linux/topology.h          |  128 ++------------------
>   kernel/sched/core.c               |  235 ++++++++++++++++++-------------------
>   10 files changed, 237 insertions(+), 339 deletions(-)
>

Hi Vincent,

I reviewed your patch-set carefully (including test runs on TC2), 
especially due to the fact that we want to build our sd_energy stuff on 
top of it.


One thing I'm still not convinced of is the fact that specifying 
additional sd levels in the struct sched_domain_topology_level table has 
an advantage over a function pointer for sd topology flags similar to 
the one we're already using for the cpu mask in struct 
sched_domain_topology_level.

   int (*sched_domain_flags_f)(int cpu);

This function pointer would be simply another member of struct 
sched_domain_topology_level and would replace int sd_flags.  AFAICS, you 
have to create additional cpu mask functions anyway for the additional 
sd levels, like cpu_corepower_mask() for the  GMC level in the ARM case. 
  There could be a set of standard sd topology flags function for the 
default sd layer and archs which want to pass in something special 
define those function locally since they will use them only in their 
arch specific struct sched_domain_topology_level table anyway.  I know 
that you use the existing sd degenerate functionality for this and that 
the freeing of the redundant data structures (sched_domain, sched_group 
and sched_group_power) is there too but it still doesn't seem to me to 
be the right thing to do.

The problem that we now expose internal data structures (struct sd_data 
and struct sched_domain_topology_level) could be dealt with later.

-- Dietmar


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/6] sched: rework of sched_domain topology definition
  2014-03-05 17:09   ` Dietmar Eggemann
@ 2014-03-06  8:32     ` Vincent Guittot
  2014-03-11 10:31       ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: Vincent Guittot @ 2014-03-06  8:32 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel, preeti,
	linaro-kernel

On 6 March 2014 01:09, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 05/03/14 07:18, Vincent Guittot wrote:
>>
>> We replace the old way to configure the scheduler topology with a new
>> method
>> which enables a platform to declare additionnal level (if needed).
>>
>> We still have a default topology table definition that can be used by
>> platform
>> that don't want more level than the SMT, MC, CPU and NUMA ones. This table
>> can
>> be overwritten by an arch which wants to add new level where a load
>> balance
>> make sense like BOOK or powergating level.
>>
>> For each level, we need a function pointer that returns cpumask for each
>> cpu,
>> the flags configuration and a name. Each level must be a subset on
>
>
> Maybe it's worth mentioning here that those flags are from the set of the sd
> topology flags to distinguish them from the set of sd behavioural flags.
> Latter can't be set via this interface.

yes, i will add the list of flags that can be set with the table

>
>
>> the next one. The build sequence of the sched_domain will take care of
>> removing useless levels like those with 1 CPU and those with the same CPU
>> span
>> and relevant information for load balancing than its child .
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>   arch/ia64/include/asm/topology.h |   24 ----
>>   arch/s390/include/asm/topology.h |    2 -
>>   arch/tile/include/asm/topology.h |   33 ------
>>   include/linux/sched.h            |   29 +++++
>>   include/linux/topology.h         |  128 +++------------------
>>   kernel/sched/core.c              |  227
>> +++++++++++++++++++-------------------
>>   6 files changed, 156 insertions(+), 287 deletions(-)
>>

[snip]

>> -
>> -#define for_each_sd_topology(tl)                       \
>> -       for (tl = sched_domain_topology; tl->init; tl++)
>
>
> Why is sched_domains_curr_level now outside #ifdef CONFIG_NUMA?

it should not as well as its use in sd_init

>
>
>> +static int sched_domains_curr_level;
>>
>>   #ifdef CONFIG_NUMA
>> -
>>   static int sched_domains_numa_levels;
>>   static int *sched_domains_numa_distance;
>>   static struct cpumask ***sched_domains_numa_masks;
>> -static int sched_domains_curr_level;
>> -
>> -static inline int sd_local_flags(int level)
>> -{
>> -       if (sched_domains_numa_distance[level] > RECLAIM_DISTANCE)
>> -               return 0;
>> +#endif
>>
>> -       return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
>> -}
>> +/*
>> + * SD_flags allowed in topology descriptions.
>> + *
>> + * SD_SHARE_CPUPOWER      - describes SMT topologies
>> + * SD_SHARE_PKG_RESOURCES - describes shared caches
>> + * SD_NUMA                - describes NUMA topologies
>> + *
>> + * Odd one out:
>> + * SD_ASYM_PACKING        - describes SMT quirks
>> + */
>> +#define TOPOLOGY_SD_FLAGS              \
>> +       (SD_SHARE_CPUPOWER |            \
>> +        SD_SHARE_PKG_RESOURCES |       \
>> +        SD_NUMA |                      \
>> +        SD_ASYM_PACKING)
>>
>>   static struct sched_domain *
>> -sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
>> +sd_init(struct sched_domain_topology_level *tl, int cpu)
>>   {
>>          struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
>> -       int level = tl->numa_level;
>> -       int sd_weight = cpumask_weight(
>> -
>> sched_domains_numa_masks[level][cpu_to_node(cpu)]);
>> +       int sd_weight;
>> +
>
>
> Next line could be guared by #ifdef CONFIG_NUMA. We still use #ifdef
> CONFIG_NUMA later in sd_init() though.
>
>> +       /*
>> +        * Ugly hack to pass state to sd_numa_mask()...
>> +        */
>> +       sched_domains_curr_level = tl->numa_level;
>> +
>> +       sd_weight = cpumask_weight(tl->mask(cpu));
>> +
>> +       if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
>> +                       "wrong sd_flags in topology description\n"))
>> +               tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
>>
>>          *sd = (struct sched_domain){
>>                  .min_interval           = sd_weight,
>>                  .max_interval           = 2*sd_weight,
>>                  .busy_factor            = 32,
>>                  .imbalance_pct          = 125,
>> -               .cache_nice_tries       = 2,
>> -               .busy_idx               = 3,
>> -               .idle_idx               = 2,
>> +
>> +               .cache_nice_tries       = 0,
>> +               .busy_idx               = 0,
>> +               .idle_idx               = 0,
>>                  .newidle_idx            = 0,
>>                  .wake_idx               = 0,
>>                  .forkexec_idx           = 0,
>
>
> Why we want to explicitly set those indexes to 0 here? IMHO, the memory for
> *sd is zeroed out before. This is true for all data members which are set to
> 0 later in this function including the | 0*SD_FOO . IMHO, would make the
> code more readable.

I would say that it makes the configuration more readable and
modifiable because you have the list of possible flag to set

>
>
>>
>>                  .flags                  = 1*SD_LOAD_BALANCE
>>                                          | 1*SD_BALANCE_NEWIDLE
>> -                                       | 0*SD_BALANCE_EXEC
>> -                                       | 0*SD_BALANCE_FORK
>> +                                       | 1*SD_BALANCE_EXEC
>> +                                       | 1*SD_BALANCE_FORK
>>                                          | 0*SD_BALANCE_WAKE
>> -                                       | 0*SD_WAKE_AFFINE
>> +                                       | 1*SD_WAKE_AFFINE
>>                                          | 0*SD_SHARE_CPUPOWER
>>                                          | 0*SD_SHARE_PKG_RESOURCES
>> -                                       | 1*SD_SERIALIZE
>> +                                       | 0*SD_SERIALIZE
>>                                          | 0*SD_PREFER_SIBLING
>> -                                       | 1*SD_NUMA
>> -                                       | sd_local_flags(level)
>> +                                       | 0*SD_NUMA
>> +                                       | tl->sd_flags
>>                                          ,
>> +
>>                  .last_balance           = jiffies,
>>                  .balance_interval       = sd_weight,
>> +               .smt_gain               = 0,
>> +               .max_newidle_lb_cost    = 0,
>> +               .next_decay_max_lb_cost = jiffies,
>> +#ifdef CONFIG_SCHED_DEBUG
>> +               .name                   = tl->name,
>> +#endif
>>          };
>> -       SD_INIT_NAME(sd, NUMA);
>> -       sd->private = &tl->data;
>>

[snip]

>>
>> +/*
>> + * Topology list, bottom-up.
>> + */
>> +static struct sched_domain_topology_level default_topology[] = {
>> +#ifdef CONFIG_SCHED_SMT
>> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>> SD_INIT_NAME(SMT) },
>> +#endif
>> +#ifdef CONFIG_SCHED_MC
>> +       { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },
>> +#endif
>> +#ifdef CONFIG_SCHED_BOOK
>> +       { cpu_book_mask, SD_INIT_NAME(BOOK) },
>> +#endif
>
>
> Never got the new name DIE for CPU? Might confuse people when they use
> /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().

In fact, CPU is also confusing because it's used for different things.
But if it makes things even more confusing, i can come back to CPU

>
>
>> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>> +       { NULL, },
>> +};
>> +
>> +struct sched_domain_topology_level *sched_domain_topology =
>> default_topology;
>> +

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 6/6] sched: ARM: create a dedicated scheduler topology table
  2014-03-05 22:38   ` Dietmar Eggemann
@ 2014-03-06  8:42     ` Vincent Guittot
  0 siblings, 0 replies; 35+ messages in thread
From: Vincent Guittot @ 2014-03-06  8:42 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel, preeti,
	linaro-kernel

On 6 March 2014 06:38, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 05/03/14 07:18, Vincent Guittot wrote:
>> Create a dedicated topology table for ARM which will create new level to
>> differentiate CPUs that can or not powergate independantly from others.
>>
>> The patch gives an example of how to add domain that will take advantage of
>> SD_SHARE_POWERDOMAIN.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

[snip]

>>
>>       parse_dt_topology();
>> +
>> +     /* Set scheduler topology descriptor */
>> +     set_sched_topology();
>>   }
>>
>
> How about the core scheduler provides an interface set_sched_topology()
> instead of each arch has its own __init function? Sketched out below
> for ARM.

yes, i will add it

>
> -- >8 --
> Subject: [PATCH] sched: set_sched_topology() as an interface of core scheduler
>
> ---
>  arch/arm/kernel/topology.c | 7 +------
>  include/linux/sched.h      | 2 +-
>  kernel/sched/core.c        | 5 +++++
>  3 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index ae8ffbc..89d5592 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -284,11 +284,6 @@ static struct sched_domain_topology_level arm_topology[] = {
>         { NULL, },
>  };
>
> -static void __init set_sched_topology(void)
> -{
> -       sched_domain_topology = arm_topology;
> -}
> -
>  /*
>   * init_cpu_topology is called at boot when only one cpu is running
>   * which prevent simultaneous write access to cpu_topology array
> @@ -314,5 +309,5 @@ void __init init_cpu_topology(void)
>         parse_dt_topology();
>
>         /* Set scheduler topology descriptor */
> -       set_sched_topology();
> +       set_sched_topology(arm_topology);
>  }
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8831413..fefd4e7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -998,7 +998,7 @@ struct sched_domain_topology_level {
>  #endif
>  };
>
> -extern struct sched_domain_topology_level *sched_domain_topology;
> +extern void set_sched_topology(struct sched_domain_topology_level *tl);
>
>  #ifdef CONFIG_SCHED_DEBUG
>  # define SD_INIT_NAME(type)            .name = #type
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b4fb0df..a748c92 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6105,6 +6105,11 @@ static struct sched_domain_topology_level default_topology[] = {
>
>  struct sched_domain_topology_level *sched_domain_topology = default_topology;
>
> +void set_sched_topology(struct sched_domain_topology_level *tl)
> +{
> +       sched_domain_topology = tl;
> +}
> +
>  #define for_each_sd_topology(tl)                       \
>         for (tl = sched_domain_topology; tl->mask; tl++)
>
> --
> 1.8.3.2
>
>
>
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-05 23:17 ` [RFC 0/6] rework sched_domain topology description Dietmar Eggemann
@ 2014-03-06  9:04   ` Vincent Guittot
  2014-03-06 12:31     ` Dietmar Eggemann
  0 siblings, 1 reply; 35+ messages in thread
From: Vincent Guittot @ 2014-03-06  9:04 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel, preeti,
	linaro-kernel

On 6 March 2014 07:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 05/03/14 07:18, Vincent Guittot wrote:
>>
>> This patchset was previously part of the larger tasks packing patchset
>> [1].
>> I have splitted the latter in 3 different patchsets (at least) to make the
>> thing easier.
>> -configuration of sched_domain topology (this patchset)
>> -update and consolidation of cpu_power
>> -tasks packing algorithm
>>
>> Based on Peter Z's proposal [2][3], this patchset modifies the way to
>> configure
>> the sched_domain level in order to let architectures to add specific level
>> like
>> the current BOOK level or the proposed power gating level for ARM
>> architecture.
>>
>> [1] https://lkml.org/lkml/2013/10/18/121
>> [2] https://lkml.org/lkml/2013/11/5/239
>> [3] https://lkml.org/lkml/2013/11/5/449
>>
>> Vincent Guittot (6):
>>    sched: remove unused SCHED_INIT_NODE
>>    sched: rework of sched_domain topology definition
>>    sched: s390: create a dedicated topology table
>>    sched: powerpc: create a dedicated topology table
>>    sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
>>    sched: ARM: create a dedicated scheduler topology table
>>
>>   arch/arm/kernel/topology.c        |   26 ++++
>>   arch/ia64/include/asm/topology.h  |   24 ----
>>   arch/metag/include/asm/topology.h |   27 -----
>>   arch/powerpc/kernel/smp.c         |   35 ++++--
>>   arch/s390/include/asm/topology.h  |   13 +-
>>   arch/s390/kernel/topology.c       |   25 ++++
>>   arch/tile/include/asm/topology.h  |   33 ------
>>   include/linux/sched.h             |   30 +++++
>>   include/linux/topology.h          |  128 ++------------------
>>   kernel/sched/core.c               |  235
>> ++++++++++++++++++-------------------
>>   10 files changed, 237 insertions(+), 339 deletions(-)
>>
>
> Hi Vincent,
>
> I reviewed your patch-set carefully (including test runs on TC2), especially
> due to the fact that we want to build our sd_energy stuff on top of it.

Thanks

>
>
> One thing I'm still not convinced of is the fact that specifying additional
> sd levels in the struct sched_domain_topology_level table has an advantage
> over a function pointer for sd topology flags similar to the one we're
> already using for the cpu mask in struct sched_domain_topology_level.
>
>   int (*sched_domain_flags_f)(int cpu);
>

We have to create additional level for some kind of topology as
described in my trial https://lkml.org/lkml/2013/12/18/279 which is
not possible with function pointer.

Have you got a situation in mind where it will be necessary to use the
function pointer with cpu number as an argument ?

In the current example of this patchset, the flags are statically set
in the table but nothing prevents an architecture to update the flags
value before being given to the scheduler

> This function pointer would be simply another member of struct
> sched_domain_topology_level and would replace int sd_flags.  AFAICS, you
> have to create additional cpu mask functions anyway for the additional sd
> levels, like cpu_corepower_mask() for the  GMC level in the ARM case.  There
> could be a set of standard sd topology flags function for the default sd
> layer and archs which want to pass in something special define those
> function locally since they will use them only in their arch specific struct
> sched_domain_topology_level table anyway.  I know that you use the existing
> sd degenerate functionality for this and that the freeing of the redundant
> data structures (sched_domain, sched_group and sched_group_power) is there
> too but it still doesn't seem to me to be the right thing to do.
>
> The problem that we now expose internal data structures (struct sd_data and
> struct sched_domain_topology_level) could be dealt with later.
>
> -- Dietmar
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-06  9:04   ` Vincent Guittot
@ 2014-03-06 12:31     ` Dietmar Eggemann
  2014-03-07  2:47       ` Vincent Guittot
  0 siblings, 1 reply; 35+ messages in thread
From: Dietmar Eggemann @ 2014-03-06 12:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel, preeti,
	linaro-kernel

On 06/03/14 09:04, Vincent Guittot wrote:
> On 6 March 2014 07:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 05/03/14 07:18, Vincent Guittot wrote:
>>>
>>> This patchset was previously part of the larger tasks packing patchset
>>> [1].
>>> I have splitted the latter in 3 different patchsets (at least) to make the
>>> thing easier.
>>> -configuration of sched_domain topology (this patchset)
>>> -update and consolidation of cpu_power
>>> -tasks packing algorithm
>>>
>>> Based on Peter Z's proposal [2][3], this patchset modifies the way to
>>> configure
>>> the sched_domain level in order to let architectures to add specific level
>>> like
>>> the current BOOK level or the proposed power gating level for ARM
>>> architecture.
>>>
>>> [1] https://lkml.org/lkml/2013/10/18/121
>>> [2] https://lkml.org/lkml/2013/11/5/239
>>> [3] https://lkml.org/lkml/2013/11/5/449
>>>
>>> Vincent Guittot (6):
>>>     sched: remove unused SCHED_INIT_NODE
>>>     sched: rework of sched_domain topology definition
>>>     sched: s390: create a dedicated topology table
>>>     sched: powerpc: create a dedicated topology table
>>>     sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
>>>     sched: ARM: create a dedicated scheduler topology table
>>>
>>>    arch/arm/kernel/topology.c        |   26 ++++
>>>    arch/ia64/include/asm/topology.h  |   24 ----
>>>    arch/metag/include/asm/topology.h |   27 -----
>>>    arch/powerpc/kernel/smp.c         |   35 ++++--
>>>    arch/s390/include/asm/topology.h  |   13 +-
>>>    arch/s390/kernel/topology.c       |   25 ++++
>>>    arch/tile/include/asm/topology.h  |   33 ------
>>>    include/linux/sched.h             |   30 +++++
>>>    include/linux/topology.h          |  128 ++------------------
>>>    kernel/sched/core.c               |  235
>>> ++++++++++++++++++-------------------
>>>    10 files changed, 237 insertions(+), 339 deletions(-)
>>>
>>
>> Hi Vincent,
>>
>> I reviewed your patch-set carefully (including test runs on TC2), especially
>> due to the fact that we want to build our sd_energy stuff on top of it.
>
> Thanks
>
>>
>>
>> One thing I'm still not convinced of is the fact that specifying additional
>> sd levels in the struct sched_domain_topology_level table has an advantage
>> over a function pointer for sd topology flags similar to the one we're
>> already using for the cpu mask in struct sched_domain_topology_level.
>>
>>    int (*sched_domain_flags_f)(int cpu);
>>
>
> We have to create additional level for some kind of topology as
> described in my trial https://lkml.org/lkml/2013/12/18/279 which is
> not possible with function pointer.

This is what I don't understand at the moment. In your example in the 
link above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while 
cpu 8-15 can't). Why can't we have

static inline int cpu_coregroup_flags(int cpu)
{
     int flags = SD_SHARE_PKG_RESOURCES;

     if (cpu >= 8)
         flags |= SD_SHARE_POWERDOMAIN;

     return flags;
}

inside the arch specific topology file and use it in the MC level as the 
int (*sched_domain_flags_f)(int cpu) member of struct 
sched_domain_topology_level?

>
> Have you got a situation in mind where it will be necessary to use the
> function pointer with cpu number as an argument ?

The one above. Fundamentally all situations where you want to set sd 
topology flags differently for different cpus in the same sd level. 
big.LITTLE is another example but it's the same as 
powergated/!powergated in this perspective.

>
> In the current example of this patchset, the flags are statically set
> in the table but nothing prevents an architecture to update the flags
> value before being given to the scheduler

What will be the infrastructure if the arch wants to do so?

Thanks,

-- Dietmar

>
>> This function pointer would be simply another member of struct
>> sched_domain_topology_level and would replace int sd_flags.  AFAICS, you
>> have to create additional cpu mask functions anyway for the additional sd
>> levels, like cpu_corepower_mask() for the  GMC level in the ARM case.  There
>> could be a set of standard sd topology flags function for the default sd
>> layer and archs which want to pass in something special define those
>> function locally since they will use them only in their arch specific struct
>> sched_domain_topology_level table anyway.  I know that you use the existing
>> sd degenerate functionality for this and that the freeing of the redundant
>> data structures (sched_domain, sched_group and sched_group_power) is there
>> too but it still doesn't seem to me to be the right thing to do.
>>
>> The problem that we now expose internal data structures (struct sd_data and
>> struct sched_domain_topology_level) could be dealt with later.
>>
>> -- Dietmar
>>
>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-06 12:31     ` Dietmar Eggemann
@ 2014-03-07  2:47       ` Vincent Guittot
  2014-03-08 12:40         ` Dietmar Eggemann
  2014-03-11 13:08         ` Peter Zijlstra
  0 siblings, 2 replies; 35+ messages in thread
From: Vincent Guittot @ 2014-03-07  2:47 UTC (permalink / raw)
  To: Dietmar Eggemann, peterz
  Cc: mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel, preeti,
	linaro-kernel

On 6 March 2014 20:31, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 06/03/14 09:04, Vincent Guittot wrote:
>>
>> On 6 March 2014 07:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 05/03/14 07:18, Vincent Guittot wrote:
>>>>
>>>>
>>>> This patchset was previously part of the larger tasks packing patchset
>>>> [1].
>>>> I have splitted the latter in 3 different patchsets (at least) to make
>>>> the
>>>> thing easier.
>>>> -configuration of sched_domain topology (this patchset)
>>>> -update and consolidation of cpu_power
>>>> -tasks packing algorithm
>>>>
>>>> Based on Peter Z's proposal [2][3], this patchset modifies the way to
>>>> configure
>>>> the sched_domain level in order to let architectures to add specific
>>>> level
>>>> like
>>>> the current BOOK level or the proposed power gating level for ARM
>>>> architecture.
>>>>
>>>> [1] https://lkml.org/lkml/2013/10/18/121
>>>> [2] https://lkml.org/lkml/2013/11/5/239
>>>> [3] https://lkml.org/lkml/2013/11/5/449
>>>>
>>>> Vincent Guittot (6):
>>>>     sched: remove unused SCHED_INIT_NODE
>>>>     sched: rework of sched_domain topology definition
>>>>     sched: s390: create a dedicated topology table
>>>>     sched: powerpc: create a dedicated topology table
>>>>     sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
>>>>     sched: ARM: create a dedicated scheduler topology table
>>>>
>>>>    arch/arm/kernel/topology.c        |   26 ++++
>>>>    arch/ia64/include/asm/topology.h  |   24 ----
>>>>    arch/metag/include/asm/topology.h |   27 -----
>>>>    arch/powerpc/kernel/smp.c         |   35 ++++--
>>>>    arch/s390/include/asm/topology.h  |   13 +-
>>>>    arch/s390/kernel/topology.c       |   25 ++++
>>>>    arch/tile/include/asm/topology.h  |   33 ------
>>>>    include/linux/sched.h             |   30 +++++
>>>>    include/linux/topology.h          |  128 ++------------------
>>>>    kernel/sched/core.c               |  235
>>>> ++++++++++++++++++-------------------
>>>>    10 files changed, 237 insertions(+), 339 deletions(-)
>>>>
>>>
>>> Hi Vincent,
>>>
>>> I reviewed your patch-set carefully (including test runs on TC2),
>>> especially
>>> due to the fact that we want to build our sd_energy stuff on top of it.
>>
>>
>> Thanks
>>
>>>
>>>
>>> One thing I'm still not convinced of is the fact that specifying
>>> additional
>>> sd levels in the struct sched_domain_topology_level table has an
>>> advantage
>>> over a function pointer for sd topology flags similar to the one we're
>>> already using for the cpu mask in struct sched_domain_topology_level.
>>>
>>>    int (*sched_domain_flags_f)(int cpu);
>>>
>>
>> We have to create additional level for some kind of topology as
>> described in my trial https://lkml.org/lkml/2013/12/18/279 which is
>> not possible with function pointer.
>
>
> This is what I don't understand at the moment. In your example in the link
> above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu 8-15
> can't). Why can't we have

The 2nd example is a bit more complicated and needs an additional
level to describe the topology

>
> static inline int cpu_coregroup_flags(int cpu)
> {
>     int flags = SD_SHARE_PKG_RESOURCES;
>
>     if (cpu >= 8)
>         flags |= SD_SHARE_POWERDOMAIN;
>
>     return flags;
> }
>
> inside the arch specific topology file and use it in the MC level as the int
> (*sched_domain_flags_f)(int cpu) member of struct
> sched_domain_topology_level?
>
>
>>
>> Have you got a situation in mind where it will be necessary to use the
>> function pointer with cpu number as an argument ?
>
>
> The one above. Fundamentally all situations where you want to set sd
> topology flags differently for different cpus in the same sd level.
> big.LITTLE is another example but it's the same as powergated/!powergated in
> this perspective.

You see the flag of a level as something that can be changed per cpu
whereas the proposal is to define a number of level with interesting
properties for the scheduler and to associate a mask of cpus to this
properties.

I don't have a strong opinion about using or not a cpu argument for
setting the flags of a level (it was part of the initial proposal
before we start to completely rework the build of sched_domain)
Nevertheless, I see one potential concern that you can have completely
different flags configuration of the same sd level of 2 cpus.

Peter,

Was the use of the cpu as a parameter in the initialization of
sched_domain's flag a reason for asking for reworking the
initialization of sched_domain ?

>
>
>>
>> In the current example of this patchset, the flags are statically set
>> in the table but nothing prevents an architecture to update the flags
>> value before being given to the scheduler
>
>
> What will be the infrastructure if the arch wants to do so?

I'm not sure to catch your point. The arch is now able to defines its
own table and fill it before giving it to the scheduler so nothing
prevents to set/update the flags field according the system
configuration during boot sequence.

Thanks,
Vincent
>
> Thanks,
>
> -- Dietmar
>
>
>>
>>> This function pointer would be simply another member of struct
>>> sched_domain_topology_level and would replace int sd_flags.  AFAICS, you
>>> have to create additional cpu mask functions anyway for the additional sd
>>> levels, like cpu_corepower_mask() for the  GMC level in the ARM case.
>>> There
>>> could be a set of standard sd topology flags function for the default sd
>>> layer and archs which want to pass in something special define those
>>> function locally since they will use them only in their arch specific
>>> struct
>>> sched_domain_topology_level table anyway.  I know that you use the
>>> existing
>>> sd degenerate functionality for this and that the freeing of the
>>> redundant
>>> data structures (sched_domain, sched_group and sched_group_power) is
>>> there
>>> too but it still doesn't seem to me to be the right thing to do.
>>>
>>> The problem that we now expose internal data structures (struct sd_data
>>> and
>>> struct sched_domain_topology_level) could be dealt with later.
>>>
>>> -- Dietmar
>>>
>>
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-07  2:47       ` Vincent Guittot
@ 2014-03-08 12:40         ` Dietmar Eggemann
  2014-03-10 13:21           ` Vincent Guittot
  2014-03-11 13:17           ` Peter Zijlstra
  2014-03-11 13:08         ` Peter Zijlstra
  1 sibling, 2 replies; 35+ messages in thread
From: Dietmar Eggemann @ 2014-03-08 12:40 UTC (permalink / raw)
  To: Vincent Guittot, peterz
  Cc: mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel, preeti,
	linaro-kernel

On 07/03/14 02:47, Vincent Guittot wrote:
> On 6 March 2014 20:31, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 06/03/14 09:04, Vincent Guittot wrote:
>>>
>>> On 6 March 2014 07:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 05/03/14 07:18, Vincent Guittot wrote:
>>>>>
>>>>>
>>>>> This patchset was previously part of the larger tasks packing patchset
>>>>> [1].
>>>>> I have splitted the latter in 3 different patchsets (at least) to make
>>>>> the
>>>>> thing easier.
>>>>> -configuration of sched_domain topology (this patchset)
>>>>> -update and consolidation of cpu_power
>>>>> -tasks packing algorithm
>>>>>
>>>>> Based on Peter Z's proposal [2][3], this patchset modifies the way to
>>>>> configure
>>>>> the sched_domain level in order to let architectures to add specific
>>>>> level
>>>>> like
>>>>> the current BOOK level or the proposed power gating level for ARM
>>>>> architecture.
>>>>>
>>>>> [1] https://lkml.org/lkml/2013/10/18/121
>>>>> [2] https://lkml.org/lkml/2013/11/5/239
>>>>> [3] https://lkml.org/lkml/2013/11/5/449
>>>>>
>>>>> Vincent Guittot (6):
>>>>>      sched: remove unused SCHED_INIT_NODE
>>>>>      sched: rework of sched_domain topology definition
>>>>>      sched: s390: create a dedicated topology table
>>>>>      sched: powerpc: create a dedicated topology table
>>>>>      sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
>>>>>      sched: ARM: create a dedicated scheduler topology table
>>>>>
>>>>>     arch/arm/kernel/topology.c        |   26 ++++
>>>>>     arch/ia64/include/asm/topology.h  |   24 ----
>>>>>     arch/metag/include/asm/topology.h |   27 -----
>>>>>     arch/powerpc/kernel/smp.c         |   35 ++++--
>>>>>     arch/s390/include/asm/topology.h  |   13 +-
>>>>>     arch/s390/kernel/topology.c       |   25 ++++
>>>>>     arch/tile/include/asm/topology.h  |   33 ------
>>>>>     include/linux/sched.h             |   30 +++++
>>>>>     include/linux/topology.h          |  128 ++------------------
>>>>>     kernel/sched/core.c               |  235
>>>>> ++++++++++++++++++-------------------
>>>>>     10 files changed, 237 insertions(+), 339 deletions(-)
>>>>>
>>>>
>>>> Hi Vincent,
>>>>
>>>> I reviewed your patch-set carefully (including test runs on TC2),
>>>> especially
>>>> due to the fact that we want to build our sd_energy stuff on top of it.
>>>
>>>
>>> Thanks
>>>
>>>>
>>>>
>>>> One thing I'm still not convinced of is the fact that specifying
>>>> additional
>>>> sd levels in the struct sched_domain_topology_level table has an
>>>> advantage
>>>> over a function pointer for sd topology flags similar to the one we're
>>>> already using for the cpu mask in struct sched_domain_topology_level.
>>>>
>>>>     int (*sched_domain_flags_f)(int cpu);
>>>>
>>>
>>> We have to create additional level for some kind of topology as
>>> described in my trial https://lkml.org/lkml/2013/12/18/279 which is
>>> not possible with function pointer.
>>
>>
>> This is what I don't understand at the moment. In your example in the link
>> above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu 8-15
>> can't). Why can't we have
>
> The 2nd example is a bit more complicated and needs an additional
> level to describe the topology

I see. In the 2. example you want to have an additional MC level for cpu 
2-7 because you want to do load balance between those cpus more often 
than for cpu 0-7 for dst cpus from the set (2-7). Not sure if such 
topologies (only a subset of cpus inside a cluster can't be powergated) 
exists today in the real world though?

from https://lkml.org/lkml/2013/12/18/279:

CPU2:
domain 0: span 2-3 level: SMT
     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | 
SD_SHARE_POWERDOMAIN
     groups: 0 1              <-- Doesn't this have to be 'groups: 2 3'
   domain 1: span 2-7 level: MC
       flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
       groups: 2-7 4-5 6-7
     domain 2: span 0-7 level: MC
         flags: SD_SHARE_PKG_RESOURCES
         groups: 2-7 0-1
       domain 3: span 0-15 level: CPU
           flags:
           groups: 0-7 8-15

>
>>
>> static inline int cpu_coregroup_flags(int cpu)
>> {
>>      int flags = SD_SHARE_PKG_RESOURCES;
>>
>>      if (cpu >= 8)
>>          flags |= SD_SHARE_POWERDOMAIN;
>>
>>      return flags;
>> }
>>
>> inside the arch specific topology file and use it in the MC level as the int
>> (*sched_domain_flags_f)(int cpu) member of struct
>> sched_domain_topology_level?
>>
>>
>>>
>>> Have you got a situation in mind where it will be necessary to use the
>>> function pointer with cpu number as an argument ?
>>
>>
>> The one above. Fundamentally all situations where you want to set sd
>> topology flags differently for different cpus in the same sd level.
>> big.LITTLE is another example but it's the same as powergated/!powergated in
>> this perspective.
>
> You see the flag of a level as something that can be changed per cpu
> whereas the proposal is to define a number of level with interesting
> properties for the scheduler and to associate a mask of cpus to this
> properties.

That's right.

>
> I don't have a strong opinion about using or not a cpu argument for
> setting the flags of a level (it was part of the initial proposal
> before we start to completely rework the build of sched_domain)
> Nevertheless, I see one potential concern that you can have completely
> different flags configuration of the same sd level of 2 cpus.

Could you elaborate a little bit further regarding the last sentence? Do 
you think that those completely different flags configuration would make 
it impossible, that the load-balance code could work at all at this sd?

>
> Peter,
>
> Was the use of the cpu as a parameter in the initialization of
> sched_domain's flag a reason for asking for reworking the
> initialization of sched_domain ?
>
>>
>>
>>>
>>> In the current example of this patchset, the flags are statically set
>>> in the table but nothing prevents an architecture to update the flags
>>> value before being given to the scheduler
>>
>>
>> What will be the infrastructure if the arch wants to do so?
>
> I'm not sure to catch your point. The arch is now able to defines its
> own table and fill it before giving it to the scheduler so nothing
> prevents to set/update the flags field according the system
> configuration during boot sequence.

Sorry, misunderstanding from my side in the first place. You just said 
that the arch is able to change those flags (due to the arch specific 
topology table) and I related it to the possibility for the arch to set 
the topology flags per cpu.

>
> Thanks,
> Vincent
>>
>> Thanks,
>>
>> -- Dietmar
>>
>>
>>>
>>>> This function pointer would be simply another member of struct
>>>> sched_domain_topology_level and would replace int sd_flags.  AFAICS, you
>>>> have to create additional cpu mask functions anyway for the additional sd
>>>> levels, like cpu_corepower_mask() for the  GMC level in the ARM case.
>>>> There
>>>> could be a set of standard sd topology flags function for the default sd
>>>> layer and archs which want to pass in something special define those
>>>> function locally since they will use them only in their arch specific
>>>> struct
>>>> sched_domain_topology_level table anyway.  I know that you use the
>>>> existing
>>>> sd degenerate functionality for this and that the freeing of the
>>>> redundant
>>>> data structures (sched_domain, sched_group and sched_group_power) is
>>>> there
>>>> too but it still doesn't seem to me to be the right thing to do.
>>>>
>>>> The problem that we now expose internal data structures (struct sd_data
>>>> and
>>>> struct sched_domain_topology_level) could be dealt with later.
>>>>
>>>> -- Dietmar
>>>>
>>>
>>
>>
>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-08 12:40         ` Dietmar Eggemann
@ 2014-03-10 13:21           ` Vincent Guittot
  2014-03-11 13:17           ` Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: Vincent Guittot @ 2014-03-10 13:21 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel, preeti,
	linaro-kernel

On 8 March 2014 13:40, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 07/03/14 02:47, Vincent Guittot wrote:
>>
>> On 6 March 2014 20:31, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 06/03/14 09:04, Vincent Guittot wrote:
>>>>
>>>>
>>>> On 6 March 2014 07:17, Dietmar Eggemann <dietmar.eggemann@arm.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On 05/03/14 07:18, Vincent Guittot wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> This patchset was previously part of the larger tasks packing patchset
>>>>>> [1].
>>>>>> I have splitted the latter in 3 different patchsets (at least) to make
>>>>>> the
>>>>>> thing easier.
>>>>>> -configuration of sched_domain topology (this patchset)
>>>>>> -update and consolidation of cpu_power
>>>>>> -tasks packing algorithm
>>>>>>
>>>>>> Based on Peter Z's proposal [2][3], this patchset modifies the way to
>>>>>> configure
>>>>>> the sched_domain level in order to let architectures to add specific
>>>>>> level
>>>>>> like
>>>>>> the current BOOK level or the proposed power gating level for ARM
>>>>>> architecture.
>>>>>>
>>>>>> [1] https://lkml.org/lkml/2013/10/18/121
>>>>>> [2] https://lkml.org/lkml/2013/11/5/239
>>>>>> [3] https://lkml.org/lkml/2013/11/5/449
>>>>>>
>>>>>> Vincent Guittot (6):
>>>>>>      sched: remove unused SCHED_INIT_NODE
>>>>>>      sched: rework of sched_domain topology definition
>>>>>>      sched: s390: create a dedicated topology table
>>>>>>      sched: powerpc: create a dedicated topology table
>>>>>>      sched: add a new SD_SHARE_POWERDOMAIN for sched_domain
>>>>>>      sched: ARM: create a dedicated scheduler topology table
>>>>>>
>>>>>>     arch/arm/kernel/topology.c        |   26 ++++
>>>>>>     arch/ia64/include/asm/topology.h  |   24 ----
>>>>>>     arch/metag/include/asm/topology.h |   27 -----
>>>>>>     arch/powerpc/kernel/smp.c         |   35 ++++--
>>>>>>     arch/s390/include/asm/topology.h  |   13 +-
>>>>>>     arch/s390/kernel/topology.c       |   25 ++++
>>>>>>     arch/tile/include/asm/topology.h  |   33 ------
>>>>>>     include/linux/sched.h             |   30 +++++
>>>>>>     include/linux/topology.h          |  128 ++------------------
>>>>>>     kernel/sched/core.c               |  235
>>>>>> ++++++++++++++++++-------------------
>>>>>>     10 files changed, 237 insertions(+), 339 deletions(-)
>>>>>>
>>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> I reviewed your patch-set carefully (including test runs on TC2),
>>>>> especially
>>>>> due to the fact that we want to build our sd_energy stuff on top of it.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>>
>>>>>
>>>>> One thing I'm still not convinced of is the fact that specifying
>>>>> additional
>>>>> sd levels in the struct sched_domain_topology_level table has an
>>>>> advantage
>>>>> over a function pointer for sd topology flags similar to the one we're
>>>>> already using for the cpu mask in struct sched_domain_topology_level.
>>>>>
>>>>>     int (*sched_domain_flags_f)(int cpu);
>>>>>
>>>>
>>>> We have to create additional level for some kind of topology as
>>>> described in my trial https://lkml.org/lkml/2013/12/18/279 which is
>>>> not possible with function pointer.
>>>
>>>
>>>
>>> This is what I don't understand at the moment. In your example in the
>>> link
>>> above, (2 cluster of 4 cores with SMT), cpu 0-7 can powergate while cpu
>>> 8-15
>>> can't). Why can't we have
>>
>>
>> The 2nd example is a bit more complicated and needs an additional
>> level to describe the topology
>
>
> I see. In the 2. example you want to have an additional MC level for cpu 2-7
> because you want to do load balance between those cpus more often than for
> cpu 0-7 for dst cpus from the set (2-7). Not sure if such topologies (only a
> subset of cpus inside a cluster can't be powergated) exists today in the
> real world though?

Be sure that such topology is studied and considered by HW guys

>
> from https://lkml.org/lkml/2013/12/18/279:
>
> CPU2:
> domain 0: span 2-3 level: SMT
>     flags: SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>     groups: 0 1              <-- Doesn't this have to be 'groups: 2 3'
>   domain 1: span 2-7 level: MC
>       flags: SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN
>       groups: 2-7 4-5 6-7
>     domain 2: span 0-7 level: MC
>         flags: SD_SHARE_PKG_RESOURCES
>         groups: 2-7 0-1
>       domain 3: span 0-15 level: CPU
>           flags:
>           groups: 0-7 8-15
>
>
>>
>>>
>>> static inline int cpu_coregroup_flags(int cpu)
>>> {
>>>      int flags = SD_SHARE_PKG_RESOURCES;
>>>
>>>      if (cpu >= 8)
>>>          flags |= SD_SHARE_POWERDOMAIN;
>>>
>>>      return flags;
>>> }
>>>
>>> inside the arch specific topology file and use it in the MC level as the
>>> int
>>> (*sched_domain_flags_f)(int cpu) member of struct
>>> sched_domain_topology_level?
>>>
>>>
>>>>
>>>> Have you got a situation in mind where it will be necessary to use the
>>>> function pointer with cpu number as an argument ?
>>>
>>>
>>>
>>> The one above. Fundamentally all situations where you want to set sd
>>> topology flags differently for different cpus in the same sd level.
>>> big.LITTLE is another example but it's the same as powergated/!powergated
>>> in
>>> this perspective.
>>
>>
>> You see the flag of a level as something that can be changed per cpu
>> whereas the proposal is to define a number of level with interesting
>> properties for the scheduler and to associate a mask of cpus to this
>> properties.
>
>
> That's right.
>
>
>>
>> I don't have a strong opinion about using or not a cpu argument for
>> setting the flags of a level (it was part of the initial proposal
>> before we start to completely rework the build of sched_domain)
>> Nevertheless, I see one potential concern that you can have completely
>> different flags configuration of the same sd level of 2 cpus.
>
>
> Could you elaborate a little bit further regarding the last sentence? Do you
> think that those completely different flags configuration would make it
> impossible, that the load-balance code could work at all at this sd?

With such function, an arch could set differently some topology flags
like SD_SHARE_PKG_RESOURCES and SD_SHARE_CPUPOWER in the same sd level
which make the notion of sd level meaningless.

>
>
>>
>> Peter,
>>
>> Was the use of the cpu as a parameter in the initialization of
>> sched_domain's flag a reason for asking for reworking the
>> initialization of sched_domain ?
>>
>>>
>>>
>>>>
>>>> In the current example of this patchset, the flags are statically set
>>>> in the table but nothing prevents an architecture to update the flags
>>>> value before being given to the scheduler
>>>
>>>
>>>
>>> What will be the infrastructure if the arch wants to do so?
>>
>>
>> I'm not sure to catch your point. The arch is now able to defines its
>> own table and fill it before giving it to the scheduler so nothing
>> prevents to set/update the flags field according the system
>> configuration during boot sequence.
>
>
> Sorry, misunderstanding from my side in the first place. You just said that
> the arch is able to change those flags (due to the arch specific topology
> table) and I related it to the possibility for the arch to set the topology
> flags per cpu.

OK, i was not sure that your point was only about the cpu argument for
the SD's flags configuration

Vincent

>
>
>>
>> Thanks,
>> Vincent
>>>
>>>
>>> Thanks,
>>>
>>> -- Dietmar
>>>
>>>
>>>>
>>>>> This function pointer would be simply another member of struct
>>>>> sched_domain_topology_level and would replace int sd_flags.  AFAICS,
>>>>> you
>>>>> have to create additional cpu mask functions anyway for the additional
>>>>> sd
>>>>> levels, like cpu_corepower_mask() for the  GMC level in the ARM case.
>>>>> There
>>>>> could be a set of standard sd topology flags function for the default
>>>>> sd
>>>>> layer and archs which want to pass in something special define those
>>>>> function locally since they will use them only in their arch specific
>>>>> struct
>>>>> sched_domain_topology_level table anyway.  I know that you use the
>>>>> existing
>>>>> sd degenerate functionality for this and that the freeing of the
>>>>> redundant
>>>>> data structures (sched_domain, sched_group and sched_group_power) is
>>>>> there
>>>>> too but it still doesn't seem to me to be the right thing to do.
>>>>>
>>>>> The problem that we now expose internal data structures (struct sd_data
>>>>> and
>>>>> struct sched_domain_topology_level) could be dealt with later.
>>>>>
>>>>> -- Dietmar
>>>>>
>>>>
>>>
>>>
>>
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 4/6] sched: powerpc: create a dedicated topology table
  2014-03-05  7:18 ` [RFC 4/6] sched: powerpc: " Vincent Guittot
@ 2014-03-11 10:08   ` Preeti U Murthy
  2014-03-11 13:18     ` Vincent Guittot
  0 siblings, 1 reply; 35+ messages in thread
From: Preeti U Murthy @ 2014-03-11 10:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, benh, linux, linux-arm-kernel,
	dietmar.eggemann, linaro-kernel

Hi Vincent,

On 03/05/2014 12:48 PM, Vincent Guittot wrote:
> Create a dedicated topology table for handling asymetric feature.
> The current proposal creates a new level which describes which groups of CPUs
> take adavantge of SD_ASYM_PACKING. The useless level will be removed during the
> build of the sched_domain topology.
> 
> Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level
> during the boot sequence and before the build of the sched_domain topology.

Is the below what you mean as the other solution? If it is so, I would
strongly recommend this approach rather than adding another level to the
topology level to represent the asymmetric behaviour.

+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};

Regards
Preeti U Murthy
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  arch/powerpc/kernel/smp.c |   35 +++++++++++++++++++++++++++--------
>  kernel/sched/core.c       |    6 ------
>  2 files changed, 27 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index ac2621a..75da054 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -755,6 +755,32 @@ int setup_profiling_timer(unsigned int multiplier)
>  	return 0;
>  }
> 
> +#ifdef CONFIG_SCHED_SMT
> +/* cpumask of CPUs with asymetric SMT dependancy */
> +static const struct cpumask *cpu_asmt_mask(int cpu)
> +{
> +	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
> +		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
> +		return topology_thread_cpumask(cpu);
> +	}
> +	return cpumask_of(cpu);
> +}
> +#endif
> +
> +static struct sched_domain_topology_level powerpc_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +	{ cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING, SD_INIT_NAME(ASMT) },
> +	{ cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) },
> +#endif
> +	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +	{ NULL, },
> +};
> +
> +static void __init set_sched_topology(void)
> +{
> +	sched_domain_topology = powerpc_topology;
> +}
> +
>  void __init smp_cpus_done(unsigned int max_cpus)
>  {
>  	cpumask_var_t old_mask;
> @@ -779,15 +805,8 @@ void __init smp_cpus_done(unsigned int max_cpus)
> 
>  	dump_numa_cpu_topology();
> 
> -}
> +	set_sched_topology();
> 
> -int arch_sd_sibling_asym_packing(void)
> -{
> -	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
> -		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
> -		return SD_ASYM_PACKING;
> -	}
> -	return 0;
>  }
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3479467..7606de0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5818,11 +5818,6 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
>  	atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight);
>  }
> 
> -int __weak arch_sd_sibling_asym_packing(void)
> -{
> -       return 0*SD_ASYM_PACKING;
> -}
> -
>  /*
>   * Initializers for schedule domains
>   * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
> @@ -6000,7 +5995,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>  	if (sd->flags & SD_SHARE_CPUPOWER) {
>  		sd->imbalance_pct = 110;
>  		sd->smt_gain = 1178; /* ~15% */
> -		sd->flags |= arch_sd_sibling_asym_packing();
> 
>  	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
>  		sd->imbalance_pct = 117;
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/6] sched: rework of sched_domain topology definition
  2014-03-06  8:32     ` Vincent Guittot
@ 2014-03-11 10:31       ` Peter Zijlstra
  2014-03-11 13:27         ` Vincent Guittot
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-03-11 10:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, mingo, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel, preeti, linaro-kernel

On Thu, Mar 06, 2014 at 04:32:35PM +0800, Vincent Guittot wrote:
> > Never got the new name DIE for CPU? Might confuse people when they use
> > /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().
> 
> In fact, CPU is also confusing because it's used for different things.
> But if it makes things even more confusing, i can come back to CPU

Yeah, not sure DIE is the right thing either; because there's multi-die
packages that get classified under CPU :-)

Take for example the Core 2 Quad, which was 2 dual core dies glued
together in a single package.

There's also the AMD bulldozer which glued two dies into a single
package; but for those its not a problem because each die is a separate
numa node, so there DIE would actually be the correct term and PACKAGE
would be wrong.

So while CPU sucks, I'm not sure we can come up with anything that's
actually correct. That said; we could try for something less wrong than
CPU :-)

I'm not sure there are a lot of people who see/know the names of these
domains to be bothered by a change in them; it might be limited to just
us for all I know.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-07  2:47       ` Vincent Guittot
  2014-03-08 12:40         ` Dietmar Eggemann
@ 2014-03-11 13:08         ` Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2014-03-11 13:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, mingo, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel, preeti, linaro-kernel

On Fri, Mar 07, 2014 at 10:47:49AM +0800, Vincent Guittot wrote:
> Peter,
> 
> Was the use of the cpu as a parameter in the initialization of
> sched_domain's flag a reason for asking for reworking the
> initialization of sched_domain ?

/me tries very hard to remember.. and fails.

Reading back the linked thread also doesn't seem to jog memory.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-08 12:40         ` Dietmar Eggemann
  2014-03-10 13:21           ` Vincent Guittot
@ 2014-03-11 13:17           ` Peter Zijlstra
  2014-03-12 13:28             ` Dietmar Eggemann
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-03-11 13:17 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, mingo, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel, preeti, linaro-kernel

On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:
> >
> >I don't have a strong opinion about using or not a cpu argument for
> >setting the flags of a level (it was part of the initial proposal
> >before we start to completely rework the build of sched_domain)
> >Nevertheless, I see one potential concern that you can have completely
> >different flags configuration of the same sd level of 2 cpus.
> 
> Could you elaborate a little bit further regarding the last sentence? Do you
> think that those completely different flags configuration would make it
> impossible, that the load-balance code could work at all at this sd?

So a problem with such an interfaces is that is makes it far too easy to
generate completely broken domains.

You can, for two cpus in the same domain provide, different flags; such
a configuration doesn't make any sense at all.

Now I see why people would like to have this; but unless we can make it
robust I'd be very hesitant to go this route.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 4/6] sched: powerpc: create a dedicated topology table
  2014-03-11 10:08   ` Preeti U Murthy
@ 2014-03-11 13:18     ` Vincent Guittot
  2014-03-12  4:42       ` Preeti U Murthy
  0 siblings, 1 reply; 35+ messages in thread
From: Vincent Guittot @ 2014-03-11 13:18 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, Benjamin Herrenschmidt,
	Russell King - ARM Linux, LAK, Dietmar Eggemann, linaro-kernel

On 11 March 2014 11:08, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 03/05/2014 12:48 PM, Vincent Guittot wrote:
>> Create a dedicated topology table for handling asymetric feature.
>> The current proposal creates a new level which describes which groups of CPUs
>> take adavantge of SD_ASYM_PACKING. The useless level will be removed during the
>> build of the sched_domain topology.
>>
>> Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level
>> during the boot sequence and before the build of the sched_domain topology.
>
> Is the below what you mean as the other solution? If it is so, I would
> strongly recommend this approach rather than adding another level to the
> topology level to represent the asymmetric behaviour.
>
> +static struct sched_domain_topology_level powerpc_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
> SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() },
> +#endif
> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +       { NULL, },
> +};

Not exactly like that but something like below

+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+ { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) },
+#endif
+ { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+ { NULL, },
+};
+
+static void __init set_sched_topology(void)
+{
+#ifdef CONFIG_SCHED_SMT
+ powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();
+#endif
+ sched_domain_topology = powerpc_topology;
+}

>
> Regards
> Preeti U Murthy
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  arch/powerpc/kernel/smp.c |   35 +++++++++++++++++++++++++++--------
>>  kernel/sched/core.c       |    6 ------
>>  2 files changed, 27 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>> index ac2621a..75da054 100644
>> --- a/arch/powerpc/kernel/smp.c
>> +++ b/arch/powerpc/kernel/smp.c
>> @@ -755,6 +755,32 @@ int setup_profiling_timer(unsigned int multiplier)
>>       return 0;
>>  }
>>
>> +#ifdef CONFIG_SCHED_SMT
>> +/* cpumask of CPUs with asymetric SMT dependancy */
>> +static const struct cpumask *cpu_asmt_mask(int cpu)
>> +{
>> +     if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
>> +             printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
>> +             return topology_thread_cpumask(cpu);
>> +     }
>> +     return cpumask_of(cpu);
>> +}
>> +#endif
>> +
>> +static struct sched_domain_topology_level powerpc_topology[] = {
>> +#ifdef CONFIG_SCHED_SMT
>> +     { cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING, SD_INIT_NAME(ASMT) },
>> +     { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(SMT) },
>> +#endif
>> +     { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>> +     { NULL, },
>> +};
>> +
>> +static void __init set_sched_topology(void)
>> +{
>> +     sched_domain_topology = powerpc_topology;
>> +}
>> +
>>  void __init smp_cpus_done(unsigned int max_cpus)
>>  {
>>       cpumask_var_t old_mask;
>> @@ -779,15 +805,8 @@ void __init smp_cpus_done(unsigned int max_cpus)
>>
>>       dump_numa_cpu_topology();
>>
>> -}
>> +     set_sched_topology();
>>
>> -int arch_sd_sibling_asym_packing(void)
>> -{
>> -     if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
>> -             printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
>> -             return SD_ASYM_PACKING;
>> -     }
>> -     return 0;
>>  }
>>
>>  #ifdef CONFIG_HOTPLUG_CPU
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 3479467..7606de0 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -5818,11 +5818,6 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
>>       atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight);
>>  }
>>
>> -int __weak arch_sd_sibling_asym_packing(void)
>> -{
>> -       return 0*SD_ASYM_PACKING;
>> -}
>> -
>>  /*
>>   * Initializers for schedule domains
>>   * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
>> @@ -6000,7 +5995,6 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>>       if (sd->flags & SD_SHARE_CPUPOWER) {
>>               sd->imbalance_pct = 110;
>>               sd->smt_gain = 1178; /* ~15% */
>> -             sd->flags |= arch_sd_sibling_asym_packing();
>>
>>       } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
>>               sd->imbalance_pct = 117;
>>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/6] sched: rework of sched_domain topology definition
  2014-03-11 10:31       ` Peter Zijlstra
@ 2014-03-11 13:27         ` Vincent Guittot
  2014-03-11 13:48           ` Dietmar Eggemann
  0 siblings, 1 reply; 35+ messages in thread
From: Vincent Guittot @ 2014-03-11 13:27 UTC (permalink / raw)
  To: Peter Zijlstra, Dietmar Eggemann
  Cc: fenghua.yu, linaro-kernel, tony.luck, linux, benh, linux-kernel,
	cmetcalf, mingo, preeti, james.hogan, schwidefsky,
	linux-arm-kernel

On 11 March 2014 11:31, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Mar 06, 2014 at 04:32:35PM +0800, Vincent Guittot wrote:
>> > Never got the new name DIE for CPU? Might confuse people when they use
>> > /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().
>>
>> In fact, CPU is also confusing because it's used for different things.
>> But if it makes things even more confusing, i can come back to CPU
>
> Yeah, not sure DIE is the right thing either; because there's multi-die
> packages that get classified under CPU :-)
>
> Take for example the Core 2 Quad, which was 2 dual core dies glued
> together in a single package.
>
> There's also the AMD bulldozer which glued two dies into a single
> package; but for those its not a problem because each die is a separate
> numa node, so there DIE would actually be the correct term and PACKAGE
> would be wrong.
>
> So while CPU sucks, I'm not sure we can come up with anything that's
> actually correct. That said; we could try for something less wrong than
> CPU :-)

OK

Dietmar,

Have you got another naming that DIE that could suit better ?
otherwise i will keep it

Vincent

>
> I'm not sure there are a lot of people who see/know the names of these
> domains to be bothered by a change in them; it might be limited to just
> us for all I know.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/6] sched: rework of sched_domain topology definition
  2014-03-11 13:27         ` Vincent Guittot
@ 2014-03-11 13:48           ` Dietmar Eggemann
  0 siblings, 0 replies; 35+ messages in thread
From: Dietmar Eggemann @ 2014-03-11 13:48 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: fenghua.yu, linaro-kernel, tony.luck, linux, benh, linux-kernel,
	cmetcalf, mingo, preeti, james.hogan, schwidefsky,
	linux-arm-kernel

On 11/03/14 13:27, Vincent Guittot wrote:
> On 11 March 2014 11:31, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Mar 06, 2014 at 04:32:35PM +0800, Vincent Guittot wrote:
>>>> Never got the new name DIE for CPU? Might confuse people when they use
>>>> /proc/sys/kernel/sched_domain/cpuX/domainY/name or sched_domain_debug_one().
>>>
>>> In fact, CPU is also confusing because it's used for different things.
>>> But if it makes things even more confusing, i can come back to CPU
>>
>> Yeah, not sure DIE is the right thing either; because there's multi-die
>> packages that get classified under CPU :-)
>>
>> Take for example the Core 2 Quad, which was 2 dual core dies glued
>> together in a single package.
>>
>> There's also the AMD bulldozer which glued two dies into a single
>> package; but for those its not a problem because each die is a separate
>> numa node, so there DIE would actually be the correct term and PACKAGE
>> would be wrong.
>>
>> So while CPU sucks, I'm not sure we can come up with anything that's
>> actually correct. That said; we could try for something less wrong than
>> CPU :-)
> 
> OK
> 
> Dietmar,
> 
> Have you got another naming that DIE that could suit better ?
> otherwise i will keep it

If backward compatibility is not an issue here, keep it.

-- Dietmar

> 
> Vincent
> 
>>
>> I'm not sure there are a lot of people who see/know the names of these
>> domains to be bothered by a change in them; it might be limited to just
>> us for all I know.
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 4/6] sched: powerpc: create a dedicated topology table
  2014-03-11 13:18     ` Vincent Guittot
@ 2014-03-12  4:42       ` Preeti U Murthy
  2014-03-12  7:44         ` Vincent Guittot
  0 siblings, 1 reply; 35+ messages in thread
From: Preeti U Murthy @ 2014-03-12  4:42 UTC (permalink / raw)
  To: Vincent Guittot, Benjamin Herrenschmidt
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, Russell King - ARM Linux,
	LAK, Dietmar Eggemann, linaro-kernel

On 03/11/2014 06:48 PM, Vincent Guittot wrote:
> On 11 March 2014 11:08, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> On 03/05/2014 12:48 PM, Vincent Guittot wrote:
>>> Create a dedicated topology table for handling asymetric feature.
>>> The current proposal creates a new level which describes which groups of CPUs
>>> take adavantge of SD_ASYM_PACKING. The useless level will be removed during the
>>> build of the sched_domain topology.
>>>
>>> Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level
>>> during the boot sequence and before the build of the sched_domain topology.
>>
>> Is the below what you mean as the other solution? If it is so, I would
>> strongly recommend this approach rather than adding another level to the
>> topology level to represent the asymmetric behaviour.
>>
>> +static struct sched_domain_topology_level powerpc_topology[] = {
>> +#ifdef CONFIG_SCHED_SMT
>> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>> SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() },
>> +#endif
>> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>> +       { NULL, },
>> +};
> 
> Not exactly like that but something like below
> 
> +static struct sched_domain_topology_level powerpc_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
> SD_INIT_NAME(SMT) },
> +#endif
> + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> + { NULL, },
> +};
> +
> +static void __init set_sched_topology(void)
> +{
> +#ifdef CONFIG_SCHED_SMT
> + powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();
> +#endif
> + sched_domain_topology = powerpc_topology;
> +}

I think we can set it in powerpc_topology[] and not bother about setting
additional flags outside of this array. It is clearer this way.

On an additional note, on powerpc we would want to clear the
SD_SHARE_POWERDOMAIN flag at the CPU domain. On Power8, considering we
have 8 threads per core, we would want to consolidate tasks atleast upto
4 threads without significant performance impact before spilling over to
the other cores. By doing so, besides making use of the higher power of
the core we could do cpuidle management at the core level for the
remaining idle cores as a result of this consolidation.

So the powerpc_topology[] would be something like the below:

+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN },
+#endif
+       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+       { NULL, },
+};

The amount of consolidation to the threads in a core, we will probably
take care of it in cpu capacity or a similar parameter, but the above
topology level would be required to request the scheduler to try
consolidating tasks to cores till the cpu capacity(3/4/5 threads) has
exceeded.

Regards
Preeti U Murthy



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 4/6] sched: powerpc: create a dedicated topology table
  2014-03-12  4:42       ` Preeti U Murthy
@ 2014-03-12  7:44         ` Vincent Guittot
  2014-03-12 11:04           ` Dietmar Eggemann
  2014-03-14  2:14           ` Preeti U Murthy
  0 siblings, 2 replies; 35+ messages in thread
From: Vincent Guittot @ 2014-03-12  7:44 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Benjamin Herrenschmidt, Peter Zijlstra, Ingo Molnar,
	linux-kernel, tony.luck, fenghua.yu, schwidefsky, james.hogan,
	cmetcalf, Russell King - ARM Linux, LAK, Dietmar Eggemann,
	linaro-kernel

On 12 March 2014 05:42, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 03/11/2014 06:48 PM, Vincent Guittot wrote:
>> On 11 March 2014 11:08, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> Hi Vincent,
>>>
>>> On 03/05/2014 12:48 PM, Vincent Guittot wrote:
>>>> Create a dedicated topology table for handling asymetric feature.
>>>> The current proposal creates a new level which describes which groups of CPUs
>>>> take adavantge of SD_ASYM_PACKING. The useless level will be removed during the
>>>> build of the sched_domain topology.
>>>>
>>>> Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level
>>>> during the boot sequence and before the build of the sched_domain topology.
>>>
>>> Is the below what you mean as the other solution? If it is so, I would
>>> strongly recommend this approach rather than adding another level to the
>>> topology level to represent the asymmetric behaviour.
>>>
>>> +static struct sched_domain_topology_level powerpc_topology[] = {
>>> +#ifdef CONFIG_SCHED_SMT
>>> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>>> SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() },
>>> +#endif
>>> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>>> +       { NULL, },
>>> +};
>>
>> Not exactly like that but something like below
>>
>> +static struct sched_domain_topology_level powerpc_topology[] = {
>> +#ifdef CONFIG_SCHED_SMT
>> + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>> SD_INIT_NAME(SMT) },
>> +#endif
>> + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>> + { NULL, },
>> +};
>> +
>> +static void __init set_sched_topology(void)
>> +{
>> +#ifdef CONFIG_SCHED_SMT
>> + powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();
>> +#endif
>> + sched_domain_topology = powerpc_topology;
>> +}
>
> I think we can set it in powerpc_topology[] and not bother about setting
> additional flags outside of this array. It is clearer this way.

IIRC, the arch_sd_sibling_asym_packing of powerpc can be set at
runtime which prevents it from being put directly in the table. Or it
means that we should use a function pointer in the table for setting
flags instead of a static value like the current proposal.

>
> On an additional note, on powerpc we would want to clear the
> SD_SHARE_POWERDOMAIN flag at the CPU domain. On Power8, considering we
> have 8 threads per core, we would want to consolidate tasks atleast upto
> 4 threads without significant performance impact before spilling over to
> the other cores. By doing so, besides making use of the higher power of
> the core we could do cpuidle management at the core level for the
> remaining idle cores as a result of this consolidation.

OK. i will add the SD_SHARE_POWERDOMAIN like below

Thanks
Vincent

>
> So the powerpc_topology[] would be something like the below:
>
> +static struct sched_domain_topology_level powerpc_topology[] = {
> +#ifdef CONFIG_SCHED_SMT
> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
> SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN },
> +#endif
> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +       { NULL, },
> +};
>
> The amount of consolidation to the threads in a core, we will probably
> take care of it in cpu capacity or a similar parameter, but the above
> topology level would be required to request the scheduler to try
> consolidating tasks to cores till the cpu capacity(3/4/5 threads) has
> exceeded.
>
> Regards
> Preeti U Murthy
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 4/6] sched: powerpc: create a dedicated topology table
  2014-03-12  7:44         ` Vincent Guittot
@ 2014-03-12 11:04           ` Dietmar Eggemann
  2014-03-14  2:30             ` Preeti U Murthy
  2014-03-14  2:14           ` Preeti U Murthy
  1 sibling, 1 reply; 35+ messages in thread
From: Dietmar Eggemann @ 2014-03-12 11:04 UTC (permalink / raw)
  To: Vincent Guittot, Preeti U Murthy
  Cc: Benjamin Herrenschmidt, Peter Zijlstra, Ingo Molnar,
	linux-kernel, tony.luck, fenghua.yu, schwidefsky, james.hogan,
	cmetcalf, Russell King - ARM Linux, LAK, linaro-kernel

On 12/03/14 07:44, Vincent Guittot wrote:
> On 12 March 2014 05:42, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> On 03/11/2014 06:48 PM, Vincent Guittot wrote:
>>> On 11 March 2014 11:08, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>>> Hi Vincent,
>>>>
>>>> On 03/05/2014 12:48 PM, Vincent Guittot wrote:
>>>>> Create a dedicated topology table for handling asymetric feature.
>>>>> The current proposal creates a new level which describes which groups of CPUs
>>>>> take adavantge of SD_ASYM_PACKING. The useless level will be removed during the
>>>>> build of the sched_domain topology.
>>>>>
>>>>> Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level
>>>>> during the boot sequence and before the build of the sched_domain topology.
>>>>
>>>> Is the below what you mean as the other solution? If it is so, I would
>>>> strongly recommend this approach rather than adding another level to the
>>>> topology level to represent the asymmetric behaviour.
>>>>
>>>> +static struct sched_domain_topology_level powerpc_topology[] = {
>>>> +#ifdef CONFIG_SCHED_SMT
>>>> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>>>> SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() },
>>>> +#endif
>>>> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>>>> +       { NULL, },
>>>> +};
>>>
>>> Not exactly like that but something like below
>>>
>>> +static struct sched_domain_topology_level powerpc_topology[] = {
>>> +#ifdef CONFIG_SCHED_SMT
>>> + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>>> SD_INIT_NAME(SMT) },
>>> +#endif
>>> + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>>> + { NULL, },
>>> +};
>>> +
>>> +static void __init set_sched_topology(void)
>>> +{
>>> +#ifdef CONFIG_SCHED_SMT
>>> + powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();
>>> +#endif
>>> + sched_domain_topology = powerpc_topology;
>>> +}
>>
>> I think we can set it in powerpc_topology[] and not bother about setting
>> additional flags outside of this array. It is clearer this way.
> 
> IIRC, the arch_sd_sibling_asym_packing of powerpc can be set at
> runtime which prevents it from being put directly in the table. Or it
> means that we should use a function pointer in the table for setting
> flags instead of a static value like the current proposal.

Right,

static struct sched_domain_topology_level powerpc_topology[] = {
#ifdef CONFIG_SCHED_SMT
        { cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES |
SD_ASYM_PACKING, SD_INIT_NAME(ASMT) },
        { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES |
arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(SMT) },
#endif
        { cpu_cpu_mask, SD_INIT_NAME(DIE) },
        { NULL, },
};

is not compiling:

  CC      arch/powerpc/kernel/smp.o
arch/powerpc/kernel/smp.c:787:2: error: initializer element is not constant
arch/powerpc/kernel/smp.c:787:2: error: (near initialization for
'powerpc_topology[1].sd_flags')

So I'm in favour of a function pointer, even w/o the 'int cpu' parameter
to circumvent the issue that it is too easy to create broken sd setups.

-- Dietmar

> 
>>
>> On an additional note, on powerpc we would want to clear the
>> SD_SHARE_POWERDOMAIN flag at the CPU domain. On Power8, considering we
>> have 8 threads per core, we would want to consolidate tasks atleast upto
>> 4 threads without significant performance impact before spilling over to
>> the other cores. By doing so, besides making use of the higher power of
>> the core we could do cpuidle management at the core level for the
>> remaining idle cores as a result of this consolidation.
> 
> OK. i will add the SD_SHARE_POWERDOMAIN like below
> 
> Thanks
> Vincent
> 
>>
>> So the powerpc_topology[] would be something like the below:
>>
>> +static struct sched_domain_topology_level powerpc_topology[] = {
>> +#ifdef CONFIG_SCHED_SMT
>> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>> SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN },
>> +#endif
>> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>> +       { NULL, },
>> +};
>>
>> The amount of consolidation to the threads in a core, we will probably
>> take care of it in cpu capacity or a similar parameter, but the above
>> topology level would be required to request the scheduler to try
>> consolidating tasks to cores till the cpu capacity(3/4/5 threads) has
>> exceeded.
>>
>> Regards
>> Preeti U Murthy
>>
>>
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-11 13:17           ` Peter Zijlstra
@ 2014-03-12 13:28             ` Dietmar Eggemann
  2014-03-12 13:47               ` Vincent Guittot
  2014-03-17 11:52               ` Peter Zijlstra
  0 siblings, 2 replies; 35+ messages in thread
From: Dietmar Eggemann @ 2014-03-12 13:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel, preeti, linaro-kernel

On 11/03/14 13:17, Peter Zijlstra wrote:
> On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:
>>>
>>> I don't have a strong opinion about using or not a cpu argument for
>>> setting the flags of a level (it was part of the initial proposal
>>> before we start to completely rework the build of sched_domain)
>>> Nevertheless, I see one potential concern that you can have completely
>>> different flags configuration of the same sd level of 2 cpus.
>>
>> Could you elaborate a little bit further regarding the last sentence? Do you
>> think that those completely different flags configuration would make it
>> impossible, that the load-balance code could work at all at this sd?
> 
> So a problem with such an interfaces is that is makes it far too easy to
> generate completely broken domains.

I see the point. What I'm still struggling with is to understand why
this interface is worse then the one where we set-up additional,
adjacent sd levels with new cpu_foo_mask functions plus different static
sd-flags configurations and rely on the sd degenerate functionality in
the core scheduler to fold these levels together to achieve different
per cpu sd flags configurations.

IMHO, exposing struct sched_domain_topology_level bar_topology[] to the
arch is the reason why the core scheduler has to check if the arch
provides a sane sd setup in both cases.

> 
> You can, for two cpus in the same domain provide, different flags; such
> a configuration doesn't make any sense at all.
> 
> Now I see why people would like to have this; but unless we can make it
> robust I'd be very hesitant to go this route.
> 

By making it robust, I guess you mean that the core scheduler has to
check that the provided set-ups are sane, something like the following
code snippet in sd_init()

if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
		"wrong sd_flags in topology description\n"))
	tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;

but for per cpu set-up's.
Obviously, this check has to be in sync with the usage of these flags in
the core scheduler algorithms. This comprises probably that a subset of
these topology sd flags has to be set for all cpus in a sd level whereas
other can be set only for some cpus.

> 
> 



















^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-12 13:28             ` Dietmar Eggemann
@ 2014-03-12 13:47               ` Vincent Guittot
  2014-03-13 14:07                 ` Dietmar Eggemann
  2014-03-17 11:52               ` Peter Zijlstra
  1 sibling, 1 reply; 35+ messages in thread
From: Vincent Guittot @ 2014-03-12 13:47 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel, preeti, linaro-kernel

On 12 March 2014 14:28, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 11/03/14 13:17, Peter Zijlstra wrote:
>> On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:
>>>>
>>>> I don't have a strong opinion about using or not a cpu argument for
>>>> setting the flags of a level (it was part of the initial proposal
>>>> before we start to completely rework the build of sched_domain)
>>>> Nevertheless, I see one potential concern that you can have completely
>>>> different flags configuration of the same sd level of 2 cpus.
>>>
>>> Could you elaborate a little bit further regarding the last sentence? Do you
>>> think that those completely different flags configuration would make it
>>> impossible, that the load-balance code could work at all at this sd?
>>
>> So a problem with such an interfaces is that is makes it far too easy to
>> generate completely broken domains.
>
> I see the point. What I'm still struggling with is to understand why
> this interface is worse then the one where we set-up additional,
> adjacent sd levels with new cpu_foo_mask functions plus different static
> sd-flags configurations and rely on the sd degenerate functionality in
> the core scheduler to fold these levels together to achieve different
> per cpu sd flags configurations.

The main difference is that all CPUs has got the same levels at the
initial state and then the degenerate sequence can decide that it's
worth removing a level and if it will not create unsuable domains.

>
> IMHO, exposing struct sched_domain_topology_level bar_topology[] to the
> arch is the reason why the core scheduler has to check if the arch
> provides a sane sd setup in both cases.
>
>>
>> You can, for two cpus in the same domain provide, different flags; such
>> a configuration doesn't make any sense at all.
>>
>> Now I see why people would like to have this; but unless we can make it
>> robust I'd be very hesitant to go this route.
>>
>
> By making it robust, I guess you mean that the core scheduler has to
> check that the provided set-ups are sane, something like the following
> code snippet in sd_init()
>
> if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
>                 "wrong sd_flags in topology description\n"))
>         tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
>
> but for per cpu set-up's.
> Obviously, this check has to be in sync with the usage of these flags in
> the core scheduler algorithms. This comprises probably that a subset of
> these topology sd flags has to be set for all cpus in a sd level whereas
> other can be set only for some cpus.
>
>>
>>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-12 13:47               ` Vincent Guittot
@ 2014-03-13 14:07                 ` Dietmar Eggemann
  0 siblings, 0 replies; 35+ messages in thread
From: Dietmar Eggemann @ 2014-03-13 14:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel, preeti, linaro-kernel

On 12/03/14 13:47, Vincent Guittot wrote:
> On 12 March 2014 14:28, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 11/03/14 13:17, Peter Zijlstra wrote:
>>> On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:
>>>>>
>>>>> I don't have a strong opinion about using or not a cpu argument for
>>>>> setting the flags of a level (it was part of the initial proposal
>>>>> before we start to completely rework the build of sched_domain)
>>>>> Nevertheless, I see one potential concern that you can have completely
>>>>> different flags configuration of the same sd level of 2 cpus.
>>>>
>>>> Could you elaborate a little bit further regarding the last sentence? Do you
>>>> think that those completely different flags configuration would make it
>>>> impossible, that the load-balance code could work at all at this sd?
>>>
>>> So a problem with such an interfaces is that is makes it far too easy to
>>> generate completely broken domains.
>>
>> I see the point. What I'm still struggling with is to understand why
>> this interface is worse then the one where we set-up additional,
>> adjacent sd levels with new cpu_foo_mask functions plus different static
>> sd-flags configurations and rely on the sd degenerate functionality in
>> the core scheduler to fold these levels together to achieve different
>> per cpu sd flags configurations.
> 
> The main difference is that all CPUs has got the same levels at the
> initial state and then the degenerate sequence can decide that it's
> worth removing a level and if it will not create unsuable domains.
>

Agreed. But what I'm trying to say is that using the approach of
multiple adjacent sd levels with different cpu_mask(int cpu) functions
and static sd topology flags will not prevent us from coding the
enforcement of sane sd topology flags set-ups somewhere inside the core
scheduler.

It is possible to easily introduce erroneous set-ups from the standpoint
of sd topology flags with this approach too.

For the sake of an example on ARM TC2 platform, I changed
cpu_corepower_mask(int cpu) [arch/arm/kernel/topology.c] to simulate
that in socket 1 (3 Cortex-A7) cores can powergate individually whereas
in socket 0 (2 Cortex A15) they can't:

 const struct cpumask *cpu_corepower_mask(int cpu)
 {
-       return &cpu_topology[cpu].thread_sibling;
+       return cpu_topology[cpu].socket_id ?
&cpu_topology[cpu].thread_sibling :
+                       &cpu_topology[cpu].core_sibling;
 }


With this I get the following cpu mask configuration:

dmesg snippet (w/ additional debug in cpu_coregroup_mask(),
cpu_corepower_mask()):

...
CPU0: cpu_corepower_mask=0-1
CPU0: cpu_coregroup_mask=0-1
CPU1: cpu_corepower_mask=0-1
CPU1: cpu_coregroup_mask=0-1
CPU2: cpu_corepower_mask=2
CPU2: cpu_coregroup_mask=2-4
CPU3: cpu_corepower_mask=3
CPU3: cpu_coregroup_mask=2-4
CPU4: cpu_corepower_mask=4
CPU4: cpu_coregroup_mask=2-4
...

And I deliberately introduced the following error into the
arm_topology[] table:

 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
-       { cpu_corepower_mask, SD_SHARE_PKG_RESOURCES |
SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) },
+       { cpu_corepower_mask, SD_SHARE_POWERDOMAIN, SD_INIT_NAME(GMC) },
        { cpu_coregroup_mask, SD_SHARE_PKG_RESOURCES, SD_INIT_NAME(MC) },

With this set-up, I get GMC & DIE level for CPU0,1 and MC & DIE level
for CPU2,3,4, i.e. the SD_SHARE_PKG_RESOURCES flag is only set for
CPU2,3,4 and MC level.

dmesg snippet (w/ adapted sched_domain_debug_one(), only CPU0 and CPU2
shown here):

...
CPU0 attaching sched-domain:
domain 0: span 0-1 level GMC
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_POWERDOMAIN SD_PREFER_SIBLING
groups: 0 1
...
domain 1: span 0-4 level DIE
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_PREFER_SIBLING
groups: 0-1 (cpu_power = 2048) 2-4 (cpu_power = 3072)
...
CPU2 attaching sched-domain:
domain 0: span 2-4 level MC
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_SHARE_PKG_RESOURCES
groups: 2 3 4
...
domain 1: span 0-4 level DIE
SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK
SD_WAKE_AFFINE SD_PREFER_SIBLING
groups: 2-4 (cpu_power = 3072) 0-1 (cpu_power = 2048)
...

What I wanted to say is IMHO, it doesn't matter which approach we take
(multiple adjacent sd levels or per-cpu topo sd flag function), we have
to enforce sane sd topology flags set-up inside the core scheduler anyway.

-- Dietmar

>>
>> IMHO, exposing struct sched_domain_topology_level bar_topology[] to the
>> arch is the reason why the core scheduler has to check if the arch
>> provides a sane sd setup in both cases.
>>
>>>
>>> You can, for two cpus in the same domain provide, different flags; such
>>> a configuration doesn't make any sense at all.
>>>
>>> Now I see why people would like to have this; but unless we can make it
>>> robust I'd be very hesitant to go this route.
>>>
>>
>> By making it robust, I guess you mean that the core scheduler has to
>> check that the provided set-ups are sane, something like the following
>> code snippet in sd_init()
>>
>> if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
>>                 "wrong sd_flags in topology description\n"))
>>         tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
>>
>> but for per cpu set-up's.
>> Obviously, this check has to be in sync with the usage of these flags in
>> the core scheduler algorithms. This comprises probably that a subset of
>> these topology sd flags has to be set for all cpus in a sd level whereas
>> other can be set only for some cpus.
[...]


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 4/6] sched: powerpc: create a dedicated topology table
  2014-03-12  7:44         ` Vincent Guittot
  2014-03-12 11:04           ` Dietmar Eggemann
@ 2014-03-14  2:14           ` Preeti U Murthy
  1 sibling, 0 replies; 35+ messages in thread
From: Preeti U Murthy @ 2014-03-14  2:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Benjamin Herrenschmidt, Peter Zijlstra, Ingo Molnar,
	linux-kernel, tony.luck, fenghua.yu, schwidefsky, james.hogan,
	cmetcalf, Russell King - ARM Linux, LAK, Dietmar Eggemann,
	linaro-kernel

On 03/12/2014 01:14 PM, Vincent Guittot wrote:
> On 12 March 2014 05:42, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> On 03/11/2014 06:48 PM, Vincent Guittot wrote:
>>> On 11 March 2014 11:08, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>>> Hi Vincent,
>>>>
>>>> On 03/05/2014 12:48 PM, Vincent Guittot wrote:
>>>>> Create a dedicated topology table for handling asymetric feature.
>>>>> The current proposal creates a new level which describes which groups of CPUs
>>>>> take adavantge of SD_ASYM_PACKING. The useless level will be removed during the
>>>>> build of the sched_domain topology.
>>>>>
>>>>> Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level
>>>>> during the boot sequence and before the build of the sched_domain topology.
>>>>
>>>> Is the below what you mean as the other solution? If it is so, I would
>>>> strongly recommend this approach rather than adding another level to the
>>>> topology level to represent the asymmetric behaviour.
>>>>
>>>> +static struct sched_domain_topology_level powerpc_topology[] = {
>>>> +#ifdef CONFIG_SCHED_SMT
>>>> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>>>> SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() },
>>>> +#endif
>>>> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>>>> +       { NULL, },
>>>> +};
>>>
>>> Not exactly like that but something like below
>>>
>>> +static struct sched_domain_topology_level powerpc_topology[] = {
>>> +#ifdef CONFIG_SCHED_SMT
>>> + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>>> SD_INIT_NAME(SMT) },
>>> +#endif
>>> + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>>> + { NULL, },
>>> +};
>>> +
>>> +static void __init set_sched_topology(void)
>>> +{
>>> +#ifdef CONFIG_SCHED_SMT
>>> + powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();
>>> +#endif
>>> + sched_domain_topology = powerpc_topology;
>>> +}
>>
>> I think we can set it in powerpc_topology[] and not bother about setting
>> additional flags outside of this array. It is clearer this way.
> 
> IIRC, the arch_sd_sibling_asym_packing of powerpc can be set at
> runtime which prevents it from being put directly in the table. Or it
> means that we should use a function pointer in the table for setting
> flags instead of a static value like the current proposal.

Oh yes you are right. Then the above looks good to me. So we define the
table as usual and set the asymmetric flag in set_sched_topology().
> 
>>
>> On an additional note, on powerpc we would want to clear the
>> SD_SHARE_POWERDOMAIN flag at the CPU domain. On Power8, considering we
>> have 8 threads per core, we would want to consolidate tasks atleast upto
>> 4 threads without significant performance impact before spilling over to
>> the other cores. By doing so, besides making use of the higher power of
>> the core we could do cpuidle management at the core level for the
>> remaining idle cores as a result of this consolidation.
> 
> OK. i will add the SD_SHARE_POWERDOMAIN like below

Thanks!

Regards
Preeti U Murthy
> 
> Thanks
> Vincent
> 
>>
>> So the powerpc_topology[] would be something like the below:
>>
>> +static struct sched_domain_topology_level powerpc_topology[] = {
>> +#ifdef CONFIG_SCHED_SMT
>> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>> SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN },
>> +#endif
>> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>> +       { NULL, },
>> +};
>>
>> The amount of consolidation to the threads in a core, we will probably
>> take care of it in cpu capacity or a similar parameter, but the above
>> topology level would be required to request the scheduler to try
>> consolidating tasks to cores till the cpu capacity(3/4/5 threads) has
>> exceeded.
>>
>> Regards
>> Preeti U Murthy
>>
>>
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 4/6] sched: powerpc: create a dedicated topology table
  2014-03-12 11:04           ` Dietmar Eggemann
@ 2014-03-14  2:30             ` Preeti U Murthy
  0 siblings, 0 replies; 35+ messages in thread
From: Preeti U Murthy @ 2014-03-14  2:30 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Benjamin Herrenschmidt, Peter Zijlstra,
	Ingo Molnar, linux-kernel, tony.luck, fenghua.yu, schwidefsky,
	james.hogan, cmetcalf, Russell King - ARM Linux, LAK,
	linaro-kernel

On 03/12/2014 04:34 PM, Dietmar Eggemann wrote:
> On 12/03/14 07:44, Vincent Guittot wrote:
>> On 12 March 2014 05:42, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> On 03/11/2014 06:48 PM, Vincent Guittot wrote:
>>>> On 11 March 2014 11:08, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>>>> Hi Vincent,
>>>>>
>>>>> On 03/05/2014 12:48 PM, Vincent Guittot wrote:
>>>>>> Create a dedicated topology table for handling asymetric feature.
>>>>>> The current proposal creates a new level which describes which groups of CPUs
>>>>>> take adavantge of SD_ASYM_PACKING. The useless level will be removed during the
>>>>>> build of the sched_domain topology.
>>>>>>
>>>>>> Another solution would be to set SD_ASYM_PACKING in the sd_flags of SMT level
>>>>>> during the boot sequence and before the build of the sched_domain topology.
>>>>>
>>>>> Is the below what you mean as the other solution? If it is so, I would
>>>>> strongly recommend this approach rather than adding another level to the
>>>>> topology level to represent the asymmetric behaviour.
>>>>>
>>>>> +static struct sched_domain_topology_level powerpc_topology[] = {
>>>>> +#ifdef CONFIG_SCHED_SMT
>>>>> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>>>>> SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() },
>>>>> +#endif
>>>>> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>>>>> +       { NULL, },
>>>>> +};
>>>>
>>>> Not exactly like that but something like below
>>>>
>>>> +static struct sched_domain_topology_level powerpc_topology[] = {
>>>> +#ifdef CONFIG_SCHED_SMT
>>>> + { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>>>> SD_INIT_NAME(SMT) },
>>>> +#endif
>>>> + { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>>>> + { NULL, },
>>>> +};
>>>> +
>>>> +static void __init set_sched_topology(void)
>>>> +{
>>>> +#ifdef CONFIG_SCHED_SMT
>>>> + powerpc_topology[0].sd_flags |= arch_sd_sibling_asym_packing();
>>>> +#endif
>>>> + sched_domain_topology = powerpc_topology;
>>>> +}
>>>
>>> I think we can set it in powerpc_topology[] and not bother about setting
>>> additional flags outside of this array. It is clearer this way.
>>
>> IIRC, the arch_sd_sibling_asym_packing of powerpc can be set at
>> runtime which prevents it from being put directly in the table. Or it
>> means that we should use a function pointer in the table for setting
>> flags instead of a static value like the current proposal.
> 
> Right,
> 
> static struct sched_domain_topology_level powerpc_topology[] = {
> #ifdef CONFIG_SCHED_SMT
>         { cpu_asmt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES |
> SD_ASYM_PACKING, SD_INIT_NAME(ASMT) },
>         { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES |
> arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN, SD_INIT_NAME(SMT) },
> #endif
>         { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>         { NULL, },
> };
> 
> is not compiling:
> 
>   CC      arch/powerpc/kernel/smp.o
> arch/powerpc/kernel/smp.c:787:2: error: initializer element is not constant
> arch/powerpc/kernel/smp.c:787:2: error: (near initialization for
> 'powerpc_topology[1].sd_flags')
> 
> So I'm in favour of a function pointer, even w/o the 'int cpu' parameter
> to circumvent the issue that it is too easy to create broken sd setups.

Alright, this looks fine to me. You could use the function pointer to
retrieve flags and have all initializations of sched domain features
consolidated in the table.

Regards
Preeti U Murthy
> 
> -- Dietmar
> 
>>
>>>
>>> On an additional note, on powerpc we would want to clear the
>>> SD_SHARE_POWERDOMAIN flag at the CPU domain. On Power8, considering we
>>> have 8 threads per core, we would want to consolidate tasks atleast upto
>>> 4 threads without significant performance impact before spilling over to
>>> the other cores. By doing so, besides making use of the higher power of
>>> the core we could do cpuidle management at the core level for the
>>> remaining idle cores as a result of this consolidation.
>>
>> OK. i will add the SD_SHARE_POWERDOMAIN like below
>>
>> Thanks
>> Vincent
>>
>>>
>>> So the powerpc_topology[] would be something like the below:
>>>
>>> +static struct sched_domain_topology_level powerpc_topology[] = {
>>> +#ifdef CONFIG_SCHED_SMT
>>> +       { cpu_smt_mask, SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES,
>>> SD_INIT_NAME(SMT) | arch_sd_sibling_asym_packing() | SD_SHARE_POWERDOMAIN },
>>> +#endif
>>> +       { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>>> +       { NULL, },
>>> +};
>>>
>>> The amount of consolidation to the threads in a core, we will probably
>>> take care of it in cpu capacity or a similar parameter, but the above
>>> topology level would be required to request the scheduler to try
>>> consolidating tasks to cores till the cpu capacity(3/4/5 threads) has
>>> exceeded.
>>>
>>> Regards
>>> Preeti U Murthy
>>>
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-12 13:28             ` Dietmar Eggemann
  2014-03-12 13:47               ` Vincent Guittot
@ 2014-03-17 11:52               ` Peter Zijlstra
  2014-03-19 19:15                 ` Dietmar Eggemann
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2014-03-17 11:52 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, mingo, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel, preeti, linaro-kernel

On Wed, Mar 12, 2014 at 01:28:07PM +0000, Dietmar Eggemann wrote:
> On 11/03/14 13:17, Peter Zijlstra wrote:
> > On Sat, Mar 08, 2014 at 12:40:58PM +0000, Dietmar Eggemann wrote:
> >>>
> >>> I don't have a strong opinion about using or not a cpu argument for
> >>> setting the flags of a level (it was part of the initial proposal
> >>> before we start to completely rework the build of sched_domain)
> >>> Nevertheless, I see one potential concern that you can have completely
> >>> different flags configuration of the same sd level of 2 cpus.
> >>
> >> Could you elaborate a little bit further regarding the last sentence? Do you
> >> think that those completely different flags configuration would make it
> >> impossible, that the load-balance code could work at all at this sd?
> > 
> > So a problem with such an interfaces is that is makes it far too easy to
> > generate completely broken domains.
> 
> I see the point. What I'm still struggling with is to understand why
> this interface is worse then the one where we set-up additional,
> adjacent sd levels with new cpu_foo_mask functions plus different static
> sd-flags configurations and rely on the sd degenerate functionality in
> the core scheduler to fold these levels together to achieve different
> per cpu sd flags configurations.

Well, the folding of SD levels is 'safe' in that it keeps domains
internally consistent.

> IMHO, exposing struct sched_domain_topology_level bar_topology[] to the
> arch is the reason why the core scheduler has to check if the arch
> provides a sane sd setup in both cases.

Up to a point yes. On the other hand; the reason we have the degenerate
stuff is because the topology was generic and might contain pointless
levels because the architecture didn't actually have them.

By moving the topology setup into the arch; that could be made to go
away (not sure you want to do that, but you could).

But yes, by moving the topology setup out of the core code, you need
some extra validation to make sure that whatever you're fed makes some
kind of sense.

> > You can, for two cpus in the same domain provide, different flags; such
> > a configuration doesn't make any sense at all.
> > 
> > Now I see why people would like to have this; but unless we can make it
> > robust I'd be very hesitant to go this route.
> > 
> 
> By making it robust, I guess you mean that the core scheduler has to
> check that the provided set-ups are sane, something like the following
> code snippet in sd_init()
> 
> if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
> 		"wrong sd_flags in topology description\n"))
> 	tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
> 
> but for per cpu set-up's.

So a domain is principally a group of CPUs with the same properties.
However per-cpu domain attributes allows you to specify different domain
properties within the one domain mask.

That's completely broken.

So the way to validate something like that would be:

	cpu = cpumask_first(tl->mask());
	flags = tl->flags(cpu);

	for (;cpu = cpumask_next(cpu, tl->mask()), cpu < nr_cpu_ids;)
		BUG_ON(tl->flags(cpu) != flags);

Or something along those lines.

But for me its far easier to think in the simple one domain one flags
scenario. The whole degenerate folding is a very simple optimization
simply removing redundant levels.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-17 11:52               ` Peter Zijlstra
@ 2014-03-19 19:15                 ` Dietmar Eggemann
  2014-03-20  8:28                   ` Vincent Guittot
  0 siblings, 1 reply; 35+ messages in thread
From: Dietmar Eggemann @ 2014-03-19 19:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel, preeti, linaro-kernel

On 17/03/14 11:52, Peter Zijlstra wrote:
> On Wed, Mar 12, 2014 at 01:28:07PM +0000, Dietmar Eggemann wrote:
[...]
>> By making it robust, I guess you mean that the core scheduler has to
>> check that the provided set-ups are sane, something like the following
>> code snippet in sd_init()
>>
>> if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
>> 		"wrong sd_flags in topology description\n"))
>> 	tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
>>
>> but for per cpu set-up's.
> 
> So a domain is principally a group of CPUs with the same properties.
> However per-cpu domain attributes allows you to specify different domain
> properties within the one domain mask.
> 
> That's completely broken.
> 
> So the way to validate something like that would be:
> 
> 	cpu = cpumask_first(tl->mask());
> 	flags = tl->flags(cpu);
> 
> 	for (;cpu = cpumask_next(cpu, tl->mask()), cpu < nr_cpu_ids;)
> 		BUG_ON(tl->flags(cpu) != flags);
> 
> Or something along those lines.

I tried this idea inside sd_init() on top of Vincent's V3 and it's doing
its job. 

> 
> But for me its far easier to think in the simple one domain one flags
> scenario. The whole degenerate folding is a very simple optimization
> simply removing redundant levels.
> 

For me, the approach with the 'int cpu' parameter in the flag function is
easier to understand. One of the things I had to grasp though was the fact that
we can only specify SD_SHARE_FOO flags and not SD_NOT_SHARE_FOO per domain.

-- >8 --

Subject: [PATCH] sched: check that the sd_flags are consistent in one domain

---
 arch/arm/kernel/topology.c |   13 +++++++++----
 include/linux/sched.h      |    6 +++---
 kernel/sched/core.c        |   11 +++++++++--
 3 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 71e1fec6d31a..425f133c690d 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -275,15 +275,20 @@ void store_cpu_topology(unsigned int cpuid)
 		cpu_topology[cpuid].socket_id, mpidr);
 }
 
-static inline const int cpu_corepower_flags(void)
+//static inline const int cpu_corepower_flags(void)
+//{
+//	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
+//}
+
+static inline const int arm_cpu_core_flags(int cpu)
 {
-	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
+	return (cpu < 2) ? SD_SHARE_PKG_RESOURCES : SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
 }
 
 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
-	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
-	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+//	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
+	{ cpu_coregroup_mask, arm_cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 05ce264e5144..45e5aa3d3e80 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -870,14 +870,14 @@ enum cpu_idle_type {
 #define SD_NUMA			0x4000	/* cross-node balancing */
 
 #ifdef CONFIG_SCHED_SMT
-static inline const int cpu_smt_flags(void)
+static inline const int cpu_smt_flags(int cpu)
 {
        return SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES;
 }
 #endif
 
 #ifdef CONFIG_SCHED_MC
-static inline const int cpu_core_flags(void)
+static inline const int cpu_core_flags(int cpu)
 {
        return SD_SHARE_PKG_RESOURCES;
 }
@@ -990,7 +990,7 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 bool cpus_share_cache(int this_cpu, int that_cpu);
 
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
-typedef const int (*sched_domain_flags_f)(void);
+typedef const int (*sched_domain_flags_f)(int cpu);
 
 #define SDTL_OVERLAP   0x01
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f2ee6c72b13a..6b8ba837977c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5968,7 +5968,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
        sd_weight = cpumask_weight(tl->mask(cpu));
 
        if (tl->sd_flags)
-               sd_flags = (*tl->sd_flags)();
+               sd_flags = (*tl->sd_flags)(cpu);
        if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
                        "wrong sd_flags in topology description\n"))
                sd_flags &= ~TOPOLOGY_SD_FLAGS;
@@ -6044,9 +6044,16 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
                sd->idle_idx = 1;
        }
 
+       if (tl->sd_flags) {
+	       int flags = (*tl->sd_flags)(cpumask_first(tl->mask(cpu)));
+
+	       for (;cpu = cpumask_next(cpu, tl->mask(cpu)), cpu < nr_cpu_ids;)
+		       BUG_ON((*tl->sd_flags)(cpu) != flags);
+       }
+
        sd->private = &tl->data;
 
-	return sd;
+       return sd;
 }
 
 /*
-- 
1.7.9.5






^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC 0/6] rework sched_domain topology description
  2014-03-19 19:15                 ` Dietmar Eggemann
@ 2014-03-20  8:28                   ` Vincent Guittot
  0 siblings, 0 replies; 35+ messages in thread
From: Vincent Guittot @ 2014-03-20  8:28 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, mingo, linux-kernel, tony.luck, fenghua.yu,
	schwidefsky, james.hogan, cmetcalf, benh, linux,
	linux-arm-kernel, preeti, linaro-kernel

On 19 March 2014 20:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 17/03/14 11:52, Peter Zijlstra wrote:
>> On Wed, Mar 12, 2014 at 01:28:07PM +0000, Dietmar Eggemann wrote:
> [...]
>>> By making it robust, I guess you mean that the core scheduler has to
>>> check that the provided set-ups are sane, something like the following
>>> code snippet in sd_init()
>>>
>>> if (WARN_ONCE(tl->sd_flags & ~TOPOLOGY_SD_FLAGS,
>>>              "wrong sd_flags in topology description\n"))
>>>      tl->sd_flags &= ~TOPOLOGY_SD_FLAGS;
>>>
>>> but for per cpu set-up's.
>>
>> So a domain is principally a group of CPUs with the same properties.
>> However per-cpu domain attributes allows you to specify different domain
>> properties within the one domain mask.
>>
>> That's completely broken.
>>
>> So the way to validate something like that would be:
>>
>>       cpu = cpumask_first(tl->mask());
>>       flags = tl->flags(cpu);
>>
>>       for (;cpu = cpumask_next(cpu, tl->mask()), cpu < nr_cpu_ids;)
>>               BUG_ON(tl->flags(cpu) != flags);
>>
>> Or something along those lines.
>
> I tried this idea inside sd_init() on top of Vincent's V3 and it's doing
> its job.
>
>>
>> But for me its far easier to think in the simple one domain one flags
>> scenario. The whole degenerate folding is a very simple optimization
>> simply removing redundant levels.
>>
>
> For me, the approach with the 'int cpu' parameter in the flag function is
> easier to understand. One of the things I had to grasp though was the fact that
> we can only specify SD_SHARE_FOO flags and not SD_NOT_SHARE_FOO per domain.

Looking at you test below, the solution without cpu argument is more
readable for me because you don't have to handle 2 cpu args that can
vary when setting your level. I'm afraid that the flags function will
become quite complex and unreadable with a cpu arg. And this
additional cpu arg doesn't give any benefit.

Vincent

>
> -- >8 --
>
> Subject: [PATCH] sched: check that the sd_flags are consistent in one domain
>
> ---
>  arch/arm/kernel/topology.c |   13 +++++++++----
>  include/linux/sched.h      |    6 +++---
>  kernel/sched/core.c        |   11 +++++++++--
>  3 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 71e1fec6d31a..425f133c690d 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -275,15 +275,20 @@ void store_cpu_topology(unsigned int cpuid)
>                 cpu_topology[cpuid].socket_id, mpidr);
>  }
>
> -static inline const int cpu_corepower_flags(void)
> +//static inline const int cpu_corepower_flags(void)
> +//{
> +//     return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
> +//}
> +
> +static inline const int arm_cpu_core_flags(int cpu)
>  {
> -       return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
> +       return (cpu < 2) ? SD_SHARE_PKG_RESOURCES : SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
>  }
>
>  static struct sched_domain_topology_level arm_topology[] = {
>  #ifdef CONFIG_SCHED_MC
> -       { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
> -       { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> +//     { cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
> +       { cpu_coregroup_mask, arm_cpu_core_flags, SD_INIT_NAME(MC) },
>  #endif
>         { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>         { NULL, },
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 05ce264e5144..45e5aa3d3e80 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -870,14 +870,14 @@ enum cpu_idle_type {
>  #define SD_NUMA                        0x4000  /* cross-node balancing */
>
>  #ifdef CONFIG_SCHED_SMT
> -static inline const int cpu_smt_flags(void)
> +static inline const int cpu_smt_flags(int cpu)
>  {
>         return SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES;
>  }
>  #endif
>
>  #ifdef CONFIG_SCHED_MC
> -static inline const int cpu_core_flags(void)
> +static inline const int cpu_core_flags(int cpu)
>  {
>         return SD_SHARE_PKG_RESOURCES;
>  }
> @@ -990,7 +990,7 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
>  bool cpus_share_cache(int this_cpu, int that_cpu);
>
>  typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
> -typedef const int (*sched_domain_flags_f)(void);
> +typedef const int (*sched_domain_flags_f)(int cpu);
>
>  #define SDTL_OVERLAP   0x01
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f2ee6c72b13a..6b8ba837977c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5968,7 +5968,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>         sd_weight = cpumask_weight(tl->mask(cpu));
>
>         if (tl->sd_flags)
> -               sd_flags = (*tl->sd_flags)();
> +               sd_flags = (*tl->sd_flags)(cpu);
>         if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
>                         "wrong sd_flags in topology description\n"))
>                 sd_flags &= ~TOPOLOGY_SD_FLAGS;
> @@ -6044,9 +6044,16 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>                 sd->idle_idx = 1;
>         }
>
> +       if (tl->sd_flags) {
> +              int flags = (*tl->sd_flags)(cpumask_first(tl->mask(cpu)));
> +
> +              for (;cpu = cpumask_next(cpu, tl->mask(cpu)), cpu < nr_cpu_ids;)
> +                      BUG_ON((*tl->sd_flags)(cpu) != flags);
> +       }
> +
>         sd->private = &tl->data;
>
> -       return sd;
> +       return sd;
>  }
>
>  /*
> --
> 1.7.9.5
>
>
>
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2014-03-20  8:28 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-05  7:18 [RFC 0/6] rework sched_domain topology description Vincent Guittot
2014-03-05  7:18 ` [RFC 1/6] sched: remove unused SCHED_INIT_NODE Vincent Guittot
2014-03-05  7:18 ` [PATCH 2/6] sched: rework of sched_domain topology definition Vincent Guittot
2014-03-05 17:09   ` Dietmar Eggemann
2014-03-06  8:32     ` Vincent Guittot
2014-03-11 10:31       ` Peter Zijlstra
2014-03-11 13:27         ` Vincent Guittot
2014-03-11 13:48           ` Dietmar Eggemann
2014-03-05  7:18 ` [RFC 3/6] sched: s390: create a dedicated topology table Vincent Guittot
2014-03-05  7:18 ` [RFC 4/6] sched: powerpc: " Vincent Guittot
2014-03-11 10:08   ` Preeti U Murthy
2014-03-11 13:18     ` Vincent Guittot
2014-03-12  4:42       ` Preeti U Murthy
2014-03-12  7:44         ` Vincent Guittot
2014-03-12 11:04           ` Dietmar Eggemann
2014-03-14  2:30             ` Preeti U Murthy
2014-03-14  2:14           ` Preeti U Murthy
2014-03-05  7:18 ` [RFC 5/6] sched: add a new SD_SHARE_POWERDOMAIN for sched_domain Vincent Guittot
2014-03-05  7:18 ` [RFC 6/6] sched: ARM: create a dedicated scheduler topology table Vincent Guittot
2014-03-05 22:38   ` Dietmar Eggemann
2014-03-06  8:42     ` Vincent Guittot
2014-03-05 23:17 ` [RFC 0/6] rework sched_domain topology description Dietmar Eggemann
2014-03-06  9:04   ` Vincent Guittot
2014-03-06 12:31     ` Dietmar Eggemann
2014-03-07  2:47       ` Vincent Guittot
2014-03-08 12:40         ` Dietmar Eggemann
2014-03-10 13:21           ` Vincent Guittot
2014-03-11 13:17           ` Peter Zijlstra
2014-03-12 13:28             ` Dietmar Eggemann
2014-03-12 13:47               ` Vincent Guittot
2014-03-13 14:07                 ` Dietmar Eggemann
2014-03-17 11:52               ` Peter Zijlstra
2014-03-19 19:15                 ` Dietmar Eggemann
2014-03-20  8:28                   ` Vincent Guittot
2014-03-11 13:08         ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).